Running Real Jobs

Before we can start running jobs on the compute nodes, we need to start the local job manager (slurmd) on each of our compute nodes. We had previouly started it on the head node as a test, and the startup file seemed to work fine, so let's copy the same startup file out to all the compute nodes. Instead of doing these one at a time, we'll use pdcp (part of the pdsh package) to do the copy in parallel. We also need the munge service on eachnode, so let's copy that startup file as well.

baker:~ # pdcp -a /usr/lib/systemd/system/slurmd.service /usr/lib/systemd/system/slurmd.service
baker:~ # pdsh -a ls -l /usr/lib/systemd/system/slurmd.service
compute04: -rw-r--r-- 1 root root 470 Nov 18 16:04 /usr/lib/systemd/system/slurmd.service
compute02: -rw-r--r-- 1 root root 470 Nov 18 16:04 /usr/lib/systemd/system/slurmd.service
compute01: -rw-r--r-- 1 root root 470 Nov 18 16:04 /usr/lib/systemd/system/slurmd.service
compute03: -rw-r--r-- 1 root root 470 Nov 18 16:04 /usr/lib/systemd/system/slurmd.service
baker:~ # pdcp -a /usr/lib/systemd/system/munge.service /usr/lib/systemd/system/munge.service
baker:~ # pdsh -a ls -l /usr/lib/systemd/system/munge.service
compute04: -rw-r--r-- 1 root root 320 Nov 18 17:05 /usr/lib/systemd/system/munge.service
compute03: -rw-r--r-- 1 root root 320 Nov 18 17:05 /usr/lib/systemd/system/munge.service
compute02: -rw-r--r-- 1 root root 320 Nov 18 17:05 /usr/lib/systemd/system/munge.service
compute01: -rw-r--r-- 1 root root 320 Nov 18 17:05 /usr/lib/systemd/system/munge.service
baker:~ #

Since the pdcp command didn't give us any output, I also ran a quick check just to make sure the files got installed. Everything looks good so far.

We also need some directories for munge and slurmd to write their log files, so let's get those set up, too. This is the same procedure we used on the head node, we're just doing it in parallel now on each compute node.

baker:~ # pdsh -a mkdir /var/log/munge
baker:~ # pdsh -a chown munge:munge /var/log/munge
baker:~ # pdsh -a chmod 0700 /var/log/munge
baker:~ # pdsh -a mkdir /var/lib/munge
baker:~ # pdsh -a chown munge:munge /var/lib/munge
baker:~ # pdsh -a chmod 0711 /var/lib/munge
baker:~ # pdsh -a mkdir /etc/munge
baker:~ # pdsh -a chown munge:munge /etc/munge
baker:~ # pdsh -a chmod 0700 /etc/munge
baker:~ # pdsh -a mkdir /var/run/munge
baker:~ # pdsh -a chown munge:munge /var/run/munge
baker:~ # pdsh -a chmod 0755 /var/run/munge
baker:~ #

The munge service also has a shared secret that's stored as a seed file under /var/lib/munge. The munge process has to use the same seed file on each machine, so we will copy the seed file from the head node out to each compute node.

baker:~ # pdcp -a /var/lib/munge/munge.seed /var/lib/munge/munge.seed
baker:~ # pdsh -a ls -l /var/lib/munge/munge.seed
compute04: -rw------- 1 root root 1024 Nov 18 17:27 /var/lib/munge/munge.seed
compute01: -rw------- 1 root root 1024 Nov 18 17:27 /var/lib/munge/munge.seed
compute02: -rw------- 1 root root 1024 Nov 18 17:27 /var/lib/munge/munge.seed
compute03: -rw------- 1 root root 1024 Nov 18 17:27 /var/lib/munge/munge.seed
baker:~ # pdsh -a chown munge:munge /var/lib/munge/munge.seed
baker:~ # pdsh -a ls -l /var/lib/munge/munge.seed
compute04: -rw------- 1 munge munge 1024 Nov 18 17:27 /var/lib/munge/munge.seed
compute03: -rw------- 1 munge munge 1024 Nov 18 17:27 /var/lib/munge/munge.seed
compute01: -rw------- 1 munge munge 1024 Nov 18 17:27 /var/lib/munge/munge.seed
compute02: -rw------- 1 munge munge 1024 Nov 18 17:27 /var/lib/munge/munge.seed
baker:~ #

The pdcp command didn't perserve the ownership of the file, so I changed ownership to the munge user. We should be okay now for the munge service.

The job manager also requires the NUMA libraries so let's use the pdsh command to install those libraries before we go further.

baker:~ # pdsh -a zypper install -y libnuma1
compute04: Loading repository data...
compute04: Reading installed packages...
compute04: Resolving package dependencies...
compute04: 
compute04: The following NEW package is going to be installed:
compute04:   libnuma1
compute04: 
compute04: 1 new package to install.
compute04: Overall download size: 29.8 KiB. Already cached: 0 B. After the operation, additional 67.1 KiB will be used.
compute04: Continue? [y/n/...? shows all options] (y): y
compute04: Retrieving package libnuma1-2.0.11-lp150.2.1.aarch64 (1/1),  29.8 KiB ( 67.1 KiB unpacked)
compute04: Retrieving: libnuma1-2.0.11-lp150.2.1.aarch64.rpm [done]
compute04: Checking for file conflicts: [...done]
compute04: (1/1) Installing: libnuma1-2.0.11-lp150.2.1.aarch64 [.....done]
compute03: Retrieving repository 'openSUSE-Ports-Leap-15.0-Update' metadata [.................................done]
compute02: Retrieving repository 'openSUSE-Ports-Leap-15.0-Update' metadata [..................................done]
compute01: Retrieving repository 'openSUSE-Ports-Leap-15.0-Update' metadata [...................................done]
compute01: Building repository 'openSUSE-Ports-Leap-15.0-Update' cache [....done]
compute03: Building repository 'openSUSE-Ports-Leap-15.0-Update' cache [....done]
compute01: Loading repository data...
compute02: Building repository 'openSUSE-Ports-Leap-15.0-Update' cache [....done]
compute03: Loading repository data...
compute02: Loading repository data...
compute01: Reading installed packages...
compute03: Reading installed packages...
compute02: Reading installed packages...
compute01: Resolving package dependencies...
compute01: 
compute01: The following NEW package is going to be installed:
compute01:   libnuma1
compute01: 
compute01: 1 new package to install.
compute01: Overall download size: 29.8 KiB. Already cached: 0 B. After the operation, additional 67.1 KiB will be used.
compute01: Continue? [y/n/...? shows all options] (y): y
compute01: Retrieving package libnuma1-2.0.11-lp150.2.1.aarch64 (1/1),  29.8 KiB ( 67.1 KiB unpacked)
compute03: Resolving package dependencies...
compute03: 
compute03: The following NEW package is going to be installed:
compute03:   libnuma1
compute03: 
compute03: 1 new package to install.
compute03: Overall download size: 29.8 KiB. Already cached: 0 B. After the operation, additional 67.1 KiB will be used.
compute03: Continue? [y/n/...? shows all options] (y): y
compute03: Retrieving package libnuma1-2.0.11-lp150.2.1.aarch64 (1/1),  29.8 KiB ( 67.1 KiB unpacked)
compute02: Resolving package dependencies...
compute01: Retrieving: libnuma1-2.0.11-lp150.2.1.aarch64.rpm [.done (2.5 KiB/s)]
compute01: Checking for file conflicts: [...done]
compute02: 
compute02: The following NEW package is going to be installed:
compute02:   libnuma1
compute02: 
compute02: 1 new package to install.
compute02: Overall download size: 29.8 KiB. Already cached: 0 B. After the operation, additional 67.1 KiB will be used.
compute02: Continue? [y/n/...? shows all options] (y): y
compute01: (1/1) Installing: libnuma1-2.0.11-lp150.2.1.aarch64 [.....done]
compute02: Retrieving package libnuma1-2.0.11-lp150.2.1.aarch64 (1/1),  29.8 KiB ( 67.1 KiB unpacked)
compute03: Retrieving: libnuma1-2.0.11-lp150.2.1.aarch64.rpm [.done]
compute03: Checking for file conflicts: [...done]
compute03: (1/1) Installing: libnuma1-2.0.11-lp150.2.1.aarch64 [.....done]
compute02: Retrieving: libnuma1-2.0.11-lp150.2.1.aarch64.rpm [done]
compute02: Checking for file conflicts: [...done]
compute02: (1/1) Installing: libnuma1-2.0.11-lp150.2.1.aarch64 [.....done]
baker:~ #

When we were testing our job scheduler, we started the local job manager on our head node just to make sure that things were working. For a production cluster, we don't want to run jobs on the head node, so let's shut down the local job manager on the head node and start the munge service and the slurmd service on each compute node.

baker:~ # systemctl stop slurmd
baker:~ # pdsh -a systemctl start munge
baker:~ # pdsh -a systemctl start slurmd
baker:~ # sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
prod*        up   infinite      1  idle* baker
prod*        up   infinite      4   idle compute[01-04]
baker:~ #

This looks almost correct. Our four compute nodes are up and idle, waiting for jobs. The head node baker still shows that it's in an idle state, even though we shut down the local job agent. You'll notice that the state is listed with an asterisk after it, though. The normal compute nodes are listed as state idle and the head node is listed as idle*. In this case, the asterisk means that the node was previously online and accepting jobs, but the scheduler has lost contact with it. Eventually the scheduler will decide that the node really is down. Right now, the cheduler is being optimistic that the node may come back online, so let's mark it as offline explicitly in the scheduler so it doesn't show up as being (possibly, at some point) online.

baker:~ # scontrol update nodename=baker state=down reason=offline
baker:~ # sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
prod*        up   infinite      1  down* baker
prod*        up   infinite      4   idle compute[01-04]
baker:~ #

Now the node is explicitly down. It's still marked with an asterisk because the scheduler still can't talk to the node, but things are in good enough shape to move on and test our scheduler. To do this, we're going to write a simple job script that tells the job manager how to run our job and submit it to the scheduler. Copy the following commands into a file and save it as hostname.sh.

#!/bin/bash
#
#SBATCH --job-name=hostname
#SBATCH -N 1
/bin/hostname

Now you can submit it to the scheduler with the sbatch command. We didn't tell the scheduler where to store the output, so it will put the output in a file named after the job number. Let's give it a try and see what we get.

admin@baker:~> module add slurm
admin@baker:~> sbatch hostname.sh
Submitted batch job 12
admin@baker:~> ls
backup  bin  hostname.sh  slurm-12.out  test
admin@baker:~> cat slurm-12.out
/apps/modules-4.2.0/init/bash: line 37: /usr/bin/tclsh: No such file or directory
compute01
admin@baker:~>

Again, this is promising. There are a few things to note here. The first is that the output went to a file called slurm-12.out, named after our job number which was 12. The second item to note is that we seem to be missing another piece of software, in this case tcl (Tool Control Language), which is a scripting langauge. Nevertheless, it seems our job ran correctly. The third item to note is the actual output of our job, which just prints out the hostname. In this case, our job ran on compute node compute01. Despite any problems, it looks like our cluster is mostly working.

Let's fix the tcl issue, and then we can try running our job again.

baker:~ # pdsh -a zypper install -y tcl
compute01: Loading repository data...
compute04: Loading repository data...
compute03: Loading repository data...
compute02: Loading repository data...
compute01: Reading installed packages...
compute02: Reading installed packages...
compute04: Reading installed packages...
compute03: Reading installed packages...
compute01: Resolving package dependencies...
compute04: Resolving package dependencies...
compute02: Resolving package dependencies...
compute03: Resolving package dependencies...
compute01: 
compute01: The following NEW package is going to be installed:
compute01:   tcl
compute01: 
compute01: 1 new package to install.
compute01: Overall download size: 2.7 MiB. Already cached: 0 B. After the operation, additional 8.3 MiB will be used.
compute01: Continue? [y/n/...? shows all options] (y): y
compute04: 
compute04: The following NEW package is going to be installed:
compute04:   tcl
compute04: 
compute04: 1 new package to install.
compute02: 
compute04: Overall download size: 2.7 MiB. Already cached: 0 B. After the operation, additional 8.3 MiB will be used.
compute04: Continue? [y/n/...? shows all options] (y): y
compute02: The following NEW package is going to be installed:
compute02:   tcl
compute02: 
compute02: 1 new package to install.
compute02: Overall download size: 2.7 MiB. Already cached: 0 B. After the operation, additional 8.3 MiB will be used.
compute02: Continue? [y/n/...? shows all options] (y): y
compute01: Retrieving package tcl-8.6.7-lp150.4.4.aarch64 (1/1),   2.7 MiB (  8.3 MiB unpacked)
compute04: Retrieving package tcl-8.6.7-lp150.4.4.aarch64 (1/1),   2.7 MiB (  8.3 MiB unpacked)
compute02: Retrieving package tcl-8.6.7-lp150.4.4.aarch64 (1/1),   2.7 MiB (  8.3 MiB unpacked)
compute03: 
compute03: The following NEW package is going to be installed:
compute03:   tcl
compute03: 
compute03: 1 new package to install.
compute03: Overall download size: 2.7 MiB. Already cached: 0 B. After the operation, additional 8.3 MiB will be used.
compute03: Continue? [y/n/...? shows all options] (y): y
compute03: Retrieving package tcl-8.6.7-lp150.4.4.aarch64 (1/1),   2.7 MiB (  8.3 MiB unpacked)
compute04: Retrieving: tcl-8.6.7-lp150.4.4.aarch64.rpm [..done (1.4 MiB/s)]
compute01: Retrieving: tcl-8.6.7-lp150.4.4.aarch64.rpm [..done (1.5 MiB/s)]
compute02: Retrieving: tcl-8.6.7-lp150.4.4.aarch64.rpm [..done (1.3 MiB/s)]
compute01: Checking for file conflicts: [......done]
compute04: Checking for file conflicts: [......done]
compute02: Checking for file conflicts: [......done]
compute03: Retrieving: tcl-8.6.7-lp150.4.4.aarch64.rpm [..done (1.3 MiB/s)]
compute03: Checking for file conflicts: [......done]
compute01: (1/1) Installing: tcl-8.6.7-lp150.4.4.aarch64 [............done]
compute02: (1/1) Installing: tcl-8.6.7-lp150.4.4.aarch64 [............done]
compute04: (1/1) Installing: tcl-8.6.7-lp150.4.4.aarch64 [............done]
compute03: (1/1) Installing: tcl-8.6.7-lp150.4.4.aarch64 [............done]
baker:~ #

That looks good. Now let's submit our job a few times to make sure we can use all the compute nodes.

admin@baker:~> sbatch hostname.sh 
Submitted batch job 13
admin@baker:~> sbatch hostname.sh 
Submitted batch job 14
admin@baker:~> sbatch hostname.sh 
]Submitted batch job 15
admin@baker:~> sbatch hostname.sh 
Submitted batch job 16
admin@baker:~> cat slurm-13.out
compute01
admin@baker:~> cat slurm-14.out
compute01
admin@baker:~> cat slurm-15.out
compute01
admin@baker:~> cat slurm-16.out
compute01
admin@baker:~>

Well, at least we were able to get rid of the error message from tcl. However, every job is running on the same compute node. The problem here is that our job runs so fast, it ends and the first compute node becomes available again before the next one can start. Just to make sure the scheduler will use all the nodes, let's add a sleep command to our job script, and then submit a bunch of the same job to see where they land. We should be able to get a job on every node. I've added a sleep command to my job script so it looks like this now:

#!/bin/bash
#
#SBATCH --job-name=hostname
#SBATCH -N 1
/bin/hostname
sleep 10

Let's erase all the previous job output, submit a bunch of jobs, and see what we get for output.

admin@baker:~> for job in `seq 1 20`; do sbatch hostname.sh; done
Submitted batch job 17
Submitted batch job 18
Submitted batch job 19
Submitted batch job 20
Submitted batch job 21
Submitted batch job 22
Submitted batch job 23
Submitted batch job 24
Submitted batch job 25
Submitted batch job 26
Submitted batch job 27
Submitted batch job 28
Submitted batch job 29
Submitted batch job 30
Submitted batch job 31
Submitted batch job 32
Submitted batch job 33
Submitted batch job 34
Submitted batch job 35
Submitted batch job 36
admin@baker:~> squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                33      prod hostname    admin PD       0:00      1 (Priority)
                34      prod hostname    admin PD       0:00      1 (Priority)
                35      prod hostname    admin PD       0:00      1 (Priority)
                36      prod hostname    admin PD       0:00      1 (Priority)
                17      prod hostname    admin  R       0:03      1 compute01
                18      prod hostname    admin  R       0:03      1 compute01
                19      prod hostname    admin  R       0:03      1 compute01
                20      prod hostname    admin  R       0:03      1 compute01
                21      prod hostname    admin  R       0:03      1 compute02
                22      prod hostname    admin  R       0:03      1 compute02
                23      prod hostname    admin  R       0:03      1 compute02
                24      prod hostname    admin  R       0:03      1 compute02
                25      prod hostname    admin  R       0:03      1 compute03
                26      prod hostname    admin  R       0:03      1 compute03
                27      prod hostname    admin  R       0:02      1 compute03
                28      prod hostname    admin  R       0:02      1 compute03
                29      prod hostname    admin  R       0:02      1 compute04
                30      prod hostname    admin  R       0:02      1 compute04
                31      prod hostname    admin  R       0:02      1 compute04
                32      prod hostname    admin  R       0:02      1 compute04
admin@baker:~> squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
admin@baker:~>

The squeue command shows the job queue, or the list of jobs waiting and running. From the output we can see that each job was scheduled on a single core on each of the compute nodes. Since each node has four cores, the nodes ran four jobs at a time. With 20 jobs submitted at once, that gave us 16 jobs running, and four jobs pending (waiting for a compute node to become available). The job queue cleared out after 20 seconds. Let's check the output from the jobs.

admin@baker:~> cat slurm-*.out
compute01
compute01
compute01
compute01
compute02
compute02
compute02
compute02
compute03
compute03
compute03
compute03
compute04
compute04
compute04
compute04
compute01
compute01
compute01
compute03
admin@baker:~>

We can see from the output that jobs got scheduled on all of the compute nodes, so things are looking great. The real value of a cluster, though, is in running parallel jobs across multiple nodes. In the next section, we'll configure MPI and run a real parallel job.