Benchmarking is a pretty broad topic. Traditionally, we run a program called HPL for High Performance Linpack. Linpack is a benchmark that provides a performance benchmark based on solving a random dense linear system. HPL is the traditional measure of performance for a cluster, and is still used for the Top 500 rankings of the fastest clusters in the world.
A little more recently, a new, broader, benchmark has been developed called HPCG. HPCG stands for High Performance Conjugate Gradients. The testing is a little broader than HPL in an attempt to better represent the performance of a cluster on real-world applications.
For our purposes, let's start simple. Right now, we're running a "Pure Pi" cluster. We've overloaded our head node as a file server, job manager master, and login node. This isn't something we usually do in the real world unless we're severely budget constrained. Nevertheless, let's see how we perform in this configuration.
In addition to the raw processing power of cluster, there are other benchmarks that we should look at. Another common benchmark is how fast we can dispatch jobs. This is a common metric for a few problem domains, such as weather forecasting. When we need to run large numbers of simulations to determine the probability of tornadoes forming, or determine the path of a hurricane or typhoon, storm surge projections, or similar time-critical problems, the ability to dispatch large numbers of jobs very quickly becomes critical. Let's start by benchmarking how fast we can dispatch jobs.
Although HPL and HPCG are fairly well defined in terms of how you run them, calculating how fast you can dispatch jobs is much less well defined. Let's start with a simple job that just sleeps for 0 seconds and see how long it takes to dispatch 1000 jobs. Here's the job script:
!/bin/sh #SBATCH --job-name sleep #SBATCH -o sleep.out.%J #SBATCH -n 1 sleep 0
Let's save that job script as sleep.sh and give it a try to see how long it takes. Here's a first try with this job script:
admin@baker:~/dev> date; for i in `seq 1 1000`; do sbatch sleep.sh; done; date; squeue Tue Nov 20 23:25:55 CST 2018 Submitted batch job 45 Submitted batch job 46 ... Submitted batch job 1042 Submitted batch job 1043 Submitted batch job 1044 Tue Nov 20 23:26:48 CST 2018 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1038 prod sleep admin R 0:01 1 compute01 1039 prod sleep admin R 0:01 1 compute01 1041 prod sleep admin R 0:01 1 compute02 1042 prod sleep admin R 0:01 1 compute01 1043 prod sleep admin R 0:00 1 compute01 1044 prod sleep admin R 0:00 1 compute02 admin@baker:~/dev>
I've truncated the output because it's over 1000 lines of mosty the same data. There are a couple things we can note from this output, though. It took 53 seconds to dispatch 1000 jobs. If we submitted 1 million jobs at the same rate, it would take around 14 hours. That seems pretty reasonable for a lowly Raspberry Pi. However, as you can see at the end of the job submission script, I printed out the job queue and there were still six jobs waiting to run. They were queued, so they got dispatched, but they hadn't completed. This is actually a good sign; it means we're dispatching the jobs faster than the compute nodes can run them (although not much faster). Let's dig a little deeper.
Our cluster runs jobs in parallel, but we're submitting jobs serially. With a large user base, this isn't how the job manager typically receives jobs. The job manager receives multiple jobs from multiple users, effectively receiving them (mostly) in parallel. Let's run the same job, but this time we'll just send every submission to the background and let the job manager deal with the flood of jobs.
Let's put our submit loop into a script so it's a bit easier to modify. Save the following file as submit.sh so we can tweak it as needed.
#!/bin/bash date for i in `seq 1 1000` do sbatch sleep.sh & done date squeue
This script submits the same sleep job as our command line scripts, but runs every submission in the background, so we're not waiting for sbatch to complete before the next job is submitted. If you run this script with bash submit.sh, you'll see that even though it seems to start okay, it quickly starts failing with messages like this:
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation sbatch: error: Batch job submission failed: Socket timed out on send/recv operation sbatch: error: Batch job submission failed: Socket timed out on send/recv operation sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
When our script finishes, you will continue to get messages about jobs being submitted, since we sent every job into the background. Furthermore, we still have a lot of jobs queued when our script completes, so our timing numbers are questionable. It seems we need to rethink our approach to benchmarking job submission times.
Let's deal with the timeout problem first. Problems like this are often caused by exceeding some system limit. The system places restrictions on a number of resources to prevent a single process from overloading the server. We can check our limits with the ulimit -a command:
admin@baker:~/dev> ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2483 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited admin@baker:~/dev>
Things here don't look too bad. Most of the system restrictions are set to unlimited. We submitted 1000 jobs, and the only limit we're close to is open files. I'm sure we're not hitting that limit, so let's look at the job manager.
If you look at our scheduler configuration file (/apps/slurm-19.05/etc/slurm.conf) and just search for the keyword Timeout, you can find a handful of parameters with default values. The one that looks promising to me is #MessageTimeout=10. As a first try, let's uncomment that line and increase the value by an order of magnitude, from 10 to 100. You should be careful about making arbitrary changes like this, as they may have unintended consequences. This is just a test, but if it works, we can go back and figure out a reasonable value for this parameter.
Save the configuration file with the new value, and restart the scheduler with systemctl stop slurmctld; systemctl start slurmctld. Most of the parameters in the configuration file can be updated with the scontrol reconfigure command, but some require that the scheduler to be stopped and started again. If you're not running a production system, it's usually safer just to stop and restart the scheduler if you're not sure which parameters can be updated on the fly.
After restarting the scheduler, try running the job submission script again. This time, all the jobs will get scheduled without any problems, so we've correctly identified the parameter that was causing problems before. To determine a reasonable value, you can take a look at the output from the sdiag command which will show lots of statistics from the scheduler. Look at the section labeled Remote Procedure Call statistics by message type and specifically the average time listed for REQUEST_SUBMIT_BATCH_JOB.
That's one problem down. If we look at the times reported from our script, it shows that it finished in 14 seconds. That's a great improvement over our first try which took 53 seconds, but it's not really accurate. Since we sent all our job submissions into the background, our script finished long before all the jobs were submitted. We can check the time of the final job output file, but that will give us the amount of time it took to submit and run all the jobs. We're also not guaranteed that the final job was really the last job scheduled. We'll have to revisit this. If we look at the total time to submit and run the jobs, we did significantly better, so it's still a win. With the updated numbers, we can schedule 1 million jobs in just under 4 hours. That seems pretty impressive, but we have nothing to compare it to. Let's make a note of our numbers and move on to some real performance testing. (Don't forget to clean up all those output files!)
For benchmarking performance of our compute nodes, let's start with the traditional HPL benchmark. You can download the latest version of HPL from Netlib. Here's how I pulled it down and unpacked it:
admin@baker:~> cd /apps/source/ admin@baker:/apps/source> wget http://www.netlib.org/benchmark/hpl/hpl-2.2.tar.gz --2018-11-21 23:15:51-- http://www.netlib.org/benchmark/hpl/hpl-2.2.tar.gz Resolving www.netlib.org (www.netlib.org)... 18.104.22.168 Connecting to www.netlib.org (www.netlib.org)|22.214.171.124|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 539979 (527K) [application/x-gzip] Saving to: ‘hpl-2.2.tar.gz’ hpl-2.2.tar.gz 100%[================================================>] 527.32K 609KB/s in 0.9s 2018-11-21 23:15:52 (609 KB/s) - ‘hpl-2.2.tar.gz’ saved [539979/539979] admin@baker:/apps/source> tar -xf hpl-2.2.tar.gz admin@baker:/apps/source> ls hpl-2.2 modules-4.2.0.tar.gz openmpi-3.1.1.tar.gz pdsh-2.33 slurm-19.05.tar.z hpl-2.2.tar.gz munge-0.5.13 openmpi-3.1.2 pdsh-2.33.tar.gz modules-4.2.0 munge-0.5.13.tar.xz openmpi-3.1.2.tar.bz2 slurm-19.05 admin@baker:/apps/source> cd hpl-2.2/ admin@baker:/apps/source/hpl-2.2>
Now that we have the code, we still need some supporting software to build it. We already have our MPI library, but we still need a BLAS (Basic Linear Algebra Subprograms) library. We have options for BLAS, too. The easiest solution is to just use the implementation provided by openSUSE. We can look at other options later. Install the cblas libraries with zypper:
baker:~ # zypper install cblas-devel Retrieving repository 'openSUSE-Ports-Leap-15.0-Update' metadata ..........................[done] Building repository 'openSUSE-Ports-Leap-15.0-Update' cache ...............................[done] Retrieving repository 'Packman Repository' metadata .......................................[done] Building repository 'Packman Repository' cache ............................................[done] Loading repository data... Reading installed packages... Resolving package dependencies... The following 2 NEW packages are going to be installed: cblas-devel libcblas3 2 new packages to install. Overall download size: 41.8 KiB. Already cached: 0 B. After the operation, additional 193.5 KiB will be used. Continue? [y/n/...? shows all options] (y): Retrieving package libcblas3-20110120-lp150.1.3.aarch64 (1/2), 28.0 KiB (131.3 KiB unpacked) Retrieving: libcblas3-20110120-lp150.1.3.aarch64.rpm ......................................[done] Retrieving package cblas-devel-20110120-lp150.1.3.aarch64 (2/2), 13.8 KiB ( 62.2 KiB unpacked) Retrieving: cblas-devel-20110120-lp150.1.3.aarch64.rpm ....................................[done] Checking for file conflicts: ..............................................................[done] (1/2) Installing: libcblas3-20110120-lp150.1.3.aarch64 ....................................[done] Additional rpm output: update-alternatives: using /usr/lib64/blas/libcblas.so.3 to provide /usr/lib64/libcblas.so.3 (libcblas.so.3) in auto mode (2/2) Installing: cblas-devel-20110120-lp150.1.3.aarch64 ..................................[done] Executing %posttrans scripts ..............................................................[done] baker:~ #
Now let's go back and build HPL. We need an architecture file to tell HPL how to build the software and where all the dependencies are. There are some sample architecture files under the setup/ directory. You won't find anything for the Pi, or even the ARM archiecture in there. We'll have to build our own.
There is a small utility in the setup/ directory called make_generic that will produce a mostly blank architecture file. Let's run that, and then customize the result for our environment.
admin@baker:/apps/source/hpl-2.2/setup> sh make_generic admin@baker:/apps/source/hpl-2.2/setup> ls Make.FreeBSD_PIV_CBLAS Make.Linux_ATHLON_VSIPL Make.Linux_PII_VSIPL_gm Make.SUN4SOL2-g_VSIPL make_generic Make.Linux_Intel64 Make.MacOSX_Accelerate Make.T3E_FBLAS Make.HPUX_FBLAS Make.Linux_PII_CBLAS Make.PWR2_FBLAS Make.Tru64_FBLAS Make.I860_FBLAS Make.Linux_PII_CBLAS_gm Make.PWR3_FBLAS Make.Tru64_FBLAS_elan Make.IRIX_FBLAS Make.Linux_PII_FBLAS Make.PWRPC_FBLAS Make.UNKNOWN Make.Linux_ATHLON_CBLAS Make.Linux_PII_FBLAS_gm Make.SUN4SOL2_FBLAS Make.UNKNOWN.in Make.Linux_ATHLON_FBLAS Make.Linux_PII_VSIPL Make.SUN4SOL2-g_FBLAS admin@baker:/apps/source/hpl-2.2/setup> cp Make.UNKNOWN ../Make.ARM admin@baker:/apps/source/hpl-2.2/setup> cd ..
Now edit Make.ARM and change the following definitions:
ARCH = ARM HOME = /apps/source ARCH = ARM TOPdir = $(HOME)/hpl-2.2 MPdir = /apps/openmpi-3.1.2 MPinc = -I $(MPdir)include MPlib = $(MPdir)/lib64/libmpi.so LAdir = /usr/lib64 LAinc = -I /usr/include LAlib = /usr/lib64/blas/libblas.so.3 /usr/lib64/blas/libcblas.so.3 HPL_OPTS = -DHPL_CALL_CBLAS
Now we can build HPL. Full output from the make command is in the make.log subfolder.:
admin@baker:/apps/source/hpl-2.2> make arch=ARM |& tee make.log
If you watch the output from the make command, you'll see some warning go by. Those are safe to ignore. You can run the HPL benchamrk directly, but it will only run on a single core. To get multiple cores and multiple nodes, we need to use MPI. Let's start by benchmarking a single node (our management node) to see how we do. We need to edit the HPL.dat file to tune the parameters for the run we want. We can make some fairly good estimates on the parameters to get a decent run, but tuning for maximum performance takes some work. It's part mathematics, part experimentation, and part art form. For starters, try the following HPL.dat file:
HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 6 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 7000 Ns 1 # of NBs 128 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 2 Ps 2 Qs 16.0 threshold 1 # of panel fact 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 1 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0)
This file is a reasonable start for a Raspberry Pi 3 Model B+, which has four cores and 1GB of memory. Here's a test run using this data file:
admin@baker:/apps/source/hpl-2.2.atlas/bin/ARM> module add openmpi admin@baker:/apps/source/hpl-2.2.atlas/bin/ARM> mpirun -n 4 ./xhpl ================================================================================ HPLinpack 2.2 -- High-Performance Linpack benchmark -- February 24, 2016 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 7000 NB : 128 PMAP : Row-major process mapping P : 2 Q : 2 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 7000 128 2 2 783.15 2.921e-01 HPL_pdgesv() start time Sat Dec 29 19:23:59 2018 HPL_pdgesv() end time Sat Dec 29 19:37:02 2018 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0052147 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================ admin@baker:/apps/source/hpl-2.2.atlas/bin/ARM>
For that run, we clocked our Raspberry Pi at 292.1 MFLOPS (0.2921 GFLOPS). Is that a decent number? The theoretical peak performance (Rpeak) of a processor is:
(CPU Speed) * (number of cores) * (CPU instructions per clock cycle)
In theory, a Raspberry Pi 3 Model B+ should have a maximum HPL performance number of 11.2 GFLOPS, based on 4 cores at 1.4GHz and 2 instructions per clock cycle. This is a bit misleading, though, because although the Pi can reportedly perform do two integer calculations per cycle, floating point operations may take longer. In the best case, we can do one floating point operation every two cycles, and it may take 8 cycles in the worst case. If we assume the best case, our "instructions per cycle" multiplier goes from 2 to 0.5, and our theoretical performance drops to 2.8 GFLOPS. For the HPL benchmark, we can assume we'll be close to the best case numbers, since we're doing primarily floating point operations and the processor will be able to pipeline them.
From our HPL run above, we hit 0.292 GFLOPS, or just over 10% of our theoretical peak. Even without tweaking the HPL.dat file, this seems unreasonable. It's time to do a little digging.
Our first suspects for improvement are the external libraries we rely on, namely our MPI implementation and our BLAS libraries. Traditionally, MPICH does better than openMPI for jobs constrained to a single node, so that's one possibility, but it wouldn't account for 90% of our performance. We've used the BLAS libraries provided by openSUSE, and they're usually pretty good at optimizing their code. Nevertheless, let's try replacing the BLAS code as a start.
Another option for BLAS is the ATLAS library. It hasn't seen a lot of traffic over the past few years, but it's interesting because it automatically optimizes itself for your platform. As you might expect, this takes some time. I'm not going to cover the entire build here (it took over 24 hours on my Pi), but the build is fairly straightforward, and produces some interesting performance numbers as it optimizes the library. You can rebuild HPL using ATLAS by replacing the default values in Make.ARM with the following values:
ARCH = ARM HOME = /apps/source ARCH = ARM TOPdir = $(HOME)/hpl-2.2 MPdir = /apps/openmpi-3.1.2 MPinc = -I $(MPdir)include MPlib = $(MPdir)/lib64/libmpi.so LAdir = /apps/source/ATLAS LAinc = -I $(LAdir)/include LAlib = $(LAdir)/ATLAS.arm/lib/libcblas.a $(LAdir)/ATLAS.arm/lib/libatlas.a HPL_OPTS = -DHPL_CALL_CBLAS
Now you cane run "make arch=ARM clean; make arch=ARM" to get a new version of hpl with the ATLAS libraries. When you run the HPL benchmark again, you should get a result similar to this:
================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 7000 128 2 2 117.26 1.951e+00 HPL_pdgesv() start time Sat Dec 29 18:21:26 2018 HPL_pdgesv() end time Sat Dec 29 18:23:23 2018
Now we're seeing almost 2 GFLOPS, so we're just under 70% of peak. That's a great improvement, but we're still a ways from 2.8. We may be limited by memory bandwidth, CPU governing, we may be getting processor contention from other processes (our job scheduler, database, etc.) or a host of other issues. We can do some digging and tweaking to improve this number, but for now, let's use this as our base timing, and move on to benchmarking the entire cluster.
Benchmarking for the cluster as a whole isn't much different than benchmarking for a single node. Our problem size and available memory will go up, so we need to build a new HPL.dat file, and we'll be running across multiple nodes, so we'll need to build a job script.
What can we expect for performance out of an HPL run across four nodes? Ideally, we would get four times the performance of a single node. That won't happen, however. Since the nodes need to communicate over the network, our network is going to slow us down significantly. Nevertheless, our target is 11.2 GFLOPS.
For starters, let's build a simple job script. Once we have the job script, we can tweak the parameters in the HPL.dat file and rerun the benchmark until we're happy with the results. A sample job scripts is below. The only things we strictly need to do in the job script is tell the job manager to allocate 16 processor cores (all four nodes of our cluster) for our job, load the openmpi module, and call xhpl. I've added a few extra lines to the script to print out the nodes we're running, and the start and end time of the run. As a general rule, these are good things to have in your output file, and you (and your users) should add them to every job script.
#!/bin/bash # #SBATCH --job-name=HPL #SBATCH -o hplrun.%J.out #SBATCH -n 16 echo "Running HPL on nodes:" $SLURM_JOB_NODELIST echo -n "Run started at" /bin/date module add openmpi mpirun ./xhpl echo -n "Run finished at" /bin/date
Now we need to build a new HPL.dat file. We'll need to increase our problem size to reflect 16 cores (P=4, Q=4), and increase the Ns parameter to reflect the memory of all four nodes. Adjusting the Ns parameter is a hit-or-miss exercise. We want to get it as large as possible without running out of real memory. Our performance will go up as Ns goes up, but will fall off dramatically as soon as we start swapping. If we push too far, it's also easy to crash our nodes.
Starting with four times the number we used for Ns in our single-node run will crash our cluster nodes. I've done some experimenting with different problem sizes, and the following HPL.dat file is a reasonable place to start:
HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 6 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 18000 Ns 1 # of NBs 128 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 4 Ps 4 Qs 16.0 threshold 1 # of panel fact 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 1 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0)
With a little tweaking of the parameters, you can narrow down the optimal value a bit. Here are the runs that I did:
WR11C2R4 18000 128 4 4 637.87 6.096e+00 WR11C2R4 19000 128 4 4 733.06 6.239e+00 WR11C2R4 19500 128 4 4 794.60 6.222e+00 WR11C2R4 20000 128 4 4 1512.69 3.526e+00
So, with a little experimentation, the best performance we've achieved is 6.239 GFLOPS. So, we've hit 55.7% of our target. In general, without a lot of work, you can get 60% of expected performance with gigbit ethernet and 80% of expected performance with 40Gb Infiniband. Since we're running on 100Mb ethernet on a very cheap switch, I think this is a pretty good number.
What now? We should go back and fix all the things we glossed over. We should optimize our installations, and turn off unneeded services. We should automate customizations on the compute nodes with Ansible, Chef, Puppet, etc. We can explore PXE booting, which is supposed to be available on a limited basis on the Pi 3 B+. We can do some benchmarking using an external head node with a little more power to see how we do with a mixed architecture system.
We can use this platform to do some testing for real world scenarios, which I'll do in the HPC for Science section of this web site.
More explorations of this platform are forthcoming.