Cleaning up our implementation

There are a number of items that we skipped over or fudged to get working, and now is the time to go back and fix them, so you have a fully functional cluster that closely mimics a real world implementation. With a reasonably complete cluster implemenation, you can use it for testing different job managers, different scheduling algorithms, parallel file systems, etc.

First and foremost, we need a real authentication system. Copying password files around between nodes is a quick and dirty way to get up and running, but it's impractical in the real world. There are a few options for authentication, but you should probably look at OpenLDAP as your solution. Setting up an LDAP server is a topic of its own, and a bit outside the scope of this tutorial, so I won't go into detail here. There are guides available on the Internet that can help setting this up.

Along similar lines, we should probably set up an authoratative DNS server for our cluster nodes. Since the compute nodes aren't (typically) accessible from outside the cluster, it doesn't really make sense to use our site DNS server as an authoratative source for the compute node names and IP addresses. While we're at we should probably set up our own DHCP and NTP servers as well. Again, setting up these services is a little outside the scope of building a cluster, but none of these are very complicated, and there are plenty of references available online.

Tune your network! This probably should be part of the main tutorial, since it's critical for the efficient operation of your cluster. I left it out of the main build instructions because our Pi cluster has very low network performance to begin with, and we're unlikely to get any serious gains from tuning. We can still poke around a bit and see what works for our 100Mb/s network.

The problem we're trying to address is that Linux is tuned for "typical" network performance, but a cluster for running parallel jobs is not typical. Even though Linux will adjust some parameters dynamically, we want to start with a very high performance profile so we don't have to wait for the kernel to adjust buffer sizes. We're also memory constrained on the Pi platform (only 1MB of memory), so we don't want our network buffers too large; that will take valuable memory away from our calculations.

For starters, let's just take a guess at what may increase our performance, based on tuning parameters from some other clusters I've worked on. Copy the following into a file, and then run it on each node and the head node. (Before we commit to this, we want to make sure we're not making things worse.)

sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.ipv4.tcp_sack=1
sysctl -w net.core.netdev_max_backlog=2500
sysctl -w net.core.rmem_max=2097152
sysctl -w net.core.wmem_max=2097152
sysctl -w net.core.rmem_default=2097152
sysctl -w net.core.wmem_default=2097152
sysctl -w net.core.optmem_max=204800
sysctl -w net.ipv4.tcp_rmem="2097152 2097152 2097152"
sysctl -w net.ipv4.tcp_wmem="2097152 2097152 2097152"
sysctl -w net.ipv4.tcp_low_latency=1

Start by benchmarking your cluster without any tuning to get a baseline of performance. Then run this script on all your nodes, and run your benchmarking again to see if we made things better or worse. Following is the results from three runs without any network tuning:

WR11C2R4       19000   128     4     4             725.21              6.306e+00
WR11C2R4       19000   128     4     4             729.92              6.265e+00
WR11C2R4       19000   128     4     4             728.53              6.277e+00

For three runs, this gives us an average performance rating of 6.28 GFLOPS. After our network tuning, we get these numbers:

WR11C2R4       19000   128     4     4             733.52              6.235e+00
WR11C2R4       19000   128     4     4             728.69              6.276e+00
WR11C2R4       19000   128     4     4             731.33              6.253e+00

That gives as an average of 6.25 GFLOPS. It seems we've made things a bit worse, although the difference is only around 0.5% and may be small enough for it to be just random noise. (There's a lesson here about just picking numbers that you think might be good.)

What else may be causing us problems? To begin with, the tuning numbers above are based on a "real" cluster (scaled down a bit for the Pi), with gigabit Ethernet and nodes with lots of memory. To maximize your HPL performance, you want to make your arrays as large as possible (which we've done; you can see the HPL parameters I used in the Benchmarking section) without running out of real memory and dipping into swap space. For any HPC code, as soon as the machine starts to swap, your performance degrades significantly, so we may have simply used up to much memory when we increased the buffers, and dipped into swap.

In the case of the Pi, we're only running 100Mb Ethernet, so the TCP stack may not even need or be able to use the increased memory we gave it. While I was looking at our network configuration, I also noticed that we're not running jumbo frames. By default, Ethernet uses packet sizes of 1500 bytes. High performance sites that move a lot of data around typically set their MTU (Maximum Transmission Unit) to 9000 bytes, instead. Rather than take you through the exercise, let me cut to the punchline. Even though the Pi will accept an MTU of 9000, it doesn't behave well with the jumbo frames.

There's another lesson to be learned from our testing. Each run of HPL takes about 15 minutes on our four-node Pi cluster. That amounts to a lot of wasted time in the example above. It took 90 minutes to figure out that our changes didn't improve our performance. We need a better way to isolate individual parameters and test them quickly to see if we made things better or worse. Fortunately, several years ago, a few people got together and wrote just such a tool called HPCBench available on SourceForge at HPCBench. This tool is aimed at testing network performance which is key to MPI applications, and can be used to isolate single network parameters and test them individually. There are other tools available, of course, but HPCBench is a good starting point.

Disable unneeded services

This isn't as much of a problem for us because we started with a JeOS image, but it's a good practice to occasionally log in to a compute node and run "lsof -i | grep LISTEN" to see what network services your compute nodes are running. You should also occasionally just run a "ps -ef" and look through the process list. Anything that isn't critical to running jobs should be disabled. Some of the more common services that seem to creep in are the printer service (CUPS), some kind of FTP service, or the mail server (which is only required if the node is acceptng email, not sending).

Software updates

Part of maintaining any cluster is keeping up with patches. Your software vendors will be very appreciative if you set up your own local software repository to distribute patches to all your compute nodes from your own local repository instead of having every compute node hammering the standard repositories. This also gives you the ability to select which patches you apply to your cluster, and only add those that have been tested and approved to your local repository.

Setting this up varies depending on your distribution, and takes a fair amount of tweaking, especially if you're running a licensed version of some distribution. Search the Internet for some hints, and then set up dev, stage/test, and prod environments for testing the patches before they go into production.