Managing User Environments

Closely related to the concept of reproducibility, managing the environment for your users is a critical part of managing your cluster so your users can produce consistent results.

Ensuring that a job is run with the same code and libraries consistently is more difficult than it appears at first glance. As a general rule, software is always being updated. Sometimes the updates are to fix errors in the code; sometimes the updates are to add functionality or improve performance. In either case, whenever you update a piece of code that your users may have used to generate some kind of scientific results, you risk generating different results.

There are a number of approaches to ensuring that your code is always run with the same environment so you can always produce the same results. The easiest is simply to back up everything on your system, and restore it when you need it again. This quickly becomes cumbersome when you start thinking of clusters with hundreds or thousands of nodes. In addition, your cluster is likely to be upgraded at some point with new hardware, and your old full-system backup may need to be modified to support newer hardware. In the real world, this is not a practical solution.

Another option is to build a "tool tree" that has all your software in it, and put that tree at the beginning of your path. This quickly becomes cumbersome as well. You may need to update one library out of 20 and leave the rest alone. If you reproduce the entire tool tree to provide a new "root," you're using a lot of space duplicating what you already have. You can use symbolic links or hard links to avoid that problem, but that comes with its own set of problems. What if you need everything in the tool tree, but need some extra libraries? You can add another component to the beginning of your path. What if you need a newer version of a library or utility that hasn't been updated in the standard tool tree? You can add another component to your path at the beginning that will override the default. What if you also need some libraries that a colleague built? Now you have four additional components added to your path, and they need to be in the correct order. Additionally, just setting your $PATH isn't enough. You need to update your $INCLUDE path, your $LD_LIBRARY_PATH, and often set other environment variables before everything works correctly. It's easy to see how this quickly becomes unwieldy.

There are other options for encapsulating your environment using virtual machines or even containers, but these solutions require a lot of work to build the original image, and then you're stuck with a single-use solution. What if you want to test your code with a variety of MPI libraries to see which gives the best performance? You would need to duplicate your single-use solution, and then replace the libraries in each of them. This isn't really a practical solution in terms of time, space, or expertise.

Most centers have settled on the Environment Modules package to handle these issues (commonly refered to simply as Modules). Modules manages a users environment at the component level, adding and removing individual entries, and setting environment variables, as needed. Modules is capable of setting arbitrary environment variables, and checking for conflicts and dependencies. As a simple (and common) example, if you look at the modulefile we built for OpenMPI on the Raspberry Pi cluster, you'll see that we set $PATH, $INCLUDE, $LD_LIBRARY_PATH, and $MANPATH, as well as the environment variables $CC, $CXX, $MPI_ROOT, and $MPI_HOME. Obviously, setting all of these by hand whenever you want to build or run with a different library is unreasonable. Modules takes care of this for you, while keeping everything consistent.