Reproducibility

Reproducibility is a critical topic in science (hence in HPC), and there are numerous examples of publications being withdrawn because the results couldn't be reproduced by their peers. See RetractionWatch for examples. Part of this is simply managing user environments, but there's a larger problem as technology gets updated and hardware changes.

Ensuring that a job is run with the same code and libraries consistently is more difficult than it appears at first glance. As a general rule, software is always being updated. Sometimes the updates are to fix errors in the code; sometimes the updates are to add functionality or improve performance. In either case, whenever you update a piece of code that you used to generate some kind of scientific results, you risk generating different results.

There are a number of ways that you can preserve the software environment so it can be restored to reproduce the same results. The easiest is simply to back up everything on your system, and restore it when you need it again. This quickly becomes cumbersome when you start thinking of clusters with hundreds or thousands of nodes. In addition, your cluster is likely to be upgraded at some point with new hardware, and your old full-system backup may not have drivers for the new hardware.

Another option is to build a "tool tree" that has all your software in it, and put that tree at the beginning of your path. This quickly becomes cumbersome as well. You may need to update one library out of 20 and leave the rest alone. If you reproduce the entire tool tree to provide a new "root," you're using a lot of space duplicating what you already have. You can use symbolic links to avoid that problem, but that comes with its own set of problems. What if the original tool tree had software with errors and was subsequently updated? You'll never need that original tool tree since it has known errors, but it still has a lot of code that you need. Hard links are another option, but that approach also has problems that quickly become unwieldy.

Another option is just to make a virtual machine that has everything you need, and run that (perhaps across thousands of nodes) when you need to reproduce your results. This solution also wastes a lot of space with the image, but you're also losing some performance by running in a virtual environment. You will still run into problems with the underlying system, both in terms hardware and drivers, as well as your hypervisor. The format of a VM may change over time, and even though there will be a period of backward compatibility, eventually the format may be unsupported. You can convert your VM at every upgrade, but then you need to monitor the hypervisor upgrade path and remember to update your VM whenever necessary.

Recently, containers have appeared and seem to be the answer to all our problems. Containers offer a lot of the benefits of a virtual machine image, without the overhead of retaining files we don't need (games, calculators, music players etc.), and without the overhead of running in a virtual environment. Honestly, these are not a bad option. However, these are also not the answer to all our problems. Containers rely on the underlying kernel and associated drivers of the host system, which will change over time. Most science codes don't depend on the kernel, but over time, upgrades to processors and system libraries may still change numerical results. Containers are still a fairly new concept, and the technology is still evolving.