Building the Compute Nodes

Although the Raspberry Pi is a admirable little piece of hardware, it's lacking in a few areas. One is that it has no ability to boot from the network. Usually we configure our compute nodes to boot with PXE (Preboot eXecution Environment) to configure them automatically. With the Pi, we have to configure each one individually. Even with only four compute nodes, this is a cumbersome process. Once we have them on the network we can finish the configuration remotely, but to start, we need to write an SD card for each node, and then plug in a keyboard and monitor to configure the network. The process is the same as writing the original headnode image, so I'll skip the details here. Just work your way through bringing each compute node up and getting the IP address set. We'll do everything else remotely.

Once you have the compute nodes on the network, we have one more task before we can use pdsh to configure things remotely. We want all the compute nodes to trust the head node so we can log in without a password. We accomplish this by setting up ssh key access between all our nodes. This is also required for the compute nodes to communicate using MPI. Once we get this working, we'll take another look at the security implications of this and tighten things down a bit.

There are a few things we have to set up to get this working. First, we need to configure the ssh program and the sshd service to accept Host Based Authentication. This just means that if one host has authenticated a user, the other hosts will trust that authentication and not make the user enter a password again. Once that is done, we need to collect all the host keys into /etc/ssh/ssh_known_hosts and list all the trusted machines in /etc/ssh/shosts.equiv. Let's start by setting up ssh and sshd so they both accept HBA.

Edit /etc/ssh/ssh_config and search for the line that says HostbasedAuthentication. It should be commented out with the defaut value of no listed. I usually make a copy of the line, uncomment it, and then change it to the value I want. Here's the section of ssh_config where I made my changes:

#   PasswordAuthentication yes
#   HostbasedAuthentication no
HostbasedAuthentication yes
#   GSSAPIAuthentication no
#   GSSAPIDelegateCredentials no

Do the same with /etc/ssh/sshd_config. iAfter changing the sshd_config file, we need to restart the service with systemctl restart sshd for the changes to take effect. Here's the section of sshd_config:

# For this to work you will also need host keys in /etc/ssh/ssh_known_hosts
#HostbasedAuthentication no
HostbasedAuthentication yes
# Change to yes if you don't trust ~/.ssh/known_hosts for
# HostbasedAuthentication

The configuration line that's commented out has the default value in it. At least it's supposed o have the default value. It's good practice to explicitly set the value you want, even if it's the default. Default configurations change. If you explicitly set the value you want, you won't be surprised.

Now we just need to modify shosts.equiv and ssh_known_keys. Let's start with shosts.equiv, since it just lists the names of hosts we trust. Becasue we're still using /etc/hosts for name resolution, let's add the IP addresses to the file, as well. The file should look like this:

baker
compute01
compute02
compute03
compute04
192.168.0.200
192.168.0.201
192.168.0.202
192.168.0.203
192.168.0.204

Normally, we would list the FQDN (Fully Qualified Domain Name) as well in this file, but since we haven't set up a domain, we'll just list the short hostnames. Save this file as /etc/ssh/shosts.equiv and let's move on to ssh_known_hosts.

The ssh_known_hosts file lists each host name, along with its host key. There's a bit of a quirk in the way the SSH service works. The host name that gets sent to the remote system is not necessarily the same every time every time we create a connection.. When we list the host name, we have to list every possible variant. That means the host name, the FQDN, and even the IP address. There are also multiple ways of collecting the host keys. For our purposes, we'll use a utility called ssh-keyscan. This will contact the host and ask it what its host keys are. You can run it as a test against one of the compute nodes like this:

admin@baker:~> ssh-keyscan compute01
# compute01:22 SSH-2.0-OpenSSH_7.6
compute01 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFebkSf1smb0suMHsLOl8c8Flocc2s1AY8mUlEJyGjuhlPjWSydG+SXWWJR3lXo4Cb4g6Cprk8s3WMJ4DHJeB2Q=
# compute01:22 SSH-2.0-OpenSSH_7.6
compute01 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCymUYT07H2TaJrxSJj/J3BcffM3yqbN5APGRAa+f6UL6JEGKIMeAlFysbJtanUfY4AxmUCGXT9gTMbCSc9ITBUB0fRDtOSp6IhWJDqiVzNYxogTo86J4fd6UOeX3jKUsrs4DeHALe/lfB7w65jKWP9nfRJCb6i40ZTOhXwgH50+Ye4VvRAHxmI4XjjZrI8dRZMmNJFyeo41mkhSJcGvmFVPEFUJRv84vzi26Xy6Nip7xfnGGBVfpuERdJMhFIM4LORyoOEVsZDS1/OKW8Pjomr6w5BQ8llxKjVDIBH2qCfd2NvINc+e1zQiQoOkpBa1nhIOgighxjvT2GXferSg62J
# compute01:22 SSH-2.0-OpenSSH_7.6
compute01 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIK7iZwjR02vfLEiZHa8izypkDYsL+c9hjm2sCFvpsn2V
admin@baker:~>

We can use this to collect host keys from all the nodes, massage the host names, and use that as our ssh_known_hosts file. Here's how I did the initial host key collection:

dmin@baker:~> for node in baker compute{01..04}; do ssh-keyscan $node; done | tee ssh_known_hosts
# baker:22 SSH-2.0-OpenSSH_7.6
# baker:22 SSH-2.0-OpenSSH_7.6
# baker:22 SSH-2.0-OpenSSH_7.6
baker ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCvqFaUdZVuxwUhtqp22ZtUP3Pl+BVAYzE8TPLFRySbMZUM43RxUI8u2X8u+GHJGSQiVC9GTvfd1ogGjZzx1TVoxE1MhGO0kBFPDzhAzNbOhA07cZwGzFClWS62sOSCNuA8ArAiCpSKuooiWeMMWJZUJX4m5LQzcQSqSpgtpcQNl2Wm6FHyjE1vZW3upYbP2xjUTC9s25a0/dHD7e5mZWoNsa8RsGd1Qno+OIXsoyGLpgwmD5cTAdtkCWMrMHAzqHAIpUJwdQMLfx+IZIEL8ZHYA/i6uGvp5vip0EZM0FBoRupqleoxn1AI5BV6y/Pyk+vHGa2i2HaE2rZNb0xT/Rm5
baker ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBPZooIyFN1XHc2a8R8SWFc0b/dhSIzRVKFCWzXR1G3milb94AN3jzWl37YgrWlKxsoZtETb2cY6Djx4VWHvbYTg=
baker ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMa71SHQT4bfrHbQ/DoFfCCjT9UOVubHkluoWgApRSWh
# compute01:22 SSH-2.0-OpenSSH_7.6
# compute01:22 SSH-2.0-OpenSSH_7.6
# compute01:22 SSH-2.0-OpenSSH_7.6
compute01 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFebkSf1smb0suMHsLOl8c8Flocc2s1AY8mUlEJyGjuhlPjWSydG+SXWWJR3lXo4Cb4g6Cprk8s3WMJ4DHJeB2Q=
compute01 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCymUYT07H2TaJrxSJj/J3BcffM3yqbN5APGRAa+f6UL6JEGKIMeAlFysbJtanUfY4AxmUCGXT9gTMbCSc9ITBUB0fRDtOSp6IhWJDqiVzNYxogTo86J4fd6UOeX3jKUsrs4DeHALe/lfB7w65jKWP9nfRJCb6i40ZTOhXwgH50+Ye4VvRAHxmI4XjjZrI8dRZMmNJFyeo41mkhSJcGvmFVPEFUJRv84vzi26Xy6Nip7xfnGGBVfpuERdJMhFIM4LORyoOEVsZDS1/OKW8Pjomr6w5BQ8llxKjVDIBH2qCfd2NvINc+e1zQiQoOkpBa1nhIOgighxjvT2GXferSg62J
compute01 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIK7iZwjR02vfLEiZHa8izypkDYsL+c9hjm2sCFvpsn2V
# compute02:22 SSH-2.0-OpenSSH_7.6
# compute02:22 SSH-2.0-OpenSSH_7.6
# compute02:22 SSH-2.0-OpenSSH_7.6
compute02 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDLp9jHgpFb0mUY1IdakAvS3Nn4hInSTpgwhGz5eDAg4ObTv8o7HxTrAdoVrIXQruHRifoCkTE4yu5GDB2HGI8kfrt4JwxQsSz6/nSgp9yZfQ0D9MjkGsx2nclpJ320E8hSjc66bLpVeEiDaZIdbX790gldQ4pUbcUMd6BdZDRHUALKf4j1494OzUiNx9nRxl7YC0NAAPdfB7pLiEWIXwLb63NAva1pLbCfCqeE59ho9zO7BoT60iOT2nBWOxi5HJt6itieFwya6pVQNPoY0mXoWdG9ih42Br9W1UCdYOPe4KeKY3CAbgYQtkA1RxgaX0i6N6uziLdJM9Ul+SgP7EOB
compute02 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBP3ju3RafTIEVKpffboA3oKHnYtBHTJqYaYi/duax8ciIDg4OE5RPtTfWdfu5ZgDc4J3TqV9v1bgIkEmjWJHAcU=
compute02 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPdrot7DSxcqFsgo92lIY3dTZDKphaqesie2lir2RrNj
# compute03:22 SSH-2.0-OpenSSH_7.6
# compute03:22 SSH-2.0-OpenSSH_7.6
# compute03:22 SSH-2.0-OpenSSH_7.6
compute03 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBCxxqyqzRiBaBnr8BJwoWe99I5UJEDMUaXMg9yjiszDBBEFCOFAyGa5E9egoDqcualNsZT9uL6cCBYg+QKskdt0=
compute03 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDTY3VZwj+yP059IrDsX3JZIi/3ThOMNGha7/GS7fgzVhBtpqx6W1CON3HUICqJH2Uzr9iJTVJt6z7ugiQs5sQsYlFCGEmrc1ru/atxIVQqEgJi//SGVZJBX99VsQRvnQYWtxCh2+riaU9cQvT2OCt4m9bOdh9QYCI3i42Oud0tQ/gCIjo02m7RSx0Wgvv98Noi+slatUTcwO10PD67Nf+g7fPP3y5arFvWzyJ/nhm0VdbjNovWr7Kwcq/8hvuO3sbfMiVQT3tNOfEWesVEnkYvUcbHrD+2/gj2PXhZdeOEnB8qnOjWTRnqG6MW5EKOdrBwq08d7JN+iIaOUfUAncFp
compute03 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBfD4Nf5ymxEu6emoAQlSuO3RNTDm5sM4fEIubxDwh1F
# compute04:22 SSH-2.0-OpenSSH_7.6
# compute04:22 SSH-2.0-OpenSSH_7.6
# compute04:22 SSH-2.0-OpenSSH_7.6
compute04 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBPuKNyQ04U57cd1nWDrVmBuiYCFREvw/qYWdxNOgEK7LeH/KKU/pRvwBrfN2aUCtMmHtiSW8fIef+I227mCXOTI=
compute04 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDD9p8Accgdug8+QeWuqMJNUsD8/SQOqgvIs9zyc0FMitzKP7bJKZnsGjK6T/3LjllFbPHrqEKqc0NI3uujKP1EXEHzJn3FYBfVOfC5RjmIshxC3KAIGtKyisKtaq0XwaiGJmoD9w1GVcI1icL9+alVR3158+to2mbhBwSbbQmmQdupPsd2QJvFCWDLv/8Y81HEDGNvLA1NWIQN+mvse1PC9ZTY3d8ccEaRVQ4Wx2NWxrBx6T2SFoHwc/VuVWM49qqA0hzQOTck/MbFhoBSy5NpoXmkMgLhtZPnoXKpYAee3+/0B84oWqdghTQ798zcy71EMJ5T6G0WNSmL7+5tgspp
compute04 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIdmA05TDS1zl38kZah+fxeVXBCXPYfxlE52nYTNR2Gq
admin@baker:~>

Now you can edit the ssh_known_hosts file, add the IP addresses for each node, and save it as /etc/ssh/ssh_known_hosts. Here's how my final ssh_known_hosts file looks:

baker,192.168.0.200 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCvqFaUdZVuxwUhtqp22ZtUP3Pl+BVAYzE8TPLFRySbMZUM43RxUI8u2X8u+GHJGSQiVC9GTvfd1ogGjZzx1TVoxE1MhGO0kBFPDzhAzNbOhA07cZwGzFClWS62sOSCNuA8ArAiCpSKuooiWeMMWJZUJX4m5LQzcQSqSpgtpcQNl2Wm6FHyjE1vZW3upYbP2xjUTC9s25a0/dHD7e5mZWoNsa8RsGd1Qno+OIXsoyGLpgwmD5cTAdtkCWMrMHAzqHAIpUJwdQMLfx+IZIEL8ZHYA/i6uGvp5vip0EZM0FBoRupqleoxn1AI5BV6y/Pyk+vHGa2i2HaE2rZNb0xT/Rm5
baker,192.168.0.200 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBPZooIyFN1XHc2a8R8SWFc0b/dhSIzRVKFCWzXR1G3milb94AN3jzWl37YgrWlKxsoZtETb2cY6Djx4VWHvbYTg=
baker,192.168.0.200 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMa71SHQT4bfrHbQ/DoFfCCjT9UOVubHkluoWgApRSWh
compute01,192.168.0.201 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFebkSf1smb0suMHsLOl8c8Flocc2s1AY8mUlEJyGjuhlPjWSydG+SXWWJR3lXo4Cb4g6Cprk8s3WMJ4DHJeB2Q=
compute01,192.168.0.201 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCymUYT07H2TaJrxSJj/J3BcffM3yqbN5APGRAa+f6UL6JEGKIMeAlFysbJtanUfY4AxmUCGXT9gTMbCSc9ITBUB0fRDtOSp6IhWJDqiVzNYxogTo86J4fd6UOeX3jKUsrs4DeHALe/lfB7w65jKWP9nfRJCb6i40ZTOhXwgH50+Ye4VvRAHxmI4XjjZrI8dRZMmNJFyeo41mkhSJcGvmFVPEFUJRv84vzi26Xy6Nip7xfnGGBVfpuERdJMhFIM4LORyoOEVsZDS1/OKW8Pjomr6w5BQ8llxKjVDIBH2qCfd2NvINc+e1zQiQoOkpBa1nhIOgighxjvT2GXferSg62J
compute01,192.168.0.201 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIK7iZwjR02vfLEiZHa8izypkDYsL+c9hjm2sCFvpsn2V
compute02,192.168.0.202 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDLp9jHgpFb0mUY1IdakAvS3Nn4hInSTpgwhGz5eDAg4ObTv8o7HxTrAdoVrIXQruHRifoCkTE4yu5GDB2HGI8kfrt4JwxQsSz6/nSgp9yZfQ0D9MjkGsx2nclpJ320E8hSjc66bLpVeEiDaZIdbX790gldQ4pUbcUMd6BdZDRHUALKf4j1494OzUiNx9nRxl7YC0NAAPdfB7pLiEWIXwLb63NAva1pLbCfCqeE59ho9zO7BoT60iOT2nBWOxi5HJt6itieFwya6pVQNPoY0mXoWdG9ih42Br9W1UCdYOPe4KeKY3CAbgYQtkA1RxgaX0i6N6uziLdJM9Ul+SgP7EOB
compute02,192.168.0.202 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBP3ju3RafTIEVKpffboA3oKHnYtBHTJqYaYi/duax8ciIDg4OE5RPtTfWdfu5ZgDc4J3TqV9v1bgIkEmjWJHAcU=
compute02,192.168.0.202 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPdrot7DSxcqFsgo92lIY3dTZDKphaqesie2lir2RrNj
compute03,192.168.0.203 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBCxxqyqzRiBaBnr8BJwoWe99I5UJEDMUaXMg9yjiszDBBEFCOFAyGa5E9egoDqcualNsZT9uL6cCBYg+QKskdt0=
compute03,192.168.0.203 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDTY3VZwj+yP059IrDsX3JZIi/3ThOMNGha7/GS7fgzVhBtpqx6W1CON3HUICqJH2Uzr9iJTVJt6z7ugiQs5sQsYlFCGEmrc1ru/atxIVQqEgJi//SGVZJBX99VsQRvnQYWtxCh2+riaU9cQvT2OCt4m9bOdh9QYCI3i42Oud0tQ/gCIjo02m7RSx0Wgvv98Noi+slatUTcwO10PD67Nf+g7fPP3y5arFvWzyJ/nhm0VdbjNovWr7Kwcq/8hvuO3sbfMiVQT3tNOfEWesVEnkYvUcbHrD+2/gj2PXhZdeOEnB8qnOjWTRnqG6MW5EKOdrBwq08d7JN+iIaOUfUAncFp
compute03,192.168.0.203 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBfD4Nf5ymxEu6emoAQlSuO3RNTDm5sM4fEIubxDwh1F
compute04,192.168.0.204 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBPuKNyQ04U57cd1nWDrVmBuiYCFREvw/qYWdxNOgEK7LeH/KKU/pRvwBrfN2aUCtMmHtiSW8fIef+I227mCXOTI=
compute04,192.168.0.204 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDD9p8Accgdug8+QeWuqMJNUsD8/SQOqgvIs9zyc0FMitzKP7bJKZnsGjK6T/3LjllFbPHrqEKqc0NI3uujKP1EXEHzJn3FYBfVOfC5RjmIshxC3KAIGtKyisKtaq0XwaiGJmoD9w1GVcI1icL9+alVR3158+to2mbhBwSbbQmmQdupPsd2QJvFCWDLv/8Y81HEDGNvLA1NWIQN+mvse1PC9ZTY3d8ccEaRVQ4Wx2NWxrBx6T2SFoHwc/VuVWM49qqA0hzQOTck/MbFhoBSy5NpoXmkMgLhtZPnoXKpYAee3+/0B84oWqdghTQ798zcy71EMJ5T6G0WNSmL7+5tgspp
compute04,192.168.0.204 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIdmA05TDS1zl38kZah+fxeVXBCXPYfxlE52nYTNR2Gq

There's another small configuration item we need to fix. The key signing program in /usr/lib/ssh/ssh-keysign has to be setuid to root, or normal users won't be able to use HBA. You can fix this with the following command:

baker:~ # chmod +s /usr/lib/ssh/ssh-keysign

This should be enough to let normal users use ssh to connect to the compute nodes without a password. The problem we will encounter if we keep going is that the compute nodes don't yet have accounts for anyone other than root. Let's make sure our ssh setup is correct, and then we can worry about accounts on all the machines.

User accounts are defined through three primary files: /etc/passwd, /etc/shadow, and (to a lesser extent) /etc/group. To make our test work, let's just copy these from the head node to compute01 and run a quick test to make sure the admin user can use ssh to connect without a password.

baker:~ # cd /etc
baker:/etc # scp passwd compute01:/etc/passwd
Password: 
Password: 
baker:/etc # scp passwd compute01:/etc/passwd
Password: 
passwd                                                100% 1142   622.1KB/s   00:00    
baker:/etc # scp shadow compute01:/etc/shadow
Password: 
shadow                                                100%  662   387.7KB/s   00:00    
baker:/etc # scp group compute01:/etc/group
Password: 
group                                                 100%  513   327.5KB/s   00:00    
baker:/etc # su - admin
admin@baker:~> ssh compute01 date
Could not chdir to home directory /home/admin: No such file or directory
Mon Apr 30 04:52:13 UTC 2018
admin@baker:~>

Okay, things are looking good, but there are a few things to note from the output above:

  • The root user still has to enter a password
  • The admin user has no home directory
  • The time and date on the compute node are a long way from being accurate

The reason root still has to use a password is a bit convoluted, and is related to the way the ssh program has developed over time. The details are too involved to go into here, but there are a few ways to get around this. For a quick fix, let's log in to node compute01 and change the /etc/ssh/sshd_config to modify these settings:

IgnoreRhosts no

Now create the file /root/.shosts and add our head node to it. It should look like this:

baker
192.168.0.200

Now if we try to ssh as root again, we get this:

baker:/etc # ssh compute01 date
Mon Apr 30 05:02:44 UTC 2018
baker:/etc #

That's one problem solved. The second problem is that we don't have a shared file system for user directories. It's a little unfortunate that we didn't plan ahead for this, but since we're building a "Pure Pi" cluster, let's just export the /home file system from the management node and mount it on the compute nodes. Edit the /etc/exports file on the head node so it looks like this:

/apps 192.168.0.0/24(ro,root_squash,no_subtree_check)
/home 192.168.0.0/24(rw,root_squash,no_subtree_check)

Now run exportfs -a to re-export the file systems and it should be available to our compute nodes. The next immediate problem we will run into is that our compute nodes are still running the JeOS image which doesn't include the NFS client. This is simple enough to fix on compute01. We just have to install the NFS client programs, and we should be able to mount the file system:

compute01:~ # zypper install nfs-client
Retrieving repository 'openSUSE-Ports-Leap-15.0-Update' metadata ......................[done]
Building repository 'openSUSE-Ports-Leap-15.0-Update' cache ...........................[done]
Retrieving repository 'openSUSE-Ports-Leap-15.0-repo-oss' metadata ....................[done]
Building repository 'openSUSE-Ports-Leap-15.0-repo-oss' cache .........................[done]
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following 5 NEW packages are going to be installed:
  keyutils nfs-client nfsidmap rpcbind system-user-nobody

5 new packages to install.
Overall download size: 465.0 KiB. Already cached: 0 B. After the operation, additional 1.8
MiB will be used.
Continue? [y/n/...? shows all options] (y): y
Retrieving package nfsidmap-0.26-lp150.2.3.1.aarch64    (1/5),  43.7 KiB (276.0 KiB unpacked)
Retrieving: nfsidmap-0.26-lp150.2.3.1.aarch64.rpm .........................[done (1.0 KiB/s)]
Retrieving package keyutils-1.5.10-lp150.3.3.aarch64    (2/5),  83.5 KiB (256.8 KiB unpacked)
Retrieving: keyutils-1.5.10-lp150.3.3.aarch64.rpm .......................[done (183.6 KiB/s)]
Retrieving package system-user-nobody-20170617-lp150.3.4.noarch
                                                        (3/5),  10.4 KiB (  116   B unpacked)
Retrieving: system-user-nobody-20170617-lp150.3.4.noarch.rpm ..........................[done]
Retrieving package rpcbind-0.2.3-lp150.2.2.aarch64      (4/5),  66.9 KiB (216.6 KiB unpacked)
Retrieving: rpcbind-0.2.3-lp150.2.2.aarch64.rpm .......................................[done]
Retrieving package nfs-client-2.1.1-lp150.4.3.1.aarch64 (5/5), 260.5 KiB (  1.1 MiB unpacked)
Retrieving: nfs-client-2.1.1-lp150.4.3.1.aarch64.rpm ..................................[done]
Checking for file conflicts: ..........................................................[done]
(1/5) Installing: nfsidmap-0.26-lp150.2.3.1.aarch64 ...................................[done]
(2/5) Installing: keyutils-1.5.10-lp150.3.3.aarch64 ...................................[done]
(3/5) Installing: system-user-nobody-20170617-lp150.3.4.noarch ........................[done]
Additional rpm output:
groupadd -r -g 65533 nogroup
groupadd -r -g 65534 nobody
useradd -r -s /sbin/nologin -c "nobody" -g nobody -d /var/lib/nobody -u 65534 nobody
usermod -a -G nogroup nobody
usermod: no changes

(4/5) Installing: rpcbind-0.2.3-lp150.2.2.aarch64 .....................................[done]
Additional rpm output:
Updating /etc/sysconfig/rpcbind ...

(5/5) Installing: nfs-client-2.1.1-lp150.4.3.1.aarch64 ................................[done]
Additional rpm output:
Updating /etc/sysconfig/nfs ...
setting /sbin/mount.nfs to root:root 4755. (wrong permissions 0755)

compute01:~ # mount /home
compute01:~ # ls -l /home
total 8
drwxr-xr-x 9 admin users 4096 Nov 18 03:11 admin
drwxr-xr-x 6 slurm users 4096 Nov 10 00:35 slurm
compute01:~ #

One more problem down. Now the compute node has the /home file system mounted from the management node, and it knows who the admin and slurm users are.

The last problem is the date and time. Having the time synchronized across the cluster is critical for things to work correctly. Actually, it's not critical that the time be correct, just that all members of the cluster agree on what the time is. Nevertheless, there' no reason not to have the time correct, and it makes everything else easier. Naturally, there' a service to synchronize time with a central server, and it's called NTP for Network Time Protocol. We'll tell all our compute nodes to get their time from the head node, so even if the time on the head node gets messed up, our cluster as a whole will still be in sync.

The NTP service is controlled through a configuration file in /etc/ntp.conf. On the head node, edit /etc/ntp.conf and add the local network as authorized clients. The section in ntp.conf dealing with clients should look like this:

# Local users may interrogate the ntp server more closely.
restrict 127.0.0.1
restrict ::1
restrict 192.168.0.0/24

# Clients from this (example!) subnet have unlimited access, but only if
# cryptographically authenticated.

That's the server side. Now, on node compute01, change the /etc/ntp.conf file to point to the head node for time information. Comment out the default SuSE hosts and add only the head node as an authoritative source. The relevant part of your file should look like this:

server 192.168.0.200
# server 0.opensuse.pool.ntp.org iburst
# server 1.opensuse.pool.ntp.org iburst
# server 2.opensuse.pool.ntp.org iburst
# server 3.opensuse.pool.ntp.org iburst

Your compute node is probably still set to UTC. This is helpful for global deployments, but since we're only running a test cluster locally, it helps to have the time expressed in the local timezone. You can set this through YaST, or you can just copy the /etc/localtime file from the head node to the compute nodes (assuming you have the timezone you want on the head node).

Once you have the correct timezone set, you need to update the time by hand. If the time is too different from the authoritative source, the compute nodes won't correct their time automatically. To force an update, you need to stop the running ntpd process, run the ntpdate command to set the correct time and date, and then restart the ntpd process so the time stays in sync with the head node.

compute01:~ # systemctl stop ntpd
compute01:~ # ntpdate 192.168.0.200
17 Nov 22:58:14 ntpdate[4967]: adjust time server 192.168.0.200 offset 0.012193 sec
compute01:~ # systemctl start ntpd
compute01:~ #

That brings us back into sync with our head node. Now we just have to repeat the same steps for the remaining compute nodes, and we should be okay. For configutation files, we can just copy the file from compute01 to the head node, and then copy it out to the rest of the compute nodes. You kept a list of what we changed, right??

Here's what I did:

baker:~ # mkdir distrib
baker:~ # cd distrib
baker:~/distrib # scp compute01:/etc/ssh/ssh_config .
ssh_config                                            100% 2586     1.2MB/s   00:00    
baker:~/distrib # scp compute01:/etc/ssh/sshd_config .
sshd_config                                           100% 3768     1.5MB/s   00:00    
baker:~/distrib # scp compute01:/etc/ssh/ssh_known_hosts .
ssh_known_hosts                                       100%  683   399.9KB/s   00:00    
baker:~/distrib # scp compute01:/etc/ssh/shosts.equiv .
shosts.equiv                                          100%  116    77.2KB/s   00:00    
baker:~/distrib # scp compute01:/etc/localtime .
localtime                                             100% 2294     1.0MB/s   00:00    
baker:~/distrib # scp compute01:/etc/ntp.conf .
ntp.conf                                              100% 3156     1.3MB/s   00:00    
baker:~/distrib # scp compute01:/root/.shosts .
.shosts                                               100%   20    12.9KB/s   00:00    
baker:~/distrib # for node in compute02 compute03 compute04; do
> ssh $node zypper install -y nfs-client;
> scp /etc/hosts $node:/etc/hosts;
> scp ssh_config $node:/etc/ssh/ssh_config
> scp sshd_config $node:/etc/ssh/sshd_config
> scp shosts.equiv $node:/etc/ssh/shosts.equiv
> scp ssh_known_hosts $node:/etc/ssh/ssh_known_hosts
> scp .shosts $node:/root/.shosts
> scp /etc/passwd $node:/etc/passwd
> scp /etc/shadow $node:/etc/shadow
> scp /etc/group $node:/etc/group
> ssh $node systemctl tart shd
> scp ntp.conf $node:/etc/ntp.conf
> ssh $node "systemctl stop ntpd; ntpdate 192.168.0.200; systemctl start ntpd"
> ssh $node "echo 'baker:/apps /apps nfs nfsvers=3 0 0' >> /etc/fstab; echo 'baker:/home /home nfs nfsvers=3 0 0' >> /etc/fstab; mount /home; mkdir /apps; mount /apps"
> done
Password: 
Retrieving repository 'openSUSE-Ports-Leap-15.0-Update' metadata [..................done]
Building repository 'openSUSE-Ports-Leap-15.0-Update' cache [....done]
Retrieving repository 'openSUSE-Ports-Leap-15.0-repo-oss' metadata [.....................done]
Building repository 'openSUSE-Ports-Leap-15.0-repo-oss' cache [....done]
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following 5 NEW packages are going to be installed:
  keyutils nfs-client nfsidmap rpcbind system-user-nobody

5 new packages to install.
Overall download size: 465.0 KiB. Already cached: 0 B. After the operation, additional 1.8 MiB will be used.
Continue? [y/n/...? shows all options] (y): y
Retrieving package nfsidmap-0.26-lp150.2.3.1.aarch64 (1/5),  43.7 KiB (276.0 KiB unpacked)
Retrieving: nfsidmap-0.26-lp150.2.3.1.aarch64.rpm [.done (1.0 KiB/s)]
Retrieving package keyutils-1.5.10-lp150.3.3.aarch64 (2/5),  83.5 KiB (256.8 KiB unpacked)
Retrieving: keyutils-1.5.10-lp150.3.3.aarch64.rpm [.done (1.0 KiB/s)]
Retrieving package system-user-nobody-20170617-lp150.3.4.noarch (3/5),  10.4 KiB (  116   B unpacked)
Retrieving: system-user-nobody-20170617-lp150.3.4.noarch.rpm [.done]
Retrieving package rpcbind-0.2.3-lp150.2.2.aarch64 (4/5),  66.9 KiB (216.6 KiB unpacked)
Retrieving: rpcbind-0.2.3-lp150.2.2.aarch64.rpm [done]
Retrieving package nfs-client-2.1.1-lp150.4.3.1.aarch64 (5/5), 260.5 KiB (  1.1 MiB unpacked)
Retrieving: nfs-client-2.1.1-lp150.4.3.1.aarch64.rpm [.done]
Checking for file conflicts: [........done]
(1/5) Installing: nfsidmap-0.26-lp150.2.3.1.aarch64 [...........done]
(2/5) Installing: keyutils-1.5.10-lp150.3.3.aarch64 [...........done]
(3/5) Installing: system-user-nobody-20170617-lp150.3.4.noarch [.....done]
Additional rpm output:
groupadd -r -g 65533 nogroup
groupadd -r -g 65534 nobody
useradd -r -s /sbin/nologin -c "nobody" -g nobody -d /var/lib/nobody -u 65534 nobody
usermod -a -G nogroup nobody

(4/5) Installing: rpcbind-0.2.3-lp150.2.2.aarch64 [.........done]
Additional rpm output:
Updating /etc/sysconfig/rpcbind ...

(5/5) Installing: nfs-client-2.1.1-lp150.4.3.1.aarch64 [............done]
Additional rpm output:
Updating /etc/sysconfig/nfs ...
setting /sbin/mount.nfs to root:root 4755. (wrong permissions 0755)

Password: 
hosts                                                 100%  758   446.0KB/s   00:00    
Password: 
ssh_config                                            100% 2586   392.3KB/s   00:00    
Password: 
sshd_config                                           100% 3768     1.5MB/s   00:00    
Password: 
shosts.equiv                                          100%  116    79.8KB/s   00:00    
Password: 
ssh_known_hosts                                       100%  683   404.9KB/s   00:00    
Password: 
.shosts                                               100%   20    13.8KB/s   00:00    
Password: 
ntp.conf                                              100% 3156     1.4MB/s   00:00    
18 Nov 05:41:35 ntpdate[4260]: adjust time server 192.168.0.200 offset -0.001041 sec
Password: 
Retrieving repository 'openSUSE-Ports-Leap-15.0-Update' metadata [..................done]
Building repository 'openSUSE-Ports-Leap-15.0-Update' cache [....done]
Retrieving repository 'openSUSE-Ports-Leap-15.0-repo-oss' metadata [.....................done]
Building repository 'openSUSE-Ports-Leap-15.0-repo-oss' cache [....done]
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following 5 NEW packages are going to be installed:
  keyutils nfs-client nfsidmap rpcbind system-user-nobody

5 new packages to install.
Overall download size: 465.0 KiB. Already cached: 0 B. After the operation, additional 1.8 MiB will be used.
Continue? [y/n/...? shows all options] (y): y
Retrieving package nfsidmap-0.26-lp150.2.3.1.aarch64 (1/5),  43.7 KiB (276.0 KiB unpacked)
Retrieving: nfsidmap-0.26-lp150.2.3.1.aarch64.rpm [.done (1.0 KiB/s)]
Retrieving package keyutils-1.5.10-lp150.3.3.aarch64 (2/5),  83.5 KiB (256.8 KiB unpacked)
Retrieving: keyutils-1.5.10-lp150.3.3.aarch64.rpm [.done (1.0 KiB/s)]
Retrieving package system-user-nobody-20170617-lp150.3.4.noarch (3/5),  10.4 KiB (  116   B unpacked)
Retrieving: system-user-nobody-20170617-lp150.3.4.noarch.rpm [.done]
Retrieving package rpcbind-0.2.3-lp150.2.2.aarch64 (4/5),  66.9 KiB (216.6 KiB unpacked)
Retrieving: rpcbind-0.2.3-lp150.2.2.aarch64.rpm [done]
Retrieving package nfs-client-2.1.1-lp150.4.3.1.aarch64 (5/5), 260.5 KiB (  1.1 MiB unpacked)
Retrieving: nfs-client-2.1.1-lp150.4.3.1.aarch64.rpm [.done]
Checking for file conflicts: [........done]
(1/5) Installing: nfsidmap-0.26-lp150.2.3.1.aarch64 [...........done]
(2/5) Installing: keyutils-1.5.10-lp150.3.3.aarch64 [...........done]
(3/5) Installing: system-user-nobody-20170617-lp150.3.4.noarch [.....done]
Additional rpm output:
groupadd -r -g 65533 nogroup
groupadd -r -g 65534 nobody
useradd -r -s /sbin/nologin -c "nobody" -g nobody -d /var/lib/nobody -u 65534 nobody
usermod -a -G nogroup nobody

(4/5) Installing: rpcbind-0.2.3-lp150.2.2.aarch64 [.........done]
Additional rpm output:
Updating /etc/sysconfig/rpcbind ...

(5/5) Installing: nfs-client-2.1.1-lp150.4.3.1.aarch64 [............done]
Additional rpm output:
Updating /etc/sysconfig/nfs ...
setting /sbin/mount.nfs to root:root 4755. (wrong permissions 0755)

Password: 
hosts                                                 100%  758   446.0KB/s   00:00    
Password: 
ssh_config                                            100% 2586   392.3KB/s   00:00    
Password: 
sshd_config                                           100% 3768     1.5MB/s   00:00    
Password: 
shosts.equiv                                          100%  116    79.8KB/s   00:00    
Password: 
ssh_known_hosts                                       100%  683   404.9KB/s   00:00    
Password: 
.shosts                                               100%   20    13.8KB/s   00:00    
Password: 
ntp.conf                                              100% 3156     1.4MB/s   00:00    
18 Nov 05:41:35 ntpdate[4260]: adjust time server 192.168.0.200 offset -0.002162 sec
Password: 
Retrieving repository 'openSUSE-Ports-Leap-15.0-Update' metadata [..................done]
Building repository 'openSUSE-Ports-Leap-15.0-Update' cache [....done]
Retrieving repository 'openSUSE-Ports-Leap-15.0-repo-oss' metadata [.....................done]
Building repository 'openSUSE-Ports-Leap-15.0-repo-oss' cache [....done]
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following 5 NEW packages are going to be installed:
  keyutils nfs-client nfsidmap rpcbind system-user-nobody

5 new packages to install.
Overall download size: 465.0 KiB. Already cached: 0 B. After the operation, additional 1.8 MiB will be used.
Continue? [y/n/...? shows all options] (y): y
Retrieving package nfsidmap-0.26-lp150.2.3.1.aarch64 (1/5),  43.7 KiB (276.0 KiB unpacked)
Retrieving: nfsidmap-0.26-lp150.2.3.1.aarch64.rpm [.done (1.0 KiB/s)]
Retrieving package keyutils-1.5.10-lp150.3.3.aarch64 (2/5),  83.5 KiB (256.8 KiB unpacked)
Retrieving: keyutils-1.5.10-lp150.3.3.aarch64.rpm [.done (1.0 KiB/s)]
Retrieving package system-user-nobody-20170617-lp150.3.4.noarch (3/5),  10.4 KiB (  116   B unpacked)
Retrieving: system-user-nobody-20170617-lp150.3.4.noarch.rpm [.done]
Retrieving package rpcbind-0.2.3-lp150.2.2.aarch64 (4/5),  66.9 KiB (216.6 KiB unpacked)
Retrieving: rpcbind-0.2.3-lp150.2.2.aarch64.rpm [done]
Retrieving package nfs-client-2.1.1-lp150.4.3.1.aarch64 (5/5), 260.5 KiB (  1.1 MiB unpacked)
Retrieving: nfs-client-2.1.1-lp150.4.3.1.aarch64.rpm [.done]
Checking for file conflicts: [........done]
(1/5) Installing: nfsidmap-0.26-lp150.2.3.1.aarch64 [...........done]
(2/5) Installing: keyutils-1.5.10-lp150.3.3.aarch64 [...........done]
(3/5) Installing: system-user-nobody-20170617-lp150.3.4.noarch [.....done]
Additional rpm output:
groupadd -r -g 65533 nogroup
groupadd -r -g 65534 nobody
useradd -r -s /sbin/nologin -c "nobody" -g nobody -d /var/lib/nobody -u 65534 nobody
usermod -a -G nogroup nobody

(4/5) Installing: rpcbind-0.2.3-lp150.2.2.aarch64 [.........done]
Additional rpm output:
Updating /etc/sysconfig/rpcbind ...

(5/5) Installing: nfs-client-2.1.1-lp150.4.3.1.aarch64 [............done]
Additional rpm output:
Updating /etc/sysconfig/nfs ...
setting /sbin/mount.nfs to root:root 4755. (wrong permissions 0755)

Password: 
hosts                                                 100%  758   446.0KB/s   00:00    
Password: 
ssh_config                                            100% 2586   392.3KB/s   00:00    
Password: 
sshd_config                                           100% 3768     1.5MB/s   00:00    
Password: 
shosts.equiv                                          100%  116    79.8KB/s   00:00    
Password: 
ssh_known_hosts                                       100%  683   404.9KB/s   00:00    
Password: 
.shosts                                               100%   20    13.8KB/s   00:00    
Password: 
ntp.conf                                              100% 3156     1.4MB/s   00:00    
18 Nov 05:41:35 ntpdate[4260]: adjust time server 192.168.0.200 offset -0.010073 sec

What you can note from the output bove is that we were asked for a password until the ssh configuration was updated, and then everything worked without a password. That suggests that we got the ssh configuration correct. Let's do a few checks to make sure everything else worked correctly.

baker:~/distrib # module add pdsh
baker:~/distrib # pdsh -a date
compute04: Sat Nov 17 23:50:05 CST 2018
compute02: Sat Nov 17 23:50:05 CST 2018
compute01: Sat Nov 17 23:50:05 CST 2018
compute03: Sat Nov 17 23:50:05 CST 2018
baker:~/distrib # pdsh -a ls -ld /home/admin
compute04: drwxr-xr-x 9 admin users 4096 Nov 17 21:11 /home/admin
compute03: drwxr-xr-x 9 admin users 4096 Nov 17 21:11 /home/admin
compute02: drwxr-xr-x 9 admin users 4096 Nov 17 21:11 /home/admin
compute01: drwxr-xr-x 9 admin users 4096 Nov 17 21:11 /home/admin
baker:~/distrib #

Everything looks good. The time and date are synchronized across the cluster, and every node can resolve users. We at least have the beginnings of our cluster. Actually, most of the hard work is done. Now we just need to fire up the job manager on our clients and make sure we can run jobs. Sadly, we're not done after we get to that point. We've taken a lot of shortcuts in bringing up the cluster, and we need to go back and fix a lot of things. Nevertheless, if we can run real jobs, that's a great accomplishment.