Upgrading Lustre

It’s been close to a year since I updated our cluster; I was going to do it over Christmas, but never quite got around to it. The period of social distancing (and procrastinating on my research) is a great time, right? The cluster is running Centos 7. The biggest issue with upgrading it is the Lustre file system. These are all my notes on the upgrade process. I’m hoping by writing them down here, my life will be somewhat easier the next time I need to do this. Learning how Lustre works all over again every time I do an update is an involved process!

Lustre is very picky about the version of the Linux kernel. This means we can’t just do a blanket “sudo yum update” on the system. We need to upgrade to the specific kernel version that is required by the new version of Lustre we will be installing.

On wyeast, the Lustre server is installed across three different nodes: wyeast-lustre01, wyeast-lustre02, and wyeast-lustre03. The metadata server is on the first node, and the object storage targets are stored on lustre02 and lustre03.

First, update the list of updates that yum knows about:

sudo yum makecache

Next, look at the lustre-server repo and find the current version of the Lustre server and the Linux kernel it uses.

sudo yum repo-pkgs lustre-server list

From this, I found that the current Lustre server version is 2.12.4. I checked the changelog on lustre.org to determine the kernel version needed:

http://wiki.lustre.org/Lustre_2.12.4_Changelog

The Linux kernel needed is actually available in the Lustre-server repo:

kernel-3.10.0-1062.9.1.el7_lustre

So I needed to make sure to install that particular version and not the most up-to-date kernel.

sudo yum repo-pkgs lustre-server update kernel-3.10.0-1062.9.1.el7_lustre kernel-devel-3.10.0-1062.9.1.el7_lustre kernel-headers-3.10.0-1062.9.1.el7_lustre

After that, I checked the current list of other updates available in the Lustre server repository.

sudo yum repo-pkgs lustre-server list

Next, I updated all the Lustre packages that were already installed:

sudo yum repo-pkgs lustre-server update kmod-lustre.x86_64 kmod-lustre-osd-ldiskfs.x86_64 libnvpair1.x86_64 libuutil1.x86_64 libzfs2.x86_64 libzpool2.x86_64 lustre.x86_64 lustre-osd-ldiskfs-mount.x86_64 lustre-osd-zfs-mount.x86_64 lustre-resource-agents.x86_64 lustre-zfs-dkms.noarch spl.x86_64 spl-dkms.noarch zfs.x86_64 zfs-dkms.noarch

Finally, I’ll update all the other system software, carefully excluding the Linux kernel packages:

sudo yum -x kernel,kernel-headers,kernel-debug-devel,kernel-tools,kernel-tools-libs,kmod-lustre.x86_64,kmod-lustre-osd-ldiskfs.x86_64,libnvpair1.x86_64,libuutil1.x86_64,libzfs2.x86_64,libzpool2.x86_64,lustre.x86_64,lustre-osd-ldiskfs-mount.x86_64,lustre-osd-zfs-mount.x86_64,lustre-resource-agents.x86_64,lustre-zfs-dkms.noarch,spl.x86_64,spl-dkms.noarch,spl-dkms.noarch,zfs.x86_64,zfs-dkms.noarch,kernel-devel update

That completes all the software upgrades. The same process needs to be done on wyeast-lustre02 and wyeast-lustre03. I probably should have umounted Lustre mounts before this process, but I didn’t. So after the reboot, Lustre wasn’t quite working. I had to fix it.

First, I had to fix the firewall again on the Lustre machines:

sudo iptables -F

Next, zfs (the file system used by Lustre) was messed up on wyeast-lustre01 and wyeast-lustre02.

The command:

zfs list

wasn’t working. It showed that zfs wasn’t loaded. So the first step is to do:

modprobe zfs

This loaded zfs. However, our zfs pools are missing. This command fixed that:

zpool import

This finds the zpools and allows them to be imported:

zpool import lustre-ost0/ost0

zpool import lustre-ost0/ost0

This loads the zfs pools, but I still need to remount the Lustre file system. This needs to be done on the object storage targets first (lustre02 and lustre03) before it is done on the metadata server (lustre01).

sudo mount -t lustre lustre-ost0/ost0 /lustre-ost0/ost0

sudo mount -t lustre lustre-ost1/ost1 /lustre-ost1/ost1

Lustre actually automounted correctly on Lustre03, so I didn’t have to fix anything. With the targets working, it was time to fix Lustre01:

mount -t lustre lustre-mgsmdt/mgsmdt /lustre-mgsmdt/mgsmdt

Mounting the Lustre file system starts the Lustre service and we are off to the races.

Back on the compute nodes, it wasn’t finding the Lustre mount on the head node. So I had to unmount and then remount Lustre.

First, when I tried to unmount Lustre, the file system was reported as busy. So I ran the following command the find the guilty processes:

sudo lsof +f -- /lustre

This gives me a list of processes that I was then able to kill off. After that:

sudo umount /lustre

Followed by:

sudo mount -t lustre 192.168.1.11@tcp:/lustre /lustre

Which worked! Although I hadn’t yet updated the Lustre client, it was still able to handle the updated Lustre server. The other nodes that didn’t have active shells attached to them didn’t have any trouble with the change; I didn’t even have to remount them; the file system just showed up without any trouble.

Next step is to update the software on the compute nodes. Similar process except somewhat easier since we don’t have to deal with zfs. I still want to limit the install to the particular Linux kernel and the “Lustre-client” repo. In this case, I had to download the rpms from rpmfind:

https://rpmfind.net/linux/rpm2html/search.php?query=kernel%28×86-64%29&submit=Search+…&system=&arch=

I downloaded RPMs for kernel, kernel-debug-devel, kernel-headers, kernel-tools, and kernel-tools-libs. This time, I remembered to unmount /lustre first. Then I installed the new kernel modules:

Then, to install them:

sudo yum localinstall kernel-3.10.0-1062.9.1.el7.x86_64.rpm kernel-debug-devel-3.10.0-1062.9.1.el7.x86_64.rpm kernel-headers-3.10.0-1062.9.1.el7.x86_64.rpm kernel-tools-3.10.0-1062.9.1.el7.x86_64.rpm kernel-tools-libs-3.10.0-1062.9.1.el7.x86_64.rpm

Next, update the Lustre client:

sudo yum repo-pkgs lustre-client update kmod-lustre-client.x86_64 lustre-client.x86_64

Then update everything else, excluding the kernel stuff:

sudo yum update -x kernel,kernel-debug-devel,kernel-headers,kernel-tools,kernel-tools-libs

Finally, reboot and then remount Lustre:

sudo mount -t lustre 192.168.1.11@tcp:/lustre /lustre

Unlike with the Lustre server, I didn’t encounter any trouble with the reboot. The Lustre partition survived the update just fine, and I was able to successfully update all the rest of the installed software on the system.

0 thoughts on “Upgrading Lustre

Leave a Reply

Your email address will not be published. Required fields are marked *