I’m currently maintaining a small cluster with a head node and 17 other nodes with everything connected together through a switch. It’s running Centos 7 and has a Lustre file sytem which is contained on three additional machines; one of them is running the MGS/MDS servers, and the other two are running two storage targets each. Lustre is normally used on large systems with lots of users and lots of storage targets, and this system is tiny in comparison. But I’m doing research that involves a lot of I/O, and my paper is going to be more useful to other computer scientists if it uses the same type of file system in use by most, far larger HPC systems. So that’s why I’m going to all the trouble of maintaining Lustre, which is far more complicated than a file system such as NFS.
That said, everything was going great with Lustre until I ran into a weird issue with one of my bash scripts. The script itself is pretty trivial; just a couple of lines and nothing fancy. But when I tried to edit it with vim, my shell just hung. And hung. And hung. It was pretty obvious after a half hour or so that it wasn’t coming back. When I tried to ls into that directory to check for hidden lock files or something else that might be causing an issue, the ls process also hung. Same with cd. I even tried to copy the file. And that hung. It became the directory where all processes go to die. Except I couldn’t even do a kill -9 on these processes; they were persistent and wouldn’t stop.
I unmounted Lustre (umount -l /lustre) and then remounted it. I could access the directory again, but the file was still “stuck.” It was late and I was cranky. Even though I knew it was probably not a good thing to do, I rebooted the MGS server, hoping to reset everything. Looking back, the correct thing to do would be to unmount Lustre from the cluster, turn off the target servers, turn off the MGS, and then turn everything back on and remount to see if that would fix the problem. Instead, I took a hammer to it and rebooted. This exposed a different problem… one that would have needed to be fixed eventually, but by forcing the issue, I exchanged a stuck file for a broken Lustre.
Namely, when I tried to remount Lustre on my cluster nodes, it gave me a not-very-helpful unspecified I/O error. When I did an lctl ping <my-lustre-server>, I got a similar error: failed to ping 198.168.11.1@tcp: Input/output error. Awesome.
And then I discovered that I could still access Lustre from a single one of my nodes. Not the headnode, but one (and only one) of my other nodes. Everything was working great on that node, including the previously stuck script file. At this point, it was 3am and I was tired. So I went to bed before I did something else to later regret.
The next day, I spent a bunch of time on that one working node and made sure to back up all of my precious research data. I was pretty sure that the issues weren’t actually with the data file (which clearly were working fine), but with some sort of configuration issue related to the network. But, just in case, it’s always a good idea to backup your data, especially when using a file system that isn’t working 100%. After that, I did the thing that I should have done before; I took Lustre down carefully, in the correct order, and rebooted the Lustre machines. They probably didn’t need a reboot, but the error had something to do with the network, and all that stuff is configured automatically, so I hoped the reboot would unstick something. I was wrong.
The reboot ensured that my one working node no longer worked, and that now non of my nodes could connect to the Lustre server. Although on the Lustre side of things, everything seemed to be working; the ZFS pools were showing up. The MGT/MDT were mounting properly. I could even mount the object targets. But the nodes couldn’t connect, and the Lustre machines couldn’t even ping each other with lctl (normal pings were working fine, though).
After Googling like crazy and trying so many different things (none of which worked at all), I eventually found an offhand comment to someone else’s lctl network issues “You might want to make sure that you can connect to the Lustre server on port 988. That’s what it uses.” Hmmm.
# telnet wyeast-lustre01 988 Trying 192.168.1.11… telnet: connect to address 192.168.1.11: No route to host
Okay. This is progress. I have finally identified the problem. For some reason, although Lustre had been working fine before, so port 988 must have been reachable previous, but, after the reboot, the iptables changed and closed it up. I have no idea why it would have done this because I hadn’t updated that machine or added any new software. Total mystery.
Anyway, my cluster nodes are pretty well walled off. To get to them, you need to get into the PSU computer system and then into the cluster’s head node. And only then are the cluster nodes visible. So I’m probably safe opening up the firewall, even though it wouldn’t be secure to do that on most other systems. So on each of the Lustre systems, I did an “iptables -F”. This opened up the firewall. Voila! The Lustre machine could talk to each other. My nodes could talk to the Lustre machines. And I could mount the Lustre file system onto all of my nodes. System restored!
I still don’t know why that little bash script file got stuck, though….