Many of you are aware that Premise Login node has experienced a number of crashes in the last few weeks.
The lockups have not clearly indicated any specific problem. One crash appeared to be the InfiniBand card, but it crashed soon after we swapped cards with another node. One crash seemed to indicate RAM, and another implicated the BeeGFS storage system.
Having the login node crash certainly makes it difficult to check on, or start new jobs. Luckily it should not cause any job(s) running on compute nodes to fail. So we are not currently considering any drastic measures to solve this annoyance. Our solutions are likely to have a larger impact on customers than the problem we are attempting to solve.
The Premise login node has been up for the last 8 days, but we do not believe the problem has been resolved.
While we wait for another data point, we can consider a few options:
If the next data suggest RAM we will move the suspicious RAM and confirm the problem follows that RAM chip. Depending on the timing of the event we will either swap the RAM while the login node is down from the crash OR schedule a short downtime during working hours for just the login node.
There are newer versions of BeeGFS available that could be installed on Premise, but they require the entire cluster be down for a half day. We like to provide you two weeks notice for any full shutdown.
While it would be good to plan time to do the BeeGFS software upgrade, but it's not clear it is actually the cause of these crashes.
We're asking the Premise community for feedback. Please let us know if you have a strong preference for a half day of scheduled Premise downtime.
Requesting feedback for these options (for a half day of Premise downtime):
A: The login node crashes are so important you would prefer the system be down for 1/2 day as soon as we could scheduled it, and rule out one potential cause. Possibly Wednesday 3/30 all morning.
B: Just schedule downtime a few weeks out and get it over with. Likely a day like Wednesday 4/13.
C: Wait until we know more or the end of classes but before Summer. Example Wednesday 5/18.
D: Just schedule time for a full system upgrade for a full day or two of downtime. This could be summer or other convenient times suggested by users.
E: Shutdown anytime with 2 weeks notice, just NOT between X and Y.
If you have no strong opinions you need not reply. You can assume we'll give 2 weeks notice and schedule at least the half day of downtime when we have more data that suggests BeeGFS is the cause of the login node crashes.
Thanks for your patience and input.