[premise-users] Options for Premise Login node crashes.

Tue Mar 22 15:59:02 EDT 2022

Many of you are aware that Premise Login node has experienced a number
of crashes in the last few weeks.

The lockups have not clearly indicated any specific problem.  One crash
appeared to be the InfiniBand card, but it crashed soon after we
swapped cards with another node.  One crash seemed to indicate RAM, and
another implicated the BeeGFS storage system.

Having the login node crash certainly makes it difficult to check on,
or start new jobs.  Luckily it should not cause any job(s) running on
compute nodes to fail.  So we are not currently considering any drastic
measures to solve this annoyance.  Our solutions are likely to have a
larger impact on customers than the problem we are attempting to solve.

The Premise login node has been up for the last 8 days, but we do not
believe the problem has been resolved.

While we wait for another data point, we can consider a few options:

If the next data suggest RAM we will move the suspicious RAM and
confirm the problem follows that RAM chip.  Depending on the timing of
the  event we will either swap the RAM while the login node is down
from the crash OR schedule a short downtime during working hours for
just the login node.  

There are newer versions of BeeGFS available that could be installed on
Premise, but they require the entire cluster be down for a half  day.  
We like to provide you two weeks notice for any full shutdown.

While it would be good to plan time to do the BeeGFS software upgrade,
but it's not clear it is actually the cause of these crashes.

We're asking the Premise community for feedback.  Please let us know if
you have a strong preference for a half day of scheduled Premise
downtime.

Requesting feedback for these options (for a half day of Premise
downtime):

A:  The login node crashes are so important you would prefer the system
be down for 1/2 day as soon as we could scheduled it, and rule out one
potential cause.  Possibly Wednesday 3/30 all morning.

B:  Just schedule downtime a few weeks out and get it over with.
 Likely a day like Wednesday 4/13.

C: Wait until we know more or the end of classes but before Summer.
 Example Wednesday 5/18.

D: Just schedule time for a full system upgrade for a full day or two
of downtime.  This could be summer or other convenient times suggested
by users.

E: Shutdown anytime with 2 weeks notice, just NOT between X and Y.  

If you have no strong opinions you need not reply.  You can assume
we'll give 2 weeks notice and schedule at least the half day of
downtime when we have  more data that suggests BeeGFS is  the cause of
the login node crashes.

Thanks for your patience and input.

-- 
Robert Anderson <rea at sr.unh.edu>
UNH RCC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.sr.unh.edu/pipermail/premise-users/attachments/20220322/6be5c4d8/attachment.html>