[premise-users] Premise Head node CPU1 error on 9/5/19 @ 12:30

Thu Sep 5 15:21:49 EDT 2019

The Premise HPC cluster when down during lunchtime today.

It appears that CPU1 detected a machine check level error. 

Most jobs appear to have survived the head node reboot.  You should
confirm that your running jobs are still OK.

If this occurs again we will run diagnostic tests while it is down in
an attempt to confirm the error sufficiently to order it's hardware
replacement.

On a potentially related note, our work on migrating from Lustre to
BeeGFS has progressed significantly in the last few weeks.  If the
impact from this hardware issue becomes  significant RCC may decide to
"rush" the migration and cut over to an alternate configuration.  This
will result in some downtime, and some rough transitions for groups
with the largest storage areas.   We had hoped to work out a way to
slowly transition, with lower impact.  But this hardware issue may make
a quick transition actually the least impact. 

We will try to keep you updated.  With any luck this was a random CPU
error that will not repeat.

Thanks,

Robert E. Anderson
Associate Director
UNH Research Computing Center
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.sr.unh.edu/pipermail/premise-users/attachments/20190905/55c4eec3/attachment.html>