The Premise HPC cluster when down during lunchtime today.

It appears that CPU1 detected a machine check level error.

Most jobs appear to have survived the head node reboot. You should confirm that your running jobs are still OK.

If this occurs again we will run diagnostic tests while it is down in an attempt to confirm the error sufficiently to order it's hardware replacement.



On a potentially related note, our work on migrating from Lustre to BeeGFS has progressed significantly in the last few weeks. If the impact from this hardware issue becomes significant RCC may decide to "rush" the migration and cut over to an alternate configuration. This will result in some downtime, and some rough transitions for groups with the largest storage areas. We had hoped to work out a way to slowly transition, with lower impact. But this hardware issue may make a quick transition actually the least impact.

We will try to keep you updated. With any luck this was a random CPU error that will not repeat.

Thanks,

Robert E. Anderson
Associate Director
UNH Research Computing Center