[premise-users] Premise is back online.

Robert Anderson Robert.E.Anderson at unh.edu
Mon Jul 18 22:13:08 EDT 2022


The Lenharth Data Center had UPS work done today.  The combination of bypass sequences looked like the a power outage and subsequent depletion of the UPS batteries.  Resulting in the SheepDog monitoring system shutting down the Premise HPC cluster around 12:45 this afternoon.

The oldLustre system came back online but was in an odd failover state where only one server in each pair took control of that pair's shared disks.  This failover condition persisted in four full power cycle boot attempts.

We now know that failover conditions required manual failback commands to return to a load balanced pairs of servers. After issuing manual failback commands on both storage pairs and the metadata storage, everything was back to it's normal state.  We performed a few tests and finally brought the system back online around 9pm.


Robert E. Anderson

University of NH / Research Computing Center / Data Center Operations


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.sr.unh.edu/pipermail/premise-users/attachments/20220719/584aec29/attachment.html>


More information about the premise-users mailing list