[premise-users] Premise Head node lost InfiniBand/NFS connection on 9/25/19 @ 10:00

Robert Anderson rea at sr.unh.edu
Wed Sep 25 11:14:15 EDT 2019


The Premise HPC systems head node slowly became unresponsive around
10am this morning.  It appears to have lost NFS contact over
InfiniBand, followed bu all of the other IB connections stopping.

There is NO indication of this being related to the 9/5 CPU1 error.

The Premise headnode was rebooted around 10:30am and seems to be fully
functional.  Job running on nodes do not seem to have been
affected.  Please check on your jobs and let us know if you find any
that require cleaning up.


Luckily none of the recent problems have involved Lustre storage, and
our Lustre backups are very close to complete.  At some point we will
require some downtime to make the switch from Lustre  to the new BeeGFS
strorage.


-- 
Robert Anderson <rea at sr.unh.edu>
UNH RCC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.sr.unh.edu/pipermail/premise-users/attachments/20190925/b947aa39/attachment.html>


More information about the premise-users mailing list