The Premise HPC systems head node slowly became unresponsive around 10am this morning. It appears to have lost NFS contact over InfiniBand, followed bu all of the other IB connections stopping.

There is NO indication of this being related to the 9/5 CPU1 error.

The Premise headnode was rebooted around 10:30am and seems to be fully functional. Job running on nodes do not seem to have been affected. Please check on your jobs and let us know if you find any that require cleaning up.

Luckily none of the recent problems have involved Lustre storage, and our Lustre backups are very close to complete. At some point we will require some downtime to make the switch from Lustre to the new BeeGFS strorage.

--

Robert Anderson <rea@sr.unh.edu>
UNH RCC