[biomade-users] BioMade HPC Core RAM error on storage server on Sat/Sun 9/17-18
Robert Anderson
Robert.E.Anderson at unh.edu
Sun Sep 18 15:33:06 EDT 2022
Sometime between Saturday 7:30 am and Sunday 8:00 am a RAM chip error
on BioMade's Storage01 server cause the storage system to pause.
The server was shutdown on Sunday early afternoon and RAM chip B3 was
re-seated, in order to bring the server back online.
It appears that all of the storage/jobs continued running after un-
pausing. This pause was far longer than the 15 minutes that we thought
a pause like this could be without job interruption. If you had jobs
running on BioMade this weekend you should check the status of your
jobs. If they are still running also check their output to confirm
they are still working as expected.
The few that I looked at do appear to still be running. But I was not
sure where to look to confirm their output was not affected by the
storage pause.
--
Robert E. Anderson / UNH Research Computing Center / Data Center
Operations
More information about the biomade-users
mailing list