The good news is that we have been able to get the Premise Lustre functional again.

Two bad drives have been automatically replaced with the only two hot spare drives in the enclosure.  
We have no cold spare drive(s) to add to the enclosure, so a few should be purchased to best protect data integrity.   
The "bad" drives  have not be removed and they are  causing delays while booting Lustre.

The other bad news is that while checking those disks a cable management part of the storage enclosure failed in such a way that we can no longer get one of the four storage drawers to fully close.  The system appears to function with the drawer 3 inches out from it's fully closed position.  But we  know the cabling is being pinched and we will need to contact Seagate support for a possible replacement.

We plan to discuss our options in the morning and determine a plan to move forward.  Very likely it will involve: removing the "dead" disks, ordering replacement drives for hot & cold spares, and contacting Seagate for a quote on fixing the internal cable management (and one fan).   Depending on the length of time to replace the cable management part(s) we have to decide what portion of Premise to bring online.  

If we bring everything online we will need future downtime to replace the Luste cable management.  
If we leave Lustre offline until fully repaired half of the clusters users will not have home areas and the majority of  Anaconda software will be unavailable.  There may be additional  problems discovered  in running without all of the normal Premise storage systems online.

That's the latest news.  It's very frustrating to have fought through all of the software issues only to have a hardware cabling snag create a roadblock.


-- 
Robert Anderson <rea@sr.unh.edu>
UNH RCC