The Premise HPC Lustre storage resync command that started on Sunday completed around noontime today.
We spent the afternoon trying to resolve a mysql database error that is occurring on the Lustre Management node. A single uncommitted transaction keeps the database from coming up. All attempts to rollback that transaction have failed. We have backed up the mysql file system storage, and also dumped the data for all tables from within mysql. We're currently reloading all of the msql data back into a clean database, which could take a few hours. Once restored we expect mysql will start normally and also contain the prior data necessary to continue the Lustre boot process.
It's possible that we will run into a new issue after this restore is complete. But our current roadblock is mysql, and it's data appears to be intact, so we are still optimistic. There is a chance that the server will be up later tonight if all goes well. If we do run into a new problem later this evening we will hold off and start something new when we are fresh on Tuesday morning.
At some future point we may have to consider bringing Premise back up without the old Lustre storage. All of the users with home directories on BeeGFS and not using Anaconda would still be able to use the system without the old Lustre. We are not yet to that point, and will continue working towards 100% functionality.
Thanks for your patience.