[premise-users] Premise is now back online.

Wed Oct 14 15:53:40 EDT 2020

Just before 3pm I reviewed the output looking for a better indication
of when the still running rsync might finish and found it was not
creating the expected hardlinks.   

I then stopped the rsync and started the Premise bootup sequence.

The main BeeGFS mirror issue has been solved, and the "old Lustre"
metadata drive has been replaced.  So we should now be fully functional,
just without and of the extra backups and improvements we had hoped to
make during this unfortunate shutdown event.

On Wed, 14 Oct 2020 12:24:15 -0400
"Robert E. Anderson" <rea at sr.unh.edu> wrote:

> Things are better this morning.
> 
> For clarification the problem we've been working on is only on the
> secondary metadata mirror for the BeeGFS volume mounted on Premise as
> "/mnt/home".  There is also a failed SSD drive in the "old Lustre"
> storage array mounted on Premise as "/mnt/lustre".  A replacement
> Lustre SSD drive was previously purchased and was on campus late
> Monday, we should have the new drive in Morse today.
> 
> 
> 
> Last nights 2nd fix completely solved the BeeGFS metadata resync
> error. It took about 6 hours to complete, but we now have a good
> BeeGFS metadata mirror.
> 
> The original metadata backup process ran out of memory (more than a
> day in).  The rsync backup of the same data is still
> running, just taking longer than predicted to complete using the size
> of data on the destination.
> 
> The current plan is to:
> 
> 1. Wait for the rsync backup to complete (best guess is 2 more hours)
> 
> 2. Bring Premise back online for users.
> 
> 3. Start an beegfs-fsck (BeeGFS-File System ChecK) on the live BeeGFS
> storage (which will take longer than the offline beegfs-fsck,
> possibly days to complete).
> 
> Ideally we can also replace the failed SSD drive in the "old Lustre"
> metadata storage while the rsync completes.
> 
> We are holding off on migrating the BeeGFS secondary metadata server
> to ZFS until after a clean beegfs-fsck.  Knowing that it takes 6
> hours to resync, makes doing this work first just extend further this
> unscheduled downtime.  The rsync ZFS destination is also showing us
> how much space our metadata consumes on a ZFS storage target.  So we
> are gaining insights on this option. 
> 
> I'll provide either a messages that the system is back up, or yet
> another status on or before 3pm.
> 
> 
> On Tue, 13 Oct 2020 09:39:13 -0400
> Robert Anderson <rea at sr.unh.edu> wrote:
> 
> > The metadata storage problem is proving more difficult to resolve. I
> > hate keeping Premise down longer, but it's a lot better than having
> > to say we lost all your stored data.
> > 
> > The issue is that the primary metadata server is failing to mirror
> > it's data to the secondary. We beleive a full parallel file system
> > check is needed, but that fails when the secondary metadata mirror
> > is out of sync. A catch 22, that has no simple solution.
> > 
> > The underlying low level file storage currently utilizes a hardware 
> > mirrored pair of SSD drives. This lowest level storage passes
> > individuall file system checks.
> > 
> > To ensure the safety of the stored data we want to get at least one
> > good offline copy of the current metadata. That process is going
> > much slower than anticipated. Since we have no successful backup of
> > this metadata, we have nothing to base a completion estimate on.
> > 
> > Once the backup safety net is in place we need to: reconfigure
> > things to remove the secondary metadata mirror, perform the
> > parallel file system check, and then add back the secondary meta
> > data sever. Getting a good backup requires no users. The parallel
> > file system check and re-mirror would be faster without users, but
> > can also be done with normal user workloads.
> > 
> > While we are down we will also try to reconfigure the storage on the
> > idle server to avoid this meta data backup issue in the future.
> > During active use the metadata is always changing, and backups take
> > many hours to complete. So we can never able capture a stable backup
> > with active users on Premise.  Changing the low level storage to a
> > file system that supports snapshots and replication, will provide us
> > with a few viable backup options moving forward.
> > 
> > I will send another status update email this evening.
> > 
> > On October 12, 2020 17:17:45 Robert Anderson <rea at sr.unh.edu>
> > wrote:  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> >   
> 
> 
> 

-- 
Robert E. Anderson
Associate Director
UNH Research Computing Center