[premise-users] Slow progress on backup & metadata mirroring.

Robert E. Anderson rea at sr.unh.edu
Wed Oct 14 01:48:41 EDT 2020


The original metadata backup is still running.  I've also started an
rsync backup of the same metadata which looks like it will finish mid
morning.

The Metadata mirror resync runs about 2.5 hours before stopping with the
same file error each time.  One "manual fix" changed the reported error
but just indicated a different error with the same file.  A 2nd manual
fix has been done and the mirror resync is now running.  

If the most recent fix works we will then convert one of the two
low level meta storage mounts to ZFS and resync.  If things go smoothly
Premise could be back online as early as noontime. It is possible
the fix solves only the file previously reported, but we have
more than one file in this state. The sync process stops on the first
error after running a few hours, so each new file error could take a
few hours to resolve and uncover the next one.

I'll send another update mid morning once I have new information from
tonight's running processes.

Once again, thanks for your patience.

On Tue, 13 Oct 2020 09:39:13 -0400
Robert Anderson <rea at sr.unh.edu> wrote:

> The metadata storage problem is proving more difficult to resolve. I
> hate keeping Premise down longer, but it's a lot better than having
> to say we lost all your stored data.
> 
> The issue is that the primary metadata server is failing to mirror
> it's data to the secondary. We beleive a full parallel file system
> check is needed, but that fails when the secondary metadata mirror is
> out of sync. A catch 22, that has no simple solution.
> 
> The underlying low level file storage currently utilizes a hardware 
> mirrored pair of SSD drives. This lowest level storage passes
> individuall file system checks.
> 
> To ensure the safety of the stored data we want to get at least one
> good offline copy of the current metadata. That process is going much
> slower than anticipated. Since we have no successful backup of this
> metadata, we have nothing to base a completion estimate on.
> 
> Once the backup safety net is in place we need to: reconfigure things
> to remove the secondary metadata mirror, perform the parallel file
> system check, and then add back the secondary meta data sever.
> Getting a good backup requires no users. The parallel file system
> check and re-mirror would be faster without users, but can also be
> done with normal user workloads.
> 
> While we are down we will also try to reconfigure the storage on the
> idle server to avoid this meta data backup issue in the future.
> During active use the metadata is always changing, and backups take
> many hours to complete. So we can never able capture a stable backup
> with active users on Premise.  Changing the low level storage to a
> file system that supports snapshots and replication, will provide us
> with a few viable backup options moving forward.
> 
> I will send another status update email this evening.
> 
> On October 12, 2020 17:17:45 Robert Anderson <rea at sr.unh.edu> wrote:
> > There was a filesystem error on Premise /mnt/home.
> >
> > A filesystem check on the main /mnt/home space needs to complete
> > without active users.
> >
> > Hopefully the system will be back online later this evening or in
> > the morning if the check takes all night.
> >
> >
> > On Mon, 2020-10-12 at 08:33 -0400, Thomas J. Baker wrote:
> >> We lost chilled water this morning and the Premise cluster has
> >> been shut down. The back up chiller appears to be misbehaving as
> >> well so no ETA yet on Premise return to service.
> >>
> >> Thanks,
> >>
> >> tjb
> >> --
> >> =======================================================================
> >>> Thomas Baker |
> >>> Systems Programmer email:
> >>> tjb at unh.edu
> >>> |
> >>> Research Computing Center office: Morse 206 |
> >>> University of New Hampshire voice: (603) 862-4490 |
> >>> 213 Morse Hall fax: (603) 862-1761 |
> >>> Durham, NH 03824 USA |
> >> =======================================================================
> >>
> >>
> >> _______________________________________________
> >> premise-users mailing list
> >> premise-users at lists.sr.unh.edu
> >>
> >> https://lists.sr.unh.edu/mailman/listinfo/premise-users
> 



-- 
Robert E. Anderson
Associate Director
UNH Research Computing Center


More information about the premise-users mailing list