[premise-users] Premise downtime scheduled for Monday 12/16 through Wednesday 12/18

Thu Jan 2 15:52:23 EST 2020

Here is the long overdue Premise update.  And yes there should have
been a more of these, but it took a long time to understand the real
problem.  So most of the prior updates would have been "we can't seem
to get a good copy, trying another way to move and or test the problem
data".
Throughout the long break we continued our attempts to resolve
discrepancies using various copy, backup and restore
efforts.  Eventually we discovered that the BeeGFS destination target
does not support hard links when they link to anything outside of a
single directory.
There are a few things affected by this:
#1 Anaconda.  On Lustre Anaconda created a large number of hard
links.    A re-installation of Anaconda on BeeGFS resolves this issues,
but there are a huge number of Anaconda packages in the bioinformatics
space that would be impractical to quickly re-install.  We could
continue to use the old Anaconda distribution on Lustre, but we have
successfully tested an alternative solution of just duplicating all of
the hardlinks.  Either of these solutions allow the old
"anaconda/colsa" software to remain as it was.  Linking to the old
Lustre would only be a temporary solution, since that hardware is old
and off maintenance.  Our goal has been to cleanly complete the
migration and not be reliant on any of the outdated hardware.  Testing
so far indicates the duplicate copy of all the links solves the
anaconda/colsa distribution  problems.
#2 Anaconda files in the local user environments.  Most of these are
located in a users "~/.conda" folder, but some users have separate
minconda or anaconda versions in their home areas.  There are a number
of these environments scattered across Premise.  The best solution
remains to re-install them, but we do not fully understand how big of
an issue this is for the users involved.   The best path forward may
not be the same for all users, we welcome any insights you might have
on this issue.  There is a way to export anaconda environments and then
re-install them elsewhere, but Toni reports this is often
unsuccessful.  The best option is to remove all the old installs and
start fresh, if that is a viable option for you.
#3 There are other hard links used inside of user home
directories.  Some of those include: conda, miniconda, ".julia/", git
and hg repositories, src build areas, and some "blast datastores".
I have details on what is affected for every group.  How to provide
that data to each group is harder.  It might be best to assign a single
point of contact for your group and we can review the data for your
group with you.
So when will it be online?  We are placing it back online now.
Please login and check your areas.  Run some code and verify things are
working.  There is a high likelyhood that we will need to tune some of
the storage parameters over the next few days.  Some have been updated
during our work, but the normal Premise load is likely to be very
different that the file copies and checks it has been running for the
last few weeks.  Please ramp up your jobs and confirm things are
working first, we'll monitor things and adjust over the next few
days.  Hold off on large scale work if possible until we know it's all
good.
The old Lustre storage is still functional and available if
needed.   We also have a full backup of all data from lustre, in
addition to the copy now moved to BeeGFS storage.  I believe your data
is all safe, even if we lost the Lustre storage tomorrow.  Let us know
if you find any other issues or need help it resolving any of the
problems noted above.
Thanks for you patience in this extremely long downtime.
Robert Anderson and the RCC HPC team.
On Thu, 2019-12-19 at 09:41 -0500, Robert Anderson wrote:
> The Premise HPC reconfiguration is not yet complete. 
> The largest remaining task is to validate that all of the user data
> was successfully moved. All user data has been migrated from Lustre
> to BeeGFS storage. We are confirming it is an exact copy. This
> process started yesterday and many group areas have been scanned and
> confirmed to be exact copies. 
> The last remaing system task is to reimage the old Premise head/login
> node to be just a login node.  This was started this morning and is
> not expected to take long to complete. 
> RCC believes that user data is the most important part of Premise and
> we want to ensure an accurate copy has been made. Given the size of
> the data being checked this process could drag out for some groups.
> At this point we are going to keep Premise offline until the number
> of unconfirmed group data is smaller. 
> In the meantime we will try to determine a way to safely allow the
> checked groups access to Premise.  Worst case could be Monday
> morning, but another email update will be sent this evening. 
> Sorry for the extra downtime. 
> 
> 
> 
> 
> 
> 
> On November 12, 2019 16:42:27 Robert Anderson <rea at sr.unh.edu> wrote:
> > The Premise HPC cluster is in need of a scheduled downtime for
> > reconfiguration, migration to the new storage, and general system
> > upgrades.  We hope that by scheduling a month out people can work
> > around these dates, and that Premise will be ready for the many
> > jobs expected during the long holiday break. 
> > Our plan is to shutdown first thing Monday morning 12/16.  The main
> > upgrades will occur on Monday.  We will then work to migrate as
> > much data as possible from Lustre to the new BeeGFS storage.  We
> > have moved over the data multiple times, but it gets immediately
> > stale with every job output you run.
> > Given the size of the data storage on Premise there is little
> > chance we can complete all the migrations within a three day
> > window.   We will start with the smallest groups and  provide
> > detailed status update for the larger groups on the 18th.
> > You can help to complete the storage migration by:
> > 1. Cleaning up anything currently stored on Premise that you do not
> > need.  This would be a great chance to ensure the 2nd copy of your
> > data is complete.
> > 2. "STATIC" If you store a lot of data on Premise please let us
> > know the areas that we can copy now that will NOT have to be
> > updated later.  If you have any large  datasets that need to be
> > moved but will not change please provide the  path to them so we
> > can copy them  in the weeks before 12/14 and NOT attempt to update
> > them during this short window.  If in doubt provide the path, as we
> > are not reformatting the old storage immediately, so it will be
> > available to us for awhile AFTER this planned storage migration
> > window.
> > 3. "CRITICAL" On the other hand if you have critical area that you
> > really need, please provide the path(s) so that we can ensure it is
> > moved during the scheduled 3 day window.  This only makes sense if
> > your group has multiple TB of storage on Premise, smaller groups
> > (<5TB) do not need to specify their critical areas, since we will
> > have time to easily move your entire group area.
> > Please email question or responses to #2 "STATIC" & #3 "CRITICAL"
> > above to:  RCCOPS at sr.unh.edu
> > Thanks for your cooperation.-- 
> > Robert Anderson <rea at sr.unh.edu>
> > UNH RCC
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________premise-users mailing 
> listpremise-users at lists.sr.unh.edu
> https://lists.sr.unh.edu/mailman/listinfo/premise-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.sr.unh.edu/pipermail/premise-users/attachments/20200102/88f93ddc/attachment.html>