[premise-users] Premise update 1/10/20

Fri Jan 10 20:02:04 EST 2020

During the past week a number of issues have been resolved.Additional
BeeGFS tuning has been done.  However, one significant performance
issue related to linux file system caching remains.  BeeGFS does not
use the linux file system cache, so job workloads that repeatedly re-
read the exact same file(s) experience significantly slower
access.  Some runtimes have been estimated to be more than one  order
of magnitude longer.  COLSA users appear to be affected the most since
this access model is common in bioinformatic sequence assembly
workloads.
A number of groups have decided to continue using the old Lustre
storage until we are able to close the above mentioned  runtime
gap.  The PI of any group may choose to go return to using Lustre home
areas if this caching problem provides a hardship to their research
objectives.  This choice does come with risk, since the Lustre hardware
is aging and no longer under maintenance.  Unless your group is greatly
impacted RCC would suggest you remain on the supported BeeGFS storage. 
Updates have been complete this week to address:
  Forced password change errors related to expired passwords  A brief
network drop between the login node and the slurm server that occurred
every 18 minutes.  Allow users  write access to the old lustre
storage.  Changed some high traffic software and user home areas back
to lustre storage.  Anaconda package updates and changes to reflect
environment changes.  (Probably additional changes that we've managed
to forget...)
If you are experiencing any problems please let us know.  Missing
licenses and other minor configuration issues  often remain
undiscovered for weeks after changes like these.  Please check early,
before you really need it.
If you have been waiting for the dust to settle on this storage
migration, we hope that it finally has.  You should consider the
Premise system to be fully back online, and start running your jobs
again.
Thanks for you patience, and have a good weekend.
-- 
Robert Anderson <rea at sr.unh.edu>
UNH RCC

On Thu, 2020-01-02 at 15:52 -0500, Robert Anderson wrote:
> Here is the long overdue Premise update.  And yes there should have
> been a more of these, but it took a long time to understand the real
> problem.  So most of the prior updates would have been "we can't seem
> to get a good copy, trying another way to move and or test the
> problem data".
> Throughout the long break we continued our attempts to resolve
> discrepancies using various copy, backup and restore
> efforts.  Eventually we discovered that the BeeGFS destination target
> does not support hard links when they link to anything outside of a
> single directory.
> There are a few things affected by this:
> #1 Anaconda.  On Lustre Anaconda created a large number of hard
> links.    A re-installation of Anaconda on BeeGFS resolves this
> issues, but there are a huge number of Anaconda packages in the
> bioinformatics space that would be impractical to quickly re-
> install.  We could continue to use the old Anaconda distribution on
> Lustre, but we have successfully tested an alternative solution of
> just duplicating all of the hardlinks.  Either of these solutions
> allow the old "anaconda/colsa" software to remain as it was.  Linking
> to the old Lustre would only be a temporary solution, since that
> hardware is old and off maintenance.  Our goal has been to cleanly
> complete the migration and not be reliant on any of the outdated
> hardware.  Testing so far indicates the duplicate copy of all the
> links solves the anaconda/colsa distribution  problems.
> #2 Anaconda files in the local user environments.  Most of these are
> located in a users "~/.conda" folder, but some users have separate
> minconda or anaconda versions in their home areas.  There are a
> number of these environments scattered across Premise.  The best
> solution remains to re-install them, but we do not fully understand
> how big of an issue this is for the users involved.   The best path
> forward may not be the same for all users, we welcome any insights
> you might have on this issue.  There is a way to export anaconda
> environments and then re-install them elsewhere, but Toni reports
> this is often unsuccessful.  The best option is to remove all the old
> installs and start fresh, if that is a viable option for you.
> #3 There are other hard links used inside of user home
> directories.  Some of those include: conda, miniconda, ".julia/", git
> and hg repositories, src build areas, and some "blast datastores".
> I have details on what is affected for every group.  How to provide
> that data to each group is harder.  It might be best to assign a
> single point of contact for your group and we can review the data for
> your group with you.
> So when will it be online?  We are placing it back online now.
> Please login and check your areas.  Run some code and verify things
> are working.  There is a high likelyhood that we will need to tune
> some of the storage parameters over the next few days.  Some have
> been updated during our work, but the normal Premise load is likely
> to be very different that the file copies and checks it has been
> running for the last few weeks.  Please ramp up your jobs and confirm
> things are working first, we'll monitor things and adjust over the
> next few days.  Hold off on large scale work if possible until we
> know it's all good.
> The old Lustre storage is still functional and available if
> needed.   We also have a full backup of all data from lustre, in
> addition to the copy now moved to BeeGFS storage.  I believe your
> data is all safe, even if we lost the Lustre storage tomorrow.  Let
> us know if you find any other issues or need help it resolving any of
> the problems noted above.
> Thanks for you patience in this extremely long downtime.
> Robert Anderson and the RCC HPC team.
> On Thu, 2019-12-19 at 09:41 -0500, Robert Anderson wrote:
> > The Premise HPC reconfiguration is not yet complete. 
> > The largest remaining task is to validate that all of the user data
> > was successfully moved. All user data has been migrated from Lustre
> > to BeeGFS storage. We are confirming it is an exact copy. This
> > process started yesterday and many group areas have been scanned
> > and confirmed to be exact copies. 
> > The last remaing system task is to reimage the old Premise
> > head/login node to be just a login node.  This was started this
> > morning and is not expected to take long to complete. 
> > RCC believes that user data is the most important part of Premise
> > and we want to ensure an accurate copy has been made. Given the
> > size of the data being checked this process could drag out for some
> > groups.
> > At this point we are going to keep Premise offline until the number
> > of unconfirmed group data is smaller. 
> > In the meantime we will try to determine a way to safely allow the
> > checked groups access to Premise.  Worst case could be Monday
> > morning, but another email update will be sent this evening. 
> > Sorry for the extra downtime. 
> > 
> > 
> > 
> > 
> > 
> > 
> > On November 12, 2019 16:42:27 Robert Anderson <rea at sr.unh.edu>
> > wrote:
> > > The Premise HPC cluster is in need of a scheduled downtime for
> > > reconfiguration, migration to the new storage, and general system
> > > upgrades.  We hope that by scheduling a month out people can work
> > > around these dates, and that Premise will be ready for the many
> > > jobs expected during the long holiday break. 
> > > Our plan is to shutdown first thing Monday morning 12/16.  The
> > > main upgrades will occur on Monday.  We will then work to migrate
> > > as much data as possible from Lustre to the new BeeGFS
> > > storage.  We have moved over the data multiple times, but it gets
> > > immediately stale with every job output you run.
> > > Given the size of the data storage on Premise there is little
> > > chance we can complete all the migrations within a three day
> > > window.   We will start with the smallest groups and  provide
> > > detailed status update for the larger groups on the 18th.
> > > You can help to complete the storage migration by:
> > > 1. Cleaning up anything currently stored on Premise that you do
> > > not need.  This would be a great chance to ensure the 2nd copy of
> > > your data is complete.
> > > 2. "STATIC" If you store a lot of data on Premise please let us
> > > know the areas that we can copy now that will NOT have to be
> > > updated later.  If you have any large  datasets that need to be
> > > moved but will not change please provide the  path to them so we
> > > can copy them  in the weeks before 12/14 and NOT attempt to
> > > update them during this short window.  If in doubt provide the
> > > path, as we are not reformatting the old storage immediately, so
> > > it will be available to us for awhile AFTER this planned storage
> > > migration window.
> > > 3. "CRITICAL" On the other hand if you have critical area that
> > > you really need, please provide the path(s) so that we can ensure
> > > it is moved during the scheduled 3 day window.  This only makes
> > > sense if your group has multiple TB of storage on Premise,
> > > smaller groups (<5TB) do not need to specify their critical
> > > areas, since we will have time to easily move your entire group
> > > area.
> > > Please email question or responses to #2 "STATIC" & #3 "CRITICAL"
> > > above to:  RCCOPS at sr.unh.edu
> > > Thanks for your cooperation.-- 
> > > Robert Anderson <rea at sr.unh.edu>
> > > UNH RCC
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________premise-users
> > mailing listpremise-users at lists.sr.unh.edu
> > https://lists.sr.unh.edu/mailman/listinfo/premise-users


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.sr.unh.edu/pipermail/premise-users/attachments/20200110/f233bbd2/attachment.html>