During the past week a number of issues have been resolved.

Additional BeeGFS tuning has been done. However, one significant performance issue related to linux file system caching remains. BeeGFS does not use the linux file system cache, so job workloads that repeatedly re-read the exact same file(s) experience significantly slower access. Some runtimes have been estimated to be more than one order of magnitude longer. COLSA users appear to be affected the most since this access model is common in bioinformatic sequence assembly workloads.

A number of groups have decided to continue using the old Lustre storage until we are able to close the above mentioned runtime gap. The PI of any group may choose to go return to using Lustre home areas if this caching problem provides a hardship to their research objectives. This choice does come with risk, since the Lustre hardware is aging and no longer under maintenance. Unless your group is greatly impacted RCC would suggest you remain on the supported BeeGFS storage.

Updates have been complete this week to address:

Forced password change errors related to expired passwords

A brief network drop between the login node and the slurm server that occurred every 18 minutes.

Allow users write access to the old lustre storage.

Changed some high traffic software and user home areas back to lustre storage.

Anaconda package updates and changes to reflect environment changes.

(Probably additional changes that we've managed to forget...)

If you are experiencing any problems please let us know. Missing licenses and other minor configuration issues often remain undiscovered for weeks after changes like these. Please check early, before you really need it.

If you have been waiting for the dust to settle on this storage migration, we hope that it finally has. You should consider the Premise system to be fully back online, and start running your jobs again.

Thanks for you patience, and have a good weekend.

--

Robert Anderson <rea@sr.unh.edu>
UNH RCC

On Thu, 2020-01-02 at 15:52 -0500, Robert Anderson wrote:

Here is the long overdue Premise update. And yes there should have been a more of these, but it took a long time to understand the real problem. So most of the prior updates would have been "we can't seem to get a good copy, trying another way to move and or test the problem data".

Throughout the long break we continued our attempts to resolve discrepancies using various copy, backup and restore efforts. Eventually we discovered that the BeeGFS destination target does not support hard links when they link to anything outside of a single directory.

There are a few things affected by this:

#1 Anaconda. On Lustre Anaconda created a large number of hard links. A re-installation of Anaconda on BeeGFS resolves this issues, but there are a huge number of Anaconda packages in the bioinformatics space that would be impractical to quickly re-install. We could continue to use the old Anaconda distribution on Lustre, but we have successfully tested an alternative solution of just duplicating all of the hardlinks. Either of these solutions allow the old "anaconda/colsa" software to remain as it was. Linking to the old Lustre would only be a temporary solution, since that hardware is old and off maintenance. Our goal has been to cleanly complete the migration and not be reliant on any of the outdated hardware. Testing so far indicates the duplicate copy of all the links solves the anaconda/colsa distribution problems.

#2 Anaconda files in the local user environments. Most of these are located in a users "~/.conda" folder, but some users have separate minconda or anaconda versions in their home areas. There are a number of these environments scattered across Premise. The best solution remains to re-install them, but we do not fully understand how big of an issue this is for the users involved. The best path forward may not be the same for all users, we welcome any insights you might have on this issue. There is a way to export anaconda environments and then re-install them elsewhere, but Toni reports this is often unsuccessful. The best option is to remove all the old installs and start fresh, if that is a viable option for you.

#3 There are other hard links used inside of user home directories. Some of those include: conda, miniconda, ".julia/", git and hg repositories, src build areas, and some "blast datastores".

I have details on what is affected for every group. How to provide that data to each group is harder. It might be best to assign a single point of contact for your group and we can review the data for your group with you.

So when will it be online? We are placing it back online now.

Please login and check your areas. Run some code and verify things are working. There is a high likelyhood that we will need to tune some of the storage parameters over the next few days. Some have been updated during our work, but the normal Premise load is likely to be very different that the file copies and checks it has been running for the last few weeks. Please ramp up your jobs and confirm things are working first, we'll monitor things and adjust over the next few days. Hold off on large scale work if possible until we know it's all good.

The old Lustre storage is still functional and available if needed. We also have a full backup of all data from lustre, in addition to the copy now moved to BeeGFS storage. I believe your data is all safe, even if we lost the Lustre storage tomorrow. Let us know if you find any other issues or need help it resolving any of the problems noted above.

Thanks for you patience in this extremely long downtime.

Robert Anderson and the RCC HPC team.

On Thu, 2019-12-19 at 09:41 -0500, Robert Anderson wrote:

The Premise HPC reconfiguration is not yet complete.

The largest remaining task is to validate that all of the user data was successfully moved. All user data has been migrated from Lustre to BeeGFS storage. We are confirming it is an exact copy. This process started yesterday and many group areas have been scanned and confirmed to be exact copies.

The last remaing system task is to reimage the old Premise head/login node to be just a login node. This was started this morning and is not expected to take long to complete.

RCC believes that user data is the most important part of Premise and we want to ensure an accurate copy has been made. Given the size of the data being checked this process could drag out for some groups.

At this point we are going to keep Premise offline until the number of unconfirmed group data is smaller.

In the meantime we will try to determine a way to safely allow the checked groups access to Premise. Worst case could be Monday morning, but another email update will be sent this evening.

Sorry for the extra downtime.

On November 12, 2019 16:42:27 Robert Anderson <rea@sr.unh.edu> wrote:

The Premise HPC cluster is in need of a scheduled downtime for reconfiguration, migration to the new storage, and general system upgrades. We hope that by scheduling a month out people can work around these dates, and that Premise will be ready for the many jobs expected during the long holiday break.

Our plan is to shutdown first thing Monday morning 12/16. The main upgrades will occur on Monday. We will then work to migrate as much data as possible from Lustre to the new BeeGFS storage. We have moved over the data multiple times, but it gets immediately stale with every job output you run.

Given the size of the data storage on Premise there is little chance we can complete all the migrations within a three day window. We will start with the smallest groups and provide detailed status update for the larger groups on the 18th.

You can help to complete the storage migration by:

1. Cleaning up anything currently stored on Premise that you do not need. This would be a great chance to ensure the 2nd copy of your data is complete.

2. "STATIC" If you store a lot of data on Premise please let us know the areas that we can copy now that will NOT have to be updated later. If you have any large datasets that need to be moved but will not change please provide the path to them so we can copy them in the weeks before 12/14 and NOT attempt to update them during this short window. If in doubt provide the path, as we are not reformatting the old storage immediately, so it will be available to us for awhile AFTER this planned storage migration window.

3. "CRITICAL" On the other hand if you have critical area that you really need, please provide the path(s) so that we can ensure it is moved during the scheduled 3 day window. This only makes sense if your group has multiple TB of storage on Premise, smaller groups (<5TB) do not need to specify their critical areas, since we will have time to easily move your entire group area.

Please email question or responses to #2 "STATIC" & #3 "CRITICAL" above to: RCCOPS@sr.unh.edu

Thanks for your cooperation.
--

Robert Anderson <rea@sr.unh.edu>
UNH RCC