[Trillian-users] Trillian PBS

Yu, Jiexiang Jiexiang.Yu at unh.edu
Fri May 19 08:33:53 EDT 2017


Dear Jimmy and Liang,
I apologize for my emotional response. I was sad just because if I were informed of the reboot, progress of a two-week-long job could have been saved. I truly appreciate everything everyone does to maintain this cluster with such a high efficiency and extreme convenience.
Best regards,
Jie-xiang

2017年5月19日 00:31,Liang Wang <frank0734 at gmail.com>写道:
Hi Jiexiang,

I fully understand your frustration. But as a user just like you (I do not maintain Trillian), I also hope you could realize how difficult it is to maintain computer resources like Trillian in a university, and maybe show more appreciation to people who are investing their time doing this (within or beyond their job duty). The past month has been particularly tough with many new issues and the cluster has not been able to work consistently for us. Like you said, this recent reboot is unexpected for some of us (including me), but the important thing here really is to solve the problem, which, like Jimmy said, "requires some compromise". Thus, I feel sorry for your restarted job without a heads-up, but I also have to say that complaining alone does not help anyone, and hopefully, the new job scheduling policy will make the situation better.

-- Liang Wang

On Thu, May 18, 2017 at 6:22 PM, Yu, Jiexiang <Jiexiang.Yu at unh.edu<mailto:Jiexiang.Yu at unh.edu>> wrote:

BUT the fact is PBS system HAS BEEN rebooted!

The worse thing is all jobs are RESTARTED.

GOOD JOBS!​


________________________________
From: Trillian-users <trillian-users-bounces at lists.sr.unh.edu<mailto:trillian-users-bounces at lists.sr.unh.edu>> on behalf of Jimmy Raeder <J.Raeder at unh.edu<mailto:J.Raeder at unh.edu>>
Sent: Thursday, May 18, 2017 3:17 PM
To: trillian-users at lists.sr.unh.edu<mailto:trillian-users at lists.sr.unh.edu>
Subject: [Trillian-users] Trillian PBS


As many of you have noticed, the Trillian PBS batch system has had issues lately, such that jobs got stuck in the input queue despite there being enough nodes available.

Unfortunately, we have not been able to pinpoint the cause.  In addition, we no longer subscribe to Cray support, so any help they give us will come with a cost.  We have an open ticket w/Cray now, and hopefully the issue will be resolved soon.

In the mean time, the only remedy to free the stuck nodes is a reboot, which will kill any jobs that are still running.  This requires a compromise, since waiting for active jobs to finish can leave large portions of the machine unusable for a long time.

For the time being, we will employ the policy that we will not wait for jobs to finish if they already run for more than 3 days.  In other words, if you run jobs longer than 3 days they are at risk of being terminated.  It is thus advised to rather use   more nodes and shorter runtimes.  Please note that most computing centers limit wall clock time to 1-2 days typically, because long running jobs make it much harder to manage the machine properly.

—  Jimmy Raeder

--------------------------------------------------------------------------------------------------
Joachim (Jimmy) Raeder
Professor of Physics, Department of Physics & Space Science Center
University of New Hampshire
245G Morse Hall, 8 College Rd, Durham, NH 03824-3525
voice: 603-862-3412<tel:%28603%29%20862-3412>  mobile: 603-502-9505<tel:%28603%29%20502-9505>  assistant: 603-862-1431<tel:%28603%29%20862-1431>
e-mail: J.Raeder at unh.edu<mailto:J.Raeder at unh.edu>
WWW: http://mhd.sr.unh.edu/~jraeder/tmp.homepage
--------------------------------------------------------------------------------------------------




_______________________________________________
Trillian-users mailing list
Trillian-users at lists.sr.unh.edu<mailto:Trillian-users at lists.sr.unh.edu>
http://lists.sr.unh.edu/mailman/listinfo/trillian-users



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sr.unh.edu/pipermail/trillian-users/attachments/20170519/286695bc/attachment.html>


More information about the Trillian-users mailing list