[Trillian-users] Trillian PBS

Fri May 19 00:31:15 EDT 2017

Hi Jiexiang,

I fully understand your frustration. But as a user just like you (I do not
maintain Trillian), I also hope you could realize how difficult it is to
maintain computer resources like Trillian in a university, and maybe show
more appreciation to people who are investing their time doing this (within
or beyond their job duty). The past month has been particularly tough with
many new issues and the cluster has not been able to work consistently for
us. Like you said, this recent reboot is unexpected for some of us
(including me), but the important thing here really is to solve the
problem, which, like Jimmy said, "requires some compromise". Thus, I feel
sorry for your restarted job without a heads-up, but I also have to say
that complaining alone does not help anyone, and hopefully, the new job
scheduling policy will make the situation better.

-- Liang Wang

On Thu, May 18, 2017 at 6:22 PM, Yu, Jiexiang <Jiexiang.Yu at unh.edu> wrote:

> BUT the fact is PBS system HAS BEEN rebooted!
>
> The worse thing is all jobs are RESTARTED.
>
> GOOD JOBS!
>
>
> ------------------------------
> *From:* Trillian-users <trillian-users-bounces at lists.sr.unh.edu> on
> behalf of Jimmy Raeder <J.Raeder at unh.edu>
> *Sent:* Thursday, May 18, 2017 3:17 PM
> *To:* trillian-users at lists.sr.unh.edu
> *Subject:* [Trillian-users] Trillian PBS
>
>
> As many of you have noticed, the Trillian PBS batch system has had issues
> lately, such that jobs got stuck in the input queue despite there being
> enough nodes available.
>
> Unfortunately, we have not been able to pinpoint the cause.  In addition,
> we no longer subscribe to Cray support, so any help they give us will come
> with a cost.  We have an open ticket w/Cray now, and hopefully the issue
> will be resolved soon.
>
> In the mean time, the only remedy to free the stuck nodes is a reboot,
> which will kill any jobs that are still running.  This requires a
> compromise, since waiting for active jobs to finish can leave large
> portions of the machine unusable for a long time.
>
> For the time being, we will employ the policy that we will not wait for
> jobs to finish if they already run for more than 3 days.  In other words,
> if you run jobs longer than 3 days they are at risk of being terminated.
> It is thus advised to rather use   more nodes and shorter runtimes.  Please
> note that most computing centers limit wall clock time to 1-2 days
> typically, because long running jobs make it much harder to manage the
> machine properly.
>
> —  Jimmy Raeder
>
> ------------------------------------------------------------
> --------------------------------------
> Joachim (Jimmy) Raeder
> Professor of Physics, Department of Physics & Space Science Center
> University of New Hampshire
> 245G Morse Hall, 8 College Rd, Durham, NH 03824-3525
> voice: 603-862-3412 <(603)%20862-3412>  mobile: 603-502-9505
> <(603)%20502-9505>  assistant: 603-862-1431 <(603)%20862-1431>
> e-mail: J.Raeder at unh.edu
> WWW: http://mhd.sr.unh.edu/~jraeder/tmp.homepage
> ------------------------------------------------------------
> --------------------------------------
>
>
>
>
> _______________________________________________
> Trillian-users mailing list
> Trillian-users at lists.sr.unh.edu
> http://lists.sr.unh.edu/mailman/listinfo/trillian-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sr.unh.edu/pipermail/trillian-users/attachments/20170519/76d78afd/attachment-0001.html>