[Trillian-users] Trillian PBS

Kai Germaschewski kai.germaschewski at unh.edu
Fri May 19 16:40:49 EDT 2017


If you don't want jobs to restart, there's a "-r n" option to qsub, that
should prevent from jobs being rerun after the system goes down -- though I
haven't tried whether it actually works on the Cray.

Otherwise, of course you can build something into your batch script to see
whether there's already output, in which case you could choose to not
actually start the code again and overwrite things.

--Kai


On Thu, May 18, 2017 at 6:28 PM Yu, Jiexiang <Jiexiang.Yu at unh.edu> wrote:

> BUT the fact is PBS system HAS BEEN rebooted!
>
> The worse thing is all jobs are RESTARTED.
>
> GOOD JOBS!​
>
>
> ------------------------------
> *From:* Trillian-users <trillian-users-bounces at lists.sr.unh.edu> on
> behalf of Jimmy Raeder <J.Raeder at unh.edu>
> *Sent:* Thursday, May 18, 2017 3:17 PM
> *To:* trillian-users at lists.sr.unh.edu
> *Subject:* [Trillian-users] Trillian PBS
>
>
> As many of you have noticed, the Trillian PBS batch system has had issues
> lately, such that jobs got stuck in the input queue despite there being
> enough nodes available.
>
> Unfortunately, we have not been able to pinpoint the cause.  In addition,
> we no longer subscribe to Cray support, so any help they give us will come
> with a cost.  We have an open ticket w/Cray now, and hopefully the issue
> will be resolved soon.
>
> In the mean time, the only remedy to free the stuck nodes is a reboot,
> which will kill any jobs that are still running.  This requires a
> compromise, since waiting for active jobs to finish can leave large
> portions of the machine unusable for a long time.
>
> For the time being, we will employ the policy that we will not wait for
> jobs to finish if they already run for more than 3 days.  In other words,
> if you run jobs longer than 3 days they are at risk of being terminated.
> It is thus advised to rather use   more nodes and shorter runtimes.  Please
> note that most computing centers limit wall clock time to 1-2 days
> typically, because long running jobs make it much harder to manage the
> machine properly.
>
> —  Jimmy Raeder
>
>
> --------------------------------------------------------------------------------------------------
> Joachim (Jimmy) Raeder
> Professor of Physics, Department of Physics & Space Science Center
> University of New Hampshire
> 245G Morse Hall, 8 College Rd, Durham, NH 03824-3525
> voice: 603-862-3412 <(603)%20862-3412>  mobile: 603-502-9505
> <(603)%20502-9505>  assistant: 603-862-1431 <(603)%20862-1431>
> e-mail: J.Raeder at unh.edu
> WWW: http://mhd.sr.unh.edu/~jraeder/tmp.homepage
>
> --------------------------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Trillian-users mailing list
> Trillian-users at lists.sr.unh.edu
> http://lists.sr.unh.edu/mailman/listinfo/trillian-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sr.unh.edu/pipermail/trillian-users/attachments/20170519/5da972c3/attachment.html>


More information about the Trillian-users mailing list