On Fri, May 19, 2017 at 4:40 PM, Germaschewski, Kai <Kai.Germaschewski@unh.edu> wrote:

If you don't want jobs to restart, there's a "-r n" option to qsub, that should prevent from jobs being rerun after the system goes down -- though I haven't tried whether it actually works on the Cray.

Otherwise, of course you can build something into your batch script to see whether there's already output, in which case you could choose to not actually start the code again and overwrite things.

--Kai

On Thu, May 18, 2017 at 6:28 PM Yu, Jiexiang <Jiexiang.Yu@unh.edu> wrote:

BUT the fact is PBS system HAS BEEN rebooted!

The worse thing is all jobs are RESTARTED.

GOOD JOBS!

From: Trillian-users <trillian-users-bounces@lists.sr.unh.edu> on behalf of Jimmy Raeder <J.Raeder@unh.edu>
Sent: Thursday, May 18, 2017 3:17 PM
To: trillian-users@lists.sr.unh.edu
Subject: [Trillian-users] Trillian PBS

As many of you have noticed, the Trillian PBS batch system has had issues lately, such that jobs got stuck in the input queue despite there being enough nodes available.

Unfortunately, we have not been able to pinpoint the cause. In addition, we no longer subscribe to Cray support, so any help they give us will come with a cost. We have an open ticket w/Cray now, and hopefully the issue will be resolved soon.

In the mean time, the only remedy to free the stuck nodes is a reboot, which will kill any jobs that are still running. This requires a compromise, since waiting for active jobs to finish can leave large portions of the machine unusable for a long time.

For the time being, we will employ the policy that we will not wait for jobs to finish if they already run for more than 3 days. In other words, if you run jobs longer than 3 days they are at risk of being terminated. It is thus advised to rather use more nodes and shorter runtimes. Please note that most computing centers limit wall clock time to 1-2 days typically, because long running jobs make it much harder to manage the machine properly.

— Jimmy Raeder

--------------------------------------------------------------------------------------------------
Joachim (Jimmy) Raeder
Professor of Physics, Department of Physics & Space Science Center
University of New Hampshire
245G Morse Hall, 8 College Rd, Durham, NH 03824-3525
voice: 603-862-3412 mobile: 603-502-9505 assistant: 603-862-1431
e-mail: J.Raeder@unh.edu
WWW: http://mhd.sr.unh.edu/~jraeder/tmp.homepage
--------------------------------------------------------------------------------------------------

_______________________________________________
Trillian-users mailing list
Trillian-users@lists.sr.unh.edu
http://lists.sr.unh.edu/mailman/listinfo/trillian-users