[Trillian-users] Trillian PBS

Fri May 19 16:57:26 EDT 2017

If you put the following into the bash file that you submit to qsub, it
will stop your code if output exists

{
if [  -f MY_OUTPUT_FILE ]; then
    echo "OUTPUT FILE EXISTS"
    exit 0
fi
}

Of course, you need to define MY_OUTPUT_FILE as appropriate, and this
assumes you are submitting a bash file. I like this over any queue option
because then my scripts are a bit more portable.

Jamie

On Fri, May 19, 2017 at 4:40 PM, Germaschewski, Kai <
Kai.Germaschewski at unh.edu> wrote:

> If you don't want jobs to restart, there's a "-r n" option to qsub, that
> should prevent from jobs being rerun after the system goes down -- though I
> haven't tried whether it actually works on the Cray.
>
> Otherwise, of course you can build something into your batch script to see
> whether there's already output, in which case you could choose to not
> actually start the code again and overwrite things.
>
> --Kai
>
>
> On Thu, May 18, 2017 at 6:28 PM Yu, Jiexiang <Jiexiang.Yu at unh.edu> wrote:
>
>> BUT the fact is PBS system HAS BEEN rebooted!
>>
>> The worse thing is all jobs are RESTARTED.
>>
>> GOOD JOBS!
>>
>>
>> ------------------------------
>> *From:* Trillian-users <trillian-users-bounces at lists.sr.unh.edu> on
>> behalf of Jimmy Raeder <J.Raeder at unh.edu>
>> *Sent:* Thursday, May 18, 2017 3:17 PM
>> *To:* trillian-users at lists.sr.unh.edu
>> *Subject:* [Trillian-users] Trillian PBS
>>
>>
>> As many of you have noticed, the Trillian PBS batch system has had issues
>> lately, such that jobs got stuck in the input queue despite there being
>> enough nodes available.
>>
>> Unfortunately, we have not been able to pinpoint the cause.  In addition,
>> we no longer subscribe to Cray support, so any help they give us will come
>> with a cost.  We have an open ticket w/Cray now, and hopefully the issue
>> will be resolved soon.
>>
>> In the mean time, the only remedy to free the stuck nodes is a reboot,
>> which will kill any jobs that are still running.  This requires a
>> compromise, since waiting for active jobs to finish can leave large
>> portions of the machine unusable for a long time.
>>
>> For the time being, we will employ the policy that we will not wait for
>> jobs to finish if they already run for more than 3 days.  In other words,
>> if you run jobs longer than 3 days they are at risk of being terminated.
>> It is thus advised to rather use   more nodes and shorter runtimes.  Please
>> note that most computing centers limit wall clock time to 1-2 days
>> typically, because long running jobs make it much harder to manage the
>> machine properly.
>>
>> —  Jimmy Raeder
>>
>> ------------------------------------------------------------
>> --------------------------------------
>> Joachim (Jimmy) Raeder
>> Professor of Physics, Department of Physics & Space Science Center
>> University of New Hampshire
>> 245G Morse Hall, 8 College Rd, Durham, NH 03824-3525
>> voice: 603-862-3412 <(603)%20862-3412>  mobile: 603-502-9505
>> <(603)%20502-9505>  assistant: 603-862-1431 <(603)%20862-1431>
>> e-mail: J.Raeder at unh.edu
>> WWW: http://mhd.sr.unh.edu/~jraeder/tmp.homepage
>> ------------------------------------------------------------
>> --------------------------------------
>>
>>
>>
>> _______________________________________________
>> Trillian-users mailing list
>> Trillian-users at lists.sr.unh.edu
>> http://lists.sr.unh.edu/mailman/listinfo/trillian-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sr.unh.edu/pipermail/trillian-users/attachments/20170519/ae70bd4d/attachment.html>