[Trillian-users] Trillian PBS
James Pringle
jpringle at unh.edu
Fri May 19 16:57:26 EDT 2017
If you put the following into the bash file that you submit to qsub, it
will stop your code if output exists
{
if [ -f MY_OUTPUT_FILE ]; then
echo "OUTPUT FILE EXISTS"
exit 0
fi
}
Of course, you need to define MY_OUTPUT_FILE as appropriate, and this
assumes you are submitting a bash file. I like this over any queue option
because then my scripts are a bit more portable.
Jamie
On Fri, May 19, 2017 at 4:40 PM, Germaschewski, Kai <
Kai.Germaschewski at unh.edu> wrote:
> If you don't want jobs to restart, there's a "-r n" option to qsub, that
> should prevent from jobs being rerun after the system goes down -- though I
> haven't tried whether it actually works on the Cray.
>
> Otherwise, of course you can build something into your batch script to see
> whether there's already output, in which case you could choose to not
> actually start the code again and overwrite things.
>
> --Kai
>
>
> On Thu, May 18, 2017 at 6:28 PM Yu, Jiexiang <Jiexiang.Yu at unh.edu> wrote:
>
>> BUT the fact is PBS system HAS BEEN rebooted!
>>
>> The worse thing is all jobs are RESTARTED.
>>
>> GOOD JOBS!
>>
>>
>> ------------------------------
>> *From:* Trillian-users <trillian-users-bounces at lists.sr.unh.edu> on
>> behalf of Jimmy Raeder <J.Raeder at unh.edu>
>> *Sent:* Thursday, May 18, 2017 3:17 PM
>> *To:* trillian-users at lists.sr.unh.edu
>> *Subject:* [Trillian-users] Trillian PBS
>>
>>
>> As many of you have noticed, the Trillian PBS batch system has had issues
>> lately, such that jobs got stuck in the input queue despite there being
>> enough nodes available.
>>
>> Unfortunately, we have not been able to pinpoint the cause. In addition,
>> we no longer subscribe to Cray support, so any help they give us will come
>> with a cost. We have an open ticket w/Cray now, and hopefully the issue
>> will be resolved soon.
>>
>> In the mean time, the only remedy to free the stuck nodes is a reboot,
>> which will kill any jobs that are still running. This requires a
>> compromise, since waiting for active jobs to finish can leave large
>> portions of the machine unusable for a long time.
>>
>> For the time being, we will employ the policy that we will not wait for
>> jobs to finish if they already run for more than 3 days. In other words,
>> if you run jobs longer than 3 days they are at risk of being terminated.
>> It is thus advised to rather use more nodes and shorter runtimes. Please
>> note that most computing centers limit wall clock time to 1-2 days
>> typically, because long running jobs make it much harder to manage the
>> machine properly.
>>
>> — Jimmy Raeder
>>
>> ------------------------------------------------------------
>> --------------------------------------
>> Joachim (Jimmy) Raeder
>> Professor of Physics, Department of Physics & Space Science Center
>> University of New Hampshire
>> 245G Morse Hall, 8 College Rd, Durham, NH 03824-3525
>> voice: 603-862-3412 <(603)%20862-3412> mobile: 603-502-9505
>> <(603)%20502-9505> assistant: 603-862-1431 <(603)%20862-1431>
>> e-mail: J.Raeder at unh.edu
>> WWW: http://mhd.sr.unh.edu/~jraeder/tmp.homepage
>> ------------------------------------------------------------
>> --------------------------------------
>>
>>
>>
>> _______________________________________________
>> Trillian-users mailing list
>> Trillian-users at lists.sr.unh.edu
>> http://lists.sr.unh.edu/mailman/listinfo/trillian-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sr.unh.edu/pipermail/trillian-users/attachments/20170519/ae70bd4d/attachment.html>
More information about the Trillian-users
mailing list