[Zaphod-Users] batch problem - server kills the run
tom fogal
tfogal at apollo.sr.unh.edu
Fri Oct 27 21:09:56 EDT 2006
<1161980024.5605.16.camel at localhost>Alexander Vapirev writes:
>Hi,
Hi... please send these to zaphod-users. I happen to have responded to
you to lend a hand, but I don't want people to start thinking I'm their
personal zaphod support guy.
>is there any "manual" how to create and run jobs on zaphod?
I don't understand what you mean here..
>This is what I do:
>
>compile with f77 from command line:
>
>f77 -O1 -w -o ram.out ramvst_AV_new.f fct-source.f msis86.f iri87.f
>IntRoutineVp.f Geopack_2003.for T04_s.f
Just to make sure, if you just run `ram.out' at this point -- without
trying to submit it as a job on zaphod -- it works fine, correct?
>run the following batch:
>
>-----
>#!/bin/csh
>#PBS -q long -l nodes=1:ppn=1:myri
>#PBS -l walltime=72:00:00
>cd /home/space/vapirev/2001storm-ram-new/
>./ram.out
>date
>-----
>
>./ramque.batch
Run it? Is the above csh script `ramque.batch'? You don't want to run
this manually. You want to submit it to the scheduler using qsub.
Please read the zaphod user wiki, in particular this article:
http://www.zaphod.sr.unh.edu/wiki/index.php?title=Run_an_MPI_job
>and server the output after about 40 mins:
I happen to have a fairly good guess at what's going on here. In
general though you should provide more information than this. For
example, what program generated this message? Does `server' mean the
host you are ssh'd in to? Is this repeatable 100% of the time?
>Killed
>Fri Oct 27 16:05:46 EDT 2006
The `Fri Oct [. . .]' is coming from when you run the `date' program in
your script. The `Killed' is coming from your shell, informing you
that your job has been killed. This is almost assuredly because you
used up all of the memory on the machine.
Use `dmesg' after you get strange messages you don't understand. If
you do it soon after you get this message, it's likely you'll see a
more descriptive message about how the kernel killed it because of an
OOMUse `dmesg' after you get strange messages you don't understand. If
you do it soon after you get this message, it's likely you'll see a
more descriptive message along with the program name.
Since you're running out of memory, I'll pre-empty your next problem
and tell you to submit it to the `storage' queue instead of the `long'
queue, once you start submitting jobs properly. Just change `-q long'
to `-q storage' in your batch script.
>On Fri, 2006-10-27 at 15:09 -0400, tom fogal wrote:
>> <1161967663.5605.1.camel at localhost>Alexander Vapirev writes:
>> >honestly, I have no idea what are those files and where to find them. I
>> >am complete ignorant when comes to running a job on a server.
>>
>> They will appear in the directory you ran the `qsub' command from,
>> after the job has completed (doesn't appear in `showq' anymore).
>>
Again:
#### EMPHASIS #####
>> This is a private reply to you, but make sure you mail the list when
>> you copy those output files.
#### EMPHASIS #####
Thanks,
-tom
More information about the Zaphod-Users
mailing list