[Zaphod-Users] batch problem - server kills the run

tom fogal tfogal at apollo.sr.unh.edu
Fri Oct 27 21:09:56 EDT 2006


 <1161980024.5605.16.camel at localhost>Alexander Vapirev writes:
>Hi,

Hi... please send these to zaphod-users.  I happen to have responded to
you to lend a hand, but I don't want people to start thinking I'm their
personal zaphod support guy.

>is there any "manual" how to create and run jobs on zaphod?

I don't understand what you mean here..

>This is what I do:
>
>compile with f77 from command line:
>
>f77 -O1 -w -o ram.out ramvst_AV_new.f fct-source.f msis86.f iri87.f
>IntRoutineVp.f Geopack_2003.for T04_s.f

Just to make sure, if you just run `ram.out' at this point -- without
trying to submit it as a job on zaphod -- it works fine, correct?

>run the following batch:
>
>-----
>#!/bin/csh
>#PBS -q long -l nodes=1:ppn=1:myri
>#PBS -l walltime=72:00:00
>cd /home/space/vapirev/2001storm-ram-new/
>./ram.out
>date
>-----
>
>./ramque.batch

Run it?  Is the above csh script `ramque.batch'?  You don't want to run
this manually.  You want to submit it to the scheduler using qsub.

Please read the zaphod user wiki, in particular this article:

   http://www.zaphod.sr.unh.edu/wiki/index.php?title=Run_an_MPI_job

>and server the output after about 40 mins:

I happen to have a fairly good guess at what's going on here.  In
general though you should provide more information than this.  For
example, what program generated this message?  Does `server' mean the
host you are ssh'd in to?  Is this repeatable 100% of the time?

>Killed
>Fri Oct 27 16:05:46 EDT 2006

The `Fri Oct [. . .]' is coming from when you run the `date' program in
your script.  The `Killed' is coming from your shell, informing you
that your job has been killed.  This is almost assuredly because you
used up all of the memory on the machine.
Use `dmesg' after you get strange messages you don't understand.  If
you do it soon after you get this message, it's likely you'll see a
more descriptive message about how the kernel killed it because of an
OOMUse `dmesg' after you get strange messages you don't understand.  If
you do it soon after you get this message, it's likely you'll see a
more descriptive message along with the program name.

Since you're running out of memory, I'll pre-empty your next problem
and tell you to submit it to the `storage' queue instead of the `long'
queue, once you start submitting jobs properly.  Just change `-q long'
to `-q storage' in your batch script.

>On Fri, 2006-10-27 at 15:09 -0400, tom fogal wrote:
>>  <1161967663.5605.1.camel at localhost>Alexander Vapirev writes:
>> >honestly, I have no idea what are those files and where to find them. I
>> >am complete ignorant when comes to running a job on a server.
>> 
>> They will appear in the directory you ran the `qsub' command from,
>> after the job has completed (doesn't appear in `showq' anymore).
>> 

Again:
   #### EMPHASIS #####
>> This is a private reply to you, but make sure you mail the list when
>> you copy those output files.
   #### EMPHASIS #####

Thanks,

-tom


More information about the Zaphod-Users mailing list