[Zaphod-Users] batch problem - server kills the run

Alexander Vapirev avapirev at cisunix.unh.edu
Fri Oct 27 21:52:15 EDT 2006


Quoting tom fogal <tfogal at apollo.sr.unh.edu>:

> <1161980024.5605.16.camel at localhost>Alexander Vapirev writes:
>> Hi,
>
> Hi... please send these to zaphod-users.  I happen to have responded to
> you to lend a hand, but I don't want people to start thinking I'm their
> personal zaphod support guy.
>
>> is there any "manual" how to create and run jobs on zaphod?
>
> I don't understand what you mean here..
>
>> This is what I do:
>>
>> compile with f77 from command line:
>>
>> f77 -O1 -w -o ram.out ramvst_AV_new.f fct-source.f msis86.f iri87.f
>> IntRoutineVp.f Geopack_2003.for T04_s.f
>
> Just to make sure, if you just run `ram.out' at this point -- without
> trying to submit it as a job on zaphod -- it works fine, correct?
>
>> run the following batch:
>>
>> -----
>> #!/bin/csh
>> #PBS -q long -l nodes=1:ppn=1:myri
>> #PBS -l walltime=72:00:00
>> cd /home/space/vapirev/2001storm-ram-new/
>> ./ram.out
>> date
>> -----
>>
>> ./ramque.batch
>
> Run it?  Is the above csh script `ramque.batch'?  You don't want to run
> this manually.  You want to submit it to the scheduler using qsub.
>
> Please read the zaphod user wiki, in particular this article:
>
>   http://www.zaphod.sr.unh.edu/wiki/index.php?title=Run_an_MPI_job
>
>> and server the output after about 40 mins:
>
> I happen to have a fairly good guess at what's going on here.  In
> general though you should provide more information than this.  For
> example, what program generated this message?  Does `server' mean the
> host you are ssh'd in to?  Is this repeatable 100% of the time?
>
>> Killed
>> Fri Oct 27 16:05:46 EDT 2006
>
> The `Fri Oct [. . .]' is coming from when you run the `date' program in
> your script.  The `Killed' is coming from your shell, informing you
> that your job has been killed.  This is almost assuredly because you
> used up all of the memory on the machine.
> Use `dmesg' after you get strange messages you don't understand.  If
> you do it soon after you get this message, it's likely you'll see a
> more descriptive message about how the kernel killed it because of an
> OOMUse `dmesg' after you get strange messages you don't understand.  If
> you do it soon after you get this message, it's likely you'll see a
> more descriptive message along with the program name.
>
> Since you're running out of memory, I'll pre-empty your next problem
> and tell you to submit it to the `storage' queue instead of the `long'
> queue, once you start submitting jobs properly.  Just change `-q long'
> to `-q storage' in your batch script.
>
>> On Fri, 2006-10-27 at 15:09 -0400, tom fogal wrote:
>>>  <1161967663.5605.1.camel at localhost>Alexander Vapirev writes:
>>> >honestly, I have no idea what are those files and where to find them. I
>>> >am complete ignorant when comes to running a job on a server.
>>>
>>> They will appear in the directory you ran the `qsub' command from,
>>> after the job has completed (doesn't appear in `showq' anymore).
>>>
>
> Again:
>   #### EMPHASIS #####
>>> This is a private reply to you, but make sure you mail the list when
>>> you copy those output files.
>   #### EMPHASIS #####
>
> Thanks,
>
> -tom
> _______________________________________________
> Zaphod-Users mailing list
> Zaphod-Users at lists.sr.unh.edu
> http://lists.sr.unh.edu/mailman/listinfo/zaphod-users
>



Thanks a lot. I think this solves my problem. I didnt know how to use 
qsub and I was suspecting I was running out of memory.

Alexander.


More information about the Zaphod-Users mailing list