[Zaphod-Users] Zaphod job submission tips

Douglas Larson Douglas.Larson at unh.edu
Sat Sep 15 11:40:36 EDT 2007


Hello Zaphod users.

Many of you have experienced jobs being stuck in the queue. There are a
number of possible reasons:

1) There is a hardware error with a myrinet or ethernet card.
2) The pbs/maui queuing system has assigned a broken node to your job.
3) The file system that you want to write your data is full.

Case 1:
When you submit your job you can immediately do a checkjob. This will 
give you feedback about your job. Sometimes you might see:

"job is deferred.  Reason:  NoResources  (cannot create reservation for 
job 'NNNNN' (intital reservation attempt)

Holds:    Defer  (hold reason:  NoResources)"

In that case I would qdel the job. Then I would look over my submission
script and determine if I was asking for more wallclock time than I really
require, more processors, or maybe I can switch from myri to eth. This can
improve your chances.

If check job shows something *similar* to:
Allocated Nodes:
[m193:2][m192:2][m191:2][m190:2]
[m189:2][m188:2][m187:2][m186:2]
[m185:2][m184:2][m183:2][m182:2]
[m181:2][m180:2][m179:2][m178:2]
[m177:2][m176:2][m175:2][m174:2]
[m173:2][m172:2][m171:2][m170:2]
[m169:2][m168:2][m167:2][m166:2]
[m165:2][m164:2][m163:2][m162:2]
[m161:2][m159:2][m158:2][m157:2]
[m156:2][m155:2][m154:2][m153:2]
[m151:2][m150:2][m149:2]

I would save that in a temp file and then if the job dies you can
help determine if a compute node has a hardware error.

case 2:
The command: /usr/local/bin/pbsnodes -l
will show you what the pbs system considers as unavailable. If your checkjob
command said that you were assigned a node that is now marked as down, your
job should be deleted, it will never complete. Report your checkjob and
the suspected intersection with pbsnodes to this list.

case 3:
If the file system is full, your job will not be able to complete. Check the
status of how full the file system you want to use by the df command.

testm at h101:~> df -h /mnt/data05
Filesystem            Size  Used Avail Use% Mounted on
s103:/data            2.7T  171G  2.5T   7% /mnt/data05

Hint: The "walltime" directive in your pbs script should be based on your
experience when running your code. Try and keep track of the actual time that
your code requires and use a walltime based on that value. This will 
allow pbs to schedule you better.





More information about the Zaphod-Users mailing list