[Zaphod-Users] Solving a 'node down' problem
Douglas Larson
djl at artemis.sr.unh.edu
Mon Jan 28 11:31:11 EST 2008
Problem: SomeUser has noticed that their job is using a node that is down.
SomeUser at h101:~> showstate
cluster state summary for Mon Jan 28 10:49:36
JobName S User Group Procs Remaining StartTime
------------------ - --------- -------- ----- ----------- -------------------
<A List of all the jobs, current and pending>
<A usage Summary of all of the nodes>
node m109 is down
node m113 is down
node m152 is down
node m172 is down
node m216 is down
job 3666 requires node m123 (node is down)
Cause: It could be a real problem with the cluster, so they try and login
to the node.
SomeUser at h101:~> ssh m123
Last login: Mon Jan 28 10:27:08 2008 from h101.cl.unh.edu
Have a lot of fun...
Outcome: Successful login!
Next couse of action: Better see if the job is running on the node.
SomeUser at m123:~> top
top - 11:00:36 up 10 days, 21:05, 1 user, load average: 1.92, 1.91, 1.96
Tasks: 108 total, 4 running, 104 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0% us, 0.2% sy, 81.2% ni, 18.6% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 3994064k total, 3979920k used, 14144k free, 131072k buffers
Swap: 0k total, 0k used, 0k free, 3761676k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14157 SomeUser 39 15 6364 788 472 R 98.9 0.0 1303:38 coolcode.exe
14182 SomeUser 39 15 6364 788 472 R 98.9 0.0 1303:14 coolcode.exe
1 root 15 0 720 264 216 S 0.0 0.0 0:04.66 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.54 migration/0
etc.
Side note: Curious what top is all about? try 'man top'
Outcome: The code appears to be running fine on the compute node.
Next couse of action: Check the filesystems.
SomeUser at m123:~> df
Filesystem 1K-blocks Used Available Use% Mounted on
none 5120 2576 2544 51% /ram
none 1024 24 1000 3% /dev
none 5120 8 5112 1% /tmp
none 2048 2048 0 100% /var/spool/torque
tmpfs 1997032 8 1997024 1% /dev/shm
Note that the "/var/spool/torque" is at 100%. This means that the filesystem
is full. That is not good. The Zaphod cluster now uses a diskless boot system.
This means that the operating system and all associated files live in the
physical memory (RAM) on the compute node motherboard. The Zaphod cluster nodes
have 4GB of RAM. The /var/spool/torque filesystem is assigned only 2MB of RAM.
SomeUser at m123:~> ls -l /var/spool/torque/spool/
total 16
-rw------- 1 SomeUser users 108 2008-01-27 13:15 3666.h101.c.ER
-rw------- 1 SomeUser users 839680 2008-01-27 13:15 3666.h101.c.OU
-rw------- 1 SomeUser users 108 2008-01-27 13:15 3666.h101.c.ER
-rw------- 1 SomeUser users 839680 2008-01-27 13:15 3666.h101.c.OU
We see at least 2 * 839680bytes / 1024 = 1640K
at /var/spool/torque:
8K /var/spool/torque/aux
460K /var/spool/torque/mom_logs
40K /var/spool/torque/mom_priv
4K /var/spool/torque/pbs_environment
4K /var/spool/torque/server_name
Outcome: SomeUser has a process that is filling the spool area.
Solution: Redirect both stdout and stderr to file "filename.log"
When SomeUser submits their job, they should do something similar to:
mpiexec coolcode &>filename.log
An extremely detailed discussion of IO redirection can be
found in the "Advanced Bash Scripting" book:
http://tldp.org/LDP/abs/html/io-redirection.html
--------
Doug
More information about the Zaphod-Users
mailing list