[Zaphod-Users] Solving a 'node down' problem

Mon Jan 28 11:31:11 EST 2008

Problem: SomeUser has noticed that their job is using a node that is down.

SomeUser at h101:~> showstate
cluster state summary for Mon Jan 28 10:49:36
    JobName            S      User    Group Procs   Remaining            StartTime
    ------------------ - --------- -------- ----- -----------  -------------------
<A List of all the jobs, current and pending>
<A usage Summary of all of the nodes>

node m109 is down
node m113 is down
node m152 is down
node m172 is down
node m216 is down
job 3666 requires node m123 (node is down)

Cause: It could be a real problem with the cluster, so they try and login
       to the node.

SomeUser at h101:~> ssh m123
Last login: Mon Jan 28 10:27:08 2008 from h101.cl.unh.edu
Have a lot of fun...

Outcome: Successful login!
Next couse of action: Better see if the job is running on the node.

SomeUser at m123:~> top
top - 11:00:36 up 10 days, 21:05,  1 user,  load average: 1.92, 1.91, 1.96
Tasks: 108 total,   4 running, 104 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  0.2% sy, 81.2% ni, 18.6% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   3994064k total,  3979920k used,    14144k free,   131072k buffers
Swap:        0k total,        0k used,        0k free,  3761676k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14157 SomeUser  39  15  6364  788  472 R 98.9  0.0   1303:38 coolcode.exe
14182 SomeUser  39  15  6364  788  472 R 98.9  0.0   1303:14 coolcode.exe
    1 root      15   0   720  264  216 S  0.0  0.0   0:04.66 init
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.54 migration/0
etc.

Side note: Curious what top is all about? try 'man top'

Outcome: The code appears to be running fine on the compute node.
Next couse of action: Check the filesystems.

SomeUser at m123:~> df
Filesystem           1K-blocks      Used Available Use% Mounted on
none                      5120      2576      2544  51% /ram
none                      1024        24      1000   3% /dev
none                      5120         8      5112   1% /tmp
none                      2048      2048         0 100% /var/spool/torque
tmpfs                  1997032         8   1997024   1% /dev/shm

Note that the "/var/spool/torque" is at 100%. This means that the filesystem
is full. That is not good. The Zaphod cluster now uses a diskless boot system.
This means that the operating system and all associated files live in the
physical memory (RAM) on the compute node motherboard. The Zaphod cluster nodes
have 4GB of RAM. The /var/spool/torque filesystem is assigned only 2MB of RAM.

SomeUser at m123:~> ls -l /var/spool/torque/spool/
total 16
-rw-------  1 SomeUser users     108 2008-01-27 13:15 3666.h101.c.ER
-rw-------  1 SomeUser users  839680 2008-01-27 13:15 3666.h101.c.OU
-rw-------  1 SomeUser users     108 2008-01-27 13:15 3666.h101.c.ER
-rw-------  1 SomeUser users  839680 2008-01-27 13:15 3666.h101.c.OU

We see at least 2 * 839680bytes / 1024 =  1640K
at /var/spool/torque:
8K      /var/spool/torque/aux
460K    /var/spool/torque/mom_logs
40K     /var/spool/torque/mom_priv
4K      /var/spool/torque/pbs_environment
4K      /var/spool/torque/server_name

Outcome: SomeUser has a process that is filling the spool area.
Solution: Redirect both stdout and stderr to file "filename.log"

When SomeUser submits their job, they should do something similar to:
mpiexec coolcode &>filename.log

An extremely detailed discussion of IO redirection can be
found in the "Advanced Bash Scripting" book:
http://tldp.org/LDP/abs/html/io-redirection.html

--------
Doug