[Zaphod-Users] nodes m187 and m182 are in state down

Wed May 23 14:51:43 EDT 2007

    Hi Kai,
  Thanks for your reply. I do not see any unusal things in my jobs. Most of them  are running without any problem. More strange is that the running jobs are  among my heavier calculations which need more memory. Thus it may not be due to  the memory problem of the nodes.
  I noticed some time ago that even the master node, i.e., h101 also was not  available:
  qstat -a
  Connection refused
  qstat: cannot connect to server h101 (errno=111)

  But it is okay now.

  Can it be a case that someone is playing with the cluster to test something?

  One can see that all the nodes (even thoes nodes that were not doing jobs) are  not continuously stay up.
  Your,
  S. Jalali.

Kai Germaschewski <kai.germaschewski at unh.edu> wrote:  
On Wed, 23 May 2007, Saeid Jalali wrote:

> You can see from the following status of the job number 4123 that the nodes m187 and m182 are in state down.
>   I wonder why in a day almost all the nodes one by one are shutted down!

Well, you are right that this is definitely an undesirable situation.

Now while it happens occasionally that a node crashes, this so far has 
been a rare event, while it seems to occur rather frequently with your 
jobs. Do you have any idea whether your jobs do something unusual? 
One possibility I can think of would be that they are running out of 
available memory and the node may swap itself to death.

--Kai

---------------------------------
Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.sr.unh.edu/pipermail/zaphod-users/attachments/20070523/7f7d8dc3/attachment.html