[Zaphod-Users] nodes m187 and m182 are in state down
Saeid Jalali
s_jalali_a at yahoo.com
Wed May 23 14:51:43 EDT 2007
Hi Kai,
Thanks for your reply. I do not see any unusal things in my jobs. Most of them are running without any problem. More strange is that the running jobs are among my heavier calculations which need more memory. Thus it may not be due to the memory problem of the nodes.
I noticed some time ago that even the master node, i.e., h101 also was not available:
qstat -a
Connection refused
qstat: cannot connect to server h101 (errno=111)
But it is okay now.
Can it be a case that someone is playing with the cluster to test something?
One can see that all the nodes (even thoes nodes that were not doing jobs) are not continuously stay up.
Your,
S. Jalali.
Kai Germaschewski <kai.germaschewski at unh.edu> wrote:
On Wed, 23 May 2007, Saeid Jalali wrote:
> You can see from the following status of the job number 4123 that the nodes m187 and m182 are in state down.
> I wonder why in a day almost all the nodes one by one are shutted down!
Well, you are right that this is definitely an undesirable situation.
Now while it happens occasionally that a node crashes, this so far has
been a rare event, while it seems to occur rather frequently with your
jobs. Do you have any idea whether your jobs do something unusual?
One possibility I can think of would be that they are running out of
available memory and the node may swap itself to death.
--Kai
---------------------------------
Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.sr.unh.edu/pipermail/zaphod-users/attachments/20070523/7f7d8dc3/attachment.html
More information about the Zaphod-Users
mailing list