<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

whether or not your code is the issue, as zaphod is a shared resource,

it would make sense that all software be validated and checked for

resource leaks <i>before</i> it is actually running on zaphod.... <br>

<br>

Not sure how many of us are doing this... if you haven't, then maybe

you should give it a go, off zaphod and eliminate that possibility.<br>

<br>

Such a validation can be done in part by running software through

valgrind to detect memory leaks. Also you want to make sure software is

not leaking

other resources like open files. In the case your software has resource

leaks you wouldn't notice anything unusual except the crashes. As if

present resource leaks are always present and only become visible when

resources become low...<br>

<br>

just a thought<span class="moz-smiley-s1"><span> :-) </span></span><br>

<br>

<br>

<br>

<br>

Saeid Jalali wrote:

<blockquote cite="mid113502.22142.qm@web54308.mail.yahoo.com"

 type="cite">

  <div class="MsoNormal">Hi Kai,<br>

Thanks for your reply. I do not see any unusal things in my jobs. Most

of them are running without any problem. More strange is that the

running jobs are among my heavier calculations which need more memory.

Thus it may not be due to the memory problem of the nodes.<br>

I noticed some time ago that even the master node, i.e., h101 also was

not available:<br>

qstat -a<br>

Connection refused<br>

qstat: cannot connect to server h101 (errno=111)<br>

  <br>

But it is okay now.<br>

  <br>

Can it be a case that someone is playing with the cluster to test

something?<br>

  <br>

One can see that all the nodes (even thoes nodes that were not doing

jobs) are not continuously stay up.<br style="">

<!--[if !supportLineBreakNewLine]-->Your,<br>

S. Jalali.<br style="">

<!--[endif]--></div>

  <br>

  <b><i>Kai Germaschewski <a class="moz-txt-link-rfc2396E" href="mailto:kai.germaschewski@unh.edu">&lt;kai.germaschewski@unh.edu&gt;</a></i></b>

wrote:

  <blockquote class="replbq"

 style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px;">

    <br>

On Wed, 23 May 2007, Saeid Jalali wrote:<br>

    <br>

&gt; You can see from the following status of the job number 4123 that

the nodes m187 and m182 are in state down.<br>

&gt; I wonder why in a day almost all the nodes one by one are shutted

down!<br>

    <br>

Well, you are right that this is definitely an undesirable situation.<br>

    <br>

Now while it happens occasionally that a node crashes, this so far has <br>

been a rare event, while it seems to occur rather frequently with your <br>

jobs. Do you have any idea whether your jobs do something unusual? <br>

One possibility I can think of would be that they are running out of <br>

available memory and the node may swap itself to death.<br>

    <br>

--Kai<br>

    <br>

  </blockquote>

  <br>

  <p> </p>

  <hr size="1">Looking for a deal? <a

 href="http://us.rd.yahoo.com/evt=47094/*http://farechase.yahoo.com/;_ylc=X3oDMTFicDJoNDllBF9TAzk3NDA3NTg5BHBvcwMxMwRzZWMDZ3JvdXBzBHNsawNlbWFpbC1uY20-">Find

great prices on flights and hotels</a> with Yahoo! FareChase.

  <pre wrap="">

<hr size="4" width="90%">

_______________________________________________

Zaphod-Users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Zaphod-Users@lists.sr.unh.edu">Zaphod-Users@lists.sr.unh.edu</a>

<a class="moz-txt-link-freetext" href="http://lists.sr.unh.edu/mailman/listinfo/zaphod-users">http://lists.sr.unh.edu/mailman/listinfo/zaphod-users</a>

  </pre>

</blockquote>

<br>

<div class="moz-signature">-- <br>

<img src="cid:part1.06060408.08080500@apollo.sr.unh.edu" border="0"></div>

</body>

</html>