[Zaphod-Users] nodes m187 and m182 are in state down
Tod Hagan
tod at gust.sr.unh.edu
Wed May 23 14:41:53 EDT 2007
On Wed, 2007-05-23 at 14:05 -0400, Kai Germaschewski wrote:
> ...occasionally that a node crashes, this so far has
> been a rare event, while it seems to occur rather frequently with your
> jobs. Do you have any idea whether your jobs do something unusual?
> One possibility I can think of would be that they are running out of
> available memory and the node may swap itself to death.
I too have been wondering about jobs killing nodes.
The failure rate of the Myrinet nodes has jumped drastically in the last
few days. Six have failed today, and 10 since 0000 Tuesday:
Sun May 20 00:00:01 EDT 2007
JobEx Job Free Off/Dn Unk Miss J+F O+U+M
Ethernet 1 0 32 2 3 0 33 5
2.6% 0.0% 84.2% 5.3% 7.9% 0.0% 86.8% 13.2%
Myrinet 97 0 6 0 19 0 103 19
79.5% 0.0% 4.9% 0.0% 15.6% 0.0% 84.4% 15.6%
Mon May 21 00:00:02 EDT 2007
JobEx Job Free Off/Dn Unk Miss J+F O+U+M
Ethernet 1 0 32 2 3 0 33 5
2.6% 0.0% 84.2% 5.3% 7.9% 0.0% 86.8% 13.2%
Myrinet 68 1 32 2 19 0 101 21
55.7% 0.8% 26.2% 1.6% 15.6% 0.0% 82.8% 17.2%
Tue May 22 00:00:03 EDT 2007
JobEx Job Free Off/Dn Unk Miss J+F O+U+M
Ethernet 2 0 31 2 3 0 33 5
5.3% 0.0% 81.6% 5.3% 7.9% 0.0% 86.8% 13.2%
Myrinet 60 1 39 3 19 0 100 22
49.2% 0.8% 32.0% 2.5% 15.6% 0.0% 82.0% 18.0%
Wed May 23 00:00:00 EDT 2007
JobEx Job Free Off/Dn Unk Miss J+F O+U+M
Ethernet 2 0 31 2 3 0 33 5
5.3% 0.0% 81.6% 5.3% 7.9% 0.0% 86.8% 13.2%
Myrinet 73 3 20 2 24 0 96 26
59.8% 2.5% 16.4% 1.6% 19.7% 0.0% 78.7% 21.3%
Wed May 23 14:27:30 EDT 2007
JobEx Job Free Off/Dn Unk Miss J+F O+U+M
Ethernet 12 0 21 2 3 0 33 5
31.6% 0.0% 55.3% 5.3% 7.9% 0.0% 86.8% 13.2%
Myrinet 70 1 19 0 32 0 90 32
57.4% 0.8% 15.6% 0.0% 26.2% 0.0% 73.8% 26.2%
It seems that most of the unavailable nodes really are down:
h101> node_pbs_test -a m
30 Unpingable: m104 m108 m109 m111 m113 m117 m123 m142 m150 m152 m155 m166 m172 m174 m182 m185 m187 m191 m202 m203 m205 m208 m209 m210 m215 m216 m217 m218 m221 m222
1 ssh timed out or refused: m110
91 Pingable, /home mounted: m101 m102 m103 m105 m106 m107 m112 m114 m115 m116 m118 m119 m120 m121 m122 m124 m125 m126 m127 m128 m129 m130 m131 m132 m133 m134 m135 m136 m137 m138 m139 m140 m141 m143 m144 m145 m146 m147 m148 m149 m151 m153 m154 m156 m157 m158 m159 m160 m161 m162 m163 m164 m165 m167 m168 m169 m170 m171 m173 m175 m176 m177 m178 m179 m180 m181 m183 m184 m186 m188 m189 m190 m192 m193 m194 m195 m196 m197 m198 m199 m200 m201 m204 m206 m207 m211 m212 m213 m214 m219 m220
h101>
Tod
--
Tod Hagan
Information Technologist
AIRMAP/Climate Change Research Center
Institute for the Study of Earth, Oceans, and Space
University of New Hampshire
Durham, NH 03824
Phone: 603-862-3116
More information about the Zaphod-Users
mailing list