[Trillian-users] trillian

Maciolek, Mark Mark.Maciolek at unh.edu
Wed May 3 08:51:14 EDT 2017


Hi,

I have opened a case with Cray, not sure if they will respond since our support contract expired last year.

Mark

--Mark Maciolek
Network Administrator
Morse Hall Rm 338
http://www.unh.edu/research/support-units/research-computing-center

From: W Douglas Cramer [mailto:D.Cramer at unh.edu]
Sent: Wednesday, May 3, 2017 8:33 AM
To: Maciolek, Mark <Mark.Maciolek at unh.edu>
Cc: Raeder, Joachim <j.raeder at unh.edu>
Subject: Re: [Trillian-users] trillian

Mark,
I looks like the problems with the scheduler have reappeared.
Doug

On Tue, Apr 25, 2017 at 10:36 AM, Maciolek, Mark <Mark.Maciolek at unh.edu<mailto:Mark.Maciolek at unh.edu>> wrote:
Matt,

This is what the qstat command shows for a few jobs that are in hold status:

comment = job held, too many failed attempts to run

I restarted pbs on trillian and so far only 5 jobs have restarted

mark

--Mark Maciolek
Network Administrator
Morse Hall Rm 338
http://www.unh.edu/research/support-units/research-computing-center

-----Original Message-----
From: Gorby, Matthew
Sent: Tuesday, April 25, 2017 10:03 AM
To: Ethan Stewart <es2025 at sr.unh.edu<mailto:es2025 at sr.unh.edu>>; Maciolek, Mark <Mark.Maciolek at unh.edu<mailto:Mark.Maciolek at unh.edu>>; Germaschewski, Kai <Kai.Germaschewski at unh.edu<mailto:Kai.Germaschewski at unh.edu>>; opss <ops at sr.unh.edu<mailto:ops at sr.unh.edu>>
Subject: Re: trillian

Hello,

The problems on Trillian persist.  My existing job is still in 'H' status.  When I tried to start a new job it went from 'R' to 'Q' to 'H' rapidly, and when I tried the interactive job it did the same as last time: said it was ready and then said it was completed immediately.

Thanks again,

-Matt
________________________________________
From: Ethan Stewart <es2025 at sr.unh.edu<mailto:es2025 at sr.unh.edu>>
Sent: Monday, April 24, 2017 4:44 PM
To: Gorby, Matthew; Maciolek, Mark; Germaschewski, Kai; opss
Subject: Re: trillian

It looks like the restart of alps didn't fix everything. The logs still
show too many files open:
2017-04-24 16:26:26: [1388] processControlMsg:171: Agent received
placement message on fd 212
2017-04-24 16:26:26: [1388] doAuth:1814: Agent popen failure: Too many
open files
2017-04-24 16:26:26: [1388] setupConn:1839: Agent authentication failure
with host nid00235 port 607
2017-04-24 16:26:26: [21547] get_apsched_info:853: Unable to open
/ufs/alps_shared/apschedNid: Too many open files

And sure enough; it has a lot of open files; mostly pointing to
variations of:
/var/spool/cray/llm/apsys.21547-20170424t081526.436974
(10.131.255.254:/snv/245/var)

I've restarted alps again; you should have enough time to get your job
started before it starts causing errors again.

-- Ethan

On 04/24/2017 04:22 PM, Gorby, Matthew wrote:
> Hello,
>
>
> When I submit a job to the queue it is again setting it to the held
> ('H') status.  Also, when I started an interactive job this time it
> didn't hang but instead immediately did the following:
>
>
> mgorby at trillian:~/runs/bastille.29$<mailto:mgorby at trillian:~/runs/bastille.29$> qsub -q workq -I -l nodes=3:ppn=32
> -N interactive
> qsub: waiting for job 35242.sdb to start
> qsub: job 35242.sdb ready
>
>
> qsub: job 35242.sdb completed
>
> It completed without letting me interact with the session at all.
>
> Thanks,
>
> -Matt
>
>
> ------------------------------------------------------------------------
> *From:* Mark Maciolek <Mark.Maciolek at unh.edu<mailto:Mark.Maciolek at unh.edu>>
> *Sent:* Monday, April 24, 2017 3:31 PM
> *To:* Germaschewski, Kai; opss
> *Cc:* Gorby, Matthew
> *Subject:* RE: trillian
>
>
> Hi,
>
>
>
> Found this in the logs:
> fopen(/var/log/alps/apsched20170424) failed (Too many open files)
>
> 20170424t152733.506065: msgWriter: Error: unable to open tmp log file:
> /var/spool/cray/llm/unknown.8010-20170424t152733.506050
>
> 20170424t152733.506153: msgWriter: msgWriter: ERROR: discarding log
> message: write to file failed.
>
> 2017-04-24 15:27:33: Switching pid 8010 to /var/log/alps/apsched20170424
>
> 2017-04-24 15:27:33: fopen(/var/log/alps/apsched20170424) failed (Too
> many open files)
>
> 20170424t152733.506465: msgWriter: Error: unable to open tmp log file:
> /var/spool/cray/llm/unknown.8010-20170424t152733.506449
>
> 20170424t152733.506532: msgWriter: msgWriter: ERROR: discarding log
> message: write to file failed.
>
>
>
> Restarted alps on the system, seems happier now, the 4 nodes that were
> down for repair do not show as down the apstat command.
>
>
>
> mark
>
>
>
> *From:*Kai Germaschewski [mailto:kai.germaschewski at unh.edu<mailto:kai.germaschewski at unh.edu>]
> *Sent:* Monday, April 24, 2017 1:52 PM
> *To:* RCC Ops <ops at sr.unh.edu<mailto:ops at sr.unh.edu>>
> *Subject:* Fwd: trillian
>
>
>
> I figured I might want to forward this to rcc-ops, in case Mark isn't in
> / busy.
>
>
>
> --Kai
>
>
>
> ---------- Forwarded message ---------
> From: Kai Germaschewski <kai.germaschewski at unh.edu<mailto:kai.germaschewski at unh.edu>
> <mailto:kai.germaschewski at unh.edu<mailto:kai.germaschewski at unh.edu>>>
> Date: Mon, Apr 24, 2017 at 12:27 PM
> Subject: trillian
> To: Mark Maciolek <mlm at sr.unh.edu<mailto:mlm at sr.unh.edu> <mailto:mlm at sr.unh.edu<mailto:mlm at sr.unh.edu>>>
>
>
>
> Hi Mark,
>
>
>
> it seems that trillian is having issues again -- I have an interactive
> job running, but "aprun" hangs when I try to actually run something, and
> if I submit a new run, it says it's starting but never actually gets going.
>
>
>
> --Kai
>
>
>

_______________________________________________
Trillian-users mailing list
Trillian-users at lists.sr.unh.edu<mailto:Trillian-users at lists.sr.unh.edu>
http://lists.sr.unh.edu/mailman/listinfo/trillian-users



--
W. Douglas Cramer
Space Science Center
University of New Hampshire
245B Morse Hall
8 College Rd
Durham, NH 03824
Phone: 603-862-1293
Email: D.Cramer at unh.edu<mailto:D.Cramer at unh.edu>

"What can be asserted without evidence can be dismissed without evidence." -- Christopher Hitchens, 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sr.unh.edu/pipermail/trillian-users/attachments/20170503/2795c48b/attachment.html>


More information about the Trillian-users mailing list