Thanks, Mark.  I hope this gets resolved soon.

As an additional comment, could you please send an all-users e-mail when you do things like this?  Getting this kind of information shouldn’t require chance attempts to logon just to see what the status is.  Thanks.

Phil Isenberg

On May 4, 2017, at 8:48 AM, Maciolek, Mark <Mark.Maciolek@unh.edu> wrote:

Hi,
 
Have not heard back from Cray so I went ahead and rebooted trillian last night. Will be keeping an eye on the scheduler log file throughout the day.
 
 
mark
 
--Mark Maciolek
Network Administrator
Morse Hall Rm 338
 
From: Jimmy Raeder [mailto:J.Raeder@unh.edu] 
Sent: Wednesday, May 3, 2017 10:04 AM
To: Maciolek, Mark <Mark.Maciolek@unh.edu>
Cc: Raeder, Joachim <j.raeder@unh.edu>; Cramer, D <dcramer@guero.sr.unh.edu>; trillian-users@lists.sr.unh.edu
Subject: Re: [Trillian-users] trillian
 
Here is what I’m seeing for a while now:
 
trillian>  qstat -a
 
sdb:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
35276.sdb       pai      workq    st.pbs       8334  40 128    --  200:0 R 185:3
35280.sdb       jxyu     workq    y.vdWFM     13424   4 128    --  395:5 R 185:1
35286.sdb       liang    workq    STDIN       18169  16 512    --  240:0 R 185:0
35318.sdb       kvonkrus workq    PBS_job_sc    --    4 128    --  48:10 H   --
35344.sdb       kai      workq    STDIN       16240  16 512    --    --  R 118:1
35347.sdb       dcfy     workq    gkeyll-M      --    1  32    --  240:0 H   --
35365.sdb       dcramer  workq    cirt1_2013    --   15 480    --  120:0 H   --
35382.sdb       liang    workq    STDIN       10960   4 128    --  240:0 R 55:43
35387.sdb       dcramer  workq    cirt1_2013    --   15 480    --  120:0 H   --
35394.sdb       jxyu     workq    y.MWT         --    4 128    --  295:5 H   --
35395.sdb       jxyu     workq    y.MWT         --    4 128    --  295:5 H   --
trillian>
 
There is 80 nodes running, and 43 nodes total holding, which makes no sense.
This is an issue with the scheduler, apparently.
 
—  Jimmy 
 
 

--------------------------------------------------------------------------------------------------
Joachim (Jimmy) Raeder
Professor of Physics, Department of Physics & Space Science Center
University of New Hampshire
245G Morse Hall, 8 College Rd, Durham, NH 03824-3525
voice: 603-862-3412  mobile: 603-502-9505  assistant: 603-862-1431
e-mail: J.Raeder@unh.edu
WWW: http://mhd.sr.unh.edu/~jraeder/tmp.homepage
--------------------------------------------------------------------------------------------------
 
 
 
On May 3, 2017, at 2:51 PM, Maciolek, Mark <Mark.Maciolek@unh.edu> wrote:
 
Hi,
 
I have opened a case with Cray, not sure if they will respond since our support contract expired last year.
 
Mark
 
--Mark Maciolek
Network Administrator
Morse Hall Rm 338
 
From: W Douglas Cramer [mailto:D.Cramer@unh.edu] 
Sent: Wednesday, May 3, 2017 8:33 AM
To: Maciolek, Mark <Mark.Maciolek@unh.edu>
Cc: Raeder, Joachim <j.raeder@unh.edu>
Subject: Re: [Trillian-users] trillian
 

Mark,

I looks like the problems with the scheduler have reappeared.

Doug
 
On Tue, Apr 25, 2017 at 10:36 AM, Maciolek, Mark <Mark.Maciolek@unh.edu> wrote:
Matt,

This is what the qstat command shows for a few jobs that are in hold status:

comment = job held, too many failed attempts to run

I restarted pbs on trillian and so far only 5 jobs have restarted

mark

--Mark Maciolek
Network Administrator
Morse Hall Rm 338
http://www.unh.edu/research/support-units/research-computing-center

-----Original Message-----
From: Gorby, Matthew
Sent: Tuesday, April 25, 2017 10:03 AM
To: Ethan Stewart <es2025@sr.unh.edu>; Maciolek, Mark <Mark.Maciolek@unh.edu>; Germaschewski, Kai <Kai.Germaschewski@unh.edu>; opss <ops@sr.unh.edu>
Subject: Re: trillian

Hello,

The problems on Trillian persist.  My existing job is still in 'H' status.  When I tried to start a new job it went from 'R' to 'Q' to 'H' rapidly, and when I tried the interactive job it did the same as last time: said it was ready and then said it was completed immediately.

Thanks again,

-Matt
________________________________________
From: Ethan Stewart <es2025@sr.unh.edu>
Sent: Monday, April 24, 2017 4:44 PM
To: Gorby, Matthew; Maciolek, Mark; Germaschewski, Kai; opss
Subject: Re: trillian

It looks like the restart of alps didn't fix everything. The logs still
show too many files open:
2017-04-24 16:26:26: [1388] processControlMsg:171: Agent received
placement message on fd 212
2017-04-24 16:26:26: [1388] doAuth:1814: Agent popen failure: Too many
open files
2017-04-24 16:26:26: [1388] setupConn:1839: Agent authentication failure
with host nid00235 port 607
2017-04-24 16:26:26: [21547] get_apsched_info:853: Unable to open
/ufs/alps_shared/apschedNid: Too many open files

And sure enough; it has a lot of open files; mostly pointing to
variations of:
/var/spool/cray/llm/apsys.21547-20170424t081526.436974
(10.131.255.254:/snv/245/var)

I've restarted alps again; you should have enough time to get your job
started before it starts causing errors again.

-- Ethan

On 04/24/2017 04:22 PM, Gorby, Matthew wrote:
> Hello,
>
>
> When I submit a job to the queue it is again setting it to the held
> ('H') status.  Also, when I started an interactive job this time it
> didn't hang but instead immediately did the following:
>
>
> mgorby@trillian:~/runs/bastille.29$ qsub -q workq -I -l nodes=3:ppn=32
> -N interactive
> qsub: waiting for job 35242.sdb to start
> qsub: job 35242.sdb ready
>
>
> qsub: job 35242.sdb completed
>
> It completed without letting me interact with the session at all.
>
> Thanks,
>
> -Matt
>
>
> ------------------------------------------------------------------------
> *From:* Mark Maciolek <Mark.Maciolek@unh.edu>
> *Sent:* Monday, April 24, 2017 3:31 PM
> *To:* Germaschewski, Kai; opss
> *Cc:* Gorby, Matthew
> *Subject:* RE: trillian
>
>
> Hi,
>
>
>
> Found this in the logs:
> fopen(/var/log/alps/apsched20170424) failed (Too many open files)
>
> 20170424t152733.506065: msgWriter: Error: unable to open tmp log file:
> /var/spool/cray/llm/unknown.8010-20170424t152733.506050
>
> 20170424t152733.506153: msgWriter: msgWriter: ERROR: discarding log
> message: write to file failed.
>
> 2017-04-24 15:27:33: Switching pid 8010 to /var/log/alps/apsched20170424
>
> 2017-04-24 15:27:33: fopen(/var/log/alps/apsched20170424) failed (Too
> many open files)
>
> 20170424t152733.506465: msgWriter: Error: unable to open tmp log file:
> /var/spool/cray/llm/unknown.8010-20170424t152733.506449
>
> 20170424t152733.506532: msgWriter: msgWriter: ERROR: discarding log
> message: write to file failed.
>
>
>
> Restarted alps on the system, seems happier now, the 4 nodes that were
> down for repair do not show as down the apstat command.
>
>
>
> mark
>
>
>
> *From:*Kai Germaschewski [mailto:kai.germaschewski@unh.edu]
> *Sent:* Monday, April 24, 2017 1:52 PM
> *To:* RCC Ops <ops@sr.unh.edu>
> *Subject:* Fwd: trillian
>
>
>
> I figured I might want to forward this to rcc-ops, in case Mark isn't in
> / busy.
>
>
>
> --Kai
>
>
>
> ---------- Forwarded message ---------
> From: Kai Germaschewski <kai.germaschewski@unh.edu
> <mailto:kai.germaschewski@unh.edu>>
> Date: Mon, Apr 24, 2017 at 12:27 PM
> Subject: trillian
> To: Mark Maciolek <mlm@sr.unh.edu <mailto:mlm@sr.unh.edu>>
>
>
>
> Hi Mark,
>
>
>
> it seems that trillian is having issues again -- I have an interactive
> job running, but "aprun" hangs when I try to actually run something, and
> if I submit a new run, it says it's starting but never actually gets going.
>
>
>
> --Kai
>
>
>

_______________________________________________
Trillian-users mailing list
Trillian-users@lists.sr.unh.edu
http://lists.sr.unh.edu/mailman/listinfo/trillian-users



-- 
W. Douglas Cramer
Space Science Center
University of New Hampshire
245B Morse Hall
8 College Rd
Durham, NH 03824
Phone: 603-862-1293
Email: D.Cramer@unh.edu

"What can be asserted without evidence can be dismissed without evidence." -- Christopher Hitchens, 2007
 
_______________________________________________
Trillian-users mailing list
Trillian-users@lists.sr.unh.edu
http://lists.sr.unh.edu/mailman/listinfo/trillian-users