[Trillian-users] trillian

Philip Isenberg phil.isenberg at unh.edu
Thu May 4 10:48:42 EDT 2017


Thanks, Mark.  I hope this gets resolved soon.

As an additional comment, could you please send an all-users e-mail when you do things like this?  Getting this kind of information shouldn’t require chance attempts to logon just to see what the status is.  Thanks.

Phil Isenberg

> On May 4, 2017, at 8:48 AM, Maciolek, Mark <Mark.Maciolek at unh.edu> wrote:
> 
> Hi,
>  
> Have not heard back from Cray so I went ahead and rebooted trillian last night. Will be keeping an eye on the scheduler log file throughout the day.
>  
>  
> mark
>  
> --Mark Maciolek
> Network Administrator
> Morse Hall Rm 338
> http://www.unh.edu/research/support-units/research-computing-center <http://www.unh.edu/research/support-units/research-computing-center>
>  
> From: Jimmy Raeder [mailto:J.Raeder at unh.edu <mailto:J.Raeder at unh.edu>] 
> Sent: Wednesday, May 3, 2017 10:04 AM
> To: Maciolek, Mark <Mark.Maciolek at unh.edu <mailto:Mark.Maciolek at unh.edu>>
> Cc: Raeder, Joachim <j.raeder at unh.edu <mailto:j.raeder at unh.edu>>; Cramer, D <dcramer at guero.sr.unh.edu <mailto:dcramer at guero.sr.unh.edu>>; trillian-users at lists.sr.unh.edu <mailto:trillian-users at lists.sr.unh.edu>
> Subject: Re: [Trillian-users] trillian
>  
> Here is what I’m seeing for a while now:
>  
> trillian>  qstat -a
>  
> sdb:
>                                                             Req'd  Req'd   Elap
> Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
> --------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
> 35276.sdb       pai      workq    st.pbs       8334  40 128    --  200:0 R 185:3
> 35280.sdb       jxyu     workq    y.vdWFM     13424   4 128    --  395:5 R 185:1
> 35286.sdb       liang    workq    STDIN       18169  16 512    --  240:0 R 185:0
> 35318.sdb       kvonkrus workq    PBS_job_sc    --    4 128    --  48:10 H   --
> 35344.sdb       kai      workq    STDIN       16240  16 512    --    --  R 118:1
> 35347.sdb       dcfy     workq    gkeyll-M      --    1  32    --  240:0 H   --
> 35365.sdb       dcramer  workq    cirt1_2013    --   15 480    --  120:0 H   --
> 35382.sdb       liang    workq    STDIN       10960   4 128    --  240:0 R 55:43
> 35387.sdb       dcramer  workq    cirt1_2013    --   15 480    --  120:0 H   --
> 35394.sdb       jxyu     workq    y.MWT         --    4 128    --  295:5 H   --
> 35395.sdb       jxyu     workq    y.MWT         --    4 128    --  295:5 H   --
> trillian>
>  
> There is 80 nodes running, and 43 nodes total holding, which makes no sense.
> This is an issue with the scheduler, apparently.
>  
> —  Jimmy 
>  
>  
> 
> --------------------------------------------------------------------------------------------------
> Joachim (Jimmy) Raeder
> Professor of Physics, Department of Physics & Space Science Center
> University of New Hampshire
> 245G Morse Hall, 8 College Rd, Durham, NH 03824-3525
> voice: 603-862-3412  mobile: 603-502-9505  assistant: 603-862-1431
> e-mail: J.Raeder at unh.edu <mailto:J.Raeder at unh.edu>
> WWW: http://mhd.sr.unh.edu/~jraeder/tmp.homepage <http://mhd.sr.unh.edu/~jraeder/tmp.homepage>
> --------------------------------------------------------------------------------------------------
>  
>  
>  
> On May 3, 2017, at 2:51 PM, Maciolek, Mark <Mark.Maciolek at unh.edu <mailto:Mark.Maciolek at unh.edu>> wrote:
>  
> Hi,
>  
> I have opened a case with Cray, not sure if they will respond since our support contract expired last year.
>  
> Mark
>  
> --Mark Maciolek
> Network Administrator
> Morse Hall Rm 338
> http://www.unh.edu/research/support-units/research-computing-center <http://www.unh.edu/research/support-units/research-computing-center>
>  
> From: W Douglas Cramer [mailto:D.Cramer at unh.edu <mailto:D.Cramer at unh.edu>] 
> Sent: Wednesday, May 3, 2017 8:33 AM
> To: Maciolek, Mark <Mark.Maciolek at unh.edu <mailto:Mark.Maciolek at unh.edu>>
> Cc: Raeder, Joachim <j.raeder at unh.edu <mailto:j.raeder at unh.edu>>
> Subject: Re: [Trillian-users] trillian
>  
> Mark,
> 
> I looks like the problems with the scheduler have reappeared.
> 
> Doug
>  
> On Tue, Apr 25, 2017 at 10:36 AM, Maciolek, Mark <Mark.Maciolek at unh.edu <mailto:Mark.Maciolek at unh.edu>> wrote:
> Matt,
> 
> This is what the qstat command shows for a few jobs that are in hold status:
> 
> comment = job held, too many failed attempts to run
> 
> I restarted pbs on trillian and so far only 5 jobs have restarted
> 
> mark
> 
> --Mark Maciolek
> Network Administrator
> Morse Hall Rm 338
> http://www.unh.edu/research/support-units/research-computing-center <http://www.unh.edu/research/support-units/research-computing-center>
> 
> -----Original Message-----
> From: Gorby, Matthew
> Sent: Tuesday, April 25, 2017 10:03 AM
> To: Ethan Stewart <es2025 at sr.unh.edu <mailto:es2025 at sr.unh.edu>>; Maciolek, Mark <Mark.Maciolek at unh.edu <mailto:Mark.Maciolek at unh.edu>>; Germaschewski, Kai <Kai.Germaschewski at unh.edu <mailto:Kai.Germaschewski at unh.edu>>; opss <ops at sr.unh.edu <mailto:ops at sr.unh.edu>>
> Subject: Re: trillian
> 
> Hello,
> 
> The problems on Trillian persist.  My existing job is still in 'H' status.  When I tried to start a new job it went from 'R' to 'Q' to 'H' rapidly, and when I tried the interactive job it did the same as last time: said it was ready and then said it was completed immediately.
> 
> Thanks again,
> 
> -Matt
> ________________________________________
> From: Ethan Stewart <es2025 at sr.unh.edu <mailto:es2025 at sr.unh.edu>>
> Sent: Monday, April 24, 2017 4:44 PM
> To: Gorby, Matthew; Maciolek, Mark; Germaschewski, Kai; opss
> Subject: Re: trillian
> 
> It looks like the restart of alps didn't fix everything. The logs still
> show too many files open:
> 2017-04-24 16:26:26: [1388] processControlMsg:171: Agent received
> placement message on fd 212
> 2017-04-24 16:26:26: [1388] doAuth:1814: Agent popen failure: Too many
> open files
> 2017-04-24 16:26:26: [1388] setupConn:1839: Agent authentication failure
> with host nid00235 port 607
> 2017-04-24 16:26:26: [21547] get_apsched_info:853: Unable to open
> /ufs/alps_shared/apschedNid: Too many open files
> 
> And sure enough; it has a lot of open files; mostly pointing to
> variations of:
> /var/spool/cray/llm/apsys.21547-20170424t081526.436974
> (10.131.255.254:/snv/245/var)
> 
> I've restarted alps again; you should have enough time to get your job
> started before it starts causing errors again.
> 
> -- Ethan
> 
> On 04/24/2017 04:22 PM, Gorby, Matthew wrote:
> > Hello,
> >
> >
> > When I submit a job to the queue it is again setting it to the held
> > ('H') status.  Also, when I started an interactive job this time it
> > didn't hang but instead immediately did the following:
> >
> >
> > mgorby at trillian:~/runs/bastille.29$ <mailto:mgorby at trillian:~/runs/bastille.29$> qsub -q workq -I -l nodes=3:ppn=32
> > -N interactive
> > qsub: waiting for job 35242.sdb to start
> > qsub: job 35242.sdb ready
> >
> >
> > qsub: job 35242.sdb completed
> >
> > It completed without letting me interact with the session at all.
> >
> > Thanks,
> >
> > -Matt
> >
> >
> > ------------------------------------------------------------------------
> > *From:* Mark Maciolek <Mark.Maciolek at unh.edu <mailto:Mark.Maciolek at unh.edu>>
> > *Sent:* Monday, April 24, 2017 3:31 PM
> > *To:* Germaschewski, Kai; opss
> > *Cc:* Gorby, Matthew
> > *Subject:* RE: trillian
> >
> >
> > Hi,
> >
> >
> >
> > Found this in the logs:
> > fopen(/var/log/alps/apsched20170424) failed (Too many open files)
> >
> > 20170424t152733.506065: msgWriter: Error: unable to open tmp log file:
> > /var/spool/cray/llm/unknown.8010-20170424t152733.506050
> >
> > 20170424t152733.506153: msgWriter: msgWriter: ERROR: discarding log
> > message: write to file failed.
> >
> > 2017-04-24 15:27:33: Switching pid 8010 to /var/log/alps/apsched20170424
> >
> > 2017-04-24 15:27:33: fopen(/var/log/alps/apsched20170424) failed (Too
> > many open files)
> >
> > 20170424t152733.506465: msgWriter: Error: unable to open tmp log file:
> > /var/spool/cray/llm/unknown.8010-20170424t152733.506449
> >
> > 20170424t152733.506532: msgWriter: msgWriter: ERROR: discarding log
> > message: write to file failed.
> >
> >
> >
> > Restarted alps on the system, seems happier now, the 4 nodes that were
> > down for repair do not show as down the apstat command.
> >
> >
> >
> > mark
> >
> >
> >
> > *From:*Kai Germaschewski [mailto:kai.germaschewski at unh.edu <mailto:kai.germaschewski at unh.edu>]
> > *Sent:* Monday, April 24, 2017 1:52 PM
> > *To:* RCC Ops <ops at sr.unh.edu <mailto:ops at sr.unh.edu>>
> > *Subject:* Fwd: trillian
> >
> >
> >
> > I figured I might want to forward this to rcc-ops, in case Mark isn't in
> > / busy.
> >
> >
> >
> > --Kai
> >
> >
> >
> > ---------- Forwarded message ---------
> > From: Kai Germaschewski <kai.germaschewski at unh.edu <mailto:kai.germaschewski at unh.edu>
> > <mailto:kai.germaschewski at unh.edu <mailto:kai.germaschewski at unh.edu>>>
> > Date: Mon, Apr 24, 2017 at 12:27 PM
> > Subject: trillian
> > To: Mark Maciolek <mlm at sr.unh.edu <mailto:mlm at sr.unh.edu> <mailto:mlm at sr.unh.edu <mailto:mlm at sr.unh.edu>>>
> >
> >
> >
> > Hi Mark,
> >
> >
> >
> > it seems that trillian is having issues again -- I have an interactive
> > job running, but "aprun" hangs when I try to actually run something, and
> > if I submit a new run, it says it's starting but never actually gets going.
> >
> >
> >
> > --Kai
> >
> >
> >
> 
> _______________________________________________
> Trillian-users mailing list
> Trillian-users at lists.sr.unh.edu <mailto:Trillian-users at lists.sr.unh.edu>
> http://lists.sr.unh.edu/mailman/listinfo/trillian-users <http://lists.sr.unh.edu/mailman/listinfo/trillian-users>
> 
> 
> 
> -- 
> W. Douglas Cramer
> Space Science Center
> University of New Hampshire
> 245B Morse Hall
> 8 College Rd
> Durham, NH 03824
> Phone: 603-862-1293
> Email: D.Cramer at unh.edu <mailto:D.Cramer at unh.edu>
> 
> "What can be asserted without evidence can be dismissed without evidence." -- Christopher Hitchens, 2007
>  
> _______________________________________________
> Trillian-users mailing list
> Trillian-users at lists.sr.unh.edu <mailto:Trillian-users at lists.sr.unh.edu>
> http://lists.sr.unh.edu/mailman/listinfo/trillian-users <http://lists.sr.unh.edu/mailman/listinfo/trillian-users>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sr.unh.edu/pipermail/trillian-users/attachments/20170504/dc0f1979/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1594 bytes
Desc: not available
URL: <http://lists.sr.unh.edu/pipermail/trillian-users/attachments/20170504/dc0f1979/smime-0001.p7s>


More information about the Trillian-users mailing list