[Trillian-users] trillian

Tue Apr 25 11:15:02 EDT 2017

Mine ended up from Q to R to E to H.
So instead of exiting it went to H.

Further more, i deleted my run using "qdel" my program as it seemed to go
nowhere. Most of the runs above which are held have similar comments.

I was hoping to see an error file but while held, it is not generating an
error file so figuring out what is wrong has been challenging.

-Sam-

On Tue, Apr 25, 2017 at 11:04 AM, Gorby, Matthew <Matthew.Gorby at unh.edu>
wrote:

> I qdel'd my job and submitted a new one to the queue.  It did the same
> thing: 'R' to 'Q' to 'R' to 'Q' and then to 'H'.  I watched it as I spammed
> qstat.
>
> -Matt
>
> ________________________________________
> From: Gorby, Matthew
> Sent: Tuesday, April 25, 2017 11:01 AM
> To: Maciolek, Mark; Ethan Stewart; Germaschewski, Kai; opss
> Cc: trillian-users at lists.sr.unh.edu
> Subject: Re: trillian
>
> Below is the printout from qstat -a and xtnodestat.  There seems to be
> something deeper wrong.  Half the machine is empty but look at the queue.
> Also, what is going on with the interactive run going from ready to
> complete with nothing happening in between?
>
> Thanks,
>
> -Matt
>
> mgorby at trillian:~$ qstat -a
>
> sdb:
>                                                             Req'd  Req'd
>  Elap
> Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S
> Time
> --------------- -------- -------- ---------- ------ --- --- ------ ----- -
> -----
> 33956.sdb       fs1036   workq    3Dtest        --   15 120  240gb 8888:
> H   --
> 34043.sdb       fs1036   workq    3Dtest        --   15 120  150gb 8888:
> H   --
> 34048.sdb       fs1036   workq    jw2D         6833   2  32    --  888:0 R
> 00:29
> 34056.sdb       fs1036   workq    jw2D          --    2  32   16gb 888:0
> H   --
> 34128.sdb       yilongya workq    trillian-j    --   40 160    --  48:00
> Q   --
> 34947.sdb       jxyu     workq    y.vdWFM      7096   4 128    --  395:5 R
> 00:28
> 35155.sdb       jxyu     workq    y.MT         6906   8 256    --  367:5 R
> 00:29
> 35172.sdb       kgklein  workq    max_D_alt.   6912   2  64    --  500:0 R
> 00:29
> 35176.sdb       pai      workq    st.pbs      23000  40 128    --  200:0
> H   --
> 35189.sdb       jxyu     workq    y.MWT        1006   4 128    --  295:5
> H   --
> 35196.sdb       jxyu     workq    y.MC.MB02   10394   1  32    --  295:5 R
> 00:13
> 35197.sdb       jxyu     workq    y.MC.MB02    6583   1  32    --  295:5 R
> 00:29
> 35233.sdb       jxyu     workq    y.MC.T01      --    1  32    --  295:5
> H   --
> 35234.sdb       jxyu     workq    y.MC.T01      --    1  32    --  295:5
> H   --
> 35235.sdb       jxyu     workq    y.MC.T01      --    1  32    --  295:5
> H   --
> 35245.sdb       mgorby   workq    bastilleDa    --    3  96    --  168:0
> H   --
> 35255.sdb       liang    workq    e68           --   16 512    --  24:00
> H   --
> 35262.sdb       sambid   workq    run_geantM    --    1  16    --  01:00
> E   --
> 35263.sdb       pai      workq    st.pbs        --   40 128    --  200:0
> H   --
>
> mgorby at trillian:~$ xtnodestat
> Current Allocation Status at Tue Apr 25 10:59:09 2017
>
>      C0-0     C1-0     C2-0
>   n3 c---     cf--     AAS-
>   n2 cb--     cf--     AAS-
>   n1 ----     ----     --S-
> c2n0 ----     ----     --S-
>   n3     --A-     -SAA     --AA
>   n2     --AA     -SAA     --AA
>   n1     AA-A     -S-A     ---A
> c1n0     Aa-A     -S-A     -A--
>   n3 SA--     AA--     AA--
>   n2 SA--     AA--     AA--
>   n1 SA--     eA--     dd--
> c0n0 SA--     eA--     dd--
>     s01234567 01234567 01234567
>
> Legend:
>    nonexistent node                  S  service node
> ;  free interactive compute node     -  free batch compute node
> A  allocated interactive or ccm node ?  suspect compute node
> W  waiting or non-running job        X  down compute node
> Y  down or admindown service node    Z  admindown compute node
>
> Available compute nodes:          0 interactive,         80 batch
>
>
> Job ID      User     Size  Age      State      command line
> --- ------- -------- ----- -------- ------- ------------------------------
> a   124626  jxyu     1     0h30m    run      spinmc_mpi.x
> b   124643  jxyu     1     0h15m    run      spinmc_mpi.x
> c   124632  jxyu     4     0h30m    run      vasp.541.ncl
> d   124633  jxyu     4     0h30m    run      vasp.541.ncl
> e   124631  fs1036   2     0h30m    run      main
> f   124634  kgklein  2     0h30m    run      ALPS.e
> ________________________________________
> From: Maciolek, Mark
> Sent: Tuesday, April 25, 2017 10:36 AM
> To: Gorby, Matthew; Ethan Stewart; Germaschewski, Kai; opss
> Cc: trillian-users at lists.sr.unh.edu
> Subject: RE: trillian
>
> Matt,
>
> This is what the qstat command shows for a few jobs that are in hold
> status:
>
> comment = job held, too many failed attempts to run
>
> I restarted pbs on trillian and so far only 5 jobs have restarted
>
> mark
>
> --Mark Maciolek
> Network Administrator
> Morse Hall Rm 338
> http://www.unh.edu/research/support-units/research-computing-center
>
> -----Original Message-----
> From: Gorby, Matthew
> Sent: Tuesday, April 25, 2017 10:03 AM
> To: Ethan Stewart <es2025 at sr.unh.edu>; Maciolek, Mark <
> Mark.Maciolek at unh.edu>; Germaschewski, Kai <Kai.Germaschewski at unh.edu>;
> opss <ops at sr.unh.edu>
> Subject: Re: trillian
>
> Hello,
>
> The problems on Trillian persist.  My existing job is still in 'H'
> status.  When I tried to start a new job it went from 'R' to 'Q' to 'H'
> rapidly, and when I tried the interactive job it did the same as last time:
> said it was ready and then said it was completed immediately.
>
> Thanks again,
>
> -Matt
> ________________________________________
> From: Ethan Stewart <es2025 at sr.unh.edu>
> Sent: Monday, April 24, 2017 4:44 PM
> To: Gorby, Matthew; Maciolek, Mark; Germaschewski, Kai; opss
> Subject: Re: trillian
>
> It looks like the restart of alps didn't fix everything. The logs still
> show too many files open:
> 2017-04-24 16:26:26: [1388] processControlMsg:171: Agent received
> placement message on fd 212
> 2017-04-24 16:26:26: [1388] doAuth:1814: Agent popen failure: Too many
> open files
> 2017-04-24 16:26:26: [1388] setupConn:1839: Agent authentication failure
> with host nid00235 port 607
> 2017-04-24 16:26:26: [21547] get_apsched_info:853: Unable to open
> /ufs/alps_shared/apschedNid: Too many open files
>
> And sure enough; it has a lot of open files; mostly pointing to
> variations of:
> /var/spool/cray/llm/apsys.21547-20170424t081526.436974
> (10.131.255.254:/snv/245/var)
>
> I've restarted alps again; you should have enough time to get your job
> started before it starts causing errors again.
>
> -- Ethan
>
> On 04/24/2017 04:22 PM, Gorby, Matthew wrote:
> > Hello,
> >
> >
> > When I submit a job to the queue it is again setting it to the held
> > ('H') status.  Also, when I started an interactive job this time it
> > didn't hang but instead immediately did the following:
> >
> >
> > mgorby at trillian:~/runs/bastille.29$ qsub -q workq -I -l nodes=3:ppn=32
> > -N interactive
> > qsub: waiting for job 35242.sdb to start
> > qsub: job 35242.sdb ready
> >
> >
> > qsub: job 35242.sdb completed
> >
> > It completed without letting me interact with the session at all.
> >
> > Thanks,
> >
> > -Matt
> >
> >
> > ------------------------------------------------------------------------
> > *From:* Mark Maciolek <Mark.Maciolek at unh.edu>
> > *Sent:* Monday, April 24, 2017 3:31 PM
> > *To:* Germaschewski, Kai; opss
> > *Cc:* Gorby, Matthew
> > *Subject:* RE: trillian
> >
> >
> > Hi,
> >
> >
> >
> > Found this in the logs:
> > fopen(/var/log/alps/apsched20170424) failed (Too many open files)
> >
> > 20170424t152733.506065: msgWriter: Error: unable to open tmp log file:
> > /var/spool/cray/llm/unknown.8010-20170424t152733.506050
> >
> > 20170424t152733.506153: msgWriter: msgWriter: ERROR: discarding log
> > message: write to file failed.
> >
> > 2017-04-24 15:27:33: Switching pid 8010 to /var/log/alps/apsched20170424
> >
> > 2017-04-24 15:27:33: fopen(/var/log/alps/apsched20170424) failed (Too
> > many open files)
> >
> > 20170424t152733.506465: msgWriter: Error: unable to open tmp log file:
> > /var/spool/cray/llm/unknown.8010-20170424t152733.506449
> >
> > 20170424t152733.506532: msgWriter: msgWriter: ERROR: discarding log
> > message: write to file failed.
> >
> >
> >
> > Restarted alps on the system, seems happier now, the 4 nodes that were
> > down for repair do not show as down the apstat command.
> >
> >
> >
> > mark
> >
> >
> >
> > *From:*Kai Germaschewski [mailto:kai.germaschewski at unh.edu]
> > *Sent:* Monday, April 24, 2017 1:52 PM
> > *To:* RCC Ops <ops at sr.unh.edu>
> > *Subject:* Fwd: trillian
> >
> >
> >
> > I figured I might want to forward this to rcc-ops, in case Mark isn't in
> > / busy.
> >
> >
> >
> > --Kai
> >
> >
> >
> > ---------- Forwarded message ---------
> > From: Kai Germaschewski <kai.germaschewski at unh.edu
> > <mailto:kai.germaschewski at unh.edu>>
> > Date: Mon, Apr 24, 2017 at 12:27 PM
> > Subject: trillian
> > To: Mark Maciolek <mlm at sr.unh.edu <mailto:mlm at sr.unh.edu>>
> >
> >
> >
> > Hi Mark,
> >
> >
> >
> > it seems that trillian is having issues again -- I have an interactive
> > job running, but "aprun" hangs when I try to actually run something, and
> > if I submit a new run, it says it's starting but never actually gets
> going.
> >
> >
> >
> > --Kai
> >
> >
> >
>
> _______________________________________________
> Trillian-users mailing list
> Trillian-users at lists.sr.unh.edu
> http://lists.sr.unh.edu/mailman/listinfo/trillian-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sr.unh.edu/pipermail/trillian-users/attachments/20170425/79834862/attachment-0001.html>