[Trillian-users] trillian

Tue Apr 25 11:04:36 EDT 2017

I qdel'd my job and submitted a new one to the queue.  It did the same thing: 'R' to 'Q' to 'R' to 'Q' and then to 'H'.  I watched it as I spammed qstat.

-Matt

________________________________________
From: Gorby, Matthew
Sent: Tuesday, April 25, 2017 11:01 AM
To: Maciolek, Mark; Ethan Stewart; Germaschewski, Kai; opss
Cc: trillian-users at lists.sr.unh.edu
Subject: Re: trillian

Below is the printout from qstat -a and xtnodestat.  There seems to be something deeper wrong.  Half the machine is empty but look at the queue.  Also, what is going on with the interactive run going from ready to complete with nothing happening in between?

Thanks,

-Matt

mgorby at trillian:~$ qstat -a

sdb:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
33956.sdb       fs1036   workq    3Dtest        --   15 120  240gb 8888: H   --
34043.sdb       fs1036   workq    3Dtest        --   15 120  150gb 8888: H   --
34048.sdb       fs1036   workq    jw2D         6833   2  32    --  888:0 R 00:29
34056.sdb       fs1036   workq    jw2D          --    2  32   16gb 888:0 H   --
34128.sdb       yilongya workq    trillian-j    --   40 160    --  48:00 Q   --
34947.sdb       jxyu     workq    y.vdWFM      7096   4 128    --  395:5 R 00:28
35155.sdb       jxyu     workq    y.MT         6906   8 256    --  367:5 R 00:29
35172.sdb       kgklein  workq    max_D_alt.   6912   2  64    --  500:0 R 00:29
35176.sdb       pai      workq    st.pbs      23000  40 128    --  200:0 H   --
35189.sdb       jxyu     workq    y.MWT        1006   4 128    --  295:5 H   --
35196.sdb       jxyu     workq    y.MC.MB02   10394   1  32    --  295:5 R 00:13
35197.sdb       jxyu     workq    y.MC.MB02    6583   1  32    --  295:5 R 00:29
35233.sdb       jxyu     workq    y.MC.T01      --    1  32    --  295:5 H   --
35234.sdb       jxyu     workq    y.MC.T01      --    1  32    --  295:5 H   --
35235.sdb       jxyu     workq    y.MC.T01      --    1  32    --  295:5 H   --
35245.sdb       mgorby   workq    bastilleDa    --    3  96    --  168:0 H   --
35255.sdb       liang    workq    e68           --   16 512    --  24:00 H   --
35262.sdb       sambid   workq    run_geantM    --    1  16    --  01:00 E   --
35263.sdb       pai      workq    st.pbs        --   40 128    --  200:0 H   --

mgorby at trillian:~$ xtnodestat
Current Allocation Status at Tue Apr 25 10:59:09 2017

     C0-0     C1-0     C2-0
  n3 c---     cf--     AAS-
  n2 cb--     cf--     AAS-
  n1 ----     ----     --S-
c2n0 ----     ----     --S-
  n3     --A-     -SAA     --AA
  n2     --AA     -SAA     --AA
  n1     AA-A     -S-A     ---A
c1n0     Aa-A     -S-A     -A--
  n3 SA--     AA--     AA--
  n2 SA--     AA--     AA--
  n1 SA--     eA--     dd--
c0n0 SA--     eA--     dd--
    s01234567 01234567 01234567

Legend:
   nonexistent node                  S  service node
;  free interactive compute node     -  free batch compute node
A  allocated interactive or ccm node ?  suspect compute node
W  waiting or non-running job        X  down compute node
Y  down or admindown service node    Z  admindown compute node

Available compute nodes:          0 interactive,         80 batch

Job ID      User     Size  Age      State      command line
--- ------- -------- ----- -------- ------- ------------------------------
a   124626  jxyu     1     0h30m    run      spinmc_mpi.x
b   124643  jxyu     1     0h15m    run      spinmc_mpi.x
c   124632  jxyu     4     0h30m    run      vasp.541.ncl
d   124633  jxyu     4     0h30m    run      vasp.541.ncl
e   124631  fs1036   2     0h30m    run      main
f   124634  kgklein  2     0h30m    run      ALPS.e
________________________________________
From: Maciolek, Mark
Sent: Tuesday, April 25, 2017 10:36 AM
To: Gorby, Matthew; Ethan Stewart; Germaschewski, Kai; opss
Cc: trillian-users at lists.sr.unh.edu
Subject: RE: trillian

Matt,

This is what the qstat command shows for a few jobs that are in hold status:

comment = job held, too many failed attempts to run

I restarted pbs on trillian and so far only 5 jobs have restarted

mark

--Mark Maciolek
Network Administrator
Morse Hall Rm 338
http://www.unh.edu/research/support-units/research-computing-center

-----Original Message-----
From: Gorby, Matthew
Sent: Tuesday, April 25, 2017 10:03 AM
To: Ethan Stewart <es2025 at sr.unh.edu>; Maciolek, Mark <Mark.Maciolek at unh.edu>; Germaschewski, Kai <Kai.Germaschewski at unh.edu>; opss <ops at sr.unh.edu>
Subject: Re: trillian

Hello,

The problems on Trillian persist.  My existing job is still in 'H' status.  When I tried to start a new job it went from 'R' to 'Q' to 'H' rapidly, and when I tried the interactive job it did the same as last time: said it was ready and then said it was completed immediately.

Thanks again,

-Matt
________________________________________
From: Ethan Stewart <es2025 at sr.unh.edu>
Sent: Monday, April 24, 2017 4:44 PM
To: Gorby, Matthew; Maciolek, Mark; Germaschewski, Kai; opss
Subject: Re: trillian

It looks like the restart of alps didn't fix everything. The logs still
show too many files open:
2017-04-24 16:26:26: [1388] processControlMsg:171: Agent received
placement message on fd 212
2017-04-24 16:26:26: [1388] doAuth:1814: Agent popen failure: Too many
open files
2017-04-24 16:26:26: [1388] setupConn:1839: Agent authentication failure
with host nid00235 port 607
2017-04-24 16:26:26: [21547] get_apsched_info:853: Unable to open
/ufs/alps_shared/apschedNid: Too many open files

And sure enough; it has a lot of open files; mostly pointing to
variations of:
/var/spool/cray/llm/apsys.21547-20170424t081526.436974
(10.131.255.254:/snv/245/var)

I've restarted alps again; you should have enough time to get your job
started before it starts causing errors again.

-- Ethan

On 04/24/2017 04:22 PM, Gorby, Matthew wrote:
> Hello,
>
>
> When I submit a job to the queue it is again setting it to the held
> ('H') status.  Also, when I started an interactive job this time it
> didn't hang but instead immediately did the following:
>
>
> mgorby at trillian:~/runs/bastille.29$ qsub -q workq -I -l nodes=3:ppn=32
> -N interactive
> qsub: waiting for job 35242.sdb to start
> qsub: job 35242.sdb ready
>
>
> qsub: job 35242.sdb completed
>
> It completed without letting me interact with the session at all.
>
> Thanks,
>
> -Matt
>
>
> ------------------------------------------------------------------------
> *From:* Mark Maciolek <Mark.Maciolek at unh.edu>
> *Sent:* Monday, April 24, 2017 3:31 PM
> *To:* Germaschewski, Kai; opss
> *Cc:* Gorby, Matthew
> *Subject:* RE: trillian
>
>
> Hi,
>
>
>
> Found this in the logs:
> fopen(/var/log/alps/apsched20170424) failed (Too many open files)
>
> 20170424t152733.506065: msgWriter: Error: unable to open tmp log file:
> /var/spool/cray/llm/unknown.8010-20170424t152733.506050
>
> 20170424t152733.506153: msgWriter: msgWriter: ERROR: discarding log
> message: write to file failed.
>
> 2017-04-24 15:27:33: Switching pid 8010 to /var/log/alps/apsched20170424
>
> 2017-04-24 15:27:33: fopen(/var/log/alps/apsched20170424) failed (Too
> many open files)
>
> 20170424t152733.506465: msgWriter: Error: unable to open tmp log file:
> /var/spool/cray/llm/unknown.8010-20170424t152733.506449
>
> 20170424t152733.506532: msgWriter: msgWriter: ERROR: discarding log
> message: write to file failed.
>
>
>
> Restarted alps on the system, seems happier now, the 4 nodes that were
> down for repair do not show as down the apstat command.
>
>
>
> mark
>
>
>
> *From:*Kai Germaschewski [mailto:kai.germaschewski at unh.edu]
> *Sent:* Monday, April 24, 2017 1:52 PM
> *To:* RCC Ops <ops at sr.unh.edu>
> *Subject:* Fwd: trillian
>
>
>
> I figured I might want to forward this to rcc-ops, in case Mark isn't in
> / busy.
>
>
>
> --Kai
>
>
>
> ---------- Forwarded message ---------
> From: Kai Germaschewski <kai.germaschewski at unh.edu
> <mailto:kai.germaschewski at unh.edu>>
> Date: Mon, Apr 24, 2017 at 12:27 PM
> Subject: trillian
> To: Mark Maciolek <mlm at sr.unh.edu <mailto:mlm at sr.unh.edu>>
>
>
>
> Hi Mark,
>
>
>
> it seems that trillian is having issues again -- I have an interactive
> job running, but "aprun" hangs when I try to actually run something, and
> if I submit a new run, it says it's starting but never actually gets going.
>
>
>
> --Kai
>
>
>