Hi Everyone –

 

In the past, a few people have noted the inability to view specific processes and resources being used on individual Premise compute nodes during execution, akin to what the “top” Linux utility provides.  Slurm utilities like “sstat” can give average and max values for an entire job, but not real-time, process level details. Likewise, “top” can only monitor the head node, and not compute nodes.

 

To address this need, we’ve developed a program now available on Premise named “slurm-monitor”, available in the linuxbrew/colsa module.  Usage is as follows:

 

“slurm-monitor <Job ID>”

 

Where “<Job ID>” is the ID of the Slurm job (reported during job submission, or at any time by running the “squeue” command)

 

Slurm-Monitor provides a wrapper for “top” – all information “top” normally provides is shown for a node running the specified job, with additional Slurm job details at the bottom of the screen.  If the job is running across multiple nodes, the currently monitored node can be cycled using the “[“ and “]” keys.  A screenshot of slurm-monitor is attached.

 

We will also be incorporating additional usage information (GPU, disk, memory, etc) into this program in the future.

 

Thanks –

 

Toni Westbrook

Computational Scientist

Research Computing Center, College of Life Sciences and Agriculture

University of New Hampshire

Office: 436 Gregg Hall