Using qstat¶

Getting More Information out of Qstat¶

You can gain a lot more information from qstat if you use its command line options. On the cluster type man qstat to read the user manual pages for further details on the qstat command.

However the man pages are quite detailed and are more a reference for the qstat command. So below are some examples of common usage with the most useful command line options.

These examples use qstats “Default Format” and “Alternate Format”. You will find that the man pages for qstat refer to these two formats a lot. This is what those formats look like:

The default format has these column headings:

Job ID     Name       User       Time Use  S  Queue
--------   --------   --------   --------  -  -----

The alternate format has these column headings:

Job ID     Username   Queue   Jobname   SessID  NDS  TSK  Memory  Time   S  Time
--------   --------   -----   -------   ------  ---  ---  ------  -----  -  ----

In the examples below some command line options will display in default format and some in alternate format.

Qstat Examples¶

Example of just running qstat with no command line options:

$ qstat

Job ID          Jobname            User       Time Use  S  Queue
-------------   ----------------   --------   --------  -  -----
6263.hpcnode0   fingerprint_li.sh  u999999     8805:35  R  workq           
6264.hpcnode0   fingerprint_lm.sh  u999999           0  Q  workq           
6266.hpcnode0   fingerprint_rt.sh  u999999    386:38:3  R  smallq          
6267.hpcnode0   fingerprint_rm.sh  u999999    385:46:4  R  smallq

If you just wish to see your jobs, which is the most likely case, you can use the -u option and apend your Username:

$ qstat -u u999777
                                       Req'd  State  Elap
Job ID          Username   Jobname     Time          Time
--------------- --------   ----------  ------ -----  -----
1256313.hpcnod* u999777    test_5      200:0  R      00:52
1256314.hpcnod* u999777    test_2      200:0  Q      --

If you use the -p option then the “Time Use” column is replaced with the percentage completed for the job.

$ qstat -p

Job ID          Name               User      % done   S  Queue
--------------  ----------------   --------  ------   -  -----
6263.hpcnode0   fingerprint_li.sh  u999999      90%   R  workq           
6264.hpcnode0   fingerprint_lm.sh  u999999            0  Q workq           
6266.hpcnode0   fingerprint_rt.sh  u999999      40%   R  smallq          
6267.hpcnode0   fingerprint_rm.sh  u999999      35%   R  smallq

For a normal job, it is the percentage of allocated CPU time used. For a job array this is the percentage of subjobs completed.

Example of listing with percentage completed, but just the jobs in the “smallq” by appending the name of the queue:

$ qstat -p smallq

Job ID            Name             User           % done  S Queue
----------------  ---------------- -------------  -------- - -----
6266.hpcnode0     fingerprint_rt.sh  999999        40%      R smallq          
6267.hpcnode0     fingerprint_rm.sh  999999        35%      R smallq

Example using the -a option to show queued and running jobs and the -n1 option to show the node that the program is executing on:

$ qstat -an1
                                                             Req'd   Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK  Memory  Time  S Time
--------------  -------- -----    ---------  ------ --- ---  ------  ----- - -----
69580.hpcnode0  999999   workq    hpc-hill2   22234   1   8     5gb  120:0 R 23:47 hpcnode6/2*8
69581.hpcnode0  999999   workq    hpc-hill2   22698   1   8     5gb  120:0 R 23:47 hpcnode6/3*8
.....
65566.hpcnode0  999999   smallq    SRR907711   18581   1  16    10gb 100:0 R 07:47 hpcnode07/0*16
65574.hpcnode0  999999   smallq    SRR907711  547551   1  16    50gb 100:0 R 05:18 hpcnode14/0*16
65583.hpcnode0  999999   smallq    SRR220853  264654   1  16    20gb 100:0 Q   --   --

Example where appending a destination queue will limit the output to just that queue.

$ qstat -an1 workq
                                                             Req'd   Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK  Memory  Time  S Time
--------------  -------- -----    ---------  ------ --- ---  ------  ----- - -----
69580.hpcnode0  999999   workq    hpc-hill2   22234   1   8     5gb  120:0 R 23:47 hpcnode6/2*8
69581.hpcnode0  999999   workq    hpc-hill2   22698   1   8     5gb  120:0 R 23:47 hpcnode6/3*8

Example of obtaining complete information on a specific job by using -f option.

$ qstat -f 1094117.hpcnode0

Job Id: 1094117.hpcnode0
Job_Name = fingerprint_li.sh
....
comment = Job run at Mon Feb 24 at 14:46 on (hpcnode3:mem=44040192kb:ncpus=48)
etime = Mon Feb 24 14:46:13 2014
Submit_arguments = fingerprint_li.sh
$

Note: If the job has already finished you will need to add a -x to show expired jobs, e.g. qstat -fx 1094117.hpcnode0

Also see the qstat examples in the section on Running Array Jobs.

Using grep with qstat¶

Here are some examples of using grep to get just the information you need. grep stands for global regular expression parser and we will use it just to search for a simple string in the output of qstat.

$ qstat -f 1126584.hpcnode0 | grep cpu
resources_used.cpupercent = 2396    
resources_used.cput = 36:47:29
resources_used.ncpus = 24           <== Used 24 cpus.                
Resource_List.ncpus = 24            <== Asked for 24 cpus.

In the above the used.cpupercent value will be 100 times the number of cpus used if all of those cpus were used 100% of the time. In this case those cpus were used most of the time, that’s very good :-)

Here is another example:

$ qstat -fx 1126585.hpcnode0 | grep cpu
resources_used.cpupercent = 109
resources_used.cput = 01:12:53
resources_used.ncpus = 5
Resource_List.ncpus = 5
Resource_List.select = 1:mem=80gb:ncpus=5

They asked for 5 cpus. Now 5 cpus were used but if they were used all the time during that 1 hour and 12 minutes of running the cpupercent would be 5 x 100 = 500. But it’s only 109. That shows the cpus were only being used about 1/5 of the time. It’s likely that the aplication just used one CPU at any one time and simple swapped from CPU to CPU. That’s not very efficient use of a HPC :-(

Example of looking at the memory usage:

$ qstat -f 1126584.hpcnode0 | grep mem
resources_used.mem = 2655840kb
resources_used.vmem = 13701636kb
Resource_List.mem = 256000mb
Resource_List.select = 1:ncpus=24:mem=256000mb

$ qstat -f 1126584.hpcnode0 | grep time
resources_used.walltime = 01:34:55          <== Used just 1.5 hours of time.
Resource_List.walltime = 24:00:00           <== Asked for 24 hours wall time.

And finally a simple example that show how to find out what node the code executed on:

$ qstat -fx 1126585.hpcnode0 | grep exec
exec_host = hpcnode11/2*5
exec_vnode = (hpcnode11:mem=83886080kb:ncpus=5)