Using qstat¶
Getting More Information out of Qstat¶
You can gain a lot more information from qstat if you use its command line
options. On the cluster type man qstat
to read the user manual pages for
further details on the qstat command.
However the man pages are quite detailed and are more a reference for the qstat command. So below are some examples of common usage with the most useful command line options.
These examples use qstats “Default Format” and “Alternate Format”. You will find that the man pages for qstat refer to these two formats a lot. This is what those formats look like:
The default format has these column headings:
Job id Name User Time Use S Queue
-------- -------- -------- -------- - -----
The alternate format has these column headings:
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------- -------- ----- ------- ------ --- --- ------ ----- - ----
In the examples below some command line options will display in default format and some in alternate format.
Qstat Examples¶
Example of just running qstat with no command line options:
$ qstat
Job id Name User Time Use S Queue
------------- ---------------- -------- -------- - -----
6263.hpcnode0 fingerprint_li.sh 999999 8805:35 R workq
6264.hpcnode0 fingerprint_lm.sh 999999 0 Q workq
6266.hpcnode0 fingerprint_rt.sh 999999 386:38:3 R smallq
6267.hpcnode0 fingerprint_rm.sh 999999 385:46:4 R smallq
If you use the -p
option then the “Time Use” column is replaced with the
percentage completed for the job.
$ qstat -p
Job id Name User % done S Queue
-------------- ---------------- -------- ------ - -----
6263.hpcnode0 fingerprint_li.sh 999999 90% R workq
6264.hpcnode0 fingerprint_lm.sh 999999 0 Q workq
6266.hpcnode0 fingerprint_rt.sh 999999 40% R smallq
6267.hpcnode0 fingerprint_rm.sh 999999 35% R smallq
For a normal job, it is the percentage of allocated CPU time used. For a job array this is the percentage of subjobs completed.
Example of listing with percentage completed, but just the jobs in the “smallq” by appending the name of the queue:
$ qstat -p smallq
Job id Name User % done S Queue
---------------- ---------------- ------------- -------- - -----
6266.hpcnode0 fingerprint_rt.sh 999999 40% R smallq
6267.hpcnode0 fingerprint_rm.sh 999999 35% R smallq
Example using the -a
option to show queued and running jobs and the -n1
option to show the node that the program is executing on:
$ qstat -an1
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------- -------- ----- --------- ------ --- --- ------ ----- - -----
69580.hpcnode0 999999 workq hpc-hill2 22234 1 8 5gb 120:0 R 23:47 hpcnode6/2*8
69581.hpcnode0 999999 workq hpc-hill2 22698 1 8 5gb 120:0 R 23:47 hpcnode6/3*8
.....
65566.hpcnode0 999999 smallq SRR907711 18581 1 16 10gb 100:0 R 07:47 hpcnode07/0*16
65574.hpcnode0 999999 smallq SRR907711 547551 1 16 50gb 100:0 R 05:18 hpcnode14/0*16
65583.hpcnode0 999999 smallq SRR220853 264654 1 16 20gb 100:0 Q -- --
Example where appending a destination queue will limit the output to just that queue.
$ qstat -an1 workq
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------- -------- ----- --------- ------ --- --- ------ ----- - -----
69580.hpcnode0 999999 workq hpc-hill2 22234 1 8 5gb 120:0 R 23:47 hpcnode6/2*8
69581.hpcnode0 999999 workq hpc-hill2 22698 1 8 5gb 120:0 R 23:47 hpcnode6/3*8
Example of obtaining complete information on a specific job by using -f
option.
$ qstat -f 1094117.hpcnode0
Job Id: 1094117.hpcnode0
Job_Name = fingerprint_li.sh
....
comment = Job run at Mon Feb 24 at 14:46 on (hpcnode3:mem=44040192kb:ncpus=48)
etime = Mon Feb 24 14:46:13 2014
Submit_arguments = fingerprint_li.sh
$
Note: If the job has already finished you will need to add a -x
to show expired jobs,
e.g. qstat -fx 1094117.hpcnode0
Also see the qstat examples in the section on Running Array Jobs.
Using grep with qstat¶
Here are some examples of using grep
to get just the information you need.
grep
stands for global regular expression parser and we will use it just
to search for a simple string in the output of qstat.
$ qstat -f 1126584.hpcnode0 | grep cpu
resources_used.cpupercent = 2396
resources_used.cput = 36:47:29
resources_used.ncpus = 24 <== Used 24 cpus.
Resource_List.ncpus = 24 <== Asked for 24 cpus.
In the above the used.cpupercent
value will be 100 times the number of cpus used
if all of those cpus were used 100% of the time. In this case those cpus were used
most of the time, that’s very good :-)
Here is another example:
$ qstat -fx 1126585.hpcnode0 | grep cpu
resources_used.cpupercent = 109
resources_used.cput = 01:12:53
resources_used.ncpus = 5
Resource_List.ncpus = 5
Resource_List.select = 1:mem=80gb:ncpus=5
They asked for 5 cpus. Now 5 cpus were used but if they were used all the time during that 1 hour and 12 minutes of running the cpupercent would be 5 x 100 = 500. But it’s only 109. That shows the cpus were only being used about 1/5 of the time. It’s likely that the aplication just used one CPU at any one time and simple swapped from CPU to CPU. That’s not very efficient use of a HPC :-(
Example of looking at the memory usage:
$ qstat -f 1126584.hpcnode0 | grep mem
resources_used.mem = 2655840kb
resources_used.vmem = 13701636kb
Resource_List.mem = 256000mb
Resource_List.select = 1:ncpus=24:mem=256000mb
$ qstat -f 1126584.hpcnode0 | grep time
resources_used.walltime = 01:34:55 <== Used just 1.5 hours of time.
Resource_List.walltime = 24:00:00 <== Asked for 24 hours wall time.
And finally a simple example that show how to find out what node the code executed on:
$ qstat -fx 1126585.hpcnode0 | grep exec
exec_host = hpcnode11/2*5
exec_vnode = (hpcnode11:mem=83886080kb:ncpus=5)