Get Info on the Queues

You can get an up-to-date summary of the nodes, queues and jobs by visiting the HPC Status page. You may wish to obtain more detailed information though for use in your job scripts. You can do this by using the qstat command.

Obtaining Information on the Queues Available

To obtain a detailed list of the queues and their limits use qstat.

There are a few different job queues on the HPC, smallq and workq are two examples, and they have different resource limitations. To obtain a list of all the queues run the command below. In this example you can see there are 28 jobs running in the smallq queue, 5 jobs running in the workq and 3 jobs queued in workq.

$ qstat -Q 

Queue        Max   Tot Ena Str   Que   Run   Hld   Wat   Trn   Ext Type
---------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ----
smallq         0    28 yes yes     0    28     0     0     0     0 Exec
expressq       0     0 yes yes     0     0     0     0     0     0 Exec
workq          0     8 yes yes     3     5     0     0     0     0 Exec

To obtain full information on the smallq for example use the command below. This is the best way to obtain up-to-date information on the queues available as we may modify queue maximum limits to manage the resources.

$ qstat -Qf smallq

Queue: smallq
queue_type = Execution
total_jobs = 28
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:28 Exiting:0 Begun:0 
resources_max.mem = 32gb ⇐ The most memory you can request 
resources_max.ncpus = 2 ⇐ The most CPUs you can request 
resources_max.walltime = 200:00:00
resources_default.walltime = 12:00:00
resources_assigned.mem = 101711872kb
resources_assigned.ncpus = 56
resources_assigned.nodect = 28

To obtain full information on all the queues including their maximum cpus, memory and wall times run the command below.

$ qstat -Qf

Normal Queues

You can see that queues such as the smallq, medq and workq are for jobs that range in size from small to large. Jobs in the smallq can be scheduled to run in the free resources still available on the nodes even when larger jobs have been fitted in. Small jobs will also be prioritised if they have been waiting for a while.

Special Queues

In addition to these normal queues there are some special queues. Some of these queues have restricted access, either all of the time or some of the time.

  • expressq – This is a high priority queue that a user can be given access to if they have a demonstrated need to have important jobs run sooner.
  • gpuq – This provides access to the GPU nodes. We usually limit access to these nodes to users that actually use the GPU rather than normal jobs that just use the CPUs.
  • c3b (C3 bioinformatics queue) – This queue provides access to the private node(s) owned by the Climate Change Cluster group. It is restricted to C3 users.
  • i3q (ithree institute) – This queue provides access to the private node(s) owned by the ithree Institute. It is restricted to i3 users.
  • riskyq – This queue allows users access to the private nodes even if they are not users of those groups. There are important caveats though. See details below.

The Risky Queue

Occasionally the private nodes owned by other groups might not be fully utilised. The UTS still has to pay the costs of running those nodes in the commercial data centre, whether they are are being used or not. If all of the other nodes are fully utilised running jobs, then rather than have jobs wait, those jobs could be running on a private node.

The risky queue (riskyq) has been setup to facilitate this. If you submit your job to the risky queue then it will be queued to run on one of the private nodes. As soon as there are the resources to run your job it will be run on the private node. But when a job in the c3b queue or i3q queue needs to run on their private node you job will be “preempted”. Its run on that node will be ended. That’s why it is called the “risky” queue.

When your job is preempted though it will go back to the queued state. There it will remain until there are again resources available on one of the private nodes, and it will start to run. Hence you can leave it there until it has managed to run to completion.

However PLEASE monitor your job(s) in the risky queue. If they are not likely to run please use qdel job_no to remove them as otherwise they will stay queued indefinately. Also once they are in the risky queue they will not be scheduled to run on any of the non-private nodes.

To use the risky queue just include #PBS -q riskyq in your job script.
You can also add for instance #PBS -l host=i3node01 to specify a specific private host.

Here we will just submit a short test from the command line; use the riskyq, ask for 2 cpus and 1 GB memory, 5 minutes of walltime, and run the bash sleep command for 60 seconds.

$ qsub -q riskyq -l select=1:ncpus=2:mem=1G -l walltime=00:05:00 -- /bin/sleep 60

Check where this job is running:

$ qstat -u mlake -an1
                                                     Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- --- --- ------ ----- - -----
110768.hpcnode0 mlake    riskyq   STDIN        1   2    1gb 00:05 R 00:00 c3node03/1*2

You can see it is running on the private node c3node03. That node was chosen by PBS.

In this test we ask for the job to be run on the private node i3node01:

$ qsub -q riskyq -l select=1:ncpus=2:mem=1G:host=i3node01 -l walltime=00:05:00 -- /bin/sleep 60

And we can see it’s running on that node:

$ qstat -an1 -u mlake
                                                     Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- --- --- ------ ----- - -----
110773.hpcnode0 mlake    riskyq   STDIN        1   2    1gb 00:05 R 00:00 i3node01/0*2
This is custom footer