You can get an up-to-date summary of the nodes, queues and jobs by visiting the
HPC Status page.
You may wish to obtain more detailed information though for use in your job scripts.
You can do this by using the
To obtain a detailed list of the queues and their limits use
There are a few different job queues on the HPC, smallq and workq are two examples, and they have different resource limitations. To obtain a list of all the queues run the command below. In this example you can see there are 28 jobs running in the smallq queue, 5 jobs running in the workq and 3 jobs queued in workq.
$ qstat -Q Queue Max Tot Ena Str Que Run Hld Wat Trn Ext Type ---------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ---- smallq 0 28 yes yes 0 28 0 0 0 0 Exec expressq 0 0 yes yes 0 0 0 0 0 0 Exec workq 0 8 yes yes 3 5 0 0 0 0 Exec $
To obtain full information on the
smallq for example use the command below.
This is the best way to obtain up-to-date information on the queues available
as we may modify queue maximum limits to manage the resources.
$ qstat -Qf smallq Queue: smallq queue_type = Execution total_jobs = 28 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:28 Exiting:0 Begun:0 resources_max.mem = 32gb ⇐ The most memory you can request resources_max.ncpus = 2 ⇐ The most CPUs you can request resources_max.walltime = 200:00:00 resources_default.walltime = 12:00:00 resources_assigned.mem = 101711872kb resources_assigned.ncpus = 56 resources_assigned.nodect = 28
To obtain full information on all the queues including their maximum cpus, memory and wall times run the command below.
$ qstat -Qf
You can see that queues such as the smallq, medq and workq are for jobs that range in size from small to large. Jobs in the smallq can be scheduled to run in the free resources still available on the nodes even when larger jobs have been fitted in. Small jobs will also be prioritised if they have been waiting for a while.
In addition to these normal queues there are some special queues. Some of these queues have restricted access, either all of the time or some of the time.
Occasionally the private nodes owned by other groups might not be fully utilised. The UTS still has to pay the costs of running those nodes in the commercial data centre, whether they are are being used or not. If all of the other nodes are fully utilised running jobs, then rather than have jobs wait, those jobs could be running on a private node.
The risky queue (
riskyq) has been setup to facilitate this. If you submit
your job to the risky queue then it will be queued to run on one of the private nodes.
As soon as there are the resources to run your job it will be run on the private node.
But when a job in the c3b queue or i3q queue needs to run on their private node
you job will be “preempted”. Its run on that node will be ended.
That’s why it is called the “risky” queue.
When your job is preempted though it will go back to the queued state. There it will remain until there are again resources available on one of the private nodes, and it will start to run. Hence you can leave it there until it has managed to run to completion.
However PLEASE monitor your job(s) in the risky queue. If they are not likely
to run please use
qdel job_no to remove them as otherwise they will stay queued
indefinately. Also once they are in the risky queue they will not be scheduled to
run on any of the non-private nodes.
To use the risky queue just include
#PBS -q riskyq in your job script.
You can also add for instance
#PBS -l host=i3node01 to specify a specific
Here we will just submit a short test from the command line; use the riskyq, ask for 2 cpus and 1 GB memory, 5 minutes of walltime, and run the bash sleep command for 60 seconds.
$ qsub -q riskyq -l select=1:ncpus=2:mem=1G -l walltime=00:05:00 -- /bin/sleep 60 110768.hpcnode0
Check where this job is running:
$ qstat -u mlake -an1 Req'd Req'd Elap Job ID Username Queue Jobname NDS TSK Memory Time S Time --------------- -------- -------- ---------- --- --- ------ ----- - ----- 110768.hpcnode0 mlake riskyq STDIN 1 2 1gb 00:05 R 00:00 c3node03/1*2
You can see it is running on the private node c3node03. That node was chosen by PBS.
In this test we ask for the job to be run on the private node i3node01:
$ qsub -q riskyq -l select=1:ncpus=2:mem=1G:host=i3node01 -l walltime=00:05:00 -- /bin/sleep 60
And we can see it’s running on that node:
$ qstat -an1 -u mlake Req'd Req'd Elap Job ID Username Queue Jobname NDS TSK Memory Time S Time --------------- -------- -------- ---------- --- --- ------ ----- - ----- 110773.hpcnode0 mlake riskyq STDIN 1 2 1gb 00:05 R 00:00 i3node01/0*2