The Risky Queue¶
This queue allows users access to the private nodes even if they are not users of those groups. There are important caveats though.
Occasionally the private nodes owned by other groups might not be fully utilised. The UTS still has to pay the costs of running those nodes in the commercial data centre, whether they are are being used or not. If all of the other nodes are fully utilised running jobs, then rather than have jobs wait, those jobs could be running on a private node.
The risky queue (riskyq
) has been setup to facilitate this. If you submit
your job to the risky queue then it will be queued to run on one of the private nodes.
As soon as there are the resources to run your job it will be run on the private node.
But when a job in the c3b queue or i3q queue needs to run on their private node
you job will be “preempted”. Its run on that node will be ended.
That’s why it is called the “risky” queue.
When your job is preempted though it will go back to the queued state. There it will remain until there are again resources available on one of the private nodes, and it will start to run. Hence you can leave it there until it has managed to run to completion.
However PLEASE monitor your job(s) in the risky queue. If they are not likely
to run please use qdel job_no
to remove them as otherwise they will stay queued
indefinately. Also once they are in the risky queue they will not be scheduled to
run on any of the non-private nodes.
To use the risky queue just include #PBS -q riskyq
in your job script.
You can also add for instance #PBS -l host=i3node01
to specify a specific
private host.
Here we will just submit a short test from the command line; use the riskyq, ask for 2 cpus and 1 GB memory, 5 minutes of walltime, and run the bash sleep command for 60 seconds.
$ qsub -q riskyq -l select=1:ncpus=2:mem=1G -l walltime=00:05:00 -- /bin/sleep 60
110768.hpcnode0
Check where this job is running:
$ qstat -u mlake -an1
Req'd Req'd Elap
Job ID Username Queue Jobname NDS TSK Memory Time S Time
--------------- -------- -------- ---------- --- --- ------ ----- - -----
110768.hpcnode0 mlake riskyq STDIN 1 2 1gb 00:05 R 00:00 c3node03/1*2
You can see it is running on the private node c3node03. That node was chosen by PBS.
In this test we ask for the job to be run on the private node i3node01:
$ qsub -q riskyq -l select=1:ncpus=2:mem=1G:host=i3node01 -l walltime=00:05:00 -- /bin/sleep 60
And we can see it’s running on that node:
$ qstat -an1 -u mlake
Req'd Req'd Elap
Job ID Username Queue Jobname NDS TSK Memory Time S Time
--------------- -------- -------- ---------- --- --- ------ ----- - -----
110773.hpcnode0 mlake riskyq STDIN 1 2 1gb 00:05 R 00:00 i3node01/0*2