Running a Simple HPC Job¶
In this section we will copy some example scripts into our own directory and
submit a simple job to the PBS scheduler. The end result is that we will have
calculated some prime numbers. After reading this section you should read the
PBS sections for how to write PBS jobs which use the
/scratch
filesystem.
We use the batch queueing system PBS Pro (PBS Professional). This is a standard across major clusters in the HPC community. The NCI National Facility also uses PBS Pro so if you progress to using the large raijin cluster at NCI you will find its interface familiar.
Note
Important things to remember:
1. Do not to run large computations on the login node, use PBS.
2. Use the /scratch/ directory for reading and writing large files.
The login node is so you can login, edit your code, compile it, and perhaps run tests, using a small test data set. Your real computational work needs to be run under a PBS submission script so that it can be distributed to one of the dedicated compute nodes. This page explains how you can do this.
Summary of running a job¶
- Determine the resources required for your job.
- Create a Job Script, this wraps your job in a shell script, telling PBS your requirements.
- Submit the job using qsub.
- Monitor the job using qstat.
- Delete the job, if required, using qdel.
Never run large programs or large data sets on the cluster’s head node directly, use PBS to schedule your job into the queue. This ensures efficient allocation of resources for everyone. If you need to test a script, run a smaller set of test data, preferably via PBS, instead.
Copy example scripts to your own directory¶
There are example scripts that you can use to practice submitting some short test jobs
in /shared/eresearch/pbs_job_examples/
.
Copy these into your own directory using the following commands:
$ cd <-- This will take you to the top of your home directory.
$ mkdir jobs <-- This creates the directory "jobs".
$ cd jobs <-- This changes into the directory "jobs".
Now you should be in the new directory jobs
and we can copy into there the
example primes programs.
In the command below don’t forget the dot at the end.
$ cp -r /shared/eresearch/pbs_job_examples/primes .
Now change directory into the new “primes” directory.
$ cd primes
You will now be in the directory “primes” and you can have a look at the scripts there.
To view a file like primes.py
use the less command less primes.py
; hitting
the space bar moves down a page and hitting the q
key quites the viewer.
To edit the files use the vi or nano editors; vi primes.py
or nano primes.py
.
For help with using the vi editor (which is far more useful than nano) see the
Training section.
Determine the resource requirements¶
To make effective use of the PBS queueing system, you will need to know how much resources your job will be using. When your job starts, PBS will make sure that appropriate resources are available for your job to run up to the maximum you have specified. Try not to request considerably more than what you require.
The resources can be specified by:
-
CPU cores (
ncpus
) - If your application is multi-threaded and can make use of more than one core, you will need to specify the number of cores your script will use. -
Memory (
mem
) - This is how much memory your application will use. On a new piece of software or dataset, you might not know how much will be consumed. In such a case, start with a generous number and tune downwards. The more accurate you get, the more likely your job is to be scheduled during busy periods. Once you know how much memory your application requires just request a small amount more than what you need. -
Wall time (
walltime
) - This is the maximum amount of time you want your program to run for, and afer this time the PBS scheduler will kill your job. Start by estimating a generous wall time based on test runs and tune downwards. The smaller the wall time, the more likely the job is to be scheduled during busy periods. Once you know how much time your application requires just request a small amount more than what you need.
For the example primes.py
program to calculate primes from 100,000 to 200,000
we will use 1 CPU, 5 GB RAM and set a wall time of 5 minutes.
Create a job script¶
Your job script sets up the HPC resources we want PBS to reserve for our job. It would contain the following:
- Your resource requirements for PBS to schedule your job - this needs to be at the top of your script for PBS to read it, before the first executable line in the script.
- Any copying of data to the local scratch directory, and other pre-job administration that needs to take place.
- The job itself.
- Copying back your results from the scratch directory to your home directory, cleaning up temporary data, and other post-job administration.
An example of a PBS submission script called submit_simple.sh
is in the primes example directory.
This example job requires just 1 core, 5 GB of RAM and we expect it to take only a few minutes
to complete, so we have specified a wall time of 5 minutes to ensure it will
finish within the wall time.
A shorter version of a job script is shown below. Please read the full examples from
/shared/eresearch/pbs_job_examples/primes/
, as that covers how to use the scratch directory.
Reasons to use scratch is also covered in the Hardware Layout.
#!/bin/bash
# Run this as "qsub submit_simple.sh"
# Set a name for this run and the resource requirements,
# 1 CPU, 5 GB memory and 5 minutes wall time.
#PBS -N test
#PBS -l ncpus=1
#PBS -l mem=5GB
#PBS -l walltime=00:05:00
# Send an email when this job aborts, begins or ends.
#PBS -m abe
#PBS -M your.email@uts.edu.au
# Run your program.
cd ${PBS_O_WORKDIR}
./primes.py
Submit your job¶
If you were to just run the primes program as below it would run and output a file of primes. However this would be running the program “directly on the login node” and not via the PBS scheduler. As it uses the login nodes cpus and memory it could slow down others users work on the login node.
$ ./primes.py <-- DON'T RUN IT LIKE THIS.
Hence we don’t run it as above but like this:
$ qsub submit_simple.sh
11153.hpcnode0
This submits our job to the queue and will return the assigned job ID.
Type man qsub
for the online manual pages.
You will get an email when the job starts with the following information:
PBS Job Id: 11153.hpcnode0
Job Name: submit_simple.sh
Begun execution
Monitor your job status and get detailed job status¶
Below is an example of the output you will see.
Type man qstat
for the online manual pages.
$ qstat
Job id Name User Time Use S Queue
---------------- ----------- -------- -------- - -----
11153.hpcnode0 submit.sh 999777 00:00:39 0 R workq
“Job id” is the identifier for this job.
“Name” is the name of your submitted script.
“User” is your UTS staff/student user number.
“Time Use” is the CPU time used in HH:MM:SS.
The “S” column indicates the job’s state; “Q” the job is queued, “R” job is running and
“F” job is finished.
Whilst the job is running you can get detailed information on the job by running qstat -f job_id
.
$ qstat -fx 11153.hpcnode0
Job Id: 11153.hpcnode0
Job_Name = submit_simple.sh
resources_used.cput = 00:04:31
resources_used.mem = 17072kb
resources_used.ncpus = 1
resources_used.walltime = 00:04:42
job_state = F
queue = smallq
......
exec_host = hpcnode07/2
Output_Path = /shared/homes/999777/jobs/primes/submit_simple.sh.o11153
......
Resource_List.mem = 5gb
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.select = 1:mem=5gb:ncpus=1
Resource_List.walltime = 00:05:00
schedselect = 1:mem=5gb:ncpus=1
comment = Job run at Fri Nov 03 at 14:27 on (hpcnode07:mem=5242880kb:ncpus=1)
and finished
Exit_status = 0
Submit_arguments = submit_simple.sh
There is a lot more more information than the above, I have just included some of the important lines.
If you get this message when running qstat -f job_id
$ qstat -f 11153.hpcnode0
qstat: 11153.hpcnode0 Job has finished, use -x or -H to obtain historical job information
$
Then it means just that; the job has finished and use the -x flag like this:
$ qstat -fx 11153.hpcnode0
After the job has finished¶
If the job has finished when you run qstat
it will not show anything:
$ qstat
$
This is because by default it only shows queued or running jobs. To list your finished jobs use -x (for expired). So for instance:
$ qstat -x
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1152.hpcnode0 999777 smallq primes 56678 1 1 5gb -- F 00:09
You will also get an email when the job has finished with the following information:
PBS Job Id: 11153.hpcnode0
Job Name: submit_simple.sh
Execution terminated
Exit_status=0 <== The exit status should be zero.
resources_used.cpupercent=99
resources_used.cput=00:04:31 <== The actual CPU time used.
resources_used.mem=17072kb <== The memory used, here it's about 17 MB
resources_used.ncpus=1 <== The number of CPU cores used.
resources_used.vmem=339676kb
resources_used.walltime=00:04:42 <== The real clock time used.
The emailed information is only a summary, to obtain more detailed information then as mentioned above run qstat with the -f flag for full job information plus the -x flag for an eXpired job:
$ qstat -fx 11153.hpcnode0
The PBS output files¶
A copy of the output of your program’s standard output stdout and standard error stderr
streams gets created in the
directory you called PBS from as *.e
and a *.o
named files with the job_id
appended.
Note
The “standard output stream” is what your program would normally print to the screen.
The “standard error stream” is what your program would print if it encountered
an error like a “file not found” error.
An example of what the program primes and job number 1153 would produce is:
$ ls -l
-rwxr--r--. 1311 Nov 3 14:27 submit_simple.sh
-rw-------. 0 Nov 3 16:28 submit_simple.sh.e532527 <-- Notice zero size file.
-rw-------. 96 Nov 3 16:29 submit_simple.sh.o532527
-rwxr--r--. 894 Nov 3 16:28 primes.py
-rw-rw-r--. 12006 Nov 3 16:29 primes.txt
submit_simple.e1153
- this should always be zero sized, i.e. empty, as it contains
any errors your program may have produced.
submit_simple.o1153
- this will contain any screen output that your program would have produced.
primes.txt
- this is your data.
To delete or cancel a job¶
Sometimes you might want to delete a job that has been submitted to the PBS queue. To delete your job from the queue, use the qdel command e.g. “qdel 69580.hpcnode0”:
$ qdel job_id
Type man qdel
for the online manual pages.
Once you understand this introduction and have finished running some example jobs you should read the PBS sections on how to write a more complex and more useful job submission script that will suit your research needs.