HPC User Training¶
This is our scratch sheet for developing a training course for UTS HPC users. This training will be developed and delivered by Jianzhou Zhao and Mike Lake sometime in April 2025. This will be a day long course.
Date to email participants of this course: TODO
Date of course: TODO
Below I have indicated talks and user activities using bold prefixed by the approximate start time. Example: “9:05 am Talk ML”, and “9:30 am Participant Activity”. I have tried to intersperse those so participants don’t get bored. You can get a summary of those times with:
$ cat training_users.md | grep '\*\*'
Temporary Note: The Carpentries “Introduction to High-Performance Computing” https://carpentries-incubator.github.io/hpc-intro/ has some suggested time for each topic. We should add some times into our doc here.
For course notes in each topic we cover in this course we will direct participants to the topic in the HPC documentation pages: https://hpc.research.uts.edu.au. That way we do not have to add as much material in this document.
To create a PDF document of this file for distribution to participants you can use the pandoc command below. However it uses latex as the backend and the styles are terrible. It’s better to use asciidoctor-pdf.
$ pandoc -o training_users.pdf --toc training_users.md
Instructors Setup Notes¶
This is what we need to setup prior to the course:
- Login on HPC for instructors.
- Instructors to be able to display their terminal to large screen in class room or share via Zoom.
- Logins for course attendees. We are planning to just have a dozen? attendees.
- A selection of jobs. More than just the example primes job.
- A subset of these notes sent to them as a PDF.
- Ensure HPC resources will be available. Details below.
We need to use reserve HPC nodes & resources in case the HPC becomes particularly busy. We would need to put it in place a week or so earlier.
We can use an “Advance Reservation”.
$ pbs_rsub -R 09:00 -D 08:00:00 <== Reserve from 9 am for 8 hours.
$ pbs_rsub -R 09:00 -D 08:00:00 -l select=1:ncpus=64
Alternatively we could just set a keepout time on some nodes and for the course set those nodes to only accept jobs from the teaching queue. Preferably we will use the normal queues as its better to expose course participants to the other queues.
Perhaps 2 or 3 nodes. 64 cores on one node will provide 4 cores for 16 participants.
Prerequisites¶
All participants have to test their login before training.
- We email them a link to the standard HPC documentation on how to login and their login to them before the course.
- Ask them to check the login.
- Any issues, they shall contact Mike Lake.
Some familiarity with using the Linux command line and editing text files is useful. They need to be able to edit files in-situ using either nano or vi.
Course Topics¶
The suggested order of topics to cover would be as below.
Logging into the HPC¶
This section is based on sections in https://hpc.research.uts.edu.au/getting_started/
9:00 am Participant Activity Participants should be logged into the HPC at this stage.
9:05 am Talk ML Logging into the HPC.
We will be referring to the HPC Help Pages a lot. What is covered and where:
https://hpc.research.uts.edu.au.
Start with those pages and go to “Getting Started / Logging In”.
Jargon that you will hear; “terminal” and “SSH”.
A “terminal” program lets you login to remote computers. The terminal program allows you to interact with the operating system of the remote computer through a command-line interface. The login is via “Secure Shell” (SSH for short).
PBS stands for “Portable Batch System”. It’s one of several job schedulers for HPC systems.
Why is a scheduler so useful in a HPC system?
How it benefits users.
9:15 am Demo JZ & Participant Activity Demo nano editors, participants can practice edits.
Text editors on the HPC: nano, vim and emacs. Give a quick demo on nano editor (10 mins).
They can follow along and edit a practice file during this time.
For a practice file Mike will have some Gutenberg texts available.
Basic Linux Commands¶
9:30 am Demo JZ & Participant Activity Participants try out Linux commands.
Participants follow along in their own terminal.
- ls
- cd
- pwd
- cp
- mv
- mkdir
- rm (alias)
- grep – Global Regular Expression Print
References for Linux Commands can be found here: https://hpc.research.uts.edu.au/getting_started/training/
Running a Simple HPC Job.¶
9:50 am Participant Activity Run a simple PBS job.
In this section we will submit a simple job to find the prime numbers between the integers 100,000 and 200,000. This compute job usually takes a few minutes.
Go to the page “Running a Simple HPC Job” https://hpc.research.uts.edu.au/getting_started/running/ We will follow that page.
The link above will show how to copy some example scripts to your own directory. Edit the job submission script to have your own email.
Determine the resource requirements. How many cpus and what memory will it need? How much walltime will it need?
We will use for this small job just 1 CPU core ncpus=1
and a small amount of memory
mem=5GB
and a wall time of 5 minutes.
Submit your job using qsub
.
Monitor your job status and get detailed job status. Use qstat
for this with various
command line options.
What happens if you exceed the wall time ?
How many prime numbers were found? hint: cat primes.txt | wc -l
10:30 am to 11:00 am Break for Morning Tea / Coffee
Most Used PBS Commands¶
11:00 am Talk and Demo ML Discuss the most used PBS commands.
qsub
qstat
qdel
Cover some of the most useful command line options:
qstat queue_name
qstat -p
qstat -an1
qstat -f job_id
qstat -fx job_id
11:15 am Participant Activity Submit another job. Try out other PBS commands.
In this section we will submit a PBS job again but this time use some other PBS commands to get information about a running job and to delete a running job.
Using qstat https://hpc.research.uts.edu.au/pbs/qstat/
Using qdel https://hpc.research.uts.edu.au/pbs/qdel/
The PBS commands that give the same information as the HPC status page: https://hpc.research.uts.edu.au/status/; Use pbsnodes and qstat.
Demo man pbsnodes
and pbsnodes
.
Interactive jobs¶
https://hpc.research.uts.edu.au/pbs/access_nodes/#submitting-an-interactive-pbs-job Leave this out unless we have time.
PBS Queues and Job Routing¶
11:30 am Talk ML Short introduction to queues and job routing.
When you submit a job the job scheduler looks at how many cpus and how much memory you are requesting and routes your job to one of the regular HPC “queues”.
You do not need to specify a queue unless your job needs to be sent to one of our custom queues.
- ciq – Centre of Inflammation queue
- gpuq – The GPU queue
- riskyq – The risky queue
- interq – The interactive queue
Checking your Resource Usage¶
Explain how to check what resources were used after a job has finished.
How to use this information to estimate what one should use in future jobs.
How Many Cores do I Need? How much memory do I need? What walltime do I need?
Explain the significance of the graph https://hpc.research.uts.edu.au/software_specific/software_gromacs/
Commands you can use for the above are:
$ qstat -x -u u777999 <== Just show your finished jobs
$ qstat -fx | grep cpu
$ qstat -fx | grep mem
$ qstat -fx | grep walltime
Example of the information this will tell you:
ncpus=1
mem = 5,242,880 kb <== The 5 GB we asked for.
resources_used.mem = 17,328 kb <== PBS checks every 120 seconds for the maximum amount used.
resources_used.vmem = 692,980 kb <== Memory but also includes swap and alocated but not used memory.
11:45 am Participant Activity Check resource usage of your previous jobs.
Using Modules¶
This section is based on https://hpc.research.uts.edu.au/software_general/modules/
12:00 pm Talk JZ Touch a bit on Modules.
- What’s currently loaded:
module list
- What’s available:
module avail
e.g.module avail | grep java
- Load and unload:
module load java-latest
andmodule unload java-latest
- Getting help for a module:
module help java-1.8
Note: For some modules there might not be any help written.
There is also a large collection of modules under /shared/c3/apps/modules/
and /shared/rsg/apps/modules/
developed by Leo Hardtke.
To know about these participants can edit their .bash_profile
file and insert these lines:
For those in the Climate Change Cluser Group i.e. the c3_users
group:
export MODULEPATH=/shared/c3/apps/modules:$MODULEPATH
For those in the Remote Sensing Group i.e. in the rsg_users
group:
export MODULEPATH=/shared/rsg/apps/modules:$MODULEPATH
When they login again the modules avail
command will show these additional modules.
12:30 pm to 1:00 pm Break for Lunch
Checkpointing¶
XX:XX pm Participant Activity Run a job which uses checkpointing.
This is highly desirable on HPCs, especially for long running jobs.
Remember earlier that you created a jobs directory, and copied into there the “primes”
directory from /shared/eresearch/pbs_job_examples/primes/
.
We will use the same program in this example.
$ cd jobs/primes
Edit the primes.py program and change the “end” value to 250000 or more.
Save and exit the edit. The program will now take longer to run.
Edit the submission script and change the “walltime” to be just 1 minute i.e. 00:01:00
Submit the job.
If you type qstat
you should see your job running. After a minute your job will
have hit its walltime and will exit.
Compare the current output and error files (.o and .e files) to a previous successful run.
How many prime numbers were found this time to before?
Run the program again.
MPI – The Real Power of HPCs¶
XX:XX pm Participant Activity Run a MPI job and check resource usage.
MPI stands for “Message Passing Interface”. This is a method for programs to run multiple copies of themselves and to pass data back and forth to enable parallel computation.
Remember earlier that you created a jobs directory and copied into there the “primes” directory. Lets now do the same thing but with the “mpi” directory.
$ cd jobs
$ cp -r /shared/eresearch/pbs_job_examples/mpi . <== Note dot at end!
There is a README.md file in that directory which explains everything :-)
Run the bash shell script compile_all.sh
which will compile and build
two executable programs from the two source files primes_serial.f90
and primes_mpi.f90
.
You can use the command cat compile_all.sh
to see what comands it runs to compile the
two programs.
$ ./compile_all.sh
Now you will find two executable programs in your directory: primes_serial
and primes_mpi
.
They will calculate the prime numbers up to 100 million.
Note: You can’t run multiple copies of these programs at the same time as they all write to the same filename of primes.dat
Run the primes_serial program under PBS using the submit_primes_serial.sh submission script.
How long does it take to run?
Hint: qstat -fx job_id | grep time
Run the primes_mpi program under PBS using the submit_primes_mpi.sh submission script.
How long does it take to run?
How do these times compare to the python program that calculated prime numbers?
During this talk we could introduce the top, htop and ps commands so participants can see the primes processes running.
Delving Deeper¶
These won’t be covered in this short course but you can learn about how to use them from the HPC help pages.
- Array Jobs https://hpc.research.uts.edu.au/pbs/array_jobs/
- Job Reservations https://hpc.research.uts.edu.au/pbs/reservations/
- Passing Args to Jobs https://hpc.research.uts.edu.au/pbs/passing_args/
- GPU Computing https://hpc.research.uts.edu.au/gpu/
Further Help¶
-
Where to get further help? Make sure they know what the UTS HPC documentation pages cover and where to find what.
-
The PBS manuals are under
/shared/eresearch/pbs_manuals/
. Download thePBSUserGuide2024.1.pdf
as this is The most useful document for participants.
Glossary¶
GREP: Global Regular Expression Print
Its name comes from the ed
command g/re/p
(global regular expression search and print).
It was written in 1973 so it’s a bit over 50 years old.
SSH: Secure Shell
ssh is a program for logging into a remote machine and for executing commands on the remote machine. It provides secure encrypted communications between two hosts over the network.
Terminal:
The term “terminal” in computing comes from its literal meaning as an
“endpoint,” from the Latin “terminus” (meaning boundary or end). In early
computing, terminals were physical endpoints – devices with just a screen and
keyboard that connected to mainframe computers.
Today’s “terminal” isn’t a physical device anymore, its a program (like
Terminal on Mac or Command Prompt on Windows) that simulates those old physical
terminals, carrying on that legacy of being the “endpoint” where human and
computer meet for direct communication.
We keep the name because it serves the same purpose.
Adapted from: https://www.reddit.com/r/webdev/comments/1gzhqp7/til_why_is_the_terminal_called_that/
References¶
UTS Sites:
https://hpc.research.uts.edu.au/ and https://hpc.research.uts.edu.au/status/
Carpentries Sites: “Introduction to High-Performance Computing”
https://carpentries-incubator.github.io/hpc-intro/
Its Github repo is at: https://github.com/carpentries-incubator/hpc-intro
Intersect Australia: “Getting started with HPC using PBS Pro” https://intersectaustralia.github.io/training/HPC201/index