HPC User Training¶

This is our scratch sheet for developing a training course for UTS HPC users. This training will be developed and delivered by Jianzhou Zhao and Mike Lake sometime in April 2025. This will be a day long course.

Date to email participants of this course: TODO

Date of course: TODO

Below I have indicated talks and user activities using bold prefixed by the approximate start time. Example: “9:05 am Talk ML”, and “9:30 am Participant Activity”. I have tried to intersperse those so participants don’t get bored. You can get a summary of those times with:

$ cat training_users.md | grep '\*\*'

Temporary Note: The Carpentries “Introduction to High-Performance Computing” https://carpentries-incubator.github.io/hpc-intro/ has some suggested time for each topic. We should add some times into our doc here.

For course notes in each topic we cover in this course we will direct participants to the topic in the HPC documentation pages: https://hpc.research.uts.edu.au. That way we do not have to add as much material in this document.

To create a PDF document of this file for distribution to participants you can use the pandoc command below. However it uses latex as the backend and the styles are terrible. It’s better to use asciidoctor-pdf.

$ pandoc -o training_users.pdf --toc training_users.md

Instructors Setup Notes¶

This is what we need to setup prior to the course:

Login on HPC for instructors.
Instructors to be able to display their terminal to large screen in class room or share via Zoom.
Logins for course attendees. We are planning to just have a dozen? attendees.
A selection of jobs. More than just the example primes job.
A subset of these notes sent to them as a PDF.
Ensure HPC resources will be available. Details below.

We need to use reserve HPC nodes & resources in case the HPC becomes particularly busy. We would need to put it in place a week or so earlier.

We can use an “Advance Reservation”.

$ pbs_rsub -R 09:00 -D 08:00:00  <== Reserve from 9 am for 8 hours.
$ pbs_rsub -R 09:00 -D 08:00:00 -l select=1:ncpus=64

Alternatively we could just set a keepout time on some nodes and for the course set those nodes to only accept jobs from the teaching queue. Preferably we will use the normal queues as its better to expose course participants to the other queues.

Perhaps 2 or 3 nodes. 64 cores on one node will provide 4 cores for 16 participants.

Prerequisites¶

All participants have to test their login before training.

We email them a link to the standard HPC documentation on how to login and their login to them before the course.
Ask them to check the login.
Any issues, they shall contact Mike Lake.

Some familiarity with using the Linux command line and editing text files is useful. They need to be able to edit files in-situ using either nano or vi.

Course Topics¶

The suggested order of topics to cover would be as below.

Logging into the HPC¶

This section is based on sections in https://hpc.research.uts.edu.au/getting_started/

9:00 am Participant Activity Participants should be logged into the HPC at this stage.

9:05 am Talk ML Logging into the HPC.

We will be referring to the HPC Help Pages a lot. What is covered and where: https://hpc.research.uts.edu.au.
Start with those pages and go to “Getting Started / Logging In”.

Jargon that you will hear; “terminal” and “SSH”.

A “terminal” program lets you login to remote computers. The terminal program allows you to interact with the operating system of the remote computer through a command-line interface. The login is via “Secure Shell” (SSH for short).

PBS stands for “Portable Batch System”. It’s one of several job schedulers for HPC systems.
Why is a scheduler so useful in a HPC system?
How it benefits users.

9:15 am Demo JZ & Participant Activity Demo nano editors, participants can practice edits.

Text editors on the HPC: nano, vim and emacs. Give a quick demo on nano editor (10 mins).
They can follow along and edit a practice file during this time. For a practice file Mike will have some Gutenberg texts available.

Basic Linux Commands¶

9:30 am Demo JZ & Participant Activity Participants try out Linux commands.

Participants follow along in their own terminal.

ls
cd
pwd
cp
mv
mkdir
rm (alias)
grep – Global Regular Expression Print

References for Linux Commands can be found here: https://hpc.research.uts.edu.au/getting_started/training/

Running a Simple HPC Job.¶

9:50 am Participant Activity Run a simple PBS job.

In this section we will submit a simple job to find the prime numbers between the integers 100,000 and 200,000. This compute job usually takes a few minutes.

Go to the page “Running a Simple HPC Job” https://hpc.research.uts.edu.au/getting_started/running/ We will follow that page.

The link above will show how to copy some example scripts to your own directory. Edit the job submission script to have your own email.

Determine the resource requirements. How many cpus and what memory will it need? How much walltime will it need?

We will use for this small job just 1 CPU core ncpus=1 and a small amount of memory mem=5GB and a wall time of 5 minutes.

Submit your job using qsub.

Monitor your job status and get detailed job status. Use qstat for this with various command line options.

What happens if you exceed the wall time ?

How many prime numbers were found? hint: cat primes.txt | wc -l

10:30 am to 11:00 am Break for Morning Tea / Coffee

Most Used PBS Commands¶

11:00 am Talk and Demo ML Discuss the most used PBS commands.

qsub
qstat
qdel

Cover some of the most useful command line options:

qstat queue_name
qstat -p
qstat -an1
qstat -f job_id
qstat -fx job_id

11:15 am Participant Activity Submit another job. Try out other PBS commands.

In this section we will submit a PBS job again but this time use some other PBS commands to get information about a running job and to delete a running job.

Using qstat https://hpc.research.uts.edu.au/pbs/qstat/

Using qdel https://hpc.research.uts.edu.au/pbs/qdel/

The PBS commands that give the same information as the HPC status page: https://hpc.research.uts.edu.au/status/; Use pbsnodes and qstat.

Demo man pbsnodes and pbsnodes.

Interactive jobs¶

https://hpc.research.uts.edu.au/pbs/access_nodes/#submitting-an-interactive-pbs-job Leave this out unless we have time.

PBS Queues and Job Routing¶

11:30 am Talk ML Short introduction to queues and job routing.

When you submit a job the job scheduler looks at how many cpus and how much memory you are requesting and routes your job to one of the regular HPC “queues”.

You do not need to specify a queue unless your job needs to be sent to one of our custom queues.

ciq – Centre of Inflammation queue
gpuq – The GPU queue
riskyq – The risky queue
interq – The interactive queue

Checking your Resource Usage¶

Explain how to check what resources were used after a job has finished.
How to use this information to estimate what one should use in future jobs.

How Many Cores do I Need? How much memory do I need? What walltime do I need?

Explain the significance of the graph https://hpc.research.uts.edu.au/software_specific/software_gromacs/

Commands you can use for the above are:

$ qstat -x -u u777999   <== Just show your finished jobs
$ qstat -fx | grep cpu
$ qstat -fx | grep mem
$ qstat -fx | grep walltime

Example of the information this will tell you:

ncpus=1
mem                 = 5,242,880 kb  <== The 5 GB we asked for.
resources_used.mem  =    17,328 kb  <== PBS checks every 120 seconds for the maximum amount used.
resources_used.vmem =   692,980 kb  <== Memory but also includes swap and alocated but not used memory.

11:45 am Participant Activity Check resource usage of your previous jobs.

Using Modules¶

This section is based on https://hpc.research.uts.edu.au/software_general/modules/

12:00 pm Talk JZ Touch a bit on Modules.

What’s currently loaded: module list
What’s available: module avail e.g. module avail | grep java
Load and unload: module load java-latest and module unload java-latest
Getting help for a module: module help java-1.8

Note: For some modules there might not be any help written.

There is also a large collection of modules under /shared/c3/apps/modules/ and /shared/rsg/apps/modules/ developed by Leo Hardtke. To know about these participants can edit their .bash_profile file and insert these lines:

For those in the Climate Change Cluser Group i.e. the c3_users group:

export MODULEPATH=/shared/c3/apps/modules:$MODULEPATH

For those in the Remote Sensing Group i.e. in the rsg_users group:

export MODULEPATH=/shared/rsg/apps/modules:$MODULEPATH

When they login again the modules avail command will show these additional modules.

12:30 pm to 1:00 pm Break for Lunch

Checkpointing¶

XX:XX pm Participant Activity Run a job which uses checkpointing.

This is highly desirable on HPCs, especially for long running jobs.

Remember earlier that you created a jobs directory, and copied into there the “primes” directory from /shared/eresearch/pbs_job_examples/primes/. We will use the same program in this example.

$ cd jobs/primes

Edit the primes.py program and change the “end” value to 250000 or more.
Save and exit the edit. The program will now take longer to run.

Edit the submission script and change the “walltime” to be just 1 minute i.e. 00:01:00
Submit the job.

If you type qstat you should see your job running. After a minute your job will have hit its walltime and will exit.

Compare the current output and error files (.o and .e files) to a previous successful run.
How many prime numbers were found this time to before?

Run the program again.

MPI – The Real Power of HPCs¶

XX:XX pm Participant Activity Run a MPI job and check resource usage.

MPI stands for “Message Passing Interface”. This is a method for programs to run multiple copies of themselves and to pass data back and forth to enable parallel computation.

Remember earlier that you created a jobs directory and copied into there the “primes” directory. Lets now do the same thing but with the “mpi” directory.

$ cd jobs
$ cp -r /shared/eresearch/pbs_job_examples/mpi .    <== Note dot at end!

There is a README.md file in that directory which explains everything :-)

Run the bash shell script compile_all.sh which will compile and build two executable programs from the two source files primes_serial.f90 and primes_mpi.f90. You can use the command cat compile_all.sh to see what comands it runs to compile the two programs.

$ ./compile_all.sh

Now you will find two executable programs in your directory: primes_serial and primes_mpi. They will calculate the prime numbers up to 100 million.

Note: You can’t run multiple copies of these programs at the same time as they all write to the same filename of primes.dat

Run the primes_serial program under PBS using the submit_primes_serial.sh submission script.

How long does it take to run?
Hint: qstat -fx job_id | grep time

Run the primes_mpi program under PBS using the submit_primes_mpi.sh submission script.

How long does it take to run?

How do these times compare to the python program that calculated prime numbers?

During this talk we could introduce the top, htop and ps commands so participants can see the primes processes running.

Delving Deeper¶

These won’t be covered in this short course but you can learn about how to use them from the HPC help pages.

Array Jobs https://hpc.research.uts.edu.au/pbs/array_jobs/
Job Reservations https://hpc.research.uts.edu.au/pbs/reservations/
Passing Args to Jobs https://hpc.research.uts.edu.au/pbs/passing_args/
GPU Computing https://hpc.research.uts.edu.au/gpu/

Further Help¶

Where to get further help? Make sure they know what the UTS HPC documentation pages cover and where to find what.
The PBS manuals are under /shared/eresearch/pbs_manuals/. Download the PBSUserGuide2024.1.pdf as this is The most useful document for participants.

Glossary¶

GREP: Global Regular Expression Print

Its name comes from the ed command g/re/p (global regular expression search and print). It was written in 1973 so it’s a bit over 50 years old.

SSH: Secure Shell

ssh is a program for logging into a remote machine and for executing commands on the remote machine. It provides secure encrypted communications between two hosts over the network.

Terminal:

The term “terminal” in computing comes from its literal meaning as an “endpoint,” from the Latin “terminus” (meaning boundary or end). In early computing, terminals were physical endpoints – devices with just a screen and keyboard that connected to mainframe computers. Today’s “terminal” isn’t a physical device anymore, its a program (like Terminal on Mac or Command Prompt on Windows) that simulates those old physical terminals, carrying on that legacy of being the “endpoint” where human and computer meet for direct communication. We keep the name because it serves the same purpose.
Adapted from: https://www.reddit.com/r/webdev/comments/1gzhqp7/til_why_is_the_terminal_called_that/

References¶

UTS Sites:
https://hpc.research.uts.edu.au/ and https://hpc.research.uts.edu.au/status/

Carpentries Sites: “Introduction to High-Performance Computing” https://carpentries-incubator.github.io/hpc-intro/
Its Github repo is at: https://github.com/carpentries-incubator/hpc-intro

Intersect Australia: “Getting started with HPC using PBS Pro” https://intersectaustralia.github.io/training/HPC201/index