Submitting Multiple Jobs¶
You might also want to look at Running Array Jobs as that page also covers submitting multiple jobs and using PBS Environment Variables.
The Task¶
One of our HPC users have 40 input files of genomic sequences like this:
sequence_7004M.fasta
sequence_70082.fasta
sequence_700N7.fasta
.....
They wish to process them using RAxML-NG and get the following output files:
test_7004M.fasta
test_70082.fasta
test_700N7.fasta
.....
With RAxML-NG they would run one of the files like this:
$ raxml-ng -s sequence_7004M.fasta -n test_7004M.fasta --model LG+I+G4 --threads 4 --redo
Where the options -s sequenceFileName
specifies the input file name and -n outputFileName
specifies the output filename. The options for the --model
are things that
genomics experts would understand. Just ignore that for this exercise :-)
The user though knows that they should not write 40 slightly different PBS submission scripts. A script would do this for them.
The Solution¶
We will have a single PBS submission script that is generic in that it does not have the input and output files hardcoded. They are passed into that script as “environment variables”. We then use a “wrapper” script that reads an input file list and creates from that the input and output files names and passes those to our PBS submission script.
What will happen is that when we run the wrapper script multiple PBS jobs will be submitted, one for each genome sequence that we wish to process.
Setup Procedure¶
Create a List of Sequence Files to Process¶
First create a file listing all of our input files. Do this by running this command in the directory containing the FASTA files.
$ ls -1 sequence_*.fasta > sequence_list.txt
Then edit it then to include a comment at the top. It’s always good to
document what your files are.
The file sequence_list.txt
now looks like this:
# Input list for RAxML-NG, created 20 Nov 2022
sequence_7004M.fasta
sequence_70082.fasta
sequence_700N7.fasta
PS. You can have blank or empty lines in this file. The script run_multiple_jobs.sh
will skip those lines along with any comment lines.
Write your Single PBS Submission Script¶
Now write your PBS submission script. Name this file sub_single_job.sh
.
#!/bin/bash
# Set the resource requirements; 1 CPU, 5 GB memory and 1 hour walltime.
#PBS -N testalign
#PBS -l ncpus=1
#PBS -l mem=5GB
#PBS -l walltime=01:00:00
# Send an email when this job aborts, begins or ends.
##PBS -m abe
##PBS -M Mike.Lake@uts.edu.au
cd ${PBS_O_WORKDIR}
source $HOME/miniconda3/bin/activate
conda activate bioinfo
# Now run raxml.
echo "Running raxml-ng with input $input ==> output $output"
raxml-ng -s $input -n $output --model LG+I+G4 --threads 4 --redo
conda deactivate
$input
and $output
variables.
In the code above I have highlighted in yellow the line where the $input and
$output variables are used.
Write the Wrapper Script¶
Now write the wrapper script. Name this file run_multiple_jobs.sh
. The script
is below and it’s well commented so you can read though it to understand what it does.
#/bin/bash
# This script when run will read an input file containing the filenames
# of fasta input data to be submitted to a qsub submission script. For
# each filename in that input file it will submit a job for that file
# using the qsub submission script called sub_single_job.sh
# The filenames are passed to this submission script via environment variables.
#
# Create your input file list called sequence_list.txt.
# Note: You can place comments in there by placing a # at the very start of a line.
#
# $ ls -1 sequence_* > sequence_list.txt
#
# Usage: ./run_multiple_jobs.sh
# Do not use qsub with this script.
# Check we have our file list to read.
if [ ! -f sequence_list.txt ]; then
echo "Missing input file list. Exiting"
exit 0
fi
# Check we have our qsub submission script.
if [ ! -f sub_single_job.sh ]; then
echo "Missing submission script. Exiting"
exit 0
fi
# Now we read in the input filenames line by line from "sequence_list.txt".
# You will see that filename on the last line of this "while - done" loop.
# Note that this "read" will trim leading and training space from lines.
i=0
while read -r line; do
i=$((i+1)) # Increment our line counter.
# Skip lines starting with a # i.e. comment lines.
if [[ $line =~ ^# ]]; then
continue
fi
# Skip lines that are empty. As the line has been trimmed any
# line that just contained whitespace will have zero length.
if [ -z "$line" ]; then
continue
fi
# Set environment variables for our input and output filenames.
# For the output filename replace "sequence" with "test".
input="$line"
output=$(echo $input | sed 's/sequence/test/')
# Check if our input file exists.
if [ ! -f $input ]; then
echo "Line $i: Input file \"$input\" does not exist so skipping this entry."
continue
fi
# Submit our job, passing the filenames to the job using env variables.
# Note: No spaces in the list of environment variables are allowed.
echo -n "Submitting $input ==> $output Job: "
qsub -v input=$input,output=$output sub_single_job.sh
done < sequence_list.txt
I have highlighted in yellow where qsub
is invoked to submit the job and how
we use qsub’s -v
option to pass in the $input
and $output
variables.
Type man qsub
for the manual page for qsub and search for the -v
option.
Finally set our wrapper script to be executable. We do not need the
sub_single_job.sh
script to be execuable.
$ chmod u+x run_multiple_jobs.sh
Submit Jobs¶
Now you can run the wrapper script. Remember, you don’t submit this to PBS – just run the wrapper script. It submits the jobs for you, one job for each genomic sequence.
$ ./run_multiple_jobs.sh
The output will be like this:
Submitting sequence_7004M.fasta ==> test_7004M.fasta Job: 2002164.hpcnode0
Submitting sequence_70082.fasta ==> test_70082.fasta Job: 2002165.hpcnode0
Submitting sequence_700N7.fasta ==> test_700N7.fasta Job: 2002166.hpcnode0
Running qstat
should show the above jobs running. At the end of the runs you will
have the PBS error and output files for each job like this:
$ ls -l
-rw------- 1 mlake mlake 0 Nov 25 testalign.e2002164
-rw------- 1 mlake mlake 0 Nov 25 testalign.e2002165
-rw------- 1 mlake mlake 0 Nov 25 testalign.e2002166
-rw------- 1 mlake mlake 106 Nov 25 testalign.o2002164
-rw------- 1 mlake mlake 106 Nov 25 testalign.o2002165
-rw------- 1 mlake mlake 106 Nov 25 testalign.o2002166
Good, all the error files are of zero size, so no errors. We can look at one of the output files and see it ran OK.
$ cat testalign.o2002166
Running raxml-ng with input sequence_7004M.fasta ==> output test_7004M.fasta
I hope you have been able to see how a little knowledge of bash scripting can save you much work :-)
Note about qsub -v Option¶
If you look at the man page for qsub and search for the -v
option you will find it says:
-v <variable list>
Lists environment variables and shell functions to be exported to the job.
........
example:
qsub -v a=10, "var2='A,B'", c=20, HOME=/home/zzz job.sh
The example above is incorrect. You will just get a “usage” message listing the options for qsub.
The correct ways to submit multiple environment variables is to not have any
spaces in the list of variables being passed to the -v
option i.e.
qsub -v a=10,"var2='A,B'",c=20,HOME=/home/zzz job.sh
or to quote the entire env variable string like this:
$ qsub -v "a=10,var2='A,B',c=20,HOME=/home/zzz" job.sh
or
$ qsub -v "a=10, var2='A,B', c=20, HOME=/home/zzz" job.sh
This will pass in the variables to job.sh
and run the job.
PS. If you wish to just test your syntax for the env variables then you can use
/bin/sleep 60
for the job. It will do nothing but sleep for 60 seconds.
$ qsub -v "a=10, var2='A,B', c=20, HOME=/home/zzz" -- /bin/sleep 60
References¶
RAxML Next Generation: https://github.com/amkozlov/raxml-ng
Altair PBS Professional 2024.1 Users Guide
See the section “6.12 Using Environment Variables”, pages UG-128.
This PDF document can be found downloaded from the HPC from here:
/shared/eresearch/pbs_manuals/PBSUserGuide2024.1.pdf