You need to specify in your PBS submission scripts what “resources” such as walltime, cpu cores and memory your job requires.
If you specify insufficient walltime your program won’t be finished and the PBS scheduler will terminate it. If you don’t specify sufficient memory your program might run slowly or it might even crash.
But if you specify far more memory, cores or walltime than your job requires then you will probably be waiting longer for your job to start running. The scheduler will not start your job until it knows that the resources that you have requested will be available.
So you need to know approximately what resources your jobs will need and then request a bit more than that in your job scripts to allow for some leeway. The way to do this is to run a few jobs and look at the job information after they have finished. That will guide you in what to request for future jobs.
Its better for you and for other users on the cluster to regularly check what resources your jobs are actually using and to adjust your jobs scripts accordingly.
Here are some useful bash scripts which will help you to find out what walltime
your jobs need and what amount of memory they require. They are also an introduction
to using classic UNIX utilities such as
Save this script as
check_walltime.sh and make it executable with
chmod u+x check_walltime.sh. Run it:
#!/bin/bash # Show requested walltime and actual used time for finished jobs. jobs=`qstat -H | cut -d ' ' -f1 | grep ^[0-9]` for job in $jobs; do echo "--- $job ---" qstat -fH $job | grep walltime done
--- 247742.hpcnode0 --- resources_used.walltime = 33:13:57 Resource_List.walltime = 200:00:00 --- 248238.hpcnode0 --- resources_used.walltime = 26:49:59 Resource_List.walltime = 120:00:00 --- 248514.hpcnode0 --- resources_used.walltime = 35:42:52 Resource_List.walltime = 200:00:00
You can see here that these jobs requested 200 hours walltime. That’s probably because the user didn’t specify a required walltime and the default value is 200 hours. However, those jobs only took about 35 hours to run. If you request a walltime of 50 hours those jobs will run and your jobs will probably get scheduled to run sooner.
Save this script as
check_memory.sh and make it executable with
chmod u+x check_memory.sh. Run it:
#!/bin/bash # Show requested memory and actual used memory for finished jobs. # (Ignore vmem) jobs=`qstat -H | cut -d ' ' -f1 | grep ^[0-9]` for job in $jobs; do echo "--- $job ---" qstat -fH $job | grep '\.mem' done
--- 248734.hpcnode0 --- resources_used.mem = 7757784kb Resource_List.mem = 120gb --- 248752.hpcnode0 --- resources_used.mem = 7178944kb Resource_List.mem = 120gb --- 248753.hpcnode0 --- resources_used.mem = 7179252kb Resource_List.mem = 120gb --- 248754.hpcnode0 --- resources_used.mem = 7082440kb Resource_List.mem = 120gb
Here the user requested 120 GB of RAM but the program consistently used only about 7 GB of RAM (resources_used.mem = 7757784kb ~= 7.7 GB). The user might have had their jobs scheduled to run earlier if they had requested just 16 GB of RAM. That would have been plenty with a good leeway if one run had used more memory.
Writing a short bash script like the above to check how many cpu cores my program used is left as an exersise for the reader.