Checking GPU Usage¶
You can obtain a basic information on the NVIDA GPU and its current usage
using NVIDIA’s “System Management Interface” program
nvidia-smi
. Look at its man page for details man nvidia-smi
,
or run it with the -h option i.e. nvidia-smi -h
for help.
See Checking Device Capability to find out detailed information on the card.
Another alternative to Nvidia’s monitoring tool is the interactive Nvidia GPU process viewer, see Using nvitop
Using nvidia-smi¶
Note
Please: Do not use the nvidia-smi command with the -l
option, nor
with the watch
command. Continuously running nvidia-smi using a loop consumes
GPU resources and will slow down everyone elses jobs.
You need to be logged into a GPU node to run this command. Here is an example of its output.
GPUNode $
$ nvidia-smi
Mon Jun 28 14:13:56 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 61C P0 155W / 250W | 17289MiB / 32510MiB | 91% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
| N/A 62C P0 152W / 250W | 17289MiB / 32510MiB | 89% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 57573 C python 17283MiB |
| 1 N/A N/A 57573 C python 17283MiB |
+-----------------------------------------------------------------------------+
$
You can see that this node has two Tesla V100 GPUs installed. Both are running at about 50% memory usage and 90% GPU utilisation.
You can get a list of the two GPUs and their UUIDs with this:
hpcnode10 $ nvidia-smi --list-gpus
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-37f061b1-7948-e188-56a7-d30f5e0ffc70)
GPU 1: Tesla V100-PCIE-32GB (UUID: GPU-151b0546-4c5b-039a-e1e2-0acaa0098909)
You can specify what information you woud like to see by using the --query
option
with --display
parameters e.g.:
$ nvidia-smi -q -d MEMORY,COMPUTE,UTILIZATION
The above will show the data for both GPUs. If you only wish to see the information for a specific GPU then you can specify the UUID to query:
$ nvidia-smi -q -d MEMORY,COMPUTE,UTILIZATION -i GPU-37f061b1-7948-e188-56a7-d30f5e0ffc70 --loop=600
Notice in the above example I have also used the --loop
option. This can be
very useful but please not use this continuously with small time intervals.
For all the details on this command see the manual pages man nvidia-smi
.
Using nvitop¶
Another alternative to Nvidia’s monitoring tool is the interactive Nvidia GPU process viewer from: https://github.com/XuehaiPan/nvitop. The requirements for this are already installed on the HPC GPU nodes. Just go to the “Installation” section or read how I installed it below.
It needs to be installed into your own local Python environment.
If your using Miniconda you will need to load your Base Miniconda Python environment
and then load your Project Conda Environment.
(For a refresher on python environments see here:
https://hpc.research.uts.edu.au/software_general/python/python_miniconda/)
Be on a GPU node and load CUDA:
GPU node ~/$ module load cuda-latest
Load the base Miniconda and load your project environment. In this case mine is called “pytorch”:
$ source $HOME/miniconda3/etc/profile.d/conda.sh
$ conda activate pytorch
(pytorch)$
Now install nvitop:
$ pip3 install nvitop
Now you can just type nvitop
to run it. Type nvitop -h
for some short usage help.
The full documentation for it is on Xuehai Pan’s website as above.
My version of nvitop is 1.3.2