Checking GPU Usage¶

You can obtain a basic information on the NVIDA GPU and its current usage using NVIDIA’s “System Management Interface” program nvidia-smi. Look at its man page for details man nvidia-smi, or run it with the -h option i.e. nvidia-smi -h for help. See Checking Device Capability to find out detailed information on the card.

Another alternative to Nvidia’s monitoring tool is the interactive Nvidia GPU process viewer, see Using nvitop

Using nvidia-smi¶

Note

Please: Do not use the nvidia-smi command with the -l option, nor with the watch command. Continuously running nvidia-smi using a loop consumes GPU resources and will slow down everyone elses jobs.

You need to be logged into a GPU node to run this command. Here is an example of its output.

GPUNode $ 
$ nvidia-smi 

Mon Jun 28 14:13:56 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   61C    P0   155W / 250W |  17289MiB / 32510MiB |     91%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   62C    P0   152W / 250W |  17289MiB / 32510MiB |     89%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     57573      C   python                          17283MiB |
|    1   N/A  N/A     57573      C   python                          17283MiB |
+-----------------------------------------------------------------------------+
$

You can see that this node has two Tesla V100 GPUs installed. Both are running at about 50% memory usage and 90% GPU utilisation.

You can get a list of the two GPUs and their UUIDs with this:

hpcnode10 $ nvidia-smi --list-gpus
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-37f061b1-7948-e188-56a7-d30f5e0ffc70)
GPU 1: Tesla V100-PCIE-32GB (UUID: GPU-151b0546-4c5b-039a-e1e2-0acaa0098909)

You can specify what information you woud like to see by using the --query option with --display parameters e.g.:

$ nvidia-smi -q -d MEMORY,COMPUTE,UTILIZATION

The above will show the data for both GPUs. If you only wish to see the information for a specific GPU then you can specify the UUID to query:

$ nvidia-smi -q -d MEMORY,COMPUTE,UTILIZATION -i GPU-37f061b1-7948-e188-56a7-d30f5e0ffc70 --loop=600

Notice in the above example I have also used the --loop option. This can be very useful but please not use this continuously with small time intervals.

For all the details on this command see the manual pages man nvidia-smi.

Using nvitop¶

Another alternative to Nvidia’s monitoring tool is the interactive Nvidia GPU process viewer from: https://github.com/XuehaiPan/nvitop. The requirements for this are already installed on the HPC GPU nodes. Just go to the “Installation” section or read how I installed it below.

It needs to be installed into your own local Python environment. If your using Miniconda you will need to load your Base Miniconda Python environment and then load your Project Conda Environment.
(For a refresher on python environments see here: https://hpc.research.uts.edu.au/software_general/python/python_miniconda/)

Be on a GPU node and load CUDA:

GPU node ~/$ module load cuda-latest

Load your base Miniconda environment and load your project environment. I have already installed “pytorch” into my project environment and I have called this environment “pytorch”:

$ source $HOME/miniconda3/etc/profile.d/conda.sh
$ conda activate pytorch
(pytorch)$

Now install nvitop:

$ pip3 install nvitop

Now you can just type nvitop to run it. Type nvitop -h for some short usage help. The full documentation for it is on Xuehai Pan’s website as above.

My version of nvitop is 1.3.2