HPC Cluster

The most updated GPU documentation:

https://wiki.hpc.uconn.edu/index.php/HPC_Getting_Started#GPU_Cluster_Nodes

https://wiki.hpc.uconn.edu/index.php/GPU_Guide

The latest version of Python that includes TensorFlow and auto-sklearn libraries cannot be run on login nodes. In order to avoid the errors messages, use the submission script

#!/bin/bash
#SBATCH -p gpu_rtx
#SBATCH –gres=gpu:1
#SBATCH -t 10:00
#SBATCH –mail-user=first.last@uconn.edu
#SBATCH –mail-type=ALL

module purge
module load gcc/9.2.0 libffi/3.2.1 bzip2/1.0.6 tcl/8.6.6.8606 sqlite/3.30.1 lzma/4.32.7 cudnn/7.1.2 cuda/10.0 java/1.8.0_162 swig/3.0.7 python/3.7.6-tf2.0-rhel7.7

python3 your_python_code.py

singularity

Tensorflow requires the new version of glibc which is incompatible with our HPC cluster, so we use “singularity” to solve this issue. Currently, HPC maintains a singularity container, and all of deep-learning-based packages, such as tensorflow and pytorch, have been installed inside.

In order to use singularity, you need to load the module file named “singularity/3.1” first.

Then decide if your deep learning program should be run in CPU mode or GPU mode.

For running in CPU mode:

singularity exec /apps2/singularity/general/general_python3 python3 my_tensorflow_program.py

you can submit your job in any of our computing node.

For running in GPU mode:

singularity exec --nv -B /apps2/cuda/10.0/lib64:/home/extraLibs/cuda,/apps2/cudnn/7.4.2/cuda/lib64:/home/extraLibs/cudnn /apps2/singularity/general/general_python3 python3 my_tensorflow_program.py

If the model is to run in GPU mode:

(1) make sure use “GPU” partition

For example:

#SBATCH -p gpu

(2) Make sure require at least one gpu card:

For example:

#SBATCH --gres=gpu:1 #Request a single GPU card. Max value is 2 for K40m and 3 for V100 GPU nodes

(3) Make sure load the module files below

module load singularity/3.1 cuda/10.0

A sample sbatch script:

#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --exclude=gpu[01-02]
#SBATCH --job-name=NeuralNetworksGPU
#SBATCH --output=NeuralNetworksGPU-%j.out
#SBATCH -N 1

module purge
module load singularity/3.1

singularity exec --nv -B /apps2/cuda/10.0/lib64:/home/extraLibs/cuda,/apps2/cudnn/7.4.2/cuda/lib64:/home/extraLibs/cudnn /apps2/singularity/general/general_python3/ python3 yield_build_nn_gpu.py

In the above, if you want to use only K40m nodes, add this line to your batch script

#SBATCH --exclude=gpu[03-06]

To use only V100 GPUs, add this line instead

#SBATCH --exclude=gpu[01-02]

Other than that, singularity can be use for non-TensorFlow tasks. A sample sbatch script is as follows

#!/bin/bash
#SBATCH --partition=HaswellPriority
#SBATCH --qos=xxxxx
#SBATCH --job-name=RandomForest
#SBATCH --output=RandomForest-%j.out
#SBATCH -N 1
#SBATCH -c 24

####SBATCH -p gpu
####SBATCH --gres=gpu:1
####SBATCH -N 1

module purge
module load singularity/3.1 cuda/10.0

singularity exec /apps2/singularity/general/general_python3 python3 yield_build_rf.py

Currently we have access to HaswellPriority, SkylakePriority, general and gpu partitions.

fisbatch

We can run interactively by submitting an interactive “fisbatch” job to the specified partition:
$ fisbatch -p gpu_rtx –gres=gpu:1 -t 10:00 # <— set a time limit up to the limit for the partition
Then, once the fisbatch session opens and assigns you to a node, load your modules:
$ module purge
$ module load gcc/9.2.0 libffi/3.2.1 bzip2/1.0.6 tcl/8.6.6.8606 sqlite/3.30.1 lzma/4.32.7 cudnn/7.1.2 cuda/10.0 java/1.8.0_162 swig/3.0.7 python/3.7.6-tf2.0-rhel7.7

Then we can run python3 interactively. Doing it this way will prevent the job from running on a login node.