The most updated GPU documentation:
https://wiki.hpc.uconn.edu/index.php/HPC_Getting_Started#GPU_Cluster_Nodes
https://wiki.hpc.uconn.edu/index.php/GPU_Guide
The latest version of Python that includes TensorFlow and auto-sklearn libraries cannot be run on login nodes. In order to avoid the errors messages, use the submission script
#!/bin/bash
#SBATCH -p gpu_rtx
#SBATCH –gres=gpu:1
#SBATCH -t 10:00
#SBATCH –mail-user=first.last@uconn.edu
#SBATCH –mail-type=ALL
module purge
module load gcc/9.2.0 libffi/3.2.1 bzip2/1.0.6 tcl/8.6.6.8606 sqlite/3.30.1 lzma/4.32.7 cudnn/7.1.2 cuda/10.0 java/1.8.0_162 swig/3.0.7 python/3.7.6-tf2.0-rhel7.7
python3 your_python_code.py
singularity
Tensorflow requires the new version of glibc which is incompatible with our HPC cluster, so we use “singularity” to solve this issue. Currently, HPC maintains a singularity container, and all of deep-learning-based packages, such as tensorflow and pytorch, have been installed inside.
In order to use singularity, you need to load the module file named “singularity/3.1” first.
Then decide if your deep learning program should be run in CPU mode or GPU mode.
For running in CPU mode:
singularity exec /apps2/singularity/general/general_python3 python3 my_tensorflow_program.py
you can submit your job in any of our computing node.
For running in GPU mode:
singularity exec --nv -B /apps2/cuda/10.0/lib64:/home/extraLibs/cuda,/apps2/cudnn/7.4.2/cuda/lib64:/home/extraLibs/cudnn /apps2/singularity/general/general_python3 python3 my_tensorflow_program.py
If the model is to run in GPU mode:
(1) make sure use “GPU” partition
For example:
#SBATCH -p gpu
(2) Make sure require at least one gpu card:
For example:
#SBATCH --gres=gpu:1 #Request a single GPU card. Max value is 2 for K40m and 3 for V100 GPU nodes
(3) Make sure load the module files below
module load singularity/3.1 cuda/10.0
A sample sbatch script:
#!/bin/bash #SBATCH -p gpu #SBATCH --gres=gpu:1 #SBATCH --exclude=gpu[01-02] #SBATCH --job-name=NeuralNetworksGPU #SBATCH --output=NeuralNetworksGPU-%j.out #SBATCH -N 1 module purge module load singularity/3.1 singularity exec --nv -B /apps2/cuda/10.0/lib64:/home/extraLibs/cuda,/apps2/cudnn/7.4.2/cuda/lib64:/home/extraLibs/cudnn /apps2/singularity/general/general_python3/ python3 yield_build_nn_gpu.py
In the above, if you want to use only K40m nodes, add this line to your batch script
#SBATCH --exclude=gpu[03-06]
To use only V100 GPUs, add this line instead
#SBATCH --exclude=gpu[01-02]
Other than that, singularity can be use for non-TensorFlow tasks. A sample sbatch script is as follows
#!/bin/bash #SBATCH --partition=HaswellPriority #SBATCH --qos=xxxxx #SBATCH --job-name=RandomForest #SBATCH --output=RandomForest-%j.out #SBATCH -N 1 #SBATCH -c 24 ####SBATCH -p gpu ####SBATCH --gres=gpu:1 ####SBATCH -N 1 module purge module load singularity/3.1 cuda/10.0 singularity exec /apps2/singularity/general/general_python3 python3 yield_build_rf.py
Currently we have access to HaswellPriority, SkylakePriority, general and gpu partitions.
fisbatch
$ module load gcc/9.2.0 libffi/3.2.1 bzip2/1.0.6 tcl/8.6.6.8606 sqlite/3.30.1 lzma/4.32.7 cudnn/7.1.2 cuda/10.0 java/1.8.0_162 swig/3.0.7 python/3.7.6-tf2.0-rhel7.7
Then we can run python3 interactively. Doing it this way will prevent the job from running on a login node.