Template job submissions using GPUs in CHTC

CHTC, updated 🕥 2022-12-01 15:05:16

GPU Job Templates

Templates for submitting jobs to CHTC's high throughput compute (HTC) system that use GPUs.

General CHTC GPU guide here: Jobs that use GPUs

Licenses

These examples are available under the MIT license. See the list of individual contributors who created the templates. Third party examples are attributed in the source files.

Issues

Investigate timing in multi-GPU example

opened on 2022-10-07 03:46:44 by agitter

24 adds a multi-GPU PyTorch example that demonstrates how to use Distributed Data Parallel training. However, training with multiple GPUs does not speed up training in the example. See https://github.com/CHTC/templates-GPUs/pull/24#issuecomment-1249509118

It would be worthwhile to monitor the training more closely, for instance the GPU utilization, to understand why this is the case.

TensorFlow Docker example on Ampere GPUs

opened on 2021-03-30 16:40:03 by agitter

Our docker/tensorflow_python/ example fails on the A100 servers in CHTC with the error: 2021-03-24 19:51:27.096512: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192 [[{{node MatMul}}]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "test_tensorflow.py", line 41, in <module> sess.run(productg.op) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192 [[node MatMul (defined at test_tensorflow.py:22) ]] Errors may have originated from an input operation. Input Source operations connected to node MatMul: Variable/read (defined at test_tensorflow.py:20) Variable_1/read (defined at test_tensorflow.py:21)

I resolved the error by switching to the latest TensorFlow Docker image (2.4.1-gpu) and adding the two lines of TensorFlow migration code: import tensorflow.compat.v1 as tf tf.disable_v2_behavior()

We need to consider how to update this example. Should we have a TensorFlow 1.x example and a separate 2.x example? Do we need to constrain the servers these examples are all compatible with?

Question about device option

opened on 2020-03-11 19:53:13 by ChristinaLK

https://github.com/CHTC/templates-GPUs/blob/21c9139a9c24013e84cc62f10b3deb20eec8c740/docker/tensorflow_python/test_tensorflow.py#L19

Does this line make this script ALWAYS use the "first" GPU on a server? What if HTCondor has assigned you a different one (i.e. gpu device 3 instead of gpu device 0)?

@sameerd

NVIDIA GPU Cloud containers

opened on 2019-09-20 15:04:39 by agitter

We may want to explore the NVIDIA GPU Cloud to see whether there are containers we could use in our examples here. The catalog shows many deep learning frameworks, and we already have TensorFlow and PyTorch examples. However, I also noticed: - RELION for Cryo-EM - GROMACS for molecular dynamics - GAMESS and BigDFT for discrete Fourier transforms