Templates for submitting jobs to CHTC's high throughput compute (HTC) system that use GPUs.
General CHTC GPU guide here: Jobs that use GPUs
These examples are available under the MIT license. See the list of individual contributors who created the templates. Third party examples are attributed in the source files.
It would be worthwhile to monitor the training more closely, for instance the GPU utilization, to understand why this is the case.
Our docker/tensorflow_python/ example fails on the A100 servers in CHTC with the error:
2021-03-24 19:51:27.096512: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
[[{{node MatMul}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test_tensorflow.py", line 41, in <module>
sess.run(productg.op)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(8192, 8192), b.shape=(8192, 8192), m=8192, n=8192, k=8192
[[node MatMul (defined at test_tensorflow.py:22) ]]
Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
Variable/read (defined at test_tensorflow.py:20)
Variable_1/read (defined at test_tensorflow.py:21)
I resolved the error by switching to the latest TensorFlow Docker image (2.4.1-gpu) and adding the two lines of TensorFlow migration code:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
We need to consider how to update this example. Should we have a TensorFlow 1.x example and a separate 2.x example? Do we need to constrain the servers these examples are all compatible with?
https://github.com/CHTC/templates-GPUs/blob/21c9139a9c24013e84cc62f10b3deb20eec8c740/docker/tensorflow_python/test_tensorflow.py#L19
Does this line make this script ALWAYS use the "first" GPU on a server? What if HTCondor has assigned you a different one (i.e. gpu device 3 instead of gpu device 0)?
@sameerd
We may want to explore the NVIDIA GPU Cloud to see whether there are containers we could use in our examples here. The catalog shows many deep learning frameworks, and we already have TensorFlow and PyTorch examples. However, I also noticed: - RELION for Cryo-EM - GROMACS for molecular dynamics - GAMESS and BigDFT for discrete Fourier transforms