You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey all, I am doing distributed training using tensorflow 1.12 and horovod 0.15.2 on 4 machines and 16 v100 GPUS on cuda 9.0 and cudnn 7.14 . It trains fine, but at a specific iterations would run into this weird error shown below.
Has anyone seen this specific error? It happening at the same iteration makes me suspicious it's something to do with the data. but to figure out what's wrong with the data i need to decrypt what this error message means internally inside warp_ctc. Any insight would be much appreciated!
Traceback (most recent call last):
File "/home/ubuntu/deep-speech/tf_train.py", line 494, in <module>
tf.app.run()
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/ubuntu/deep-speech/tf_train.py", line 491, in main
run_training()
File "/home/ubuntu/deep-speech/tf_train.py", line 405, in run_training
is_training: True
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: warp_ctc error in compute_ctc_loss: unknown error
[[node WarpCTC (defined at <string>:58) = WarpCTC[blank_label=28, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transpose, boolean_mask/GatherV2/_1519, Squeeze_1, Squeeze)]]
[[{{node gradients/AddN_80/_1853}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_20646_gradients/AddN_80", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op u'WarpCTC', defined at:
File "/home/ubuntu/deep-speech/tf_train.py", line 494, in <module>
tf.app.run()
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/ubuntu/deep-speech/tf_train.py", line 491, in main
run_training()
File "/home/ubuntu/deep-speech/tf_train.py", line 363, in run_training
compile_train_op(train_inputs, train_targets, train_seq_len, train_label_lengths, is_training)
File "/home/ubuntu/deep-speech/tf_train.py", line 299, in compile_train_op
loss = tf.reduce_mean(warpctc_tensorflow.ctc(tf.cast(logits, tf.float32), targets, label_lengths, seq_len, blank_label=28))
File "/home/ubuntu/mike.venv/lib/python2.7/site-packages/warpctc_tensorflow-0.1-py2.7-linux-x86_64.egg/warpctc_tensorflow/__init__.py", line 43, in ctc
input_lengths, blank_label)
File "<string>", line 58, in warp_ctc
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): warp_ctc error in compute_ctc_loss: unknown error
[[node WarpCTC (defined at <string>:58) = WarpCTC[blank_label=28, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transpose, boolean_mask/GatherV2/_1519, Squeeze_1, Squeeze)]]
[[{{node gradients/AddN_80/_1853}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_20646_gradients/AddN_80", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
The text was updated successfully, but these errors were encountered:
Hey all, I am doing distributed training using tensorflow 1.12 and horovod 0.15.2 on 4 machines and 16 v100 GPUS on cuda 9.0 and cudnn 7.14 . It trains fine, but at a specific iterations would run into this weird error shown below.
Has anyone seen this specific error? It happening at the same iteration makes me suspicious it's something to do with the data. but to figure out what's wrong with the data i need to decrypt what this error message means internally inside warp_ctc. Any insight would be much appreciated!
The text was updated successfully, but these errors were encountered: