warp_ctc error in compute_ctc_loss #133

LearnedVector · 2019-02-07T19:59:35Z

Hey all, I am doing distributed training using tensorflow 1.12 and horovod 0.15.2 on 4 machines and 16 v100 GPUS on cuda 9.0 and cudnn 7.14 . It trains fine, but at a specific iterations would run into this weird error shown below.

Has anyone seen this specific error? It happening at the same iteration makes me suspicious it's something to do with the data. but to figure out what's wrong with the data i need to decrypt what this error message means internally inside warp_ctc. Any insight would be much appreciated!

Traceback (most recent call last):
  File "/home/ubuntu/deep-speech/tf_train.py", line 494, in <module>
    tf.app.run()
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/deep-speech/tf_train.py", line 491, in main
    run_training()
  File "/home/ubuntu/deep-speech/tf_train.py", line 405, in run_training
    is_training: True
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: warp_ctc error in compute_ctc_loss: unknown error
         [[node WarpCTC (defined at <string>:58)  = WarpCTC[blank_label=28, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transpose, boolean_mask/GatherV2/_1519, Squeeze_1, Squeeze)]]
         [[{{node gradients/AddN_80/_1853}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_20646_gradients/AddN_80", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'WarpCTC', defined at:
  File "/home/ubuntu/deep-speech/tf_train.py", line 494, in <module>
    tf.app.run()
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/deep-speech/tf_train.py", line 491, in main
    run_training()
  File "/home/ubuntu/deep-speech/tf_train.py", line 363, in run_training
    compile_train_op(train_inputs, train_targets, train_seq_len, train_label_lengths, is_training)
  File "/home/ubuntu/deep-speech/tf_train.py", line 299, in compile_train_op
    loss = tf.reduce_mean(warpctc_tensorflow.ctc(tf.cast(logits, tf.float32), targets, label_lengths, seq_len, blank_label=28))
  File "/home/ubuntu/mike.venv/lib/python2.7/site-packages/warpctc_tensorflow-0.1-py2.7-linux-x86_64.egg/warpctc_tensorflow/__init__.py", line 43, in ctc
    input_lengths, blank_label)
  File "<string>", line 58, in warp_ctc
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): warp_ctc error in compute_ctc_loss: unknown error
         [[node WarpCTC (defined at <string>:58)  = WarpCTC[blank_label=28, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transpose, boolean_mask/GatherV2/_1519, Squeeze_1, Squeeze)]]
         [[{{node gradients/AddN_80/_1853}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_20646_gradients/AddN_80", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

The text was updated successfully, but these errors were encountered:

yetiancn · 2019-04-14T11:30:32Z

I have the same problem. Do you find any solution?

LearnedVector · 2019-04-14T15:52:06Z

@yetiancn unfortunately no I did not find a solution :/ instead I just switched over to use the tensorflow ctc implementation

yetiancn · 2019-04-14T17:50:45Z

I decide to try tensorflow ctc too. Thank you!

MichaelGou1105 · 2019-10-12T13:06:13Z

how to slove it ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warp_ctc error in compute_ctc_loss #133

warp_ctc error in compute_ctc_loss #133

LearnedVector commented Feb 7, 2019

yetiancn commented Apr 14, 2019

LearnedVector commented Apr 14, 2019

yetiancn commented Apr 14, 2019

MichaelGou1105 commented Oct 12, 2019

warp_ctc error in compute_ctc_loss #133

warp_ctc error in compute_ctc_loss #133

Comments

LearnedVector commented Feb 7, 2019

yetiancn commented Apr 14, 2019

LearnedVector commented Apr 14, 2019

yetiancn commented Apr 14, 2019

MichaelGou1105 commented Oct 12, 2019