Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warp_ctc error in compute_ctc_loss #133

Open
LearnedVector opened this issue Feb 7, 2019 · 4 comments
Open

warp_ctc error in compute_ctc_loss #133

LearnedVector opened this issue Feb 7, 2019 · 4 comments

Comments

@LearnedVector
Copy link

Hey all, I am doing distributed training using tensorflow 1.12 and horovod 0.15.2 on 4 machines and 16 v100 GPUS on cuda 9.0 and cudnn 7.14 . It trains fine, but at a specific iterations would run into this weird error shown below.

Has anyone seen this specific error? It happening at the same iteration makes me suspicious it's something to do with the data. but to figure out what's wrong with the data i need to decrypt what this error message means internally inside warp_ctc. Any insight would be much appreciated!

Traceback (most recent call last):
  File "/home/ubuntu/deep-speech/tf_train.py", line 494, in <module>
    tf.app.run()
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/deep-speech/tf_train.py", line 491, in main
    run_training()
  File "/home/ubuntu/deep-speech/tf_train.py", line 405, in run_training
    is_training: True
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: warp_ctc error in compute_ctc_loss: unknown error
         [[node WarpCTC (defined at <string>:58)  = WarpCTC[blank_label=28, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transpose, boolean_mask/GatherV2/_1519, Squeeze_1, Squeeze)]]
         [[{{node gradients/AddN_80/_1853}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_20646_gradients/AddN_80", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'WarpCTC', defined at:
  File "/home/ubuntu/deep-speech/tf_train.py", line 494, in <module>
    tf.app.run()
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/deep-speech/tf_train.py", line 491, in main
    run_training()
  File "/home/ubuntu/deep-speech/tf_train.py", line 363, in run_training
    compile_train_op(train_inputs, train_targets, train_seq_len, train_label_lengths, is_training)
  File "/home/ubuntu/deep-speech/tf_train.py", line 299, in compile_train_op
    loss = tf.reduce_mean(warpctc_tensorflow.ctc(tf.cast(logits, tf.float32), targets, label_lengths, seq_len, blank_label=28))
  File "/home/ubuntu/mike.venv/lib/python2.7/site-packages/warpctc_tensorflow-0.1-py2.7-linux-x86_64.egg/warpctc_tensorflow/__init__.py", line 43, in ctc
    input_lengths, blank_label)
  File "<string>", line 58, in warp_ctc
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): warp_ctc error in compute_ctc_loss: unknown error
         [[node WarpCTC (defined at <string>:58)  = WarpCTC[blank_label=28, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transpose, boolean_mask/GatherV2/_1519, Squeeze_1, Squeeze)]]
         [[{{node gradients/AddN_80/_1853}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_20646_gradients/AddN_80", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
@yetiancn
Copy link

I have the same problem. Do you find any solution?

@LearnedVector
Copy link
Author

@yetiancn unfortunately no I did not find a solution :/ instead I just switched over to use the tensorflow ctc implementation

@yetiancn
Copy link

I decide to try tensorflow ctc too. Thank you!

@MichaelGou1105
Copy link

how to slove it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants