Error on starting training inside docker container for deepspeech 0 9 1using gpu #1670

JRMeyer · 2021-03-08T08:26:46Z

JRMeyer
Mar 8, 2021
Maintainer

>>> rajpuneet.sandhu
[December 10, 2020, 10:24pm]

I ran the following command and seems to be some cudnn issue which is
strange since I used the Docker.train file provided as is. Am I missing
something here? slash
python3 DeepSpeech.py slash
/datasets/deepspeech_wakeword_dataset/wakeword-train-other-accents.csv, slash
/datasets/deepspeech_wakeword_dataset/wakeword-train.csv, slash
/datasets/india_portal_2may2019-train.csv, slash
/datasets/india_portal_2to9may2019-train.csv, slash
/datasets/india_portal_9to19may2019-train.csv, slash
/datasets/india_portal_19to24may2019-train.csv, slash
/datasets/brazil_portal_20to26june2019-wakeword-train.csv, slash
/datasets/brazil_portal_26juneto3july2019-wakeword-train.csv, slash
/datasets/japan_portal_3july2019-wakeword-train.csv, slash
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-train.csv, slash
/datasets/alexa-train.csv, slash
/datasets/alexa-polly-train.csv, slash
/datasets/alexa-sns.csv, slash
/datasets/india_portal_ww_data_04282020/custom_train.csv, slash
/datasets/india_portal_ww_data_05042020/custom_train.csv, slash
/datasets/india_portal_ww_data_05222020/custom_train.csv, slash
/datasets/india_portal_ww_data_augmented_04282020/custom_train.csv, slash
/datasets/india_portal_ww_data_augmented_04282020/custom_test.csv, slash
/datasets/india_portal_ww_data_augmented_05042020/custom_train.csv, slash
/datasets/india_portal_ww_data_augmented_05042020/custom_test.csv, slash
/datasets/ww_gtts_data_google_siri/custom_train.csv, slash
/datasets/ww_gtts_data_google_siri/custom_dev.csv, slash
/datasets/ww_polly_data_google_siri/custom_train.csv, slash
/datasets/ww_polly_data_google_siri/custom_test.csv slash
/datasets/india_portal_2may2019-dev.csv, slash
/datasets/india_portal_2to9may2019-dev.csv, slash
/datasets/india_portal_9to19may2019-dev.csv, slash
/datasets/india_portal_19to24may2019-dev.csv, slash
/datasets/brazil_portal_20to26june2019-wakeword-dev.csv, slash
/datasets/brazil_portal_26juneto3july2019-wakeword-dev.csv, slash
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-dev.csv, slash
/datasets/alexa-dev.csv, slash
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv, slash
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv, slash
/datasets/india_portal_ww_data_05222020/custom_dev.csv, slash
/datasets/ww_gtts_data_google_siri/custom_dev.csv, slash
/datasets/ww_polly_data_google_siri/custom_dev.csv, slash
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv, slash
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv slash
/datasets/alexa-train.csv, slash
/datasets/alexa-polly-train.csv, slash
/datasets/alexa-sns.csv, slash
/datasets/alexa-dev.csv, slash
/datasets/india_portal_ww_data_04282020/custom_train.csv, slash
/datasets/india_portal_ww_data_05042020/custom_train.csv, slash
/datasets/india_portal_ww_data_04282020/custom_dev.csv, slash
/datasets/india_portal_ww_data_05042020/custom_dev.csv, slash
/datasets/india_portal_ww_data_04282020/custom_test.csv, slash
/datasets/india_portal_ww_data_05042020/custom_test.csv, slash
/datasets/india_portal_ww_data_augmented_04282020/custom_train.csv, slash
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv, slash
/datasets/india_portal_ww_data_augmented_04282020/custom_test.csv, slash
/datasets/india_portal_ww_data_augmented_05042020/custom_train.csv, slash
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv, slash
/datasets/india_portal_ww_data_augmented_05042020/custom_test.csv

checkpoints
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
I Training epoch 0...
Traceback (most recent call last):
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py', line 1365, in _do_call
return fn(*args)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py', line 1350, in _run_fn
target_list, run_metadata)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py', line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d}}]]
[[concat/concat/_99]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File 'DeepSpeech.py', line 12, in
ds_train.run_script()
File '/DeepSpeech/training/deepspeech_training/train.py', line 976, in run_script
absl.app.run(main)
File '/usr/local/lib/python3.6/dist-packages/absl/app.py', line 300, in run
_run_main(main, args)
File '/usr/local/lib/python3.6/dist-packages/absl/app.py', line 251, in _run_main
sys.exit(main(argv))
File '/DeepSpeech/training/deepspeech_training/train.py', line 948, in main
train()
File '/DeepSpeech/training/deepspeech_training/train.py', line 605, in train
train_loss, _ = run_set('train', epoch, train_init_op)
File '/DeepSpeech/training/deepspeech_training/train.py', line 570, in run_set
feed_dict=feed_dict)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py', line 956, in run
run_metadata_ptr)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py', line 1180, in _run
feed_dict_tensor, options, run_metadata)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py', line 1359, in _do_run
run_metadata)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py', line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[concat/concat/_99]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/conv1d':
File 'DeepSpeech.py', line 12, in
ds_train.run_script()
File '/DeepSpeech/training/deepspeech_training/train.py', line 976, in run_script
absl.app.run(main)
File '/usr/local/lib/python3.6/dist-packages/absl/app.py', line 300, in run
_run_main(main, args)
File '/usr/local/lib/python3.6/dist-packages/absl/app.py', line 251, in _run_main
sys.exit(main(argv))
File '/DeepSpeech/training/deepspeech_training/train.py', line 948, in main
train()
File '/DeepSpeech/training/deepspeech_training/train.py', line 483, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File '/DeepSpeech/training/deepspeech_training/train.py', line 316, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File '/DeepSpeech/training/deepspeech_training/train.py', line 243, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File '/DeepSpeech/training/deepspeech_training/train.py', line 171, in create_model
batch_x = create_overlapping_windows(batch_x)
File '/DeepSpeech/training/deepspeech_training/train.py', line 69, in create_overlapping_windows
batch_x = tf.nn.conv1d(input=batch_x, filters=eye_filter, stride=1, padding='SAME')
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py', line 574, in new_func
return func(*args, **kwargs)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py', line 574, in new_func
return func(*args, **kwargs)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py', line 1681, in conv1d
name=name)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py', line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py', line 794, in _apply_op_helper
op_def=op_def)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py', line 507, in new_func
return func(*args, **kwargs)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py', line 3357, in create_op
attrs, op_def, compute_device)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py', line 3426, in _create_op_internal
op_def=op_def)
File '/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py', line 1748, in init
self._traceback = tf_stack.extract_stack()

[This is an archived TTS discussion thread from discourse.mozilla.org/t/error-on-starting-training-inside-docker-container-for-deepspeech-0-9-1using-gpu]

JRMeyer · 2021-03-08T08:26:48Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[December 11, 2020, 7:26am]

> Am I missing something here?

yes: context on your setup.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:26:51Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> rajpuneet.sandhu
[December 11, 2020, 2:00pm]

could you please elaborate on
that? I haven't taken any additional steps because I thought that
everything is already setup in the docker

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:26:53Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[December 11, 2020, 2:04pm]

hardware, os, stack, etc.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:26:56Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> rajpuneet.sandhu
[December 11, 2020, 2:39pm]

m using dell-xps-15-9570 which
is running ubuntu 18.04 and has NVIDIA® GeForce™ GTX 1050Ti gpu

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:26:58Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[December 11, 2020, 2:40pm]

gpu mem? dataset size? when do you hit this error?

can't you just be explicit at once?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:27:01Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> rajpuneet.sandhu
[December 11, 2020, 2:58pm]

gpu memory is 4042MiB. The
dataset size is 3 hours and the error occurs right in the beginning of
the training. I have tried reducing the training and dev batch size to
as small as 2 to make sure it was not running out of memory but I still
encounter the same error

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:27:04Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[December 11, 2020, 3:20pm]

4Gb is likely not enough

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:27:06Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> rajpuneet.sandhu
[December 11, 2020, 7:00pm]

then how come when I run the
'docker build' command, the training in './bin/run-ldc93s1.sh' runs
successfully but now when I have start a docker container even
'./bin/run-ldc93s1.sh' does not run

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:27:09Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> rajpuneet.sandhu
[December 11, 2020, 8:40pm]

export TF_FORCE_GPU_ALLOW_GROWTH=true

if you do this, it works...this is in the documentation

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on starting training inside docker container for deepspeech 0 9 1using gpu #1670

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Error on starting training inside docker container for deepspeech 0 9 1using gpu #1670

JRMeyer Mar 8, 2021 Maintainer

Replies: 9 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author