Replies: 4 comments
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
>>> keliji3451 |
Beta Was this translation helpful? Give feedback.
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
>>> keliji3451 |
Beta Was this translation helpful? Give feedback.
-
>>> keliji3451
[January 21, 2021, 8:57am]
I have setup training on a Deep Learning AMI from AWS with a vanilla
Python virtual environment on 3.6 and nvcc -version says I got CUDA 10.0
with a CuDNN lib for 7.5. I did not use a conda env as I read here this
is not ecouraged.
When I start my training with
> export TF_FORCE_GPU_ALLOW_GROWTH=true
>
> python3 -u /home/ubuntu/deepspeech/DeepSpeech/DeepSpeech.py slash
> --train_files '/home/ubuntu/deepspeech/train.csv' slash
> --dev_files '/home/ubuntu/deepspeech/dev.csv' slash
> --test_files '/home/ubuntu/deepspeech/test.csv' slash
> --scorer '/home/ubuntu/deepspeech/kenlm.scorer' slash
> --alphabet_config_path '/home/ubuntu/deepspeech/alphabet.txt' slash
> --train_batch_size 16 slash
> --dev_batch_size 16 slash
> --test_batch_size 4 slash
> --learning_rate 0.0001 slash
> --dropout_rate 0.3 slash
> --epochs 15 slash
> --train_cudnn True slash
> --use_allow_growth True slash
> --automatic_mixed_precision True
I get the
Fail to find the dnn implementation.
error:> I Enabling automatic mixed precision training. slash
> I Could not find best validating checkpoint. slash
> I Could not find most recent checkpoint. slash
> I Initializing all variables. slash
> Traceback (most recent call last): slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1365, in slash _do_call slash
> return fn( slash *args) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1350, in slash _run_fn slash
> target_list, run_metadata) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1443, in slash _call_tf_sessionrun slash
> run_metadata) slash
> tensorflow.python.framework.errors_impl.UnknownError: Fail to find the
> dnn implementation. slash
> slash [ slash [{{node
> tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}} slash ] slash ]
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last): slash
> File '/home/ubuntu/deepspeech/DeepSpeech/DeepSpeech.py', line 12, in slash
> ds_train.run_script() slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py',
> line 976, in run_script slash
> absl.app.run(main) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/app.py',
> line 303, in run slash
> slash _run_main(main, args) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/app.py',
> line 251, in slash _run_main slash
> sys.exit(main(argv)) slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py',
> line 948, in main slash
> train() slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py',
> line 527, in train slash
> load_or_init_graph_for_training(session) slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py',
> line 137, in load_or_init_graph_for_training slash
> slash _load_or_init_impl(session, methods, allow_drop_layers=True) slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py',
> line 112, in slash _load_or_init_impl slash
> return slash _initialize_all_variables(session) slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py',
> line 88, in slash _initialize_all_variables slash
> session.run(v.initializer) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 956, in run slash
> run_metadata_ptr) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1180, in slash _run slash
> feed_dict_tensor, options, run_metadata) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1359, in slash _do_run slash
> run_metadata) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1384, in slash _do_call slash
> raise type(e)(node_def, op, message) slash
> tensorflow.python.framework.errors_impl.UnknownError: Fail to find the
> dnn implementation. slash
> slash [ slash [node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams
> (defined at
> /venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748)
> slash ] slash ]
Any ideas on how to solve or debug that would be appreciated
[Training does not use GPU
[This is an archived TTS discussion thread from discourse.mozilla.org/t/problem-running-on-aws-ec2-dl-ami-fail-to-find-the-dnn-implementation]
Beta Was this translation helpful? Give feedback.
All reactions