Problem running on aws ec2 dl ami fail to find the dnn implementation #1724

JRMeyer · 2021-03-08T08:40:12Z

JRMeyer
Mar 8, 2021
Maintainer

>>> keliji3451
[January 21, 2021, 8:57am]

I have setup training on a Deep Learning AMI from AWS with a vanilla
Python virtual environment on 3.6 and nvcc -version says I got CUDA 10.0
with a CuDNN lib for 7.5. I did not use a conda env as I read here this
is not ecouraged.

When I start my training with

> export TF_FORCE_GPU_ALLOW_GROWTH=true
>
> python3 -u /home/ubuntu/deepspeech/DeepSpeech/DeepSpeech.py slash
> --train_files '/home/ubuntu/deepspeech/train.csv' slash
> --dev_files '/home/ubuntu/deepspeech/dev.csv' slash
> --test_files '/home/ubuntu/deepspeech/test.csv' slash
> --scorer '/home/ubuntu/deepspeech/kenlm.scorer' slash
> --alphabet_config_path '/home/ubuntu/deepspeech/alphabet.txt' slash
> --train_batch_size 16 slash
> --dev_batch_size 16 slash
> --test_batch_size 4 slash
> --learning_rate 0.0001 slash
> --dropout_rate 0.3 slash
> --epochs 15 slash
> --train_cudnn True slash
> --use_allow_growth True slash
> --automatic_mixed_precision True

I get the Fail to find the dnn implementation. error:

> I Enabling automatic mixed precision training. slash
> I Could not find best validating checkpoint. slash
> I Could not find most recent checkpoint. slash
> I Initializing all variables. slash
> Traceback (most recent call last): slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1365, in slash _do_call slash
> return fn( slash *args) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1350, in slash _run_fn slash
> target_list, run_metadata) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1443, in slash _call_tf_sessionrun slash
> run_metadata) slash
> tensorflow.python.framework.errors_impl.UnknownError: Fail to find the
> dnn implementation. slash
> slash [ slash [{{node
> tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}} slash ] slash ]
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last): slash
> File '/home/ubuntu/deepspeech/DeepSpeech/DeepSpeech.py', line 12, in slash
> ds_train.run_script() slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py',
> line 976, in run_script slash
> absl.app.run(main) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/app.py',
> line 303, in run slash
> slash _run_main(main, args) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/app.py',
> line 251, in slash _run_main slash
> sys.exit(main(argv)) slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py',
> line 948, in main slash
> train() slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py',
> line 527, in train slash
> load_or_init_graph_for_training(session) slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py',
> line 137, in load_or_init_graph_for_training slash
> slash _load_or_init_impl(session, methods, allow_drop_layers=True) slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py',
> line 112, in slash _load_or_init_impl slash
> return slash _initialize_all_variables(session) slash
> File
> '/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py',
> line 88, in slash _initialize_all_variables slash
> session.run(v.initializer) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 956, in run slash
> run_metadata_ptr) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1180, in slash _run slash
> feed_dict_tensor, options, run_metadata) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1359, in slash _do_run slash
> run_metadata) slash
> File
> '/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py',
> line 1384, in slash _do_call slash
> raise type(e)(node_def, op, message) slash
> tensorflow.python.framework.errors_impl.UnknownError: Fail to find the
> dnn implementation. slash
> slash [ slash [node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams
> (defined at
> /venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748)
> slash ] slash ]

Any ideas on how to solve or debug that would be appreciated

[Training does not use GPU

[This is an archived TTS discussion thread from discourse.mozilla.org/t/problem-running-on-aws-ec2-dl-ami-fail-to-find-the-dnn-implementation]

JRMeyer · 2021-03-08T08:40:15Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[January 21, 2021, 9:29am]

Have you properly searched the documentation ?

?highlight=cudnn#prerequisites-for-training-a-model>

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:40:18Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> keliji3451
[January 21, 2021, 9:42am]

Thanks for pointing that out, had a look at the link to TF documentation
an it states 7.4 for TF 1.15:

https://www.tensorflow.org/install/source#gpu

Is there any simple or recommended way to upgrade to CuDNN 7.6?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:40:20Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[January 21, 2021, 10:14am]

Sorry, but read the link i shared, its 7.6. We cant help for aws
specific. Please investigate the ami, i guess it provides already a
tensorflow-gpu package. It needs to be 1.15. If that's the case, use
DS_NOTENSORFLOW=y when running deepspeech install. Search the repo for
example usage of that. It will disable our install of tensorflow dep and
so should use your ami provided one.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T08:40:23Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> keliji3451
[January 21, 2021, 11:56am]

. Solved it now.

You have to use 7.6 and you do that by:

nvidia.

others in LD-LIB in .dlamirc or your startup script so this CuDNN
version is found before the other one

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem running on aws ec2 dl ami fail to find the dnn implementation #1724

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Problem running on aws ec2 dl ami fail to find the dnn implementation #1724

JRMeyer Mar 8, 2021 Maintainer

Replies: 4 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author