Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvment: NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for best_dev_checkpoint #2338

Open
wasertech opened this issue Jan 22, 2023 · 2 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@wasertech
Copy link
Collaborator

Trying to optimize my LM but lm_optimizer.py throws NotFoundError as environment has CuDNN disabled.

Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.

I want to use my GPU --'

FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:~optuna.trial.Trial.suggest_float instead.

Related?

NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133

I have a bad feeling about this one.

+ python -u /home/trainer/lm_optimizer.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/models/fr/alphabet.txt --scorer_path /mnt/lm/fr/kenlm.scorer --feature_cache /mnt/sources/fr/feature_cache --test_files /mnt/extracted/fr/data/Assistant/train_test.csv --test_batch_size 64 --n_hidden 2048 --lm_alpha_max 2 --lm_beta_max 4 --n_trials 50 --checkpoint_dir /transfer-checkpoint
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
[I 2023-01-22 23:18:04,503] A new study created in memory with name: no-name-0f421b63-297c-468c-b30d-8aa59857a843
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:30: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
  Config.lm_alpha = trial.suggest_uniform("lm_alpha", 0, Config.lm_alpha_max)
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:31: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
  Config.lm_beta = trial.suggest_uniform("lm_beta", 0, Config.lm_beta_max)
I Loading best validating checkpoint from /mnt/checkpoints/best_dev-221133
W Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.
[W 2023-01-22 23:18:05,201] Trial 0 failed with parameters: {'lm_alpha': 0.26985826312830485, 'lm_beta': 1.3371065634850314} because of the following error: NotFoundError().
Traceback (most recent call last):
  File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 121, in _load_checkpoint
    return _load_checkpoint_impl(
  File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 21, in _load_checkpoint_impl
    ckpt = tfv1.train.load_checkpoint(checkpoint_path)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 66, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 873, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 885, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133

To Reproduce
Steps to reproduce the behavior:
Full logs

Expected behavior
A study should start on the GPU for 50 trails.

Environment (please complete the following information): Docker

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Docker
  • TensorFlow installed from (our builds, or upstream TensorFlow): 22.02-tf1
  • TensorFlow version (use command below): 22.02-tf1
  • Python version: 3.8
  • Bazel version (if compiling from source): 5.0
  • GCC/Compiler version (if compiling from source):10
  • CUDA/cuDNN version:11.6.0.021
  • GPU model and memory:RTX 3060 12Gb
  • Exact command to reproduce: python -u /home/trainer/lm_optimizer.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/models/fr/alphabet.txt --scorer_path /mnt/lm/fr/kenlm.scorer --feature_cache /mnt/sources/fr/feature_cache --test_files /mnt/extracted/fr/data/Assistant/train_test.csv --test_batch_size 64 --n_hidden 2048 --lm_alpha_max 2 --lm_beta_max 4 --n_trials 50 --checkpoint_dir /transfer-checkpoint

Additional context
Built using the Training Wizard for STT

@wasertech wasertech added the bug Something isn't working label Jan 22, 2023
@wasertech
Copy link
Collaborator Author

I think /mnt/checkpoints/best_dev-221133 doesn't exist but can't seem to find we it comes from... checkpoint file is in /transfer-checkpoint.

@wasertech
Copy link
Collaborator Author

wasertech commented Jan 23, 2023

Yes it's /transfer-checkpoint/best_dev_checkpoint pointing to /mnt/checkpoints/best_dev-221133:

# /transfer-checkpoint/best_dev_checkpoint
model_checkpoint_path: "/mnt/checkpoints/best_dev-221133"
all_model_checkpoint_paths: "/mnt/checkpoints/best_dev-221133"

lm_optimizer should probably expect tensorflow.python.framework.errors_impl.NotFoundError here:

current_samples = evaluate([test_file], create_model)

Or directly when computing results, in main:
results = compute_lm_optimization()
print(
"Best params: lm_alpha={} and lm_beta={} with WER={}".format(
results.get("lm_alpha"),
results.get("lm_beta"),
results.get("wer"),
)
)

Something like:

import sys
...
from tensorflow.python.framework.errors_impl import NotFoundError
...
try:
    results = compute_lm_optimization()
    print(
        "Best params: lm_alpha={} and lm_beta={} with WER={}".format(
            results.get("lm_alpha"),
            results.get("lm_beta"),
            results.get("wer"),
        )
    )
expect NotFoundError as e:
    print("Your checkpoint  /transfer-checkpoint/best_dev_checkpoint points to an empty checkpoint file /mnt/checkpoints/best_dev-221133\nMake sure you give a valid --checkpoint_dir path.")
    sys.exit(1)

Note: need to find variables holding /transfer-checkpoint/best_dev_checkpoint and /mnt/checkpoints/best_dev-221133. (filename and checkpoint_path?)

@wasertech wasertech added the enhancement New feature or request label Jan 23, 2023
@wasertech wasertech changed the title Bug: NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for best_dev_checkpoint Improvment: NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for best_dev_checkpoint Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant