Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: 'type' object is not subscriptable #23472

Closed
2 of 4 tasks
flckv opened this issue May 19, 2023 · 9 comments
Closed
2 of 4 tasks

TypeError: 'type' object is not subscriptable #23472

flckv opened this issue May 19, 2023 · 9 comments

Comments

@flckv
Copy link

flckv commented May 19, 2023

System Info

** pre-training wav2vec demo**
https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/README.md running demo gives error:
File “./run_wav2vec2_pretraining_no_trainer.py", line 783, in
main()
File “./run_wav2vec2_pretraining_no_trainer.py", line 510, in main
vectorized_datasets = raw_datasets.map(
TypeError: 'type' object is not subscriptable

Who can help?

@sanchit-gandhi
@pacman100
@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

just reproducing demo example with provided script and dataset:
https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/README.md#demo

Expected behavior

output should be a pre-trained wav2vec model on librispeech dataset

@amyeroberts
Copy link
Collaborator

Hi @flckv, thanks for raising this error.

I'm unable to reproduce this error when I run locally on the main branch. Could you share the running environment being used: run transformers-cli env in the terminal and copy-paste the output?

@flckv
Copy link
Author

flckv commented May 20, 2023

Hi @amyeroberts, thanks for the quick reply.

The output of transformers-cli env:

- `transformers` version: 4.26.1
- Platform: Linux-5.4.204-ql-generic-12.0-19-x86_64-with-glibc2.31
- Python version: 3.9.7
- Huggingface_hub version: 0.10.1
- PyTorch version (GPU?): 1.11.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

I am running on a cluster with resources:

#SBATCH --job-name=ol            # Job name
#SBATCH --output=/home/flck/output_.%A.txt   # Standard output and error log
#SBATCH --nodes=1                   # Run all processes on a single node    
#SBATCH --ntasks=1                  # Run on a single CPU
#SBATCH --mem=64G                   # Total RAM to be used
#SBATCH --cpus-per-task=8          # Number of CPU cores
#SBATCH --gres=gpu:3                # Number of GPUs (per node)
#SBATCH -p gpu                      # Use the gpu partition
#SBATCH --time=12:00:00             # Specify the time needed for your experiment
#SBATCH --qos=gpu-8                 # To enable the use of up to 8 GPUs

in my .sh file that I run on this cluster has these commands to reproduce the demo :

transformers-cli env
accelerate launch wav2vec/run_wav2vec2_pretraining_no_trainer.py --cache_dir="/dev/shm/" --dataset_name="librispeech_asr" --dataset_config_names clean clean --dataset_split_names validation test --model_name_or_path="patrickvonplaten/wav2vec2-base-v2" --output_dir="./wav2vec2-pretrained-demo" --max_train_steps="20000" --num_warmup_steps="32000" --gradient_accumulation_steps="8" --learning_rate="0.005" --weight_decay="0.01" --max_duration_in_seconds="20.0" --min_duration_in_seconds="2.0" --logging_steps="1" --saving_steps="10000" --per_device_train_batch_size="8" --per_device_eval_batch_size="8" --adam_beta1="0.9" --adam_beta2="0.98" --adam_epsilon="1e-06" --gradient_checkpointing --mask_time_prob="0.65" --mask_time_length="10"
transformers-cli env

Is this what you are asking for?




My guess

I think the error is coming from the fact that the dataset preprocessing (line 473) requires argument "args.audio_column_name" that is not specified in the demo command https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/README.md#demo.

1. I tried specifying --audio_column_name= []

I got this error:

 _-- schema metadata --
huggingface: '{"info": {"features": {"id": {"dtype": "string", "_type": "' + 163
to
{'id': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), 'duration_ms': Value(dtype='int32', id=None), 'text': Value(dtype='string', id=None), '[]': Audio(sampling_rate=16000, mono=True, decode=True, id=None)}
because column names don't match_

2. I tried specifying --audio_column_name=["audio", "duration_ms", "text"]

error: "run_wav2vec2_pretraining_no_trainer.py: error: unrecognized arguments: duration_ms, text]"

3. I tried specifying --audio_column_name=["audio"], which is the default setting

same issue as in 1.

line 478, in main
    raw_datasets = raw_datasets.cast_column(
raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
id: string
audio: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
duration_ms: int32
text: string
-- schema metadata --
huggingface: '{"info": {"features": {"id": {"dtype": "string", "_type": "' + 163
to
{'id': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), 'duration_ms': Value(dtype='int32', id=None), 'text': Value(dtype='string', id=None), '[audio]': Audio(sampling_rate=16000, mono=True, decode=True, id=None)}
because column names don't match

Any ideas? @amyeroberts @sanchit-gandhi @pacman100 @sgugger

@flckv
Copy link
Author

flckv commented May 20, 2023

here is a more detailed output log content:

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `3`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
wandb: Currently logged in as: flckv. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.15.3 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.15.2
wandb: Run data is saved locally in /home/flckv/wandb/run-20230519_175326-yuyk0qvn
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run rural-morning-6
wandb: ⭐️ View project at https://wandb.ai/flckv/wav2vec2-pretrained-demo
wandb: 🚀 View run at https://wandb.ai/flckv/wav2vec2-pretrained-demo/runs/yuyk0qvn
Downloading and preparing dataset librispeech_asr/clean to /dev/shm/librispeech_asr/clean/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7...

Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]
Downloading data files: 100%|██████████| 4/4 [00:00<00:00, 9828.48it/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 4/4 [00:00<00:00, 2225.39it/s]


Generating train.100 split: 100%|██████████| 28539/28539 [00:17<00:00, 1834.73 examples/s]
                                                                                          

Generating train.360 split: 100%|██████████| 104014/104014 [01:00<00:00, 1634.61 examples/s]
                                                                                            

Generating validation split: 100%|██████████| 2703/2703 [00:01<00:00, 2341.79 examples/s]
                                                                                         

Generating test split:  93%|█████████▎| 2434/2620 [00:01<00:00, 2338.24 examples/s]
                                                                                   
Dataset librispeech_asr downloaded and prepared to /dev/shm/librispeech_asr/clean/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7. Subsequent calls will reuse this data.
Found cached dataset librispeech_asr (/dev/shm/librispeech_asr/clean/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7)
Found cached dataset librispeech_asr (/dev/shm/librispeech_asr/clean/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7)

Downloading:   0%|          | 0.00/214 [00:00<?, ?B/s]
Downloading: 100%|██████████| 214/214 [00:00<00:00, 171kB/s]
loading configuration file preprocessor_config.json from cache at /home/flckv/.cache/huggingface/hub/models--patrickvonplaten--wav2vec2-base-v2/snapshots/9371f1849947b4613f451680a8e96d907617ce86/preprocessor_config.json
Feature extractor Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}


Map:   0%|          | 0/5270 [00:00<?, ? examples/s]
Map:   0%|          | 0/5270 [00:01<?, ? examples/s]
                                                    

> 
> Traceback (most recent call last):
>   File "/home/flckv/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 783, in <module>
>     main()
>   File "/home/flckv/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 510, in main
>     vectorized_datasets = raw_datasets.map(
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/dataset_dict.py", line 852, in map
>     {
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/dataset_dict.py", line 853, in <dictcomp>
>     k: dataset.map(
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
>     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
>     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2953, in map
>     for rank, done, content in Dataset._map_single(**dataset_kwargs):
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3307, in _map_single
>     example = apply_function_on_filtered_inputs(example, i, offset=offset)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3210, in apply_function_on_filtered_inputs
>     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
>   File "/home/flckv/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 493, in prepare_dataset
>     sample = batch[args.audio_column_name]
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 282, in __getitem__
>     value = self.format(key)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 380, in format
>     return self.formatter.format_column(self.pa_table.select([key]))[0]
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 447, in format_column
>     column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 228, in decode_column
>     return self.features.decode_column(column, column_name) if self.features else column
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/features/features.py", line 1866, in decode_column
>     [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/features/features.py", line 1866, in <listcomp>
>     [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/features/features.py", line 1308, in decode_nested_example
>     return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/features/audio.py", line 164, in decode_example
>     array, sampling_rate = self._decode_non_mp3_file_like(file)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/features/audio.py", line 290, in _decode_non_mp3_file_like
>     array = librosa.to_mono(array)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/lazy_loader/__init__.py", line 77, in __getattr__
>     attr = getattr(submod, name)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/lazy_loader/__init__.py", line 76, in __getattr__
>     submod = importlib.import_module(submod_path)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/importlib/__init__.py", line 127, in import_module
>     return _bootstrap._gcd_import(name[level:], package, level)
>   File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
>   File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
>   File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
>   File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
>   File "<frozen importlib._bootstrap_external>", line 850, in exec_module
>   File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/librosa/core/audio.py", line 19, in <module>
>     from .convert import frames_to_samples, time_to_samples
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/librosa/core/convert.py", line 7, in <module>
>     from . import notation
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/librosa/core/notation.py", line 8, in <module>
>     from .intervals import INTERVALS
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/librosa/core/intervals.py", line 10, in <module>
>     from numpy.typing import ArrayLike
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/numpy/typing/__init__.py", line 158, in <module>
>     from numpy._typing import (
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/numpy/_typing/__init__.py", line 164, in <module>
>     from ._dtype_like import (
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/numpy/_typing/_dtype_like.py", line 17, in <module>
>     from ._generic_alias import _DType as DType
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/numpy/_typing/_generic_alias.py", line 241, in <module>
>     _DType = np.dtype[ScalarType]
> TypeError: 'type' object is not subscriptable
> wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
> wandb: 🚀 View run rural-morning-6 at: https://wandb.ai/flckv/wav2vec2-pretrained-demo/runs/yuyk0qvn
> wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
> wandb: Find logs at: ./wandb/run-20230519_175326-yuyk0qvn/logs
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3914267 closing signal SIGTERM
> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3914268 closing signal SIGTERM
> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3914266) of binary: /home/flckv/.conda/envs/vcheckworthy/bin/python
> Traceback (most recent call last):
>   File "/home/flckv/.conda/envs/vcheckworthy/bin/accelerate", line 8, in <module>
>     sys.exit(main())
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
>     args.func(args)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/accelerate/commands/launch.py", line 909, in launch_command
>     multi_gpu_launcher(args)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
>     distrib_run.run(args)
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
>     elastic_launch(
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
>     return launch_agent(self._config, self._entrypoint, list(args))
>   File "/home/flckv/.conda/envs/vcheckworthy/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
>     raise ChildFailedError(
> torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
> 
> wav2vec/run_wav2vec2_pretraining_no_trainer.py FAILED
> ------------------------------------------------------------
> Failures:
>   <NO_OTHER_FAILURES>
> ------------------------------------------------------------
> Root Cause (first observed failure):
> [0]:
>   time      : 2023-05-19_17:55:10
>   host      : gpu-08
>   rank      : 0 (local_rank: 0)
>   exitcode  : 1 (pid: 3914266)
>   error_file: <N/A>
>   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
> 
> /var/lib/slurm-llnl/slurmd/job151161/slurm_script: line 42: EOL: command not found
> 

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented May 22, 2023

Hey @flckv! Could you try first updating all your packages to the latest versions?

pip install --upgrade pip
pip install --upgrade soundfile librosa datasets accelerate numpy transformers

The error looks like it's happening when we decode the soundfile (i.e. as we read the soundfile with librosa) - there was recently a big change to how we load audio samples with datasets that might fix this for you huggingface/datasets#5573

@flckv
Copy link
Author

flckv commented May 23, 2023

@sanchit-gandhi Thanks, but now the command is not working:

accelerate launch wav2vec/run_wav2vec2_pretraining_no_trainer.py --cache_dir="/dev/shm/" --dataset_name="librispeech_asr" --dataset_config_names test --dataset_split_names test --model_name_or_path="patrickvonplaten/wav2vec2-base-v2" --output_dir="./wav2vec2-pretrained-demo" --max_train_steps="20000" --num_warmup_steps="32000" --gradient_accumulation_steps="8" --learning_rate="0.005" --weight_decay="0.01" --max_duration_in_seconds="20.0" --min_duration_in_seconds="2.0" --logging_steps="1" --saving_steps="10000" --per_device_train_batch_size="8" --per_device_eval_batch_size="8" --adam_beta1="0.9" --adam_beta2="0.98" --adam_epsilon="1e-06" --gradient_checkpointing --mask_time_prob="0.65" --mask_time_length="10"

ERROR:

Traceback (most recent call last):
File "/home/flck/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 785, in
main()
File "/home/flck/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 513, in main
prepare_dataset(raw_datasets["train"]), # loading the audio raise KeyError(f"Column {key} not in the dataset. Current columns in the dataset: {columns}") KeyError: "Column args.audio_column_name not in the dataset. Current columns in the dataset: ['id', 'audio', 'duration_ms', 'text']"
File "/home/flck/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 493, in prepare_dataset
sample = batch['args.audio_column_name']
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2778, in getitem
return self._getitem(key)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2762, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 575, in query_table
_check_valid_column_key(key, table.column_names)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 515, in _check_valid_column_key

raise KeyError(f"Column {key} not in the dataset. Current columns in the dataset: {columns}")
KeyError: "Column args.audio_column_name not in the dataset. Current columns in the dataset: ['id', 'audio', 'duration_ms', 'text']"

Traceback (most recent call last):
File "/home/flck/.conda/envs/vcheckworthy/bin/accelerate", line 8, in
sys.exit(main())
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/accelerate/commands/launch.py", line 918, in launch_command
simple_launcher(args)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/accelerate/commands/launch.py", line 580, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/flck/.conda/envs/vcheckworthy/bin/python', 'wav2vec/run_wav2vec2_pretraining_no_trainer.py', '--cache_dir=/dev/shm/', '--dataset_name=librispeech_asr', '--dataset_config_names', 'test', '--dataset_split_names', 'test', '--model_name_or_path=patrickvonplaten/wav2vec2-base-v2', '--output_dir=./wav2vec2-pretrained-demo', '--max_train_steps=20000', '--num_warmup_steps=32000', '--gradient_accumulation_steps=8', '--learning_rate=0.005', '--weight_decay=0.01', '--max_duration_in_seconds=20.0', '--min_duration_in_seconds=2.0', '--logging_steps=1', '--saving_steps=10000', '--per_device_train_batch_size=8', '--per_device_eval_batch_size=8', '--adam_beta1=0.9', '--adam_beta2=0.98', '--adam_epsilon=1e-06', '--gradient_checkpointing', '--mask_time_prob=0.65', '--mask_time_length=10']' returned non-zero exit status 1.

/var/lib/slurm-llnl/slurmd/job153086/slurm_script: line 45: EOL: command not found


WHEN I specify this in the command args

accelerate launch wav2vec/run_wav2vec2_pretraining_no_trainer.py --cache_dir="/dev/shm/" --dataset_name="librispeech_asr" --dataset_config_names test --dataset_split_names test --model_name_or_path="patrickvonplaten/wav2vec2-base-v2" --output_dir="./wav2vec2-pretrained-demo"--audio_column_name=["id", "audio", "duration_ms", "text"]--max_train_steps="20000" --num_warmup_steps="32000" --gradient_accumulation_steps="8" --learning_rate="0.005" --weight_decay="0.01" --max_duration_in_seconds="20.0" --min_duration_in_seconds="2.0" --logging_steps="1" --saving_steps="10000" --per_device_train_batch_size="8" --per_device_eval_batch_size="8" --adam_beta1="0.9" --adam_beta2="0.98" --adam_epsilon="1e-06" --gradient_checkpointing --mask_time_prob="0.65" --mask_time_length="10"

then the error is:

run_wav2vec2_pretraining_no_trainer.py: error: unrecognized arguments: audio, duration_ms, text]


WHEN I only add "id":
accelerate launch wav2vec/run_wav2vec2_pretraining_no_trainer.py --cache_dir="/dev/shm/" --dataset_name="librispeech_asr" --dataset_config_names test --dataset_split_names test --model_name_or_path="patrickvonplaten/wav2vec2-base-v2" --output_dir="./wav2vec2-pretrained-demo"--audio_column_name=["id"]--max_train_steps="20000" --num_warmup_steps="32000" --gradient_accumulation_steps="8" --learning_rate="0.005" --weight_decay="0.01" --max_duration_in_seconds="20.0" --min_duration_in_seconds="2.0" --logging_steps="1" --saving_steps="10000" --per_device_train_batch_size="8" --per_device_eval_batch_size="8" --adam_beta1="0.9" --adam_beta2="0.98" --adam_epsilon="1e-06" --gradient_checkpointing --mask_time_prob="0.65" --mask_time_length="10"

Traceback (most recent call last):
File "/home/flck/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 785, in
main()
File "/home/flck/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 478, in main
raw_datasets = raw_datasets.cast_column(
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/dataset_dict.py", line 310, in cast_column
return DatasetDict({k: dataset.cast_column(column=column, feature=feature) for k, dataset in self.items()})
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/dataset_dict.py", line 310, in
return DatasetDict({k: dataset.cast_column(column=column, feature=feature) for k, dataset in self.items()})
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/flck.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2082, in cast_column
dataset._data = dataset._data.cast(dataset.features.arrow_schema)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/table.py", line 1152, in cast
return MemoryMappedTable(table_cast(self.table, *args, **kwargs), self.path, replays)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/table.py", line 2290, in table_cast
return cast_table_to_schema(table, schema)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/table.py", line 2248, in cast_table_to_schema
raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")

ValueError: Couldn't cast
id: string
audio: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
duration_ms: int32
text: string

-- schema metadata --
huggingface: '{"info": {"features": {"id": {"dtype": "string", "_type": "' + 163
to
{'id': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), 'duration_ms': Value(dtype='int32', id=None), 'text': Value(dtype='string', id=None), '[id]': Audio(sampling_rate=16000, mono=True, decode=True, id=None)}
because column names don't match
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: - 0.028 MB of 0.028 MB uploaded (0.000 MB deduped)
wandb: \ 0.028 MB of 0.032 MB uploaded (0.000 MB deduped)
wandb: | 0.035 MB of 0.035 MB uploaded (0.000 MB deduped)
wandb: 🚀 View run helpful-voice-29 at: https://wandb.ai/flck/wav2vec2-pretrained-demo/runs/tnmnebg6
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230523_133136-tnmnebg6/logs
Traceback (most recent call last):
File "/home/flck/.conda/envs/vcheckworthy/bin/accelerate", line 8, in
sys.exit(main())
File "/home/flck.conda/envs/vcheckworthy/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/homeflck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/accelerate/commands/launch.py", line 918, in launch_command
simple_launcher(args)
File "/home/flck.conda/envs/vcheckworthy/lib/python3.9/site-packages/accelerate/commands/launch.py", line 580, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/flck/.conda/envs/vcheckworthy/bin/python', 'wav2vec/run_wav2vec2_pretraining_no_trainer.py', '--cache_dir=/dev/shm/', '--dataset_name=librispeech_asr', '--dataset_config_names', 'test', '--dataset_split_names', 'test', '--model_name_or_path=patrickvonplaten/wav2vec2-base-v2', '--output_dir=./wav2vec2-pretrained-demo', '--audio_column_name=[id]', '--max_train_steps=20000', '--num_warmup_steps=32000', '--gradient_accumulation_steps=8', '--learning_rate=0.005', '--weight_decay=0.01', '--max_duration_in_seconds=20.0', '--min_duration_in_seconds=2.0', '--logging_steps=1', '--saving_steps=10000', '--per_device_train_batch_size=8', '--per_device_eval_batch_size=8', '--adam_beta1=0.9', '--adam_beta2=0.98', '--adam_epsilon=1e-06', '--gradient_checkpointing', '--mask_time_prob=0.65', '--mask_time_length=10']' returned non-zero exit status 1.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.29.2
  • Platform: Linux-5.4.204-ql-generic-12.0-19-x86_64-with-glibc2.31
  • Python version: 3.9.7
  • Huggingface_hub version: 0.14.1
  • Safetensors version: not installed
  • PyTorch version (GPU?): 1.11.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

/var/lib/slurm-llnl/slurmd/job153092/slurm_script: line 50: EOL: command not found
/var/lib/slurm-llnl/slurmd/job153092/slurm_script: line 53: /home/flck/wav2vec/run_wav2vec2_pretraining_no_trainer.py: Permission denied

@sanchit-gandhi
Copy link
Contributor

Hey @flckv - great! Glad updating to the latest packages fixed the previous error. Can you try setting:

--audio_column_name="audio"

Here we just need to pick-out the correct column name for the audio inputs (which in this case is "audio")

@flckv
Copy link
Author

flckv commented May 24, 2023

hey @sanchit-gandhi yes it is great! the column name is still not interpreted :

I added what you said:

accelerate launch wav2vec/run_wav2vec2_pretraining_no_trainer.py --cache_dir="/dev/shm/" --dataset_name="librispeech_asr" --dataset_config_names test --dataset_split_names test --model_name_or_path="patrickvonplaten/wav2vec2-base-v2" --output_dir="./wav2vec2-pretrained-demo" --audio_column_name="audio" --max_train_steps="20000" --num_warmup_steps="32000" --gradient_accumulation_steps="8" --learning_rate="0.005" --weight_decay="0.01" --max_duration_in_seconds="20.0" --min_duration_in_seconds="2.0" --logging_steps="1" --saving_steps="10000" --per_device_train_batch_size="8" --per_device_eval_batch_size="8" --adam_beta1="0.9" --adam_beta2="0.98" --adam_epsilon="1e-06" --gradient_checkpointing --mask_time_prob="0.65" --mask_time_length="10"

tried also:

--audio_column_name='audio'
--audio_column_name=['audio']
--audio_column_name=["audio"]


BUT :


Traceback (most recent call last):
File "/home/flck/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 785, in
main()
File "/home/flck/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 513, in main
prepare_dataset(raw_datasets["train"]), raise KeyError(f"Column {key} not in the dataset. Current columns in the dataset: {columns}") KeyError: "Column args.audio_column_name not in the dataset. Current columns in the dataset: ['id', 'audio', 'duration_ms', 'text']"
File "/home/flck/wav2vec/run_wav2vec2_pretraining_no_trainer.py", line 493, in prepare_dataset
sample = batch['args.audio_column_name']
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2778, in getitem
return self._getitem(key)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2762, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
File "/home/flcks/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 575, in query_table
_check_valid_column_key(key, table.column_names)
File "/home/flck/.conda/envs/vcheckworthy/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 515, in _check_valid_column_key
raise KeyError(f"Column {key} not in the dataset. Current columns in the dataset: {columns}")
KeyError: "Column args.audio_column_name not in the dataset. Current columns in the dataset: ['id', 'audio', 'duration_ms', 'text']"

@sanchit-gandhi
Copy link
Contributor

Can you double check you haven't changed the parser args for audio_column_name?

parser.add_argument(
"--audio_column_name",
type=str,
default="audio",
help="Column in the dataset that contains speech file path. Defaults to 'audio'",
)

I can't see the check that is erroring out for you on the example script. Your error is occurring on line 513. If I check line 513 in the example, I get something completely different to the audio column name check:

Could you make sure you are using the latest version of the script? You can just copy it from main.

@flckv
Copy link
Author

flckv commented May 31, 2023

@sanchit-gandhi thanks, you were right. It works now.

@flckv flckv closed this as completed May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants