Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing dataset gives unhelpful error message when filenames in metadata.csv are not found in the directory #7369

Open
svencornetsdegroot opened this issue Jan 14, 2025 · 1 comment

Comments

@svencornetsdegroot
Copy link

svencornetsdegroot commented Jan 14, 2025

Describe the bug

While importing an audiofolder dataset, where the names of the audiofiles don't correspond to the filenames in the metadata.csv, we get an unclear error message that is not helpful for the debugging, i.e.

ValueError: Instruction "train" corresponds to no data!

Steps to reproduce the bug

Assume an audiofolder with audiofiles, filename1.mp3, filename2.mp3 etc and a file metadata.csv which contains the columns file_name and sentence. The file_names are formatted like filename1.mp3, filename2.mp3 etc.

Load the audio

from datasets import load_dataset
load_dataset("audiofolder", data_dir='/path/to/audiofolder')

When the file_names in the csv are not in sync with the filenames in the audiofolder, then we get an Error message:

File /opt/conda/lib/python3.12/site-packages/datasets/arrow_reader.py:251, in BaseReader.read(self, name, instructions, split_infos, in_memory)
    249 if not files:
    250     msg = f'Instruction "{instructions}" corresponds to no data!'
--> 251     raise ValueError(msg)
    252 return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)

ValueError: Instruction "train" corresponds to no data!

load_dataset has a default value for the argument split = 'train'.

Expected behavior

It would be better to get an error report something like:

The metadata.csv file has different filenames than the files in the datadirectory. 

It would have saved me 4 hours of debugging.

Environment info

  • datasets version: 3.2.0
  • Platform: Linux-5.14.0-427.40.1.el9_4.x86_64-x86_64-with-glibc2.39
  • Python version: 3.12.8
  • huggingface_hub version: 0.27.0
  • PyArrow version: 18.1.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.9.0
@d2a-raudenaerde
Copy link

d2a-raudenaerde commented Jan 14, 2025

I'd prefer even more verbose errors; like "file123.mp3" is referenced in metadata.csv, but not found in the data directory '/path/to/audiofolder' ! (and 100+ more missing files) Or something along those lines.

@svencornetsdegroot svencornetsdegroot changed the title Importing dataset gives bad error message when filename's in metadata.csv are not found in the directory Importing dataset gives unhelpful error message when filename's in metadata.csv are not found in the directory Jan 14, 2025
@svencornetsdegroot svencornetsdegroot changed the title Importing dataset gives unhelpful error message when filename's in metadata.csv are not found in the directory Importing dataset gives unhelpful error message when filenames in metadata.csv are not found in the directory Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants