You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a dataset that consists of a bunch of text files, each representing an example. There is an undocumented sample_by argument for the TextConfig class that is used by Text to decide whether to split files into lines, paragraphs or take them whole. Passing sample_by=“document” to load_dataset results in files getting split into lines regardless. I have edited src/datasets/packaged_modules/text/text.py for myself to switch the default and it works fine.
As a side note, the if-else for sample_by will silently load an empty dataset if someone makes a typo in the argument, which is not ideal.
Steps to reproduce the bug
Prepare data as a bunch of files in a directory.
Load that data via load_dataset(“text”, data_files=<data_dir>/<files_glob>, …, sample_by=“document”).
Inspect the resultant dataset — every item should have the form of {“text”: <a line from a file>}.
Expected behavior
load_dataset(“text”, data_files=<data_dir>/<files_glob>, …, sample_by=“document”) should result in a dataset with items of the form {“text”: <one document>}.
Describe the bug
I have a dataset that consists of a bunch of text files, each representing an example. There is an undocumented
sample_by
argument for theTextConfig
class that is used byText
to decide whether to split files into lines, paragraphs or take them whole. Passingsample_by=“document”
toload_dataset
results in files getting split into lines regardless. I have editedsrc/datasets/packaged_modules/text/text.py
for myself to switch the default and it works fine.As a side note, the
if-else
forsample_by
will silently load an empty dataset if someone makes a typo in the argument, which is not ideal.Steps to reproduce the bug
load_dataset(“text”, data_files=<data_dir>/<files_glob>, …, sample_by=“document”)
.{“text”: <a line from a file>}
.Expected behavior
load_dataset(“text”, data_files=<data_dir>/<files_glob>, …, sample_by=“document”)
should result in a dataset with items of the form{“text”: <one document>}
.Environment info
datasets
version: 2.18.0huggingface_hub
version: 0.21.4fsspec
version: 2024.2.0The text was updated successfully, but these errors were encountered: