Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] refactor of dataset builder and executor #537

Open
wants to merge 66 commits into
base: main
Choose a base branch
from

Conversation

cyruszhang
Copy link
Collaborator

Key elements of this PR:

  1. YAML显式定义dataset不同来源;local和remote分开定义
  2. 更加灵活开放的数据集参数化控制;根据不同来源,支持不同参数和相关验证;并留出口子支持更多追加/细节配置
  3. 解绑Executor的hardcode支持(目前RayExecutor只接受local json格式,并在代码层面hardcode绑定);Executor/RayExecutor不绑定dataset输入格式,但是根据formatter/downloader对于executor类型的支持来判断是否可加载
  4. 提高Executor框架扩展性,以更方便支持Nemo、Dask、Spark等其他引擎
  5. 支持数据格式验证
  6. 额外的数据来源支持
    a. 支持modelscope
    b. 支持arxiv,下载、解压、引入
    c. 支持wiki,下载、解压、引入
    d. 支持commoncrawl,下载、解压、引入
  7. 兼容命令行目前的dataset_path格式
  8. 兼容数据混搭,data mixture
  9. 兼容empty_formatter/generated_dataset_config通路

design doc: https://aliyuque.antfin.com/yilei.z/cnk4dn/qomvqql62lyglrh2?singleDoc# 《Dataset/Loader/Executor的重构设计》

@cyruszhang cyruszhang removed the request for review from drcege February 7, 2025 20:39

# Validate conversation structure
for item in dataset:
turns = self._parse_turns(item['text'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes are still in progress, right? Do they need to be updated or implemented later?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset format of conversations can be referred here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes are still in progress, right? Do they need to be updated or implemented later?

Yes. data validator provides the boilerplate; we can add more validation where it fits. Will update the conversation validator per provided doc.

data_juicer/core/data/data_validator.py Outdated Show resolved Hide resolved
data_juicer/core/data/load_strategy.py Outdated Show resolved Hide resolved
data_juicer/core/data/load_strategy.py Outdated Show resolved Hide resolved
data_juicer/core/data/ray_dataset.py Outdated Show resolved Hide resolved
data_juicer/download/downloader.py Show resolved Hide resolved
data_juicer/download/arxiv.py Outdated Show resolved Hide resolved
data_juicer/download/wikipedia.py Show resolved Hide resolved
tests/core/test_dataset_builder.py Outdated Show resolved Hide resolved
tests/core/test_dataset_builder.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@HYLcool HYLcool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the implementation now involves too many global imports of RayDataset, which might violate the decoupling of the default mode and ray mode. Maybe we need to test it in a new environment from scratch to see if the minimal requirements work on the demo config files without installing ray.

Those implicit violations I found are:

  1. data_juicer/core/data/data_validator.py
  2. data_juicer/core/data/dataset_builder.py
  3. data_juicer/core/data/load_strategy.py

If it doesn't work or it will install ray automatically when running the demo config, maybe we need to convert the global imports to local imports and replace some Union[NestedDataset, RayDataset] with DJDataset

if k == 0:
return []
k = min(k, len(self))
return self[column][:k]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended for NestedDataset considering efficiency: self.take(k)[column]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:core issues/PRs about the core functions of Data-Juicer dj:dataset issues/PRs about the dj-dataset enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants