[WIP] refactor of dataset builder and executor #537

cyruszhang · 2025-01-09T21:18:57Z

Key elements of this PR:

YAML显式定义dataset不同来源；local和remote分开定义
更加灵活开放的数据集参数化控制；根据不同来源，支持不同参数和相关验证；并留出口子支持更多追加/细节配置
解绑Executor的hardcode支持（目前RayExecutor只接受local json格式，并在代码层面hardcode绑定）；Executor/RayExecutor不绑定dataset输入格式，但是根据formatter/downloader对于executor类型的支持来判断是否可加载
提高Executor框架扩展性，以更方便支持Nemo、Dask、Spark等其他引擎
支持数据格式验证
额外的数据来源支持
a. 支持modelscope
b. 支持arxiv，下载、解压、引入
c. 支持wiki，下载、解压、引入
d. 支持commoncrawl，下载、解压、引入
兼容命令行目前的dataset_path格式
兼容数据混搭，data mixture
兼容empty_formatter/generated_dataset_config通路

design doc: https://aliyuque.antfin.com/yilei.z/cnk4dn/qomvqql62lyglrh2?singleDoc# 《Dataset/Loader/Executor的重构设计》

…ting

… space in file name

HYLcool · 2025-02-06T07:53:55Z

data_juicer/core/data/data_validator.py

+
+        # Validate conversation structure
+        for item in dataset:
+            turns = self._parse_turns(item['text'])


These classes are still in progress, right? Do they need to be updated or implemented later?

The dataset format of conversations can be referred here.

These classes are still in progress, right? Do they need to be updated or implemented later?

Yes. data validator provides the boilerplate; we can add more validation where it fits. Will update the conversation validator per provided doc.

data_juicer/core/data/data_validator.py

data_juicer/core/data/load_strategy.py

data_juicer/core/data/ray_dataset.py

data_juicer/download/downloader.py

data_juicer/download/arxiv.py

data_juicer/download/wikipedia.py

tests/core/test_dataset_builder.py

HYLcool

I just realized that the implementation now involves too many global imports of RayDataset, which might violate the decoupling of the default mode and ray mode. Maybe we need to test it in a new environment from scratch to see if the minimal requirements work on the demo config files without installing ray.

Those implicit violations I found are:

data_juicer/core/data/data_validator.py
data_juicer/core/data/dataset_builder.py
data_juicer/core/data/load_strategy.py

If it doesn't work or it will install ray automatically when running the demo config, maybe we need to convert the global imports to local imports and replace some Union[NestedDataset, RayDataset] with DJDataset

…validation for conversation post tuning

HYLcool · 2025-02-19T08:10:17Z

data_juicer/core/data/dj_dataset.py

+            if k == 0:
+                return []
+            k = min(k, len(self))
+            return self[column][:k]


Recommended for NestedDataset considering efficiency: self.take(k)[column]

cyruszhang added 30 commits November 15, 2024 10:51

ignore __dj__produced_data__

d11f89c

add download framework; add wiki support

41dea26

refactor formatter; add dataset_builder

50f8d3d

merge with master

817caab

add config files and test entry

a089de4

initial dataset_builder

5a717d7

Merge branch 'main' into feat/cyruszhang/data-downloader

9c79844

add mixture dataset support; type/subtype

ffba7e7

RayExecutor with ExecutorBase

79ae980

get rid of subtype for local dataset; depending on ext for proper rou…

e6a6e71

…ting

use source instead of sub_type for remote dataset configs

eb300f0

arxiv downloader return Dataset instead of DJDataset

456eea1

rewrite CLI datapath with test cases

c25e40f

add executor and dataload strategy logic

75ffe3f

Merge branch 'main' into feat/cyruszhang/data-downloader

4ec1ef9

add layered load strategies

4fb6e17

Merge branch 'main' into feat/cyruszhang/data-downloader

84803cd

fix circular dependency; add dataset config test

cb5b80a

update dataset_path parsing in config

daf7a85

fix download test case; add wildcard matching for load strategy

7c48892

add test case for load strategy wild card matching

940b44d

add more test cases for datapath rewrite logic; fix rewrite to handle…

b80f991

… space in file name

materialize symlinks for duplicates

0d5d4ba

add load strategy validation framework

f3a4ec4

add DataValidator logic

70fffd2

data validator as separate pre-processing

bbc303d

update data validator logic and add/fix test cases

4b6065f

[nit] rename test

0b153ab

[nit] rename test again

171b361

add builder test cases; update ds config validation logic

6841d19

cyruszhang requested a deployment to Testing January 27, 2025 22:52 — with GitHub Actions Waiting

fix typo in configs

2963118

cyruszhang had a problem deploying to Testing January 29, 2025 18:20 — with GitHub Actions Failure

remove absolute path logic; remove dup test files

4472aef

cyruszhang requested a deployment to Testing February 7, 2025 02:05 — with GitHub Actions Waiting

update .gitignore for dup files in tests

7964867

cyruszhang had a problem deploying to Testing February 7, 2025 02:20 — with GitHub Actions Failure

cyruszhang added 3 commits February 7, 2025 10:57

fix RayDataset schema validation issue

96207ba

fix wiki downloader tests

9b1d738

remove mixture formatter; logic captured in dataloader

828e7ba

cyruszhang had a problem deploying to Testing February 7, 2025 20:23 — with GitHub Actions Failure

cyruszhang removed the request for review from drcege February 7, 2025 20:39

HYLcool reviewed Feb 12, 2025

View reviewed changes

HYLcool assigned cyruszhang Feb 12, 2025

cyruszhang added 3 commits February 13, 2025 09:09

remove unused mixture formatter

4ffb3cf

minor fixes for CR comments

7c16b23

resolve eager RayExecutor importing

f73dd41

cyruszhang had a problem deploying to Testing February 13, 2025 18:43 — with GitHub Actions Failure

cyruszhang added 2 commits February 13, 2025 12:00

bugfix: handle missing configs

8aae265

add schema support for datasets

1d65a3a

HYLcool reviewed Feb 14, 2025

View reviewed changes

bugfix: handle relative path problem in tests

96a4997

cyruszhang requested a deployment to Testing February 14, 2025 19:52 — with GitHub Actions Waiting

fix test cases

2f49eec

cyruszhang had a problem deploying to Testing February 14, 2025 20:26 — with GitHub Actions Failure

add schema support for DJDataset; remove eager Ray imports; add data …

643e7d7

…validation for conversation post tuning

cyruszhang requested a deployment to Testing February 19, 2025 02:11 — with GitHub Actions Waiting

revert relative path for demo multi-modal data

0412e36

cyruszhang had a problem deploying to Testing February 19, 2025 02:16 — with GitHub Actions Failure

HYLcool reviewed Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] refactor of dataset builder and executor #537

[WIP] refactor of dataset builder and executor #537

cyruszhang commented Jan 9, 2025

HYLcool Feb 6, 2025

HYLcool Feb 6, 2025

cyruszhang Feb 13, 2025

HYLcool left a comment

HYLcool Feb 19, 2025

[WIP] refactor of dataset builder and executor #537

Are you sure you want to change the base?

[WIP] refactor of dataset builder and executor #537

Conversation

cyruszhang commented Jan 9, 2025

HYLcool Feb 6, 2025

Choose a reason for hiding this comment

HYLcool Feb 6, 2025

Choose a reason for hiding this comment

cyruszhang Feb 13, 2025

Choose a reason for hiding this comment

HYLcool left a comment

Choose a reason for hiding this comment

HYLcool Feb 19, 2025

Choose a reason for hiding this comment