-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] refactor of dataset builder and executor #537
base: main
Are you sure you want to change the base?
Conversation
… space in file name
|
||
# Validate conversation structure | ||
for item in dataset: | ||
turns = self._parse_turns(item['text']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These classes are still in progress, right? Do they need to be updated or implemented later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataset format of conversations can be referred here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These classes are still in progress, right? Do they need to be updated or implemented later?
Yes. data validator provides the boilerplate; we can add more validation where it fits. Will update the conversation validator per provided doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that the implementation now involves too many global imports of RayDataset, which might violate the decoupling of the default mode and ray mode. Maybe we need to test it in a new environment from scratch to see if the minimal requirements work on the demo config files without installing ray.
Those implicit violations I found are:
data_juicer/core/data/data_validator.py
data_juicer/core/data/dataset_builder.py
data_juicer/core/data/load_strategy.py
If it doesn't work or it will install ray automatically when running the demo config, maybe we need to convert the global imports to local imports and replace some Union[NestedDataset, RayDataset]
with DJDataset
…validation for conversation post tuning
if k == 0: | ||
return [] | ||
k = min(k, len(self)) | ||
return self[column][:k] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommended for NestedDataset considering efficiency: self.take(k)[column]
Key elements of this PR:
a. 支持modelscope
b. 支持arxiv,下载、解压、引入
c. 支持wiki,下载、解压、引入
d. 支持commoncrawl,下载、解压、引入
design doc: https://aliyuque.antfin.com/yilei.z/cnk4dn/qomvqql62lyglrh2?singleDoc# 《Dataset/Loader/Executor的重构设计》