Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预训练 Running tokenizer on dataset 执行了两遍 #4221

Closed
CanvaChen opened this issue Jun 11, 2024 · 2 comments
Closed

预训练 Running tokenizer on dataset 执行了两遍 #4221

CanvaChen opened this issue Jun 11, 2024 · 2 comments
Labels
solved This problem has been already solved

Comments

@CanvaChen
Copy link

Running tokenizer on dataset (num_proc=48): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1621781/1621781 [07:42<00:00, 3507.62 examples/s]

上面步骤执行完成后,又会执行:

Running tokenizer on dataset (num_proc=48): 37%|█████████████████████████████████████████████▋ | 607148/1621781 [00:14<00:06, 145837.70 examples/s]

两次数据集数量相同,耗时接近,感觉重复执行了。

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jun 11, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 12, 2024
@hiyouga hiyouga closed this as completed Jun 12, 2024
@hiyouga hiyouga reopened this Jun 12, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jun 12, 2024

fixed

@adumans
Copy link

adumans commented Aug 22, 2024

fixed

@hiyouga 请问原因2次的原因是什么呢?我看改动里面主要是training_args.local_process_index这个吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants