-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running tokenizer on dataset 一直阻塞,然后subprocesses has abruptly died during map operation #5308
Comments
刚好我也遇到了这个问题,应该是Running tokenizer on dataset的时间太长了,然后多机通信等待时间太久(ddp_timeout参数)最后挂了,可能只能增大preprocessing_num_workers或者是预先处理数据?看了几个issue感觉只能这样 |
我用了4台H100,preprocessing_num_workers设置为512还是会有这个问题,在图片改小以后可以,应该是在处理超长的tokenizer的时候有问题 |
可能你的图片太大了,我8张v100训练llava,2w多条图文对,图片size比较小,preprocessing_num_workers设为128也很慢,你的量级应该比我大多了~,不过tokenizer多过程应该是只用了cpu没用到GPU,其实还有一个方法是设置数据集加载方式为streaming: true,但是实测这种方式训练到时候显卡利用率提不上来,基本上一直在加载当前iter的数据,或许只能看看像xtuner这些库是怎么处理的了 |
I got the problem when i tried to increase max_samples. Any idea to solve this |
+1 same issue. |
try removing the |
@hiyouga I already removed preprocessing_num_workers in the YAML file, but it works. But when I try increasing max_samples, the RAM overflows. I'm using qwen2vl_lora_sft.yaml." |
^ Same for me. I tried to make it save to disk from time to time (in arrow files), but then I realised that even if I do that, you'd still expect to have the whole dataset loaded into RAM for training - is that really needed/necessarily? I'm also trying to finetune qwen2 vl 7b. |
@Mihaiii Do you have any solution for the problem? |
Apparently there's a streaming param (https://huggingface.co/docs/datasets/v2.21.0/stream) that is made for this use case (meaning to load the dataset after the dataset in tokenized in chunks and save on disk as arraw files), but I erased my temp disk with the training data and gave up on my fine-tune project for the moment so i can't try it. |
@huynhbaobk so I would first try to save in chunks (example generated by ChatGPT): from datasets import Dataset
import os
# Assume 'raw_datasets' is your original dataset
# Directory to save the tokenized dataset in chunks
output_dir = "tokenized_dataset"
# Create directory if it doesn't exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)
def tokenize_function(examples):
return tokenizer(examples['text'], truncation=True, padding=True)
# Process and save the dataset in chunks
for i in range(0, len(raw_datasets), 1000):
# Slice the dataset into chunks of 1000 samples
chunk = raw_datasets.select(range(i, min(i+1000, len(raw_datasets))))
# Tokenize the chunk
tokenized_chunk = chunk.map(tokenize_function, batched=True)
# Save the tokenized chunk to disk
tokenized_chunk.save_to_disk(os.path.join(output_dir, f"chunk_{i//1000}.arrow")) Instead of this line (but, of course, keep the old map params): LLaMA-Factory/src/llamafactory/data/loader.py Line 183 in c87023d
And then load the train dataset and eval dataset from disk with stream=True. |
@Mihaiii LlamaFactory also supports streaming when you specify |
I solved the problem by set
|
I have the same problem as you. I still don't know how to fix it. I can run model with small sample arround 1500 samples but it take a lot of RAM to load dataset. If I use streaming, it still stuck with error:
@hiyouga could you help us? |
fixed in #5346 |
Reminder
System Info
Reproduction
使用2048*2048的图片,总量3万个图文对,sharegpt格式的数据集。
设置preprocessing_num_workers=256 或者128/64等,都会在Running tokenizer on dataset的时候暂停,在长时间等待后出现One of the subprocesses hasabruptly died during map operation。
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: