Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running tokenizer on dataset 一直阻塞,然后subprocesses has abruptly died during map operation #5308

Closed
1 task done
zuishusheng opened this issue Aug 29, 2024 · 18 comments · Fixed by #5346
Closed
1 task done
Labels
solved This problem has been already solved

Comments

@zuishusheng
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

Snipaste_2024-08-30_01-20-17

Reproduction

使用2048*2048的图片,总量3万个图文对,sharegpt格式的数据集。
设置preprocessing_num_workers=256 或者128/64等,都会在Running tokenizer on dataset的时候暂停,在长时间等待后出现One of the subprocesses hasabruptly died during map operation。

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Aug 29, 2024
@wwwbq
Copy link

wwwbq commented Aug 29, 2024

刚好我也遇到了这个问题,应该是Running tokenizer on dataset的时间太长了,然后多机通信等待时间太久(ddp_timeout参数)最后挂了,可能只能增大preprocessing_num_workers或者是预先处理数据?看了几个issue感觉只能这样

@zuishusheng
Copy link
Author

刚好我也遇到了这个问题,应该是Running tokenizer on dataset的时间太长了,然后多机通信等待时间太久(ddp_timeout参数)最后挂了,可能只能增大preprocessing_num_workers或者是预先处理数据?看了几个issue感觉只能这样

我用了4台H100,preprocessing_num_workers设置为512还是会有这个问题,在图片改小以后可以,应该是在处理超长的tokenizer的时候有问题

@wwwbq
Copy link

wwwbq commented Aug 30, 2024

刚好我也遇到了这个问题,应该是Running tokenizer on dataset的时间太长了,然后多机通信等待时间太久(ddp_timeout参数)最后挂了,可能只能增大preprocessing_num_workers或者是预先处理数据?看了几个issue感觉只能这样

我用了4台H100,preprocessing_num_workers设置为512还是会有这个问题,在图片改小以后可以,应该是在处理超长的tokenizer的时候有问题

可能你的图片太大了,我8张v100训练llava,2w多条图文对,图片size比较小,preprocessing_num_workers设为128也很慢,你的量级应该比我大多了~,不过tokenizer多过程应该是只用了cpu没用到GPU,其实还有一个方法是设置数据集加载方式为streaming: true,但是实测这种方式训练到时候显卡利用率提不上来,基本上一直在加载当前iter的数据,或许只能看看像xtuner这些库是怎么处理的了

@huynhbaobk
Copy link

huynhbaobk commented Aug 30, 2024

I got the problem when i tried to increase max_samples. Any idea to solve this

@zuishusheng
Copy link
Author

zuishusheng commented Aug 30, 2024

进一步debug这个问题,发现是在多进程处理数据的时候,有进程挂掉导致处理超时失败。
Snipaste_2024-08-31_04-33-48

设置preprocessing_num_workers过大,很容易导致内存被挤爆,整个机器死机,只能重启。
通过修改进程的timeout时间并不能解决问题。

@Mihaiii
Copy link

Mihaiii commented Aug 31, 2024

+1 same issue.

@hiyouga
Copy link
Owner

hiyouga commented Aug 31, 2024

try removing the preprocessing_num_workers argument

@huynhbaobk
Copy link

@hiyouga I already removed preprocessing_num_workers in the YAML file, but it works. But when I try increasing max_samples, the RAM overflows. I'm using qwen2vl_lora_sft.yaml."

@Mihaiii
Copy link

Mihaiii commented Aug 31, 2024

^ Same for me. I tried to make it save to disk from time to time (in arrow files), but then I realised that even if I do that, you'd still expect to have the whole dataset loaded into RAM for training - is that really needed/necessarily?

I'm also trying to finetune qwen2 vl 7b.

@huynhbaobk
Copy link

@Mihaiii Do you have any solution for the problem?

@Mihaiii
Copy link

Mihaiii commented Sep 1, 2024

@Mihaiii Do you have any solution for the problem?

Apparently there's a streaming param (https://huggingface.co/docs/datasets/v2.21.0/stream) that is made for this use case (meaning to load the dataset after the dataset in tokenized in chunks and save on disk as arraw files), but I erased my temp disk with the training data and gave up on my fine-tune project for the moment so i can't try it.

@Mihaiii
Copy link

Mihaiii commented Sep 1, 2024

@huynhbaobk so I would first try to save in chunks (example generated by ChatGPT):

from datasets import Dataset
import os

# Assume 'raw_datasets' is your original dataset

# Directory to save the tokenized dataset in chunks
output_dir = "tokenized_dataset"

# Create directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

# Process and save the dataset in chunks
for i in range(0, len(raw_datasets), 1000):
    # Slice the dataset into chunks of 1000 samples
    chunk = raw_datasets.select(range(i, min(i+1000, len(raw_datasets))))
    
    # Tokenize the chunk
    tokenized_chunk = chunk.map(tokenize_function, batched=True)
    
    # Save the tokenized chunk to disk
    tokenized_chunk.save_to_disk(os.path.join(output_dir, f"chunk_{i//1000}.arrow"))

Instead of this line (but, of course, keep the old map params):

dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, **kwargs)

And then load the train dataset and eval dataset from disk with stream=True.

@hiyouga
Copy link
Owner

hiyouga commented Sep 1, 2024

@Mihaiii LlamaFactory also supports streaming when you specify streaming: true

@zuishusheng
Copy link
Author

@Mihaiii LlamaFactory also supports streaming when you specify streaming: true

Maybe some problem, when streaming mode used, dataset type is 'InterableDataset', but 'map' function has problem to resolve the sharegpt format json file.
data
/aligner.py
image

@huynhbaobk
Copy link

huynhbaobk commented Sep 2, 2024

I solved the problem by set streaming: true in config yaml.
And in file aligner.py , the function align_dataset return have remove_columns = column_names, it removes the images field in final dataset. So i tried to keep the images columns by add line 210:

column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")

@zuishusheng
Copy link
Author

I solved the problem by set streaming: true in config yaml. And in file aligner.py , the function align_dataset return have remove_columns = column_names, it removes the images field in final dataset. So i tried to keep the images columns by add line 210:

column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")

when i set “column_names”,although it can continue process data, but still timeout , and it seems all data to be tokenizer before training.
Snipaste_2024-09-02_16-17-25
Do you have any idea?

@huynhbaobk
Copy link

I solved the problem by set streaming: true in config yaml. And in file aligner.py , the function align_dataset return have remove_columns = column_names, it removes the images field in final dataset. So i tried to keep the images columns by add line 210:

column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")

when i set “column_names”,although it can continue process data, but still timeout , and it seems all data to be tokenizer before training. Snipaste_2024-09-02_16-17-25 Do you have any idea?

I have the same problem as you. I still don't know how to fix it. I can run model with small sample arround 1500 samples but it take a lot of RAM to load dataset. If I use streaming, it still stuck with error:

File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 318, in forward
    q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 204, in apply_rotary_pos_emb_vision
    output = (tensor * cos) + (rotate_half(tensor) * sin)
RuntimeError: The size of tensor a (2) must match the size of tensor b (1512) at non-singleton dimension 1

@hiyouga could you help us?

@hiyouga
Copy link
Owner

hiyouga commented Sep 3, 2024

fixed in #5346

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants