Running tokenizer on dataset 一直阻塞，然后subprocesses has abruptly died during map operation #5308

zuishusheng · 2024-08-29T17:27:01Z

Reminder

I have read the README and searched the existing issues.

System Info

Reproduction

使用2048*2048的图片，总量3万个图文对，sharegpt格式的数据集。
设置preprocessing_num_workers=256 或者128/64等，都会在Running tokenizer on dataset的时候暂停，在长时间等待后出现One of the subprocesses hasabruptly died during map operation。

Expected behavior

No response

Others

No response

wwwbq · 2024-08-29T17:30:22Z

刚好我也遇到了这个问题，应该是Running tokenizer on dataset的时间太长了，然后多机通信等待时间太久（ddp_timeout参数）最后挂了，可能只能增大preprocessing_num_workers或者是预先处理数据？看了几个issue感觉只能这样

zuishusheng · 2024-08-29T17:34:28Z

刚好我也遇到了这个问题，应该是Running tokenizer on dataset的时间太长了，然后多机通信等待时间太久（ddp_timeout参数）最后挂了，可能只能增大preprocessing_num_workers或者是预先处理数据？看了几个issue感觉只能这样

我用了4台H100，preprocessing_num_workers设置为512还是会有这个问题，在图片改小以后可以，应该是在处理超长的tokenizer的时候有问题

wwwbq · 2024-08-30T02:33:12Z

刚好我也遇到了这个问题，应该是Running tokenizer on dataset的时间太长了，然后多机通信等待时间太久（ddp_timeout参数）最后挂了，可能只能增大preprocessing_num_workers或者是预先处理数据？看了几个issue感觉只能这样

我用了4台H100，preprocessing_num_workers设置为512还是会有这个问题，在图片改小以后可以，应该是在处理超长的tokenizer的时候有问题

可能你的图片太大了，我8张v100训练llava，2w多条图文对，图片size比较小，preprocessing_num_workers设为128也很慢，你的量级应该比我大多了～，不过tokenizer多过程应该是只用了cpu没用到GPU，其实还有一个方法是设置数据集加载方式为streaming: true，但是实测这种方式训练到时候显卡利用率提不上来，基本上一直在加载当前iter的数据，或许只能看看像xtuner这些库是怎么处理的了

huynhbaobk · 2024-08-30T17:45:35Z

I got the problem when i tried to increase max_samples. Any idea to solve this

zuishusheng · 2024-08-30T20:32:43Z

进一步debug这个问题，发现是在多进程处理数据的时候，有进程挂掉导致处理超时失败。

设置preprocessing_num_workers过大，很容易导致内存被挤爆，整个机器死机，只能重启。
通过修改进程的timeout时间并不能解决问题。

Mihaiii · 2024-08-31T12:26:50Z

+1 same issue.

hiyouga · 2024-08-31T13:36:53Z

try removing the preprocessing_num_workers argument

huynhbaobk · 2024-08-31T17:53:35Z

@hiyouga I already removed preprocessing_num_workers in the YAML file, but it works. But when I try increasing max_samples, the RAM overflows. I'm using qwen2vl_lora_sft.yaml."

Mihaiii · 2024-08-31T17:59:40Z

^ Same for me. I tried to make it save to disk from time to time (in arrow files), but then I realised that even if I do that, you'd still expect to have the whole dataset loaded into RAM for training - is that really needed/necessarily?

I'm also trying to finetune qwen2 vl 7b.

huynhbaobk · 2024-09-01T10:43:35Z

@Mihaiii Do you have any solution for the problem?

Mihaiii · 2024-09-01T10:52:01Z

@Mihaiii Do you have any solution for the problem?

Apparently there's a streaming param (https://huggingface.co/docs/datasets/v2.21.0/stream) that is made for this use case (meaning to load the dataset after the dataset in tokenized in chunks and save on disk as arraw files), but I erased my temp disk with the training data and gave up on my fine-tune project for the moment so i can't try it.

Mihaiii · 2024-09-01T11:01:21Z

@huynhbaobk so I would first try to save in chunks (example generated by ChatGPT):

from datasets import Dataset
import os

# Assume 'raw_datasets' is your original dataset

# Directory to save the tokenized dataset in chunks
output_dir = "tokenized_dataset"

# Create directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

# Process and save the dataset in chunks
for i in range(0, len(raw_datasets), 1000):
    # Slice the dataset into chunks of 1000 samples
    chunk = raw_datasets.select(range(i, min(i+1000, len(raw_datasets))))
    
    # Tokenize the chunk
    tokenized_chunk = chunk.map(tokenize_function, batched=True)
    
    # Save the tokenized chunk to disk
    tokenized_chunk.save_to_disk(os.path.join(output_dir, f"chunk_{i//1000}.arrow"))

Instead of this line (but, of course, keep the old map params):

LLaMA-Factory/src/llamafactory/data/loader.py

Line 183 in c87023d

    
           dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, **kwargs)

And then load the train dataset and eval dataset from disk with stream=True.

hiyouga · 2024-09-01T12:36:36Z

@Mihaiii LlamaFactory also supports streaming when you specify streaming: true

zuishusheng · 2024-09-01T19:24:38Z

@Mihaiii LlamaFactory also supports streaming when you specify streaming: true

Maybe some problem, when streaming mode used, dataset type is 'InterableDataset', but 'map' function has problem to resolve the sharegpt format json file.
data
/aligner.py

huynhbaobk · 2024-09-02T07:19:25Z

I solved the problem by set streaming: true in config yaml.
And in file aligner.py , the function align_dataset return have remove_columns = column_names, it removes the images field in final dataset. So i tried to keep the images columns by add line 210:

column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")

zuishusheng · 2024-09-02T08:18:05Z

I solved the problem by set streaming: true in config yaml. And in file aligner.py , the function align_dataset return have remove_columns = column_names, it removes the images field in final dataset. So i tried to keep the images columns by add line 210:
column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")

when i set “column_names”，although it can continue process data, but still timeout , and it seems all data to be tokenizer before training.

Do you have any idea?

huynhbaobk · 2024-09-03T06:25:15Z

I solved the problem by set streaming: true in config yaml. And in file aligner.py , the function align_dataset return have remove_columns = column_names, it removes the images field in final dataset. So i tried to keep the images columns by add line 210:
column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")
when i set “column_names”，although it can continue process data, but still timeout , and it seems all data to be tokenizer before training. Do you have any idea?

I have the same problem as you. I still don't know how to fix it. I can run model with small sample arround 1500 samples but it take a lot of RAM to load dataset. If I use streaming, it still stuck with error:

File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 318, in forward
    q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 204, in apply_rotary_pos_emb_vision
    output = (tensor * cos) + (rotate_half(tensor) * sin)
RuntimeError: The size of tensor a (2) must match the size of tensor b (1512) at non-singleton dimension 1

@hiyouga could you help us?

hiyouga · 2024-09-03T19:01:05Z

fixed in #5346

github-actions bot added the pending This problem is yet to be addressed label Aug 29, 2024

hiyouga mentioned this issue Sep 3, 2024

[exp] Lazyload for multimodal inputs #5346

Merged

2 tasks

hiyouga closed this as completed in #5346 Sep 3, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running tokenizer on dataset 一直阻塞，然后subprocesses has abruptly died during map operation #5308

Running tokenizer on dataset 一直阻塞，然后subprocesses has abruptly died during map operation #5308

zuishusheng commented Aug 29, 2024

wwwbq commented Aug 29, 2024

zuishusheng commented Aug 29, 2024

wwwbq commented Aug 30, 2024

huynhbaobk commented Aug 30, 2024 •

edited

Loading

zuishusheng commented Aug 30, 2024 •

edited

Loading

Mihaiii commented Aug 31, 2024

hiyouga commented Aug 31, 2024

huynhbaobk commented Aug 31, 2024

Mihaiii commented Aug 31, 2024 •

edited

Loading

huynhbaobk commented Sep 1, 2024

Mihaiii commented Sep 1, 2024 •

edited

Loading

Mihaiii commented Sep 1, 2024 •

edited

Loading

hiyouga commented Sep 1, 2024

zuishusheng commented Sep 1, 2024

huynhbaobk commented Sep 2, 2024 •

edited

Loading

zuishusheng commented Sep 2, 2024

huynhbaobk commented Sep 3, 2024

hiyouga commented Sep 3, 2024

Running tokenizer on dataset 一直阻塞，然后subprocesses has abruptly died during map operation #5308

Running tokenizer on dataset 一直阻塞，然后subprocesses has abruptly died during map operation #5308

Comments

zuishusheng commented Aug 29, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

wwwbq commented Aug 29, 2024

zuishusheng commented Aug 29, 2024

wwwbq commented Aug 30, 2024

huynhbaobk commented Aug 30, 2024 • edited Loading

zuishusheng commented Aug 30, 2024 • edited Loading

Mihaiii commented Aug 31, 2024

hiyouga commented Aug 31, 2024

huynhbaobk commented Aug 31, 2024

Mihaiii commented Aug 31, 2024 • edited Loading

huynhbaobk commented Sep 1, 2024

Mihaiii commented Sep 1, 2024 • edited Loading

Mihaiii commented Sep 1, 2024 • edited Loading

hiyouga commented Sep 1, 2024

zuishusheng commented Sep 1, 2024

huynhbaobk commented Sep 2, 2024 • edited Loading

zuishusheng commented Sep 2, 2024

huynhbaobk commented Sep 3, 2024

hiyouga commented Sep 3, 2024

huynhbaobk commented Aug 30, 2024 •

edited

Loading

zuishusheng commented Aug 30, 2024 •

edited

Loading

Mihaiii commented Aug 31, 2024 •

edited

Loading

Mihaiii commented Sep 1, 2024 •

edited

Loading

Mihaiii commented Sep 1, 2024 •

edited

Loading

huynhbaobk commented Sep 2, 2024 •

edited

Loading