[exp] Lazyload for multimodal inputs #5346

hiyouga · 2024-09-03T18:43:34Z

What does this PR do?

In this PR, we moved the image processor's process step from the pre-processing phase to the training phase, thereby saving disk space for cached datasets. However, it could affect training throughput if the CPU is poor.

Moreover, we introduced a preprocessing_batch_size argument to control the batch size during the pre-processing phase, to prevent process hangs.

To use dataset streaming for multimodal datasets, the best practice is:

dataset: your_dataset
buffer_size: 128
preprocessing_batch_size: 128
streaming: true
accelerator_config:
  dispatch_batches: false

dispatch_batches: false is necessary for multimodal datasets since the batch size of images may differ from that of input ids.

https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.accelerator_config

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

aliencaocao · 2024-10-12T08:19:11Z

Is there anyway to restore this behaviour? Like an option somewhere?

lazy image load

47ea97f

hiyouga temporarily deployed to tests September 3, 2024 18:43 — with GitHub Actions Inactive

hiyouga merged commit ce7ed6e into main Sep 3, 2024
1 check passed

This was referenced Sep 3, 2024

Running tokenizer on dataset 一直阻塞，然后subprocesses has abruptly died during map operation #5308

Closed

Qwen2VL sft大数据集，pyarrow报错，采用jsonl、小数据集分割、streaming的方式无法解决呢 #5331

Closed

hiyouga deleted the lazy_image branch September 3, 2024 19:04

hiyouga added the solved This problem has been already solved label Sep 3, 2024

ddddddreamcastle mentioned this pull request Sep 4, 2024

跑examples中的mllm_demo例子，更改为streaming加载数据后，报错Cannot find valid samples #5353

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exp] Lazyload for multimodal inputs #5346

[exp] Lazyload for multimodal inputs #5346

hiyouga commented Sep 3, 2024 •

edited

Loading

aliencaocao commented Oct 12, 2024

[exp] Lazyload for multimodal inputs #5346

[exp] Lazyload for multimodal inputs #5346

Conversation

hiyouga commented Sep 3, 2024 • edited Loading

What does this PR do?

Before submitting

aliencaocao commented Oct 12, 2024

hiyouga commented Sep 3, 2024 •

edited

Loading