Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exp] Lazyload for multimodal inputs #5346

Merged
merged 1 commit into from
Sep 3, 2024
Merged

[exp] Lazyload for multimodal inputs #5346

merged 1 commit into from
Sep 3, 2024

Conversation

hiyouga
Copy link
Owner

@hiyouga hiyouga commented Sep 3, 2024

What does this PR do?

Fixes #5308
Fixes #5331

In this PR, we moved the image processor's process step from the pre-processing phase to the training phase, thereby saving disk space for cached datasets. However, it could affect training throughput if the CPU is poor.

Moreover, we introduced a preprocessing_batch_size argument to control the batch size during the pre-processing phase, to prevent process hangs.

To use dataset streaming for multimodal datasets, the best practice is:

dataset: your_dataset
buffer_size: 128
preprocessing_batch_size: 128
streaming: true
accelerator_config:
  dispatch_batches: false

dispatch_batches: false is necessary for multimodal datasets since the batch size of images may differ from that of input ids.

https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.accelerator_config

Before submitting

@aliencaocao
Copy link
Contributor

Is there anyway to restore this behaviour? Like an option somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
2 participants