[Enhancement] Support ZeRO-3 when using BAdam #4352

Ledzy · 2024-06-18T06:50:52Z

What does this PR do?

This PR enables BAdam algorithm to use model parallelism, based on the implementation of deepspeed ZeRO-3.

A sample script is provided in "examples/extras/badam/train_zero3.sh". The running command generated by the current webUI works correctly as well.

When training Llama 3-8B on "alpaca_en_demo" dataset with batch size 1, the maximum per device allocated memory is about 13/10/8 GB when training with 2/3/4 RTX3090 GPUs, respectively. I suppose it would be feasible to finetune a Llama 3-70B model given 8 RTX-3090 / 3 A100-80G using BAdam, though I haven't conduct a comprehensive test due to the limited computation resources.

The main change in code is to add the a BAdam's callback during Trainer's initialization, when use_badam and ZeRO-3 mode are detected.

Thanks for review!

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

hiyouga · 2024-06-18T19:54:46Z

src/llamafactory/train/trainer_utils.py

@@ -371,6 +371,12 @@ def _create_badam_optimizer(
        dict(params=decay_params, weight_decay=training_args.weight_decay),
    ]

+    ds_zero3_enabled = False
+    if hasattr(training_args, "deepspeed_plugin") and training_args.deepspeed_plugin is not None:


Why not use from transformers.integrations import is_deepspeed_zero3_enabled

Thanks for the suggestion, I have changed it to use is_deepspeed_zero3_enabled for cleaner expressions.

hiyouga

LGTM

Ledzy added 6 commits June 17, 2024 18:18

adapt for badam with ds zero3

33b4372

update gitigore

b2fc9cc

Merge remote-tracking branch 'upstream/main'

ea1f3ba

Support distributed BAdam.

0f72aac

fix typo

8f7c78b

add example

97c5235

hiyouga added the pending This problem is yet to be addressed label Jun 18, 2024

hiyouga mentioned this pull request Jun 18, 2024

BAdam训练报错 #4354

Closed

1 task

hiyouga reviewed Jun 18, 2024

View reviewed changes

Cleaner integration.

5c2ff1b

hiyouga approved these changes Jun 24, 2024

View reviewed changes

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 24, 2024

hiyouga merged commit d0f953b into hiyouga:main Jun 24, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Support ZeRO-3 when using BAdam #4352

[Enhancement] Support ZeRO-3 when using BAdam #4352

Ledzy commented Jun 18, 2024

hiyouga Jun 18, 2024

Ledzy Jun 19, 2024

hiyouga left a comment

[Enhancement] Support ZeRO-3 when using BAdam #4352

[Enhancement] Support ZeRO-3 when using BAdam #4352

Conversation

Ledzy commented Jun 18, 2024

What does this PR do?

Before submitting

hiyouga Jun 18, 2024

Choose a reason for hiding this comment

Ledzy Jun 19, 2024

Choose a reason for hiding this comment

hiyouga left a comment

Choose a reason for hiding this comment