Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: multiple model/configuration DeepSpeed support #3097

Merged
merged 65 commits into from
Sep 13, 2024

Conversation

muellerzr
Copy link
Collaborator

@muellerzr muellerzr commented Sep 10, 2024

Multiple model/multiple configuration support for DeepSpeed

What does this add?

While we await for potentially rewriting much of the deepspeed (DS) implementation to get rid of the engine mechanic, this PR provides a way for users to use multiple deepspeed plugins (potentially with a single accelerator) to have different models prepared different ways/using different DS configurations.

Who is it for?

Solves #2496 (Finally)

Why is it needed?

There are many cases where users may want to use different DS configs/prepare multiple models when using DS. One such case is TRL where we may want an untrained model distributed in Zero3, while another in Zero2.

Note: for the case of multiple models being trained, a user should create a second Accelerator object (since we require the prepared engine to do backward()). Since the plugins are stored in the state, you don't need to re-pass in the plugins just create a secondary Accelerator.

What parts of the API does this impact?

User-facing:

Users may now pass in a list of DeepSpeedPlugin's to the Accelerator. The first plugin will be the "default" (so if you don't set an active plugin, this is what's used during all calls to .prepare()). When a user wants a different configuration, simply call plugin.enable() and this will in-turn setup that particular plugin to be the one accelerator.prepare will call (see examples below)

Internal structure:

  • Accelerator/AcceleratorState now stores a list of DeepSpeedPlugins
  • get_active_deepspeed_plugin will go through and find of the list of DeepSpeedPlugins which one is enabled.
  • Only a singular DS plugin should be "active"
  • We always try and setup the cached transformers plugin since we may constantly be swapping out its internals being used

Basic Usage Example(s):

from accelerate import Accelerator, DeepSpeedPlugin
plugin1 = DeepSpeedPlugin(config_file_a)
plugin2 = DeepSpeedPlugin(config_file_b)

plugins = {"student": plugin1, "teacher": plugin2}

accelerator = Accelerator(deepspeed_plugin=plugins)
# The ones we're training on
model, optimizer, scheduler = accelerator.prepare(model, optimizer, scheduler)
# Our inference/secondary config
accelerator.state.enable_deepspeed_plugin("teacher")
model2 = accelerator.prepare(model2)

Anticipated maintenance burden? (What will happen in say, 3 months if something changes)

Ideally this stays how it is, and we can remove that secondary/n-model Accelerator's later once we've worked with the DS team to make the DS api more pytorch-ic

Who can review?

@BenjaminBossan
@SunMarc
@stas00

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/utils/deepspeed.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/utils/dataclasses.py Outdated Show resolved Hide resolved
src/accelerate/utils/deepspeed.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing: I think you want an actual test where you do fwd/bwd with 2 models. This is insufficient to test that it works correctly IMHO.

tests/deepspeed/test_deepspeed_multiple_model.py Outdated Show resolved Hide resolved
tests/deepspeed/test_deepspeed_multiple_model.py Outdated Show resolved Hide resolved
@muellerzr muellerzr linked an issue Sep 12, 2024 that may be closed by this pull request
4 tasks
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the huge work @muellerzr ! Mostly went through the examples and the tutorial. Just in case (even though dict are ordered for python >=3.7), I think we should advise the user to always enable the deepspeed plugin before preparing the model accelerator_0.state.enable_deepspeed_plugin("model1").

Comment on lines 162 to 163
accelerator_0 = Accelerator(deepspeed_plugin=deepspeed_plugins)
accelerator_1 = Accelerator()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you pass again a deepspeed_plugin in the second accelerator ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, now we'll raise an error :)

Comment on lines 143 to 144
# Run training loop
accelerator.state.enable_deepspeed_plugin("training")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we forget to enable the training plugin during the training, what happens ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the accelerate side, not particularly much/anything at all, since it's mostly for deepspeed.init(). But for the transformers side eventually I believe this should run into some errors between zero2 and zero3, since with zero3 we need to gather params, and this sets a flag that tells transformers we're using zero3 right now.

Comment on lines 201 to 202

zero3_accelerator = Accelerator()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a quick comment that this accelerator will share the same AcceleratorState as zero2_accelerator, hence we don't need to pass deepspeed_plugins. However, it doesn't hare mixed_precision no ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rn we assume the same mixed precision as noted earlier, will look towards configuring multiple precision types later

deepspeed_plugin.enable()
else:
for plugin in deepspeed_plugin.values():
plugin.set_mixed_precision(mixed_precision)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can two deepspeed plugins have different mixed precision ? Not sure about the use case also, just asking

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I suppose technically they could, e.g. one has the model in bf16 and another in float32.

A direct use case would be the zero3 model is in fp8 vs the trained model in bf16 (te fp8 specifically).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though at first glance... quite a challenge. I'll leave this to a future PR/enhancement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require 2 accelerate configs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, which rn we don't/can't do

src/accelerate/accelerator.py Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
```python
from accelerate import Accelerator

accelerator = Accelerator(deepspeed_plugin=deepspeed_plugins)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be fixed to deepspeed_plugins as the key, otherwise this is a poor API.

Perhaps for BC it could be either?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a DS bug due to sanity checking never getting updated for inference. Please share a repro and issue. Thanks!

docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
@stas00
Copy link
Contributor

stas00 commented Sep 12, 2024

@muellerzr, where is zero3_init_flag: true (accelerate config) treated in this PR? As surely it's possible not all models will want the same treatment.

@muellerzr
Copy link
Collaborator Author

muellerzr commented Sep 12, 2024

@stas00 it's done in that tweak to when we create the HF config file (notice how it's done always, rather than just under zero3). We then update the deepspeed config reference transformers looks at. See here: https://github.com/huggingface/accelerate/pull/3097/files#diff-bcbdd609996df7224cc9c21ecd97092122d45263d1323892474021d510ec1eefR1199-R1224

And then the first thing enable does is set the reference on that plugin

"A wrapped DeepSpeed engine reference is currently tied for this `Accelerator()` instance.",
captured.output[0],
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just flagging that this test is missing the crucial part of actually running training.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that's for the later test/the script

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool use case examples! Left some suggestions to reduce wordiness and be more direct, and made it clearer that the sections correspond to these use cases :)

docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
@muellerzr muellerzr linked an issue Sep 12, 2024 that may be closed by this pull request
Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool feature @muellerzr - this is going to unlock a ton of cool use cases for post-training LLMs! I only commented on the docs from the perspective of a user - overall looks great 🔥

docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
```python
from accelerate.utils import DeepSpeedPlugin

zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to document somewhere if/how accelerate launch works when two configs are needed. As a user, I rely heavily on accelerate launch --config_file=/path/to/deepspeed_zero{1,2,3}.yaml and I wonder if we can enable something like:

# comma separated?
accelerate launch --config_file=deepspeed_zero2.yaml,deepspeed_zero3.yaml

And then the training code assumes the first plugin comes from the first config, etc

Copy link
Collaborator Author

@muellerzr muellerzr Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do that currently, it is solely based on creating DeepSpeedPlugin's manually. (at least for the second one)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most probable, setting up the following:

deepspeed_configs:
    config_1:
        deepspeed_config_file: ...
        name: ...
    config_2:
        deepspeed_config_file: ...
        name: ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, having a "master" config file would be nice to have. That way I can have my Z2/Z3 configs fixed and just toggle their use based on the task at hand

docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved
@muellerzr muellerzr merged commit e9e5a73 into main Sep 13, 2024
28 checks passed
@muellerzr muellerzr deleted the muellerzr-multiple-model-deepspeed branch September 13, 2024 11:28
loadams added a commit to microsoft/DeepSpeed that referenced this pull request Sep 25, 2024
Adding the new tests in
huggingface/accelerate#3097 caused the
nv-accelerate-v100 tests to fail. Due to other CI issues we didn't
notice this at first. This just skips the problematic test for now.

cc: @stas00 / @muellerzr
@lzy37ld
Copy link

lzy37ld commented Dec 17, 2024

Thanks for this great effort!

I just have a question here (might be trivial): I still can not understand why training two disjoint models needs two accelerators while training the student model (teacher model is fixed) only needs one accelerator.

I see that each example would call the .prepare() function twice. From my current shallow understanding, the example about training two models need two optimizers. Is that the reason we need two accelerators?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants