-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: multiple model/configuration DeepSpeed support #3097
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Stas Bekman <[email protected]>
…om/huggingface/accelerate into muellerzr-multiple-model-deepspeed
Co-authored-by: Stas Bekman <[email protected]>
…om/huggingface/accelerate into muellerzr-multiple-model-deepspeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing: I think you want an actual test where you do fwd/bwd with 2 models. This is insufficient to test that it works correctly IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge work @muellerzr ! Mostly went through the examples and the tutorial. Just in case (even though dict are ordered for python >=3.7), I think we should advise the user to always enable the deepspeed plugin before preparing the model accelerator_0.state.enable_deepspeed_plugin("model1")
.
accelerator_0 = Accelerator(deepspeed_plugin=deepspeed_plugins) | ||
accelerator_1 = Accelerator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you pass again a deepspeed_plugin
in the second accelerator ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, now we'll raise an error :)
# Run training loop | ||
accelerator.state.enable_deepspeed_plugin("training") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we forget to enable the training plugin during the training, what happens ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the accelerate side, not particularly much/anything at all, since it's mostly for deepspeed.init()
. But for the transformers side eventually I believe this should run into some errors between zero2 and zero3, since with zero3 we need to gather params, and this sets a flag that tells transformers we're using zero3 right now.
|
||
zero3_accelerator = Accelerator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a quick comment that this accelerator will share the same AcceleratorState as zero2_accelerator, hence we don't need to pass deepspeed_plugins
. However, it doesn't hare mixed_precision
no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rn we assume the same mixed precision as noted earlier, will look towards configuring multiple precision types later
deepspeed_plugin.enable() | ||
else: | ||
for plugin in deepspeed_plugin.values(): | ||
plugin.set_mixed_precision(mixed_precision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can two deepspeed plugins have different mixed precision ? Not sure about the use case also, just asking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I suppose technically they could, e.g. one has the model in bf16 and another in float32.
A direct use case would be the zero3 model is in fp8 vs the trained model in bf16 (te
fp8 specifically).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though at first glance... quite a challenge. I'll leave this to a future PR/enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would require 2 accelerate configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, which rn we don't/can't do
```python | ||
from accelerate import Accelerator | ||
|
||
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be fixed to deepspeed_plugins
as the key, otherwise this is a poor API.
Perhaps for BC it could be either?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a DS bug due to sanity checking never getting updated for inference. Please share a repro and issue. Thanks!
@muellerzr, where is |
@stas00 it's done in that tweak to when we create the HF config file (notice how it's done always, rather than just under zero3). We then update the deepspeed config reference transformers looks at. See here: https://github.com/huggingface/accelerate/pull/3097/files#diff-bcbdd609996df7224cc9c21ecd97092122d45263d1323892474021d510ec1eefR1199-R1224 And then the first thing |
Co-authored-by: Stas Bekman <[email protected]>
src/accelerate/test_utils/scripts/external_deps/test_ds_multiple_model.py
Outdated
Show resolved
Hide resolved
"A wrapped DeepSpeed engine reference is currently tied for this `Accelerator()` instance.", | ||
captured.output[0], | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just flagging that this test is missing the crucial part of actually running training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, that's for the later test/the script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool use case examples! Left some suggestions to reduce wordiness and be more direct, and made it clearer that the sections correspond to these use cases :)
Co-authored-by: Steven Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool feature @muellerzr - this is going to unlock a ton of cool use cases for post-training LLMs! I only commented on the docs from the perspective of a user - overall looks great 🔥
```python | ||
from accelerate.utils import DeepSpeedPlugin | ||
|
||
zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to document somewhere if/how accelerate launch
works when two configs are needed. As a user, I rely heavily on accelerate launch --config_file=/path/to/deepspeed_zero{1,2,3}.yaml
and I wonder if we can enable something like:
# comma separated?
accelerate launch --config_file=deepspeed_zero2.yaml,deepspeed_zero3.yaml
And then the training code assumes the first plugin comes from the first config, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't do that currently, it is solely based on creating DeepSpeedPlugin
's manually. (at least for the second one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most probable, setting up the following:
deepspeed_configs:
config_1:
deepspeed_config_file: ...
name: ...
config_2:
deepspeed_config_file: ...
name: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, having a "master" config file would be nice to have. That way I can have my Z2/Z3 configs fixed and just toggle their use based on the task at hand
Co-authored-by: lewtun <[email protected]>
Adding the new tests in huggingface/accelerate#3097 caused the nv-accelerate-v100 tests to fail. Due to other CI issues we didn't notice this at first. This just skips the problematic test for now. cc: @stas00 / @muellerzr
Thanks for this great effort! I just have a question here (might be trivial): I still can not understand why training two disjoint models needs two accelerators while training the student model (teacher model is fixed) only needs one accelerator. I see that each example would call the |
Multiple model/multiple configuration support for DeepSpeed
What does this add?
While we await for potentially rewriting much of the deepspeed (DS) implementation to get rid of the
engine
mechanic, this PR provides a way for users to use multiple deepspeed plugins (potentially with a single accelerator) to have different models prepared different ways/using different DS configurations.Who is it for?
Solves #2496 (Finally)
Why is it needed?
There are many cases where users may want to use different DS configs/prepare multiple models when using DS. One such case is TRL where we may want an untrained model distributed in Zero3, while another in Zero2.
Note: for the case of multiple models being trained, a user should create a second Accelerator object (since we require the prepared engine to do
backward()
). Since the plugins are stored in thestate
, you don't need to re-pass in the plugins just create a secondaryAccelerator
.What parts of the API does this impact?
User-facing:
Users may now pass in a list of
DeepSpeedPlugin
's to theAccelerator
. The first plugin will be the "default" (so if you don't set an active plugin, this is what's used during all calls to.prepare()
). When a user wants a different configuration, simply callplugin.enable()
and this will in-turn setup that particular plugin to be the oneaccelerator.prepare
will call (see examples below)Internal structure:
Accelerator
/AcceleratorState
now stores a list ofDeepSpeedPlugins
get_active_deepspeed_plugin
will go through and find of the list ofDeepSpeedPlugins
which one is enabled.Basic Usage Example(s):
Anticipated maintenance burden? (What will happen in say, 3 months if something changes)
Ideally this stays how it is, and we can remove that secondary/n-model
Accelerator
's later once we've worked with the DS team to make the DS api more pytorch-icWho can review?
@BenjaminBossan
@SunMarc
@stas00