-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3 #6793
Comments
My guess is wrong, please see thehir0's reply |
my code snippet:
with deepspeed version 0.16.0 i have same error on: with deepspeed version 0.15.4:
Everything works if grad_accum = 1, if grad_accum > 1, then these errors occur |
use deepspeed==0.15.4 solve the problem. |
I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4 |
it's work |
Thank you, this is very helpful. |
same issue in Zero3 training, it was likely related to this #6675 |
@66RomanReigns I think this issue should be re-opened-- downgrading the version is not a long term fix. And it's also a problem for ZeRO Stage 2. |
Same problem but ZeRo stage 2. Solved by using deepspeed==0.15.4. Thx~ |
Fixed this issue by setting |
@66RomanReigns, @allblueee, @inkcherry the reason for this assertion is that no_sync context manager is meant to disable gradient reduction during the backward pass. However, this behavior conflicts with the gradient partitioning of ZeRO2 & ZeRO3 which requires gradient reduction. That is why we added the assertion to properly support no_sync context manager. Can you explain why you need no_sync context manager in your code? |
@thehir0, can you please open a separate ticket for your issue? |
hi, @tjruwase , I think that the call to no_sync does not originate from the client code. However, in practice(this case), DeepSpeed does not require this context call because it has its own mechanism for reducing grads &determining the gradient accumulation boundary. |
Downgrading to 0.15.4 worked for me, thanks all! Using Zero 1 and HF Trainer |
Downgrading also worked for me. I was getting the error |
+1, met this on deepspeed 0.16.1 with hf trainer |
The same problem with ZERO 3, HF trainer and deepspeed 0.16.1. Solved by downgrading to deepspeed 0.15.4. |
I think this issue can be fixed by taking in huggingface/transformers#35157 |
I encountered an issue while using DeepSpeed with ZeRO Stage 3 optimization. I received the following error: no_sync is not compatible with ZeRO Stage 3. I’m not sure how to resolve this conflict.
If anyone has experience with this or knows how to resolve it, could you please guide me? Thank you in advance!
[rank0]: File "/root/miniconda3/envs/llama/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1997, in no_sync
[rank0]: assert not self.zero_optimization_partition_gradients(),
[rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
0%| | 0/168 [00:00<?, ?it/s]
W1126 23:28:07.821000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 402434 closing signal SIGTERM
E1126 23:28:11.641000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 402435) of binary: /root/miniconda3/envs/llama/bin/python
The text was updated successfully, but these errors were encountered: