Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use non_reentrant_checkpoint fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

Merged
merged 36 commits into from
Sep 6, 2023

Conversation

inkcherry
Copy link
Contributor

@inkcherry inkcherry commented Aug 26, 2023

from original PR
#4128 rebase on feature proposed in
#4118. should be reviewed after #4118 merged.

  • set config ['pipeline']['use_reentrant']=False to open it, it won't affect the original stable workload.
  • add cifar10 train ut and passed, it seems that there is no megatron-deepspeed pipeline train ut for reference. I have verified it locally(model parameters update process is consistent), and can still reduce a small memory on rank0(stage0, also the most memory occupied rank) on some 3D configurations.
    @tohtana @tjruwase @hughpu

hughpu and others added 29 commits August 7, 2023 19:20
* Pass correct node size

* formatting

---------

Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
* add deepspeed chat arxiv report

* add zeroquant v2 and fp

* add selective enhencement

* add ignore for 'Youn' in spell checker

---------

Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
@tohtana
Copy link
Contributor

tohtana commented Aug 31, 2023

@inkcherry Thank you for submitting this as a new PR! Sorry for the delay of my review.
It is great that the change looks much simpler now. But the new tests failed. Can you check it?

@inkcherry
Copy link
Contributor Author

inkcherry commented Aug 31, 2023

Hi ,@tohtana
Thanks for you replay! But I couldn't reproduce the errors that occurred in the ci workflow, launching
the test_pipe.py file or test_pipe_use_reentrant function separately in my cuda device is always passed.
could this ut can be passed separately on nv-torch-latest-v100?
hope to get some suggestions. Thanks!

@tohtana
Copy link
Contributor

tohtana commented Sep 1, 2023

Hi @inkcherry,

I could reproduced the same error (AssertionError on line 109). How did you run the test? I ran the following:

pytest test_pipe.py::TestPipeCifar10::test_pipe_use_reentrant[topo_config0]

@inkcherry
Copy link
Contributor Author

Hi @inkcherry,

I could reproduced the same error (AssertionError on line 109). How did you run the test? I ran the following:

pytest test_pipe.py::TestPipeCifar10::test_pipe_use_reentrant[topo_config0]

Hi, @tohtana, very Thanks for helping me find the reason~ seems my testing environment version is a bit higher than ci.
It can be reproduced through the following two dockers+install deepspeed

pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
(same version to ci, always failed)
nvcr.io/nvidia/pytorch:23.07-py3
(higher version than ci, always passed)

I have reduced the check level from params equal to loss convergence and can be passed currently. params equal check could be reused when the version is updated.

@tohtana tohtana self-requested a review September 6, 2023 17:04
Copy link
Contributor

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @inkcherry, the changes look good to me and I approved this PR.
It is not surprising that the updated parameters do not exactly match without setting options for reproducibility.
Let's merge this PR after the CI tests completed.

@tohtana tohtana added this pull request to the merge queue Sep 6, 2023
Merged via the queue into microsoft:master with commit 60a3e89 Sep 6, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants