use `non_reentrant_checkpoint` fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

inkcherry · 2023-08-26T17:34:41Z

from original PR
#4128 rebase on feature proposed in
#4118. should be reviewed after #4118 merged.

set config ['pipeline']['use_reentrant']=False to open it, it won't affect the original stable workload.
add cifar10 train ut and passed, it seems that there is no megatron-deepspeed pipeline train ut for reference. I have verified it locally（model parameters update process is consistent), and can still reduce a small memory on rank0（stage0, also the most memory occupied rank) on some 3D configurations.
@tohtana @tjruwase @hughpu

…af forward tensor refs

…ant_checkpoint`

* Pass correct node size * formatting --------- Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>

* add deepspeed chat arxiv report * add zeroquant v2 and fp * add selective enhencement * add ignore for 'Youn' in spell checker --------- Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>

…use and add regression tests

tohtana · 2023-08-31T00:42:13Z

@inkcherry Thank you for submitting this as a new PR! Sorry for the delay of my review.
It is great that the change looks much simpler now. But the new tests failed. Can you check it?

inkcherry · 2023-08-31T12:40:16Z

Hi ,@tohtana
Thanks for you replay! But I couldn't reproduce the errors that occurred in the ci workflow, launching
the test_pipe.py file or test_pipe_use_reentrant function separately in my cuda device is always passed.
could this ut can be passed separately on nv-torch-latest-v100?
hope to get some suggestions. Thanks!

tohtana · 2023-09-01T03:06:30Z

Hi @inkcherry,

I could reproduced the same error (AssertionError on line 109). How did you run the test? I ran the following:

pytest test_pipe.py::TestPipeCifar10::test_pipe_use_reentrant[topo_config0]

inkcherry · 2023-09-01T06:51:50Z

Hi @inkcherry,

I could reproduced the same error (AssertionError on line 109). How did you run the test? I ran the following:
pytest test_pipe.py::TestPipeCifar10::test_pipe_use_reentrant[topo_config0]

Hi, @tohtana, very Thanks for helping me find the reason~ seems my testing environment version is a bit higher than ci.
It can be reproduced through the following two dockers+install deepspeed

pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
(same version to ci, always failed)
nvcr.io/nvidia/pytorch:23.07-py3
(higher version than ci, always passed)

I have reduced the check level from params equal to loss convergence and can be passed currently. params equal check could be reused when the version is updated.

tohtana

Thank you @inkcherry, the changes look good to me and I approved this PR.
It is not surprising that the updated parameters do not exactly match without setting options for reproducibility.
Let's merge this PR after the CI tests completed.

hughpu and others added 29 commits August 7, 2023 19:20

feat: add non_reentrant_checkpoint

a20c79c

feat: add missing output postprocess and change the hook to record le…

8aeba5f

…af forward tensor refs

fix: make the multi_grad_hook registered after graph construction

ee04fa8

fix: backward compatibility for multi_tensor_hook

51f833d

fix: nonlocal reference error of deepspeed_saved_tensors

b29c1ef

fix: reduce repeating hook registration

37e7c23

Merge branch 'microsoft:master' into feat/non-reentrant-checkpoint

d7c5440

test: add test for `activation_checkpointing.checkpointing.non_reentr…

e22c487

…ant_checkpoint`

Pass correct node size for ZeRO++ (microsoft#4085)

4d2a274

* Pass correct node size * formatting --------- Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>

add deepspeed chat arxiv report (microsoft#4110)

d4d070b

* add deepspeed chat arxiv report * add zeroquant v2 and fp * add selective enhencement * add ignore for 'Youn' in spell checker --------- Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>

style: change flake8 detected style missmatch

aaf309e

test: hack to clone the test_activation_checkpointing module for re…

a910922

…use and add regression tests

doc: explain the introduction of non_reentrant_checkpoint

fc919b1

doc: explain the test of non_reentrant_checkpoint

b6a0a44

Merge branch 'microsoft:master' into feat/non-reentrant-checkpoint

8ec86a4

Merge branch 'master' into feat/non-reentrant-checkpoint

78c0d65

Merge branch 'master' into feat/non-reentrant-checkpoint

e4eff23

Merge branch 'master' into feat/non-reentrant-checkpoint

a6c7871

Merge branch 'master' into feat/non-reentrant-checkpoint

fbbb760

Merge branch 'master' into feat/non-reentrant-checkpoint

a338097

Merge branch 'master' into feat/non-reentrant-checkpoint

a00cff1

Merge branch 'master' into feat/non-reentrant-checkpoint

c17cc3d

Merge branch 'master' into feat/non-reentrant-checkpoint

a680399

Merge branch 'master' into feat/non-reentrant-checkpoint

13e766d

Merge branch 'master' into feat/non-reentrant-checkpoint

a46e326

Merge branch 'master' into feat/non-reentrant-checkpoint

b5c03f4

Merge branch 'master' into feat/non-reentrant-checkpoint

13a026d

apply non_reentrant_checkpoint in pipeline parallel training

0c18dda

ut pass

71421bf

inkcherry requested a review from ShadenSmith as a code owner August 26, 2023 17:34

inkcherry requested review from duli2012, jeffra, tjruwase and mrwyattii as code owners August 26, 2023 17:34

inkcherry mentioned this pull request Aug 26, 2023

Fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4128

Closed

Merge branch 'master' into use_reentrant

18d64d3

inkcherry added 2 commits August 31, 2023 09:31

fix ci

d9edc63

Merge branch 'master' into use_reentrant

d98790d

inkcherry added 2 commits September 1, 2023 14:19

reduce check level for ci

c27d334

reduce check level for ci

79b427d

tohtana added 2 commits September 1, 2023 16:49

Merge branch 'master' into use_reentrant

80beb04

Merge branch 'master' into use_reentrant

bdc9db0

tohtana self-requested a review September 6, 2023 17:04

tohtana approved these changes Sep 6, 2023

View reviewed changes

tohtana enabled auto-merge September 6, 2023 17:06

tohtana added this pull request to the merge queue Sep 6, 2023

Merged via the queue into microsoft:master with commit 60a3e89 Sep 6, 2023
16 checks passed

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #6556

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use `non_reentrant_checkpoint` fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

use `non_reentrant_checkpoint` fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

inkcherry commented Aug 26, 2023 •

edited

Loading

tohtana commented Aug 31, 2023 •

edited

Loading

inkcherry commented Aug 31, 2023 •

edited

Loading

tohtana commented Sep 1, 2023

inkcherry commented Sep 1, 2023

tohtana left a comment

use non_reentrant_checkpoint fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

use non_reentrant_checkpoint fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

Conversation

inkcherry commented Aug 26, 2023 • edited Loading

tohtana commented Aug 31, 2023 • edited Loading

inkcherry commented Aug 31, 2023 • edited Loading

tohtana commented Sep 1, 2023

inkcherry commented Sep 1, 2023

tohtana left a comment

Choose a reason for hiding this comment

use `non_reentrant_checkpoint` fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

use `non_reentrant_checkpoint` fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

inkcherry commented Aug 26, 2023 •

edited

Loading

tohtana commented Aug 31, 2023 •

edited

Loading

inkcherry commented Aug 31, 2023 •

edited

Loading