[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load #1525

jeffra · 2021-11-05T05:53:25Z

Example scenario that this PR addresses:

10B parameter model using 256 GPUs, each machine has 8 GPUs. We observe that each ZeRO rank checkpoint is 525MB. If elastic checkpointing is enabled we currently require each rank to load all ZeRO checkpoints in order to potentially re-partition the model to a new world size. If elastic checkpointing is disabled we are still loading all ZeRO checkpoints and then we end up throwing most of the state away.

If elastic checkpointing is disabled we currently attempt to load 525MB * 256 files * 8 gpus for a requirement of 1TB of CPU memory per machine just for optimizer state. On certain machines this results in OOM errors as you can imagine.

This PR loads only the required states for each rank. We still require all ZeRO partitions for fp32 master weights so this is loaded but is now enough to reduce the CPU memory requirements per node. For our 10B example this should be more along the lines of 40GB * 8 gpus = 320 GB.

Turns off elastic checkpointing support for ZeRO-2 as default, we've seen issues with this mode and advise users to turn it off.

deepspeed/runtime/engine.py

…into zero-ckpt-cpu-issue

tests/unit/simple_model.py

deepspeed/runtime/zero/stage_1_and_2.py

…w FT test cases)

stas00 · 2022-01-16T22:17:45Z

This is interesting, as this PR seems to have fixed a bug where groups['step'] in apex's FusedAdam wasn't getting saved restored. I was just debugging this and happened to sync my master and voila - the bug went away! Thank you, @jeffra

Not sure if an explicit test is needed since I think the fix was accidental and not intentional.

To diagnose the bug I was just dumping optimizer.optimizer.param_groups[0]['step'] after load_checkpoint and on resume I was getting 1 instead of the iteration count. But after updating to this commit it's now the iteration count.

jeffra requested review from awan-10, cli99, conglongli, eltonzheng, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners November 5, 2021 05:53

tjruwase reviewed Nov 5, 2021

View reviewed changes

deepspeed/runtime/engine.py Outdated Show resolved Hide resolved

tjruwase approved these changes Nov 5, 2021

View reviewed changes

jeffra changed the title ~~Reduce CPU memory overhead during ZeRO checkpoint loading~~ [ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load Nov 5, 2021

jeffra force-pushed the zero-ckpt-cpu-issue branch from 67aa46d to 45a416e Compare November 18, 2021 19:54

jeffra mentioned this pull request Nov 21, 2021

Error while saving T5-11B checkpoint #1574

Open

jeffra mentioned this pull request Dec 1, 2021

Optimizer state loading fix for bitsandbytes 8-bit optimizers. #1582

Open

jeffra added 2 commits January 5, 2022 14:20

[squash] zero-ckpt-cpu-issue (#1673)

0fc11fa

formatting

dbd0823

jeffra force-pushed the zero-ckpt-cpu-issue branch from 09260b6 to dbd0823 Compare January 5, 2022 22:41

tjruwase added 5 commits January 6, 2022 04:36

Merge branch 'master' into zero-ckpt-cpu-issue

92d87f0

Reduce cpu memory of loading in rigid mode

a6b6770

Merge branch 'master' into zero-ckpt-cpu-issue

21e173b

Allocate tensor on param device

cd4ce85

Merge branch 'zero-ckpt-cpu-issue' of github.com:microsoft/DeepSpeed …

4b0d366

…into zero-ckpt-cpu-issue

tjruwase reviewed Jan 6, 2022

View reviewed changes

tests/unit/simple_model.py Show resolved Hide resolved

jeffra commented Jan 7, 2022

View reviewed changes

deepspeed/runtime/zero/stage_1_and_2.py Outdated Show resolved Hide resolved

tjruwase and others added 4 commits January 7, 2022 09:45

Merge branch 'master' into zero-ckpt-cpu-issue

571b0a2

add WS check + several unit tests for ckpting (TODO: need to fix a fe…

a4b40fa

…w FT test cases)

uncomment exception check in ckpt test

6497509

Merge branch 'master' into zero-ckpt-cpu-issue

477dc89

tjruwase and others added 4 commits January 11, 2022 15:31

Merge branch 'master' into zero-ckpt-cpu-issue

61bdfec

Merge branch 'master' into zero-ckpt-cpu-issue

c13305f

fixes for remaining unit tests

091071d

Merge branch 'master' into zero-ckpt-cpu-issue

4add930

jeffra enabled auto-merge (squash) January 14, 2022 18:44

jeffra disabled auto-merge January 14, 2022 19:04

jeffra merged commit 3293cf7 into master Jan 14, 2022

jeffra deleted the zero-ckpt-cpu-issue branch January 14, 2022 19:05

NZ99 mentioned this pull request Jul 31, 2022

IndexError when checkpointing aqlaboratory/openfold#184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load #1525

[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load #1525

jeffra commented Nov 5, 2021 •

edited

Loading

stas00 commented Jan 16, 2022 •

edited

Loading

[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load #1525

[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load #1525

Conversation

jeffra commented Nov 5, 2021 • edited Loading

stas00 commented Jan 16, 2022 • edited Loading

jeffra commented Nov 5, 2021 •

edited

Loading

stas00 commented Jan 16, 2022 •

edited

Loading