Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load #1525

Merged
merged 15 commits into from
Jan 14, 2022

Conversation

jeffra
Copy link
Collaborator

@jeffra jeffra commented Nov 5, 2021

Example scenario that this PR addresses:

10B parameter model using 256 GPUs, each machine has 8 GPUs. We observe that each ZeRO rank checkpoint is 525MB. If elastic checkpointing is enabled we currently require each rank to load all ZeRO checkpoints in order to potentially re-partition the model to a new world size. If elastic checkpointing is disabled we are still loading all ZeRO checkpoints and then we end up throwing most of the state away.

If elastic checkpointing is disabled we currently attempt to load 525MB * 256 files * 8 gpus for a requirement of 1TB of CPU memory per machine just for optimizer state. On certain machines this results in OOM errors as you can imagine.

This PR loads only the required states for each rank. We still require all ZeRO partitions for fp32 master weights so this is loaded but is now enough to reduce the CPU memory requirements per node. For our 10B example this should be more along the lines of 40GB * 8 gpus = 320 GB.

Turns off elastic checkpointing support for ZeRO-2 as default, we've seen issues with this mode and advise users to turn it off.

@jeffra jeffra changed the title Reduce CPU memory overhead during ZeRO checkpoint loading [ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load Nov 5, 2021
@jeffra jeffra force-pushed the zero-ckpt-cpu-issue branch from 67aa46d to 45a416e Compare November 18, 2021 19:54
@jeffra jeffra force-pushed the zero-ckpt-cpu-issue branch from 09260b6 to dbd0823 Compare January 5, 2022 22:41
@jeffra jeffra enabled auto-merge (squash) January 14, 2022 18:44
@jeffra jeffra disabled auto-merge January 14, 2022 19:04
@jeffra jeffra merged commit 3293cf7 into master Jan 14, 2022
@jeffra jeffra deleted the zero-ckpt-cpu-issue branch January 14, 2022 19:05
@stas00
Copy link
Collaborator

stas00 commented Jan 16, 2022

This is interesting, as this PR seems to have fixed a bug where groups['step'] in apex's FusedAdam wasn't getting saved restored. I was just debugging this and happened to sync my master and voila - the bug went away! Thank you, @jeffra

Not sure if an explicit test is needed since I think the fix was accidental and not intentional.

To diagnose the bug I was just dumping optimizer.optimizer.param_groups[0]['step'] after load_checkpoint and on resume I was getting 1 instead of the iteration count. But after updating to this commit it's now the iteration count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants