-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load #1525
Conversation
67aa46d
to
45a416e
Compare
09260b6
to
dbd0823
Compare
…into zero-ckpt-cpu-issue
This is interesting, as this PR seems to have fixed a bug where Not sure if an explicit test is needed since I think the fix was accidental and not intentional. To diagnose the bug I was just dumping |
Example scenario that this PR addresses:
10B parameter model using 256 GPUs, each machine has 8 GPUs. We observe that each ZeRO rank checkpoint is 525MB. If elastic checkpointing is enabled we currently require each rank to load all ZeRO checkpoints in order to potentially re-partition the model to a new world size. If elastic checkpointing is disabled we are still loading all ZeRO checkpoints and then we end up throwing most of the state away.
If elastic checkpointing is disabled we currently attempt to load 525MB * 256 files * 8 gpus for a requirement of 1TB of CPU memory per machine just for optimizer state. On certain machines this results in OOM errors as you can imagine.
This PR loads only the required states for each rank. We still require all ZeRO partitions for fp32 master weights so this is loaded but is now enough to reduce the CPU memory requirements per node. For our 10B example this should be more along the lines of 40GB * 8 gpus = 320 GB.
Turns off elastic checkpointing support for ZeRO-2 as default, we've seen issues with this mode and advise users to turn it off.