[zero++] Synchronize at the end of secondary partitioning and simplify the logic #5216

ByronHsu · 2024-03-01T02:34:33Z

1. Why?

We have a very long thread investigating the issue. To summarize, this is because

a. The 2nd partitioning is asynchronous because it copies device-to-device from full tensor to 2nd tensor
b. When using prefetching, the all-gather of 2nd tensor can happen before 2nd partitioning ends. At that moment, the value of 2nd tensor might contain bad values.

Also, we found that the logic of copying is wrong and lengthy, so we simplified it to only two lines.

Kudos to @yundai424, Haowen Ning, @samadejacobs for the investigation effort.

2. What?

After multiple careful tests, we found patching get_accelerator().synchronize() to ensure all cuda stream finished before 2nd partitioning can prevent the issue

3. Tests

I validated the correctness of the simplification of 2nd partition logic. The loss is "exactly" the same before and after simplification under the same random seed.

Before

[
  {"loss": 2.0731},
  {"loss": 2.0288},
  {"loss": 1.927},
  {"loss": 1.8347},
  {"loss": 1.8347},
  {"loss": 1.7896},
  {"loss": 1.602},
  {"loss": 1.766},
  {"loss": 1.8751},
  {"loss": 1.6776}
]

After

[
  {"loss": 2.0731},
  {"loss": 2.0288},
  {"loss": 1.927},
  {"loss": 1.8347},
  {"loss": 1.8347},
  {"loss": 1.7896},
  {"loss": 1.602},
  {"loss": 1.766},
  {"loss": 1.8751},
  {"loss": 1.6776}
]

4. TODO

We need further investigation on the issue @samadejacobs

Revisit ZeRO-3 prefetch design
Refactor hpz to reuse primary tensor for secondary partition.

Signed-off-by: byhsu <[email protected]>

tjruwase · 2024-03-01T17:26:59Z

@GuanhuaWang, FYI!

Signed-off-by: byhsu <[email protected]>

@yundai424

…y the logic (microsoft#5216) ## 1. Why? We have a very long thread investigating [the issue](microsoft#5059). To summarize, this is because a. The 2nd partitioning is asynchronous because it copies device-to-device from full tensor to 2nd tensor b. When using prefetching, the all-gather of 2nd tensor can happen before 2nd partitioning ends. At that moment, the value of 2nd tensor might contain bad values. ![image](https://github.com/microsoft/DeepSpeed/assets/24364830/ad6ee6a2-8e1e-4214-a0d2-ee5314b252b8) Also, we found that the logic of copying is wrong and lengthy, so we simplified it to only two lines. Kudos to @yundai424, Haowen Ning, @samadejacobs for the investigation effort. ## 2. What? After multiple careful tests, we found patching `get_accelerator().synchronize()` to ensure all cuda stream finished before 2nd partitioning can prevent the issue ## 3. Tests I validated the correctness of the simplification of 2nd partition logic. The loss is "exactly" the same before and after simplification under the same random seed. Before ``` [ {"loss": 2.0731}, {"loss": 2.0288}, {"loss": 1.927}, {"loss": 1.8347}, {"loss": 1.8347}, {"loss": 1.7896}, {"loss": 1.602}, {"loss": 1.766}, {"loss": 1.8751}, {"loss": 1.6776} ] ``` After ``` [ {"loss": 2.0731}, {"loss": 2.0288}, {"loss": 1.927}, {"loss": 1.8347}, {"loss": 1.8347}, {"loss": 1.7896}, {"loss": 1.602}, {"loss": 1.766}, {"loss": 1.8751}, {"loss": 1.6776} ] ``` ## 4. TODO We need further investigation on the issue @samadejacobs 1) Revisit ZeRO-3 prefetch design 2) Refactor hpz to reuse primary tensor for secondary partition. --------- Signed-off-by: byhsu <[email protected]> Co-authored-by: byhsu <[email protected]>

@yundai424

…y the logic (microsoft#5216) ## 1. Why? We have a very long thread investigating [the issue](microsoft#5059). To summarize, this is because a. The 2nd partitioning is asynchronous because it copies device-to-device from full tensor to 2nd tensor b. When using prefetching, the all-gather of 2nd tensor can happen before 2nd partitioning ends. At that moment, the value of 2nd tensor might contain bad values. ![image](https://github.com/microsoft/DeepSpeed/assets/24364830/ad6ee6a2-8e1e-4214-a0d2-ee5314b252b8) Also, we found that the logic of copying is wrong and lengthy, so we simplified it to only two lines. Kudos to @yundai424, Haowen Ning, @samadejacobs for the investigation effort. ## 2. What? After multiple careful tests, we found patching `get_accelerator().synchronize()` to ensure all cuda stream finished before 2nd partitioning can prevent the issue ## 3. Tests I validated the correctness of the simplification of 2nd partition logic. The loss is "exactly" the same before and after simplification under the same random seed. Before ``` [ {"loss": 2.0731}, {"loss": 2.0288}, {"loss": 1.927}, {"loss": 1.8347}, {"loss": 1.8347}, {"loss": 1.7896}, {"loss": 1.602}, {"loss": 1.766}, {"loss": 1.8751}, {"loss": 1.6776} ] ``` After ``` [ {"loss": 2.0731}, {"loss": 2.0288}, {"loss": 1.927}, {"loss": 1.8347}, {"loss": 1.8347}, {"loss": 1.7896}, {"loss": 1.602}, {"loss": 1.766}, {"loss": 1.8751}, {"loss": 1.6776} ] ``` ## 4. TODO We need further investigation on the issue @samadejacobs 1) Revisit ZeRO-3 prefetch design 2) Refactor hpz to reuse primary tensor for secondary partition. --------- Signed-off-by: byhsu <[email protected]> Co-authored-by: byhsu <[email protected]>

ByronHsu added 3 commits February 29, 2024 18:25

Synchronize at the end of secondary partitioning

644a822

trim

18967fc

Signed-off-by: byhsu <[email protected]>

trim

d36ec0c

Signed-off-by: byhsu <[email protected]>

ByronHsu requested review from tjruwase and mrwyattii as code owners March 1, 2024 02:34

tjruwase approved these changes Mar 1, 2024

View reviewed changes

simplify parition logic

9837fcc

Signed-off-by: byhsu <[email protected]>

ByronHsu changed the title ~~[zero++] Synchronize at the end of secondary partitioning~~ [zero++] Synchronize at the end of secondary partitioning and simplify the logic Mar 1, 2024

only sync current stream

00e76e9

Signed-off-by: byhsu <[email protected]>

samadejacobs added this pull request to the merge queue Mar 1, 2024

Merged via the queue into microsoft:master with commit 4578c24 Mar 1, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zero++] Synchronize at the end of secondary partitioning and simplify the logic #5216

[zero++] Synchronize at the end of secondary partitioning and simplify the logic #5216

ByronHsu commented Mar 1, 2024 •

edited

Loading

tjruwase commented Mar 1, 2024

[zero++] Synchronize at the end of secondary partitioning and simplify the logic #5216

[zero++] Synchronize at the end of secondary partitioning and simplify the logic #5216

Conversation

ByronHsu commented Mar 1, 2024 • edited Loading

1. Why?

2. What?

3. Tests

4. TODO

tjruwase commented Mar 1, 2024

ByronHsu commented Mar 1, 2024 •

edited

Loading