This repository has been archived by the owner on Oct 19, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 358
Ray spill out of disk error when using alpa to auto-parallelize llama #969
Comments
zigzagcai
changed the title
Ray spill out of disk error when using alpa to auto-parallelize llama
[Bug] Ray spill out of disk error when using alpa to auto-parallelize llama
Nov 21, 2023
Also, we can see from nvidia-smi that GPU memory was reserved but GPU utility is always 0. The memory continues leaking and ray spill object continues to grow, until the out of disk error throwed.
|
zigzagcai
changed the title
[Bug] Ray spill out of disk error when using alpa to auto-parallelize llama
Ray spill out of disk error when using alpa to auto-parallelize llama
Nov 21, 2023
Update: I have solved this issue by specifying |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Please describe the bug
When I tried to use alpa to parallelize llama-7b model on ray cluster (one node with 8 GPUs), disk space will continue to grow and never stop due to ray object spilling. Finally the program will throw out of disk space error.
Please describe the expected behavior
As expected, alpa training will run normally.
System information and environment
To Reproduce
Steps to reproduce the behavior:
ray start --head
cd examples/llama_finetune
bash run_llama.sh
Error Logs
cd examples/llama_finetune && bash run_llama.sh
The text was updated successfully, but these errors were encountered: