Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

报错 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #32

Open
qingkongby opened this issue Jan 11, 2025 · 3 comments

Comments

@qingkongby
Copy link

qingkongby commented Jan 11, 2025

按照requirement进行环境安装,使用gkd启动脚本,遇到上面问题,请问之前有遇到过吗,如何解决的?

@mst272
Copy link
Owner

mst272 commented Jan 13, 2025

是否是gpu数量没设置对、或者可能是cuda的版本和torch的版本没有匹配上?

@qingkongby
Copy link
Author

qingkongby commented Jan 13, 2025

gpu数量有在zero3配置文件中“num_processes”字段做了修改对齐;当前cuda、torch版本,我有基于llama-factory走sft单机多卡训练,是没问题的。 -- 还有其它因素吗

@mst272
Copy link
Owner

mst272 commented Jan 13, 2025

1、调小bacth_size及model大小,试验一下是否是OOM问题
2、排查环境问题,可以分别安装下面尝试
(1)https://github.com/mst272/LLM-Dojo/blob/main/requirements.txt
(2)https://github.com/mst272/LLM-Dojo/blob/main/rlhf/requirements.txt
(3)torch== 2.2.2 deepspeed== 0.14.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants