We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
按照requirement进行环境安装,使用gkd启动脚本,遇到上面问题,请问之前有遇到过吗,如何解决的?
The text was updated successfully, but these errors were encountered:
是否是gpu数量没设置对、或者可能是cuda的版本和torch的版本没有匹配上?
Sorry, something went wrong.
gpu数量有在zero3配置文件中“num_processes”字段做了修改对齐;当前cuda、torch版本,我有基于llama-factory走sft单机多卡训练,是没问题的。 -- 还有其它因素吗
1、调小bacth_size及model大小,试验一下是否是OOM问题 2、排查环境问题,可以分别安装下面尝试 (1)https://github.com/mst272/LLM-Dojo/blob/main/requirements.txt (2)https://github.com/mst272/LLM-Dojo/blob/main/rlhf/requirements.txt (3)torch== 2.2.2 deepspeed== 0.14.2
No branches or pull requests
按照requirement进行环境安装,使用gkd启动脚本,遇到上面问题,请问之前有遇到过吗,如何解决的?
The text was updated successfully, but these errors were encountered: