-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
单卡能跑,多卡报错,raise Exception('cublasLt ran into an error!') #3
Comments
@kaihe 非常感谢你提出的问题。 |
多谢回复,
|
@kaihe 目前感觉可能还是依赖的问题,毕竟单卡多卡不同的只有ddp那个地方,你可以试着装一份python3.10的环境看看有没有问题。后续我们会提供更详细的版本配置。这里是一份python3.10多卡能跑的配置可以参考:
|
找到问题,应该是我的4个卡中的1个有问题。排列组合了一下显卡,发现只要用那个卡就会这个错。用其他的卡就不会 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')
是bitsandbytes.functional的这个地方,导致 has_error == 1
打印出来的矩阵tensor
error detectedA: torch.Size([512, 4096]), B: torch.Size([4096, 4096]), C: (512, 4096); (lda, ldb, ldc): (c_int(16384), c_int(131072), c_int(16384)); (m, n, k): (c_int(512), c_int(4096), c_int(4096))
The text was updated successfully, but these errors were encountered: