Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3070ti训练时报错:cublasLt ran into an error! #41

Closed
vegech1cken opened this issue Apr 6, 2023 · 5 comments
Closed

3070ti训练时报错:cublasLt ran into an error! #41

vegech1cken opened this issue Apr 6, 2023 · 5 comments
Labels
good first issue Good for newcomers

Comments

@vegech1cken
Copy link

使用finetune.py脚本训练时报错
命令为: python finetune.py --data_path merge.json --test_size 20
训练环境:
3070ti,8G显存
pytorch 2.0.0+cu117
cuda 11.3

报错:
error detectedTraceback (most recent call last):
File "finetune.py", line 274, in
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/hy-tmp/transformers/src/transformers/trainer.py", line 1659, in train
return inner_training_loop(
File "/hy-tmp/transformers/src/transformers/trainer.py", line 1926, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/hy-tmp/transformers/src/transformers/trainer.py", line 2696, in training_step
loss = self.compute_loss(model, inputs)
File "/hy-tmp/transformers/src/transformers/trainer.py", line 2728, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/peft/peft_model.py", line 575, in forward
return self.base_model(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 569, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 565, in custom_forward
return module(*inputs, output_attentions, None)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py", line 591, in forward
result = super().forward(x)
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

@vegech1cken
Copy link
Author

求哪位大佬帮忙解答一下,蟹蟹!

@Facico
Copy link
Owner

Facico commented Apr 6, 2023

@vegech1cken 这个问题是peft里面的一个bug,可能有以下问题:
1、在多卡环境没有指定使用哪张显卡,它会自动在其他显卡上加载(可以用nvidia-smi看看),问题可以见problems,指定GPU使用CUDA_VISIBLE_DEVICES=xxx
2、显存不够。比如现在显存上已经跑着一个程序,不够的时候会出现这种情况。
3、某张卡存在问题。

@Facico Facico added duplicate This issue or pull request already exists and removed duplicate This issue or pull request already exists labels Apr 11, 2023
@Facico Facico closed this as completed Apr 21, 2023
@Facico Facico added the good first issue Good for newcomers label Apr 21, 2023
@sightsIndeep
Copy link

我发觉装了torchvision就好了

@zhangyue2709
Copy link

问题解决了吗,我也遇到了同样的问题

@vegech1cken
Copy link
Author

vegech1cken commented Jun 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants