3070ti训练时报错：cublasLt ran into an error! #41

vegech1cken · 2023-04-06T08:03:55Z

使用finetune.py脚本训练时报错
命令为： python finetune.py --data_path merge.json --test_size 20
训练环境：
3070ti，8G显存
pytorch 2.0.0+cu117
cuda 11.3

报错：
error detectedTraceback (most recent call last):
File "finetune.py", line 274, in
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/hy-tmp/transformers/src/transformers/trainer.py", line 1659, in train
return inner_training_loop(
File "/hy-tmp/transformers/src/transformers/trainer.py", line 1926, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/hy-tmp/transformers/src/transformers/trainer.py", line 2696, in training_step
loss = self.compute_loss(model, inputs)
File "/hy-tmp/transformers/src/transformers/trainer.py", line 2728, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/peft/peft_model.py", line 575, in forward
return self.base_model(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 569, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 565, in custom_forward
return module(*inputs, output_attentions, None)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/hy-tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py", line 591, in forward
result = super().forward(x)
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

vegech1cken · 2023-04-06T08:05:35Z

求哪位大佬帮忙解答一下，蟹蟹！

Facico · 2023-04-06T08:56:50Z

@vegech1cken 这个问题是peft里面的一个bug，可能有以下问题：
1、在多卡环境没有指定使用哪张显卡，它会自动在其他显卡上加载（可以用nvidia-smi看看），问题可以见problems，指定GPU使用CUDA_VISIBLE_DEVICES=xxx
2、显存不够。比如现在显存上已经跑着一个程序，不够的时候会出现这种情况。
3、某张卡存在问题。

sightsIndeep · 2023-05-04T08:41:00Z

我发觉装了torchvision就好了

zhangyue2709 · 2023-06-24T08:48:12Z

问题解决了吗，我也遇到了同样的问题

vegech1cken · 2023-06-24T13:08:19Z

我当时是显存不够的问题，换了大显存的机器就可以了。 | | XuCheng | | ***@***.*** | 在2023年6月24日 ***@***.***> 写道：问题解决了吗，我也遇到了同样的问题 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

Facico added duplicate This issue or pull request already exists and removed duplicate This issue or pull request already exists labels Apr 11, 2023

Facico closed this as completed Apr 21, 2023

Facico added the good first issue Good for newcomers label Apr 21, 2023

Facico mentioned this issue May 4, 2023

readme 的环境配置指导文件是否存在错误？ #42

Closed

NanoCode012 mentioned this issue Jun 24, 2023

[Bug] Exception: cublasLt ran into an error! during fine-tuning LLM in 8bit mode bitsandbytes-foundation/bitsandbytes#538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3070ti训练时报错：cublasLt ran into an error! #41

3070ti训练时报错：cublasLt ran into an error! #41

vegech1cken commented Apr 6, 2023

vegech1cken commented Apr 6, 2023

Facico commented Apr 6, 2023

sightsIndeep commented May 4, 2023

zhangyue2709 commented Jun 24, 2023

vegech1cken commented Jun 24, 2023 via email

3070ti训练时报错：cublasLt ran into an error! #41

3070ti训练时报错：cublasLt ran into an error! #41

Comments

vegech1cken commented Apr 6, 2023

vegech1cken commented Apr 6, 2023

Facico commented Apr 6, 2023

sightsIndeep commented May 4, 2023

zhangyue2709 commented Jun 24, 2023

vegech1cken commented Jun 24, 2023 via email