Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"cublasLt ran into an error" with older GPU in 8-bit mode #379

Closed
1 task done
wk-mike opened this issue Mar 17, 2023 · 17 comments
Closed
1 task done

"cublasLt ran into an error" with older GPU in 8-bit mode #379

wk-mike opened this issue Mar 17, 2023 · 17 comments
Labels
bug Something isn't working stale

Comments

@wk-mike
Copy link

wk-mike commented Mar 17, 2023

Describe the bug

my device is GTX 1650 4GB,i5-12400 , 40BG RAM. Ubuntu 20.04. cuda11.8

I have set llama-7b according to the wiki
I can run it with python server.py --listen --auto-devices --model llama-7b
and everything goes well!

But I can't run with --load-in-8bit . According to #366 I should use this.
when I begin with python server.py --listen --auto-devices --model llama-7b --load-in-8bit
There is no error, everything seeming good,BUT once I use the web ui click the ‘Generate’ button,

there error comes in the terminal

(textgen) wk:text-generation-webui$ python server.py --listen --auto-devices --model llama-7b --load-in-8bit
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.81it/s]
Loaded the model in 7.58 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 4096]), B: torch.Size([4096, 4096]), C: (16, 4096); (lda, ldb, ldc): (c_int(512), c_int(131072), c_int(512)); (m, n, k): (c_int(16), c_int(4096), c_int(4096))
Exception in thread Thread-4 (gentask):
error detectedTraceback (most recent call last):
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
    outputs = self.model(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

this is not only happened for llama-7b , it can easily reproduction in anyother models
like:
run python server.py --listen --model opt-1.3b --load-in-8bit

There is no error, BUT once use the web ui enter anything and click the ‘Generate’ button,

there error comes in the terminal, it seems the bug has something to do with the cublasLt , like a cuda bug.

and there is no bug wih cpu python server.py --listen --model opt-1.3b --load-in-8bit it's going well.

Screenshot

No response

Logs

(textgen) wk:text-generation-webui$ python server.py --listen  --model opt-1.3b --load-in-8bit
Loading opt-1.3b...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loaded the model in 3.34 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 2048]), B: torch.Size([2048, 2048]), C: (16, 2048); (lda, ldb, ldc): (c_int(512), c_int(65536), c_int(512)); (m, n, k): (c_int(16), c_int(2048), c_int(2048))
Exception in thread Thread-3 (gentask):
error detectedTraceback (most recent call last):
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 930, in forward
    outputs = self.model.decoder(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 696, in forward
    layer_outputs = decoder_layer(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 326, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 171, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!


### System Info

```shell
my device is  GTX 1650 4GB,i5-12400 , 40BG RAM. Ubuntu 20.04. cuda11.8
@wk-mike wk-mike added the bug Something isn't working label Mar 17, 2023
@rafx85
Copy link

rafx85 commented Mar 17, 2023

I got the same error. I use 2 GPU and trying to run pygmalion-2.7b with 8bit. I use Windows.

My start-webui.bat file:
call python server.py --auto-devices --cai-chat --share --gpu-memory 5 3 --load-in-8bit

I made also that: https://www.reddit.com/r/PygmalionAI/comments/1115gom/running_pygmalion_6b_with_8gb_of_vram/
But I use libbitsandbytes_cudaall.dll for my GeForce 1660 + 960 cards.

@oobabooga
Copy link
Owner

This also happens to me on a GTX 1650 GPU.

@oobabooga oobabooga changed the title "--load-in-8bit" might get conflict with CUDA GPU cublasLt ran into an error with older GPU in 8-bit mode Mar 17, 2023
@oobabooga oobabooga changed the title cublasLt ran into an error with older GPU in 8-bit mode "cublasLt ran into an error" with older GPU in 8-bit mode Mar 17, 2023
@sgsdxzy
Copy link
Contributor

sgsdxzy commented Mar 17, 2023

I think 8bit in bitsandbytes requires Turing(20xx) or later: https://github.com/TimDettmers/bitsandbytes#requirements--installation

LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older).

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 17, 2023

On older GPU it will NEVER work with int8 threshold at 6. But I get NaN error and not this error on my P6000. I am using the pre "fixed" bits and bytes that never completed the "cuda setup" part.

I'll try it with the new bits and bytes that I don't have to patch and see if I get this error instead.

But best believe that it is possible.

8bitPascal

@askmyteapot
Copy link
Contributor

askmyteapot commented Mar 17, 2023

So i have been having this error too.

My setup is:
Ryzen 5800x, 32GB DDR4 (25gb ZRAM compressed swap), 3060ti (8GB) and 2080super (8GB)
Ubuntu 22.04
Cuda 11.8
pytorch2.0+cu118

I got it to generate by setting export CUDA_VISIBLE_DEVICES=1,0 (in Windows, use set CUDA_VISIBLE_DEVICES=1,0 but i havent tested that yet)
Noting i swapped the numbers around. Setting it 0,1 always resulted in the error

Doesnt help those that have single GPUs, but its a start i hope.

@lolxdmainkaisemaanlu
Copy link

lolxdmainkaisemaanlu commented Mar 17, 2023

I am on Windows 11 and I am able to load the LLama 7b model in 4bit on my GTX 1060 6GB using the 'allarch' 0.37.0 bitsandbytes from this repo - https://github.com/james-things/bitsandbytes-prebuilt-all_arch.

I thought it would be working natively on Linux since the author of bitsandbytes made the int8 function backward compatible so that even Pascal cards can run it. Perhaps you need to compile the .so again like windows users use a fixed .dll? Not sure.

I'm sure there is a solution to this 110%. My card is older than yours and 4bit is working fine on it. See if the instructions here - https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ help you? I was finally able to get 4bit working after following the instructions here.

@askmyteapot
Copy link
Contributor

askmyteapot commented Mar 17, 2023

I am on Windows 11 and I am able to load the LLama 7b model in 4bit on my GTX 1060 6GB using the 'allarch' 0.37.0 bitsandbytes from this repo - https://github.com/james-things/bitsandbytes-prebuilt-all_arch.

I thought it would be working natively on Linux since the author of bitsandbytes made the int8 function backward compatible so that even Pascal cards can run it. Perhaps you need to compile the .so again like windows users use a fixed .dll

I'm sure there is a solution to this 110%. My card is older than yours and 4bit is working fine on it.

I compiled from source for bitsandbytes as well as trying the pip package, just to avoid the issue of the .so
4bit works. its only 8bit that was causing me headaches. And it looks like we need 8bit to use LoRAs in LLaMa

@lolxdmainkaisemaanlu
Copy link

I compiled from source for bitsandbytes as well as trying the pip package, just to avoid the issue of the .so 4bit works. its only 8bit that was causing me headaches. And it looks like we need 8bit to use LoRAs in LLaMa

I think your issue might be something related to an improper installation because from what I understand these 8bit issues are only on older GPUs from 1xxx series and lower. Your 2080 Super and 3060ti are perfectly compatible even with the native int8 function from bitsandbytes, you shouldn't have any need to even compile from source...

Perhaps try running in 16 bit.. You have 16GB VRAM which should be more than enough.

@wk-mike
Copy link
Author

wk-mike commented Mar 17, 2023

@lolxdmainkaisemaanlu thank you,

can you tell where should I put this
bitsandbytes-prebuilt-all_arch/0.37.0/libbitsandbytes_cudaall.dll
to which folder ?

and do I have to change code about this webui?

@rafx85
Copy link

rafx85 commented Mar 17, 2023

installer_files\env\lib\site-packages\bitsandbytes\

Put it here, but still there is the same bug for me.

@rafx85
Copy link

rafx85 commented Mar 17, 2023

I am on Windows 11 and I am able to load the LLama 7b model in 4bit on my GTX 1060 6GB using the 'allarch' 0.37.0 bitsandbytes from this repo - https://github.com/james-things/bitsandbytes-prebuilt-all_arch.

I thought it would be working natively on Linux since the author of bitsandbytes made the int8 function backward compatible so that even Pascal cards can run it. Perhaps you need to compile the .so again like windows users use a fixed .dll? Not sure.

I'm sure there is a solution to this 110%. My card is older than yours and 4bit is working fine on it. See if the instructions here - https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ help you? I was finally able to get 4bit working after following the instructions here.

Did you try to run it in 8bit? Do you have error then or not?

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 18, 2023

I come here to tell you that the new accepted transformers is slow for me and I have no clue what is wrong on your cards and why mine works.

I patch models.py like this: https://pastebin.com/siPxZvkc

And then I can generate away: https://pastebin.com/R3JCmJ9L

I can even do the lora just fine.
lora

Fixed bits and bytes from pypy works, its just more verbose in messages.

@Mar2ck
Copy link

Mar2ck commented Mar 30, 2023

I also have this error on GTX 1660 Ti. I'm guessing this means GTX 16XX series isn't compatible despite also being Turing architecture.

@Mar2ck
Copy link

Mar2ck commented Apr 12, 2023

Looks like GTX 16XX does support 8-bit, it just wasn't enabled in bitsandbytes until now. bitsandbytes-foundation/bitsandbytes#292 So starting with bitsandbytes 0.38.0 these GPUs should work.

EDIT: Just tested with bitsandbytes upgraded to 0.38.0.post2 on GTX 1660 Ti and it works perfectly.

@darrenwang00
Copy link

try rebuild bitsandbytes from https://github.com/TimDettmers/bitsandbytes
my env :GeForce3090 Driver Version: 510.47.03 CUDA Version: 11.6

【fix todo】
git clone https://github.com/timdettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=116 make cuda116
python setup.py install

@bekhzod-olimov
Copy link

I had the same issue when I wanted to load the model in 8bit. Loading the model in 4bit solved my problem. load-in-4bit=True

@github-actions github-actions bot added the stale label Jan 5, 2024
Copy link

github-actions bot commented Jan 5, 2024

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

10 participants