单卡能跑，多卡报错，raise Exception('cublasLt ran into an error!') #3

kaihe · 2023-03-24T04:43:43Z

out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')

是bitsandbytes.functional的这个地方，导致 has_error == 1

if formatB == 'col_turing':
    if dtype == torch.int32:
        has_error = lib.cigemmlt_turing_32(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )
    else:
        has_error = lib.cigemmlt_turing_8(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )
elif formatB == "col_ampere":
    if dtype == torch.int32:
        has_error = lib.cigemmlt_ampere_32(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )
    else:
        has_error = lib.cigemmlt_ampere_8(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )

打印出来的矩阵tensor
error detectedA: torch.Size([512, 4096]), B: torch.Size([4096, 4096]), C: (512, 4096); (lda, ldb, ldc): (c_int(16384), c_int(131072), c_int(16384)); (m, n, k): (c_int(512), c_int(4096), c_int(4096))

The text was updated successfully, but these errors were encountered:

Facico · 2023-03-24T06:11:38Z

@kaihe 非常感谢你提出的问题。
这个找到一个类似的问题是cpu和gpu之间混合+8bit的：huggingface/transformers#21371
想问一下你训练的时候时候用的是原来的参数吗，或者能提供更详细的参数吗

kaihe · 2023-03-24T07:35:01Z

多谢回复，
用的是您原来的参数，没有刻意尝试去搞GPU CPU混合

device_map = "auto"
world_size = int(os.environ.get("WORLD_SIZE", 1))
ddp = world_size != 1
if ddp:
    device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)}
    GRADIENT_ACCUMULATION_STEPS = GRADIENT_ACCUMULATION_STEPS // world_size
print(args.model_path)
model = LlamaForCausalLM.from_pretrained(
    args.model_path,
    load_in_8bit=True,
    device_map=device_map,
)

Facico · 2023-03-24T08:10:08Z

@kaihe 目前感觉可能还是依赖的问题，毕竟单卡多卡不同的只有ddp那个地方，你可以试着装一份python3.10的环境看看有没有问题。后续我们会提供更详细的版本配置。这里是一份python3.10多卡能跑的配置可以参考：

torch                    1.13.1
torchtyping              0.1.4
torchvision              0.14.1
absl-py                  1.4.0
accelerate               0.15.0
aiodns                   3.0.0
aiofiles                 23.1.0
aiohttp                  3.8.3
aiosignal                1.3.1
altair                   4.2.2
anyio                    3.6.2
appdirs                  1.4.4
async-timeout            4.0.2
attrs                    22.2.0
beautifulsoup4           4.11.2
bitsandbytes             0.37.0
Brotli                   1.0.9
cachetools               5.3.0
certifi                  2022.12.7
cffi                     1.15.1
charset-normalizer       2.1.1
click                    8.1.3
contourpy                1.0.7
cpm-kernels              1.0.11
cycler                   0.11.0
datasets                 2.8.0
deepspeed                0.7.7
dill                     0.3.6
distlib                  0.3.6
docker-pycreds           0.4.0
einops                   0.6.0
entrypoints              0.4
evaluate                 0.4.0
fastapi                  0.95.0
ffmpy                    0.3.0
filelock                 3.9.0
fire                     0.5.0
flash-attn               0.2.8
fonttools                4.39.2
frozenlist               1.3.3
fsspec                   2023.3.0
gdown                    4.6.4
gensim                   3.8.2
gitdb                    4.0.10
GitPython                3.1.31
google-auth              2.16.2
google-auth-oauthlib     0.4.6
gradio                   3.23.0
grpcio                   1.51.3
h11                      0.14.0
hjson                    3.1.0
httpcore                 0.16.3
httpx                    0.23.3
huggingface-hub          0.13.3
icetk                    0.0.5
idna                     3.4
inflate64                0.3.1
Jinja2                   3.1.2
joblib                   1.2.0
jsonlines                3.1.0
jsonschema               4.17.3
kiwisolver               1.4.4
linkify-it-py            2.0.0
loguru                   0.6.0
loralib                  0.1.1
Markdown                 3.4.1
markdown-it-py           2.2.0
MarkupSafe               2.1.2
matplotlib               3.7.1
mdit-py-plugins          0.3.3
mdurl                    0.1.2
msgpack                  1.0.4
multidict                6.0.4
multiprocess             0.70.14
multivolumefile          0.2.3
networkx                 3.0
ninja                    1.11.1
nltk                     3.8.1
numpy                    1.24.2
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-ml-py             11.525.84
nvitop                   1.0.0
oauthlib                 3.2.2
openai                   0.27.2
orjson                   3.8.8
packaging                23.0
pandas                   1.5.3
pathtools                0.1.2
peft                     0.3.0.dev0
Pillow                   9.4.0
pip                      22.3.1
platformdirs             3.1.0
protobuf                 3.20.1
psutil                   5.9.4
py-cpuinfo               9.0.0
py7zr                    0.20.4
pyarrow                  11.0.0
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pybcj                    1.0.1
pycares                  4.3.0
pycparser                2.21
pycryptodomex            3.17
pydantic                 1.10.4
pydub                    0.25.1
Pygments                 2.14.0
pyparsing                3.0.9
pyppmd                   1.0.0
pyrsistent               0.19.3
PySocks                  1.7.1
python-dateutil          2.8.2
python-multipart         0.0.6
pytz                     2022.7.1
PyYAML                   6.0
pyzstd                   0.15.4
ray                      2.3.0
regex                    2022.10.31
requests                 2.28.2
requests-oauthlib        1.3.1
responses                0.18.0
rfc3986                  1.5.0
rich                     13.3.2
rouge-score              0.1.2
rsa                      4.9
scikit-learn             1.2.0
scipy                    1.10.1
semantic-version         2.10.0
sentencepiece            0.1.97
sentry-sdk               1.16.0
setproctitle             1.3.2
setuptools               65.6.3
six                      1.16.0
smart-open               6.3.0
smmap                    5.0.0
sniffio                  1.3.0
soupsieve                2.4
starlette                0.26.1
tabulate                 0.9.0
tensorboard              2.12.0
tensorboard-data-server  0.7.0
tensorboard-plugin-wit   1.8.1
termcolor                2.2.0
texttable                1.6.7
threadpoolctl            3.1.0
tokenizers               0.13.2
toolz                    0.12.0
torch                    1.13.1
torchtyping              0.1.4
torchvision              0.14.1
tqdm                     4.65.0
transformers             4.28.0.dev0
trlx                     0.3.0
typeguard                2.13.3
typing_extensions        4.5.0
uc-micro-py              1.0.1
urllib3                  1.26.14
uvicorn                  0.21.1
virtualenv               20.20.0
wandb                    0.13.10
websockets               10.4
Werkzeug                 2.2.3
wheel                    0.38.4
xxhash                   3.2.0
yarl                     1.8.2

kaihe · 2023-03-24T08:34:59Z

找到问题，应该是我的4个卡中的1个有问题。排列组合了一下显卡，发现只要用那个卡就会这个错。用其他的卡就不会

kaihe closed this as completed Mar 24, 2023

NanoCode012 mentioned this issue Jun 24, 2023

[Bug] Exception: cublasLt ran into an error! during fine-tuning LLM in 8bit mode bitsandbytes-foundation/bitsandbytes#538

Closed

18065013 mentioned this issue Jul 17, 2023

多卡finetune_chat时报mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008) #240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单卡能跑，多卡报错，raise Exception('cublasLt ran into an error!') #3

单卡能跑，多卡报错，raise Exception('cublasLt ran into an error!') #3

kaihe commented Mar 24, 2023

Facico commented Mar 24, 2023

kaihe commented Mar 24, 2023

Facico commented Mar 24, 2023

kaihe commented Mar 24, 2023

单卡能跑，多卡报错，raise Exception('cublasLt ran into an error!') #3

单卡能跑，多卡报错，raise Exception('cublasLt ran into an error!') #3

Comments

kaihe commented Mar 24, 2023

Facico commented Mar 24, 2023

kaihe commented Mar 24, 2023

Facico commented Mar 24, 2023

kaihe commented Mar 24, 2023