Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单卡能跑,多卡报错,raise Exception('cublasLt ran into an error!') #3

Closed
kaihe opened this issue Mar 24, 2023 · 4 comments
Closed

Comments

@kaihe
Copy link

kaihe commented Mar 24, 2023

out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')

是bitsandbytes.functional的这个地方,导致 has_error == 1

if formatB == 'col_turing':
    if dtype == torch.int32:
        has_error = lib.cigemmlt_turing_32(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )
    else:
        has_error = lib.cigemmlt_turing_8(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )
elif formatB == "col_ampere":
    if dtype == torch.int32:
        has_error = lib.cigemmlt_ampere_32(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )
    else:
        has_error = lib.cigemmlt_ampere_8(
            ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
        )

打印出来的矩阵tensor
error detectedA: torch.Size([512, 4096]), B: torch.Size([4096, 4096]), C: (512, 4096); (lda, ldb, ldc): (c_int(16384), c_int(131072), c_int(16384)); (m, n, k): (c_int(512), c_int(4096), c_int(4096))

@Facico
Copy link
Owner

Facico commented Mar 24, 2023

@kaihe 非常感谢你提出的问题。
这个找到一个类似的问题是cpu和gpu之间混合+8bit的:huggingface/transformers#21371
想问一下你训练的时候时候用的是原来的参数吗,或者能提供更详细的参数吗

@kaihe
Copy link
Author

kaihe commented Mar 24, 2023

多谢回复,
用的是您原来的参数,没有刻意尝试去搞GPU CPU混合

device_map = "auto"
world_size = int(os.environ.get("WORLD_SIZE", 1))
ddp = world_size != 1
if ddp:
    device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)}
    GRADIENT_ACCUMULATION_STEPS = GRADIENT_ACCUMULATION_STEPS // world_size
print(args.model_path)
model = LlamaForCausalLM.from_pretrained(
    args.model_path,
    load_in_8bit=True,
    device_map=device_map,
)

@Facico
Copy link
Owner

Facico commented Mar 24, 2023

@kaihe 目前感觉可能还是依赖的问题,毕竟单卡多卡不同的只有ddp那个地方,你可以试着装一份python3.10的环境看看有没有问题。后续我们会提供更详细的版本配置。这里是一份python3.10多卡能跑的配置可以参考:

torch                    1.13.1
torchtyping              0.1.4
torchvision              0.14.1
absl-py                  1.4.0
accelerate               0.15.0
aiodns                   3.0.0
aiofiles                 23.1.0
aiohttp                  3.8.3
aiosignal                1.3.1
altair                   4.2.2
anyio                    3.6.2
appdirs                  1.4.4
async-timeout            4.0.2
attrs                    22.2.0
beautifulsoup4           4.11.2
bitsandbytes             0.37.0
Brotli                   1.0.9
cachetools               5.3.0
certifi                  2022.12.7
cffi                     1.15.1
charset-normalizer       2.1.1
click                    8.1.3
contourpy                1.0.7
cpm-kernels              1.0.11
cycler                   0.11.0
datasets                 2.8.0
deepspeed                0.7.7
dill                     0.3.6
distlib                  0.3.6
docker-pycreds           0.4.0
einops                   0.6.0
entrypoints              0.4
evaluate                 0.4.0
fastapi                  0.95.0
ffmpy                    0.3.0
filelock                 3.9.0
fire                     0.5.0
flash-attn               0.2.8
fonttools                4.39.2
frozenlist               1.3.3
fsspec                   2023.3.0
gdown                    4.6.4
gensim                   3.8.2
gitdb                    4.0.10
GitPython                3.1.31
google-auth              2.16.2
google-auth-oauthlib     0.4.6
gradio                   3.23.0
grpcio                   1.51.3
h11                      0.14.0
hjson                    3.1.0
httpcore                 0.16.3
httpx                    0.23.3
huggingface-hub          0.13.3
icetk                    0.0.5
idna                     3.4
inflate64                0.3.1
Jinja2                   3.1.2
joblib                   1.2.0
jsonlines                3.1.0
jsonschema               4.17.3
kiwisolver               1.4.4
linkify-it-py            2.0.0
loguru                   0.6.0
loralib                  0.1.1
Markdown                 3.4.1
markdown-it-py           2.2.0
MarkupSafe               2.1.2
matplotlib               3.7.1
mdit-py-plugins          0.3.3
mdurl                    0.1.2
msgpack                  1.0.4
multidict                6.0.4
multiprocess             0.70.14
multivolumefile          0.2.3
networkx                 3.0
ninja                    1.11.1
nltk                     3.8.1
numpy                    1.24.2
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-ml-py             11.525.84
nvitop                   1.0.0
oauthlib                 3.2.2
openai                   0.27.2
orjson                   3.8.8
packaging                23.0
pandas                   1.5.3
pathtools                0.1.2
peft                     0.3.0.dev0
Pillow                   9.4.0
pip                      22.3.1
platformdirs             3.1.0
protobuf                 3.20.1
psutil                   5.9.4
py-cpuinfo               9.0.0
py7zr                    0.20.4
pyarrow                  11.0.0
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pybcj                    1.0.1
pycares                  4.3.0
pycparser                2.21
pycryptodomex            3.17
pydantic                 1.10.4
pydub                    0.25.1
Pygments                 2.14.0
pyparsing                3.0.9
pyppmd                   1.0.0
pyrsistent               0.19.3
PySocks                  1.7.1
python-dateutil          2.8.2
python-multipart         0.0.6
pytz                     2022.7.1
PyYAML                   6.0
pyzstd                   0.15.4
ray                      2.3.0
regex                    2022.10.31
requests                 2.28.2
requests-oauthlib        1.3.1
responses                0.18.0
rfc3986                  1.5.0
rich                     13.3.2
rouge-score              0.1.2
rsa                      4.9
scikit-learn             1.2.0
scipy                    1.10.1
semantic-version         2.10.0
sentencepiece            0.1.97
sentry-sdk               1.16.0
setproctitle             1.3.2
setuptools               65.6.3
six                      1.16.0
smart-open               6.3.0
smmap                    5.0.0
sniffio                  1.3.0
soupsieve                2.4
starlette                0.26.1
tabulate                 0.9.0
tensorboard              2.12.0
tensorboard-data-server  0.7.0
tensorboard-plugin-wit   1.8.1
termcolor                2.2.0
texttable                1.6.7
threadpoolctl            3.1.0
tokenizers               0.13.2
toolz                    0.12.0
torch                    1.13.1
torchtyping              0.1.4
torchvision              0.14.1
tqdm                     4.65.0
transformers             4.28.0.dev0
trlx                     0.3.0
typeguard                2.13.3
typing_extensions        4.5.0
uc-micro-py              1.0.1
urllib3                  1.26.14
uvicorn                  0.21.1
virtualenv               20.20.0
wandb                    0.13.10
websockets               10.4
Werkzeug                 2.2.3
wheel                    0.38.4
xxhash                   3.2.0
yarl                     1.8.2

@kaihe
Copy link
Author

kaihe commented Mar 24, 2023

找到问题,应该是我的4个卡中的1个有问题。排列组合了一下显卡,发现只要用那个卡就会这个错。用其他的卡就不会

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants