auto_gptq 0.7.1，利用工程量化出来的模型不可用，报错：(RayWorkerVllm pid=6035) ERROR 05-10 03:37:55 ray_utils.py:44] ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16] [repeated 2x across cluster] #3674

camposs1979 · 2024-05-10T03:56:57Z

Reminder

I have read the README and searched the existing issues.

Reproduction

硬件环境：
4 * RTX3090：（这个环境我已经运行过Qwen1.5-72B-Chat-GTPQ-INT4（即Qwen72B的INT4量化版）

这是我调用Web脚本：
#!/bin/bash

CUDA_VISIBLE_DEVICES=0,1,2,3 python3.10 webui.py
--model_name_or_path ../model/qwen/Qwen1.5-72B-Chat-sft-INT4
--template qwen
--use_fast_tokenizer True
--repetition_penalty 1.03
--infer_backend vllm
--cutoff_len 8192
--flash_attn auto
......
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader.py", line 70, in get_model
raise ValueError(
ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16]
(RayWorkerVllm pid=20366) INFO 05-10 03:50:19 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
...

这是我用来量化的脚本：
#!/bin/bash

DO NOT use quantized model or quantization_bit when merging lora weights

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3.10 export_model.py
--model_name_or_path /hy-tmp/models/Qwen1.5-72B-Chat-sft
--export_quantization_bit 4
--export_quantization_dataset ../data/c4_demo.json
--template qwen
--export_dir ../../models/Qwen1.5-72B-Chat-sft-INT4
--export_size 2
--export_device cpu
--export_legacy_format False

不知道是哪儿出了问题？

Expected behavior

期望采用当前硬件能够运行自己经过微调以后量化出来的Qwen1.5-72B-GTPQ-INT4。

System Info

(base) root@I19d213861d0060102e:/hy-tmp/LLaMA-Factory-main/src# python3.10 -m pip list
Package Version

accelerate 0.28.0
addict 2.4.0
aiofiles 23.2.1
aiohttp 3.9.3
aiosignal 1.3.1
aliyun-python-sdk-core 2.15.0
aliyun-python-sdk-kms 2.16.2
altair 5.2.0
annotated-types 0.6.0
anyio 4.3.0
async-timeout 4.0.3
attrs 23.2.0
auto_gptq 0.7.1
bitsandbytes 0.43.0
certifi 2019.11.28
cffi 1.16.0
chardet 3.0.4
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
cmake 3.29.2
contourpy 1.2.0
crcmod 1.7
cryptography 42.0.5
cupy-cuda12x 12.1.0
cycler 0.12.1
datasets 2.18.0
dbus-python 1.2.16
deepspeed 0.14.0
dill 0.3.8
diskcache 5.6.3
distro 1.4.0
distro-info 0.23ubuntu1
docstring_parser 0.16
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.110.0
fastrlock 0.8.2
ffmpy 0.3.2
filelock 3.13.3
fire 0.6.0
fonttools 4.50.0
frozenlist 1.4.1
fsspec 2024.2.0
galore-torch 1.0
gast 0.5.4
gekko 1.0.7
gradio 3.50.2
gradio_client 0.6.1
h11 0.14.0
hjson 3.1.0
httpcore 1.0.4
httptools 0.6.1
httpx 0.27.0
huggingface-hub 0.22.0
idna 2.8
importlib_metadata 7.1.0
importlib_resources 6.4.0
interegular 0.3.3
Jinja2 3.1.3
jmespath 0.10.0
joblib 1.3.2
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lark 1.1.9
llvmlite 0.42.0
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.8.3
mdurl 0.1.2
modelscope 1.13.3
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.2.1
ninja 1.11.1.1
numba 0.59.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.4.99
nvidia-nvtx-cu12 12.1.105
orjson 3.9.15
oss2 2.18.4
outlines 0.0.34
packaging 24.0
pandas 2.2.1
peft 0.10.0
pillow 10.2.0
pip 24.0
platformdirs 4.2.0
prometheus_client 0.20.0
protobuf 5.26.0
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 15.0.2
pyarrow-hotfix 0.6
pycparser 2.21
pycryptodome 3.20.0
pydantic 2.6.4
pydantic_core 2.16.3
pydub 0.25.1
Pygments 2.17.2
PyGObject 3.36.0
pynvml 11.5.0
pyparsing 3.1.2
python-apt 2.0.1+ubuntu0.20.4.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-multipart 0.0.9
pytz 2024.1
PyYAML 6.0.1
ray 2.10.0
referencing 0.34.0
regex 2023.12.25
requests 2.31.0
requests-unixsocket 0.2.0
rich 13.7.1
rouge 1.0.1
rpds-py 0.18.0
safetensors 0.4.2
scipy 1.12.0
semantic-version 2.10.0
sentencepiece 0.2.0
setuptools 69.2.0
shtab 1.7.1
simplejson 3.19.2
six 1.14.0
sniffio 1.3.1
sortedcontainers 2.4.0
sse-starlette 2.0.0
ssh-import-id 5.10
starlette 0.36.3
sympy 1.12
termcolor 2.4.0
tiktoken 0.6.0
tokenizers 0.15.2
tomli 2.0.1
toolz 0.12.1
torch 2.1.2
tqdm 4.66.2
transformers 4.39.1
triton 2.1.0
trl 0.8.1
typing_extensions 4.10.0
tyro 0.7.3
tzdata 2024.1
unattended-upgrades 0.1
urllib3 2.2.1
uvicorn 0.29.0
uvloop 0.19.0
vllm 0.4.0
watchfiles 0.21.0
websockets 11.0.3
wheel 0.34.2
xformers 0.0.23.post1
xxhash 3.4.1
yapf 0.40.2
yarl 1.9.4
zipp 3.18.1

Others

备注：当前的硬件环境下是成功运行过原版的Qwen1.5-72B-GTPQ-INT4

hiyouga · 2024-05-11T16:05:24Z

已经修复，可以重新量化模型，或者把量化后模型 config.json 里面的 torch_dtype 改成 float16

hiyouga closed this as completed in 5685777 May 11, 2024

hiyouga added the solved This problem has been already solved label May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto_gptq 0.7.1，利用工程量化出来的模型不可用，报错：(RayWorkerVllm pid=6035) ERROR 05-10 03:37:55 ray_utils.py:44] ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16] [repeated 2x across cluster] #3674

auto_gptq 0.7.1，利用工程量化出来的模型不可用，报错：(RayWorkerVllm pid=6035) ERROR 05-10 03:37:55 ray_utils.py:44] ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16] [repeated 2x across cluster] #3674

camposs1979 commented May 10, 2024

hiyouga commented May 11, 2024

auto_gptq 0.7.1，利用工程量化出来的模型不可用，报错：(RayWorkerVllm pid=6035) ERROR 05-10 03:37:55 ray_utils.py:44] ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16] [repeated 2x across cluster] #3674

auto_gptq 0.7.1，利用工程量化出来的模型不可用，报错：(RayWorkerVllm pid=6035) ERROR 05-10 03:37:55 ray_utils.py:44] ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16] [repeated 2x across cluster] #3674

Comments

camposs1979 commented May 10, 2024

Reminder

Reproduction

DO NOT use quantized model or quantization_bit when merging lora weights

Expected behavior

System Info

Others

hiyouga commented May 11, 2024