rtx-4090多卡推理（模型为qlora微调后qwen72b）是否支持？通过FSDP+QLoRA，可以正常对qwen-72b的模型进行微调，想问一下，如何使用rxt-4090对其进行推理部署呢？ #3023

ConniePK · 2024-03-28T09:02:40Z

rtx-4090多卡推理（模型为qlora微调后qwen72b）是否支持？通过FSDP+QLoRA，可以正常对qwen-72b的模型进行微调，想问一下，如何使用rxt-4090对其进行推理部署呢？
我尝试使用如下的脚本进行多卡推理：

CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch --config_file fsdp_config.yaml infer.py \
    --model_name_or_path '/root/.cache/modelscope/hub/qwen/Qwen-72B-Chat/' \
    --adapter_name_or_path '/home/work/bin/sft_model/v24_qlora_sft_qwen72b//' \
    --template qwen \
    --quantization_bit 4 > log.txt 2>&1 &

但很快就报oom的错误，如下所示

03/28/2024 09:14:58 - INFO - llmtuner.model.patcher - Quantizing model to 4 bit.
03/28/2024 09:14:58 - INFO - llmtuner.model.patcher - Using KV cache for faster generation.
03/28/2024 09:14:58 - INFO - llmtuner.model.patcher - Quantizing model to 4 bit.
03/28/2024 09:14:58 - INFO - llmtuner.model.patcher - Using KV cache for faster generation.
03/28/2024 09:14:58 - INFO - llmtuner.model.patcher - Quantizing model to 4 bit.
03/28/2024 09:14:58 - INFO - llmtuner.model.patcher - Using KV cache for faster generation.
[INFO|modeling_utils.py:3280] 2024-03-28 09:14:58,502 >> loading weights file /root/.cache/modelscope/hub/qwen/Qwen-72B-Chat/model.safetensors.index.json
[INFO|modeling_utils.py:1417] 2024-03-28 09:14:58,502 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-03-28 09:14:58,503 >> Generate config GenerationConfig {}

[INFO|modeling_utils.py:3280] 2024-03-28 09:14:58,507 >> loading weights file /root/.cache/modelscope/hub/qwen/Qwen-72B-Chat/model.safetensors.index.json
[INFO|modeling_utils.py:3280] 2024-03-28 09:14:58,507 >> loading weights file /root/.cache/modelscope/hub/qwen/Qwen-72B-Chat/model.safetensors.index.json
[INFO|modeling_utils.py:1417] 2024-03-28 09:14:58,508 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|modeling_utils.py:1417] 2024-03-28 09:14:58,508 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-03-28 09:14:58,509 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:928] 2024-03-28 09:14:58,509 >> Generate config GenerationConfig {}

[INFO|modeling_utils.py:3280] 2024-03-28 09:14:58,509 >> loading weights file /root/.cache/modelscope/hub/qwen/Qwen-72B-Chat/model.safetensors.index.json
[INFO|modeling_utils.py:1417] 2024-03-28 09:14:58,510 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-03-28 09:14:58,511 >> Generate config GenerationConfig {}

Loading checkpoint shards:  58%|█████▊    | 11/19 [00:53<00:38,  4.82s/it]
Traceback (most recent call last):
  File "sft_model_interface_qwen_based_llama_factory.py", line 28, in <module>
    model = ChatModel()
  File "/home/work/bin/marketing-ai-store-assistant/mi_salesclerk_robot_llm/llmtuner/chat/chat_model.py", line 23, in __init__
    self.engine: "BaseEngine" = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)
  File "/home/work/bin/marketing-ai-store-assistant/mi_salesclerk_robot_llm/llmtuner/chat/hf_engine.py", line 33, in __init__
    self.model, self.tokenizer = load_model_and_tokenizer(
  File "/home/work/bin/marketing-ai-store-assistant/mi_salesclerk_robot_llm/llmtuner/model/loader.py", line 149, in load_model_and_tokenizer
    model = load_model(tokenizer, model_args, finetuning_args, is_trainable, add_valuehead)
  File "/home/work/bin/marketing-ai-store-assistant/mi_salesclerk_robot_llm/llmtuner/model/loader.py", line 89, in load_model
    model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, config=config, **init_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 3531, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 3958, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 814, in _load_state_dict_into_meta_model
    hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
  File "/usr/local/lib/python3.8/dist-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 219, in create_quantized_param
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(target_device)
  File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/nn/modules.py", line 313, in to
    return self._quantize(device)
  File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/nn/modules.py", line 280, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
  File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/functional.py", line 986, in quantize_4bit
    out = torch.zeros(((n+1)//mod, 1), dtype=quant_storage, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 1 has a total capacty of 23.65 GiB of which 82.06 MiB is free. Process 76565 has 23.56 GiB memory in use. Of the allocated memory 23.06 GiB is allocated by PyTorch, and 71.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

当模型权重加载到11/19的时候，显存占用情况如下:

看起来是每个卡都在加载一遍模型，而不是将模型平均分配到多张卡上？

The text was updated successfully, but these errors were encountered:

sxm7078 · 2024-03-28T10:58:03Z

依赖的包的版本是多少

ConniePK · 2024-03-28T11:16:18Z

依赖的包的版本是多少

accelerate==0.28.0
addict==2.4.0
aiofiles==23.2.1
aiohttp==3.9.1
aiosignal==1.3.1
aliyun-python-sdk-core==2.14.0
aliyun-python-sdk-kms==2.16.2
altair==5.2.0
annotated-types==0.6.0
anyio==4.2.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
bitsandbytes==0.43.0
Brotli==1.0.9
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==2.0.4
click==8.1.7
cloudpickle==3.0.0
contourpy==1.2.0
crcmod==1.7
cryptography==41.0.7
cupy-cuda12x==12.1.0
cycler==0.12.1
datasets==2.16.1
decorator==5.1.1
deepspeed==0.14.0
dill==0.3.7
diskcache==5.6.3
docstring-parser==0.15
dropout-layer-norm==0.1
einops==0.7.0
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.109.0
fastrlock==0.8.2
ffmpy==0.3.1
filelock==3.13.1
fire==0.6.0
flash-attn==2.4.2
fonttools==4.47.2
frozenlist==1.4.1
fschat==0.2.36
fsspec==2023.10.0
gast==0.5.4
gmpy2==2.1.2
gradio==3.50.2
gradio_client==0.6.1
h11==0.14.0
hjson==3.1.0
httpcore==1.0.2
httptools==0.6.1
httpx==0.26.0
huggingface-hub==0.20.2
idna==3.4
importlib-metadata==7.0.1
importlib-resources==6.1.1
interegular==0.3.3
ipython==8.20.0
jedi==0.19.1
jieba==0.42.1
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.3.2
jsonschema==4.20.0
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
lark==1.1.9
llvmlite==0.42.0
markdown-it-py==3.0.0
markdown2==2.4.13
MarkupSafe==2.1.3
matplotlib==3.8.2
matplotlib-inline==0.1.6
mdurl==0.1.2
mkl-fft==1.3.8
mkl-random==1.2.4
mkl-service==2.4.0
modelscope==1.11.1
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.4
multiprocess==0.70.15
nest-asyncio==1.6.0
networkx==3.1
nh3==0.2.15
ninja==1.11.1.1
nltk==3.8.1
numba==0.59.1
numpy==1.26.3
orjson==3.9.10
oss2==2.18.4
outlines==0.0.36
packaging==23.2
pandas==2.1.4
parso==0.8.3
peft==0.9.0
pexpect==4.9.0
Pillow==10.0.1
pip==24.0
platformdirs==4.1.0
prometheus_client==0.20.0
prompt-toolkit==3.0.43
protobuf==4.25.2
psutil==5.9.7
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==14.0.2
pyarrow-hotfix==0.6
pycparser==2.21
pycryptodome==3.20.0
pydantic==2.5.3
pydantic_core==2.14.6
pydub==0.25.1
Pygments==2.17.2
pynvml==11.5.0
pyOpenSSL==23.2.0
pyparsing==3.1.1
PySocks==1.7.1
python-dateutil==2.8.2
python-dotenv==1.0.1
python-multipart==0.0.6
pytz==2023.3.post1
PyYAML==6.0.1
ray==2.10.0
referencing==0.32.1
regex==2023.12.25
requests==2.31.0
rich==13.7.0
rotary-emb==0.1
rouge-chinese==1.0.3
rpds-py==0.16.2
safetensors==0.4.1
scipy==1.11.4
semantic-version==2.10.0
sentencepiece==0.1.99
setuptools==68.2.2
shortuuid==1.0.13
shtab==1.6.5
simplejson==3.19.2
six==1.16.0
sniffio==1.3.0
sortedcontainers==2.4.0
sse-starlette==1.8.2
stack-data==0.6.3
starlette==0.35.1
svgwrite==1.4.3
sympy==1.12
termcolor==2.4.0
tiktoken==0.5.2
tokenizers==0.15.0
tomli==2.0.1
toolz==0.12.0
torch==2.1.2
torchaudio==2.1.2
torchvision==0.16.2
tqdm==4.66.1
traitlets==5.14.1
transformers==4.40.0.dev0
transformers-stream-generator==0.0.4
triton==2.1.0
trl==0.8.1
typing_extensions==4.9.0
tyro==0.6.3
tzdata==2023.4
unsloth==2024.1
urllib3==1.26.18
uvicorn==0.25.0
uvloop==0.19.0
vllm==0.3.3
watchfiles==0.21.0
wavedrom==2.0.3.post3
wcwidth==0.2.13
websockets==11.0.3
wheel==0.41.2
xformers==0.0.23.post1
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
zipp==3.17.0

tianyabanbu · 2024-04-01T04:32:16Z

我用1机8卡RTX3090 lora微调qwen1,5-72B-chat有相同的问题，换Yi-34B-chat后还是会OOM，请问 @ConniePK 您解决了吗?

tianyabanbu · 2024-04-01T04:33:19Z

我又尝试了fsdp+qlora的方式，能够正常运行

ConniePK · 2024-04-01T06:48:59Z

我又尝试了fsdp+qlora的方式，能够正常运行

是推理吗？微调可以正常运行。微调成功之后，用双卡推理出现了上述问题

hiyouga · 2024-04-01T09:35:38Z

最新版代码支持了多卡推理量化模型，使用

CUDA_VISIBLE_DEVICES=4,5,6,7 python cli_demo.py \
    --model_name_or_path 'Qwen-72B-Chat' \
    --adapter_name_or_path 'lora_model' \
    --template qwen \
    --quantization_bit 4 \
    --quantization_device_map auto

进行量化模型的多卡推理

@marko1616

* fix packages * Update wechat.jpg * Updated README with new information * Updated README with new information * Updated README with new information * Follow HF_ENDPOINT environment variable * fix hiyouga#2346 * fix hiyouga#2777 hiyouga#2895 * add orca_dpo_pairs dataset * support fsdp + qlora * update readme * update tool extractor * paper release * add citation * move file * Update README.md, fix the release date of the paper * Update README_zh.md, fix the release date of the paper * Update wechat.jpg * fix hiyouga#2941 * fix hiyouga#2928 * fix hiyouga#2936 * fix Llama lora merge crash * fix Llama lora merge crash * fix Llama lora merge crash * pass ruff check * tiny fix * Update requirements.txt * Update README_zh.md * release v0.6.0 * add arg check * Update README_zh.md * Update README.md * update readme * tiny fix * release v0.6.0 (real) * Update wechat.jpg * fix hiyouga#2961 * fix bug * fix hiyouga#2981 * fix ds optimizer * update trainers * fix hiyouga#3010 * update readme * fix hiyouga#2982 * add project * update readme * release v0.6.1 * Update wechat.jpg * fix pile datset hf hub url * upgrade gradio to 4.21.0 * support save args in webui hiyouga#2807 hiyouga#3046 some ideas are borrowed from @marko1616 * Fix Llama model save for full param train * fix blank line contains whitespace * tiny fix * support ORPO * support orpo in webui * update readme * use log1p in orpo loss huggingface/trl#1491 * fix plots * fix IPO and ORPO loss * fix ORPO loss * update webui * support infer 4bit model on GPUs hiyouga#3023 * fix hiyouga#3077 * add qwen1.5 moe * fix hiyouga#3083 * set dev version * Update SECURITY.md * fix hiyouga#3022 * add moe aux loss control hiyouga#3085 * simplify readme * update readme * update readme * update examples * update examples * add zh readme * update examples * update readme * update vllm example * Update wechat.jpg * fix hiyouga#3116 * fix resize vocab at inference hiyouga#3022 * fix requires for windows * fix bug in latest gradio * back to gradio 4.21 and fix chat * tiny fix * update examples * update readme * support Qwen1.5-32B * support Qwen1.5-32B * fix spell error * support hiyouga#3152 * rename template to breeze * rename template to breeze * add empty line * Update wechat.jpg * tiny fix * fix quant infer and qwen2moe * Pass additional_target to unsloth Fixes hiyouga#3200 * Update adapter.py * Update adapter.py * fix hiyouga#3225 --------- Co-authored-by: hiyouga <[email protected]> Co-authored-by: 刘一博 <[email protected]> Co-authored-by: khazic <[email protected]> Co-authored-by: SirlyDreamer <[email protected]> Co-authored-by: Sanjay Nadhavajhala <[email protected]> Co-authored-by: sanjay920 <[email protected]> Co-authored-by: 0xez <[email protected]> Co-authored-by: marko1616 <[email protected]> Co-authored-by: Remek Kinas <[email protected]> Co-authored-by: Tsumugii24 <[email protected]> Co-authored-by: li.yunhao <[email protected]> Co-authored-by: sliderSun <[email protected]> Co-authored-by: codingma <[email protected]> Co-authored-by: Erich Schubert <[email protected]>

xiaoliwe · 2024-07-09T03:19:21Z

目前1张4090支持多少个推理请求？

hiyouga added the pending This problem is yet to be addressed label Mar 28, 2024

hiyouga added a commit that referenced this issue Apr 1, 2024

support infer 4bit model on GPUs #3023

eb259cc

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Apr 1, 2024

hiyouga closed this as completed Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rtx-4090多卡推理（模型为qlora微调后qwen72b）是否支持？通过FSDP+QLoRA，可以正常对qwen-72b的模型进行微调，想问一下，如何使用rxt-4090对其进行推理部署呢？ #3023

rtx-4090多卡推理（模型为qlora微调后qwen72b）是否支持？通过FSDP+QLoRA，可以正常对qwen-72b的模型进行微调，想问一下，如何使用rxt-4090对其进行推理部署呢？ #3023

ConniePK commented Mar 28, 2024 •

edited

Loading

sxm7078 commented Mar 28, 2024

ConniePK commented Mar 28, 2024

tianyabanbu commented Apr 1, 2024

tianyabanbu commented Apr 1, 2024

ConniePK commented Apr 1, 2024

hiyouga commented Apr 1, 2024 •

edited

Loading

xiaoliwe commented Jul 9, 2024

rtx-4090多卡推理（模型为qlora微调后qwen72b）是否支持？通过FSDP+QLoRA，可以正常对qwen-72b的模型进行微调，想问一下，如何使用rxt-4090对其进行推理部署呢？ #3023

rtx-4090多卡推理（模型为qlora微调后qwen72b）是否支持？通过FSDP+QLoRA，可以正常对qwen-72b的模型进行微调，想问一下，如何使用rxt-4090对其进行推理部署呢？ #3023

Comments

ConniePK commented Mar 28, 2024 • edited Loading

sxm7078 commented Mar 28, 2024

ConniePK commented Mar 28, 2024

tianyabanbu commented Apr 1, 2024

tianyabanbu commented Apr 1, 2024

ConniePK commented Apr 1, 2024

hiyouga commented Apr 1, 2024 • edited Loading

xiaoliwe commented Jul 9, 2024

ConniePK commented Mar 28, 2024 •

edited

Loading

hiyouga commented Apr 1, 2024 •

edited

Loading