Error when inference using vllm with multi-gpu #4780

hzhaoy · 2024-07-11T16:23:27Z

Reminder

I have read the README and searched the existing issues.

System Info

System: Ubuntu 20.04.2 LTS
GPU: NVIDIA A100-SXM4-80GB
Docker: 24.0.0
Docker Compose: v2.17.3
llamafactory: 0.8.2.dev0
vllm: 0.5.1

Reproduction

Dockerfile: https://github.com/hiyouga/LLaMA-Factory/blob/67040f149c0b3fbae443ba656ed0dcab0ebaf730/docker/docker-cuda/Dockerfile

Build Command:

docker build -f ./Dockerfile \
    --build-arg INSTALL_BNB=true \
    --build-arg INSTALL_VLLM=true \
    --build-arg INSTALL_DEEPSPEED=true \
    --build-arg INSTALL_FLASHATTN=true \
    --build-arg PIP_INDEX=https://pypi.tuna.tsinghua.edu.cn/simple \
    -t llamafactory:latest .

Launch Command:

docker run -dit --gpus=all \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -p 7860:7860 \
    -p 8000:8000 \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash

llamafactory-cli webui

The error below occurs when loading Qwen2-7B-Instruct in the chat tab of webui using vllm with multi-gpu:

(VllmWorkerProcess pid=263) Process VllmWorkerProcess:
(VllmWorkerProcess pid=263) Traceback (most recent call last):
(VllmWorkerProcess pid=263)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=263)     self.run()
(VllmWorkerProcess pid=263)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=263)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 210, in _run_worker_process
(VllmWorkerProcess pid=263)     worker = worker_factory()
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 68, in _create_worker
(VllmWorkerProcess pid=263)     wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 334, in init_worker
(VllmWorkerProcess pid=263)     self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 85, in __init__
(VllmWorkerProcess pid=263)     self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 217, in __init__
(VllmWorkerProcess pid=263)     self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 45, in get_attn_backend
(VllmWorkerProcess pid=263)     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 151, in which_attn_to_use
(VllmWorkerProcess pid=263)     if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(VllmWorkerProcess pid=263)     prop = get_device_properties(device)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(VllmWorkerProcess pid=263)     _lazy_init()  # will define _get_device_properties
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 279, in _lazy_init
(VllmWorkerProcess pid=263)     raise RuntimeError(
(VllmWorkerProcess pid=263) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
ERROR 07-11 13:53:53 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 263 died, exit code: 1
INFO 07-11 13:53:53 multiproc_worker_utils.py:123] Killing local vLLM worker processes

Expected behavior

Successfully loading model using vllm with multi-gpu.

Others

No response

The text was updated successfully, but these errors were encountered:

0sengseng0 · 2024-07-12T07:01:06Z

I have this problem too.

BestLemoon · 2024-07-13T01:41:43Z

~~same problem while loading model for inference.~~
fixed by adding VLLM_WORKER_MULTIPROC_METHOD=spawn before command

github-actions bot added the pending This problem is yet to be addressed label Jul 11, 2024

hzhaoy mentioned this issue Jul 11, 2024

Fix cuda Dockerfile #4781

Merged

2 tasks

hzhaoy changed the title ~~Error when inference using vllm~~ Error when inference using vllm with multi-gpu Jul 12, 2024

hiyouga closed this as completed in 642c6d6 Jul 13, 2024

hiyouga closed this as completed in #4781 Jul 13, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jul 13, 2024

xtchen96 pushed a commit to xtchen96/LLaMA-Factory that referenced this issue Jul 17, 2024

fix hiyouga#4780

193f235

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when inference using vllm with multi-gpu #4780

Error when inference using vllm with multi-gpu #4780

hzhaoy commented Jul 11, 2024

0sengseng0 commented Jul 12, 2024

BestLemoon commented Jul 13, 2024 •

edited

Loading

Error when inference using vllm with multi-gpu #4780

Error when inference using vllm with multi-gpu #4780

Comments

hzhaoy commented Jul 11, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

0sengseng0 commented Jul 12, 2024

BestLemoon commented Jul 13, 2024 • edited Loading

BestLemoon commented Jul 13, 2024 •

edited

Loading