Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when inference using vllm with multi-gpu #4780

Closed
1 task done
hzhaoy opened this issue Jul 11, 2024 · 2 comments · Fixed by #4781
Closed
1 task done

Error when inference using vllm with multi-gpu #4780

hzhaoy opened this issue Jul 11, 2024 · 2 comments · Fixed by #4781
Labels
solved This problem has been already solved

Comments

@hzhaoy
Copy link
Contributor

hzhaoy commented Jul 11, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

System: Ubuntu 20.04.2 LTS
GPU: NVIDIA A100-SXM4-80GB
Docker: 24.0.0
Docker Compose: v2.17.3
llamafactory: 0.8.2.dev0
vllm: 0.5.1

Reproduction

Dockerfile: https://github.com/hiyouga/LLaMA-Factory/blob/67040f149c0b3fbae443ba656ed0dcab0ebaf730/docker/docker-cuda/Dockerfile

Build Command:

docker build -f ./Dockerfile \
    --build-arg INSTALL_BNB=true \
    --build-arg INSTALL_VLLM=true \
    --build-arg INSTALL_DEEPSPEED=true \
    --build-arg INSTALL_FLASHATTN=true \
    --build-arg PIP_INDEX=https://pypi.tuna.tsinghua.edu.cn/simple \
    -t llamafactory:latest .

Launch Command:

docker run -dit --gpus=all \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -p 7860:7860 \
    -p 8000:8000 \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash

llamafactory-cli webui

The error below occurs when loading Qwen2-7B-Instruct in the chat tab of webui using vllm with multi-gpu:

(VllmWorkerProcess pid=263) Process VllmWorkerProcess:
(VllmWorkerProcess pid=263) Traceback (most recent call last):
(VllmWorkerProcess pid=263)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=263)     self.run()
(VllmWorkerProcess pid=263)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=263)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 210, in _run_worker_process
(VllmWorkerProcess pid=263)     worker = worker_factory()
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 68, in _create_worker
(VllmWorkerProcess pid=263)     wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 334, in init_worker
(VllmWorkerProcess pid=263)     self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 85, in __init__
(VllmWorkerProcess pid=263)     self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 217, in __init__
(VllmWorkerProcess pid=263)     self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 45, in get_attn_backend
(VllmWorkerProcess pid=263)     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 151, in which_attn_to_use
(VllmWorkerProcess pid=263)     if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(VllmWorkerProcess pid=263)     prop = get_device_properties(device)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(VllmWorkerProcess pid=263)     _lazy_init()  # will define _get_device_properties
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 279, in _lazy_init
(VllmWorkerProcess pid=263)     raise RuntimeError(
(VllmWorkerProcess pid=263) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
ERROR 07-11 13:53:53 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 263 died, exit code: 1
INFO 07-11 13:53:53 multiproc_worker_utils.py:123] Killing local vLLM worker processes

Expected behavior

Successfully loading model using vllm with multi-gpu.

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jul 11, 2024
@hzhaoy hzhaoy mentioned this issue Jul 11, 2024
2 tasks
@hzhaoy hzhaoy changed the title Error when inference using vllm Error when inference using vllm with multi-gpu Jul 12, 2024
@0sengseng0
Copy link

I have this problem too.

@BestLemoon
Copy link

BestLemoon commented Jul 13, 2024

same problem while loading model for inference.
fixed by adding VLLM_WORKER_MULTIPROC_METHOD=spawn before command

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jul 13, 2024
xtchen96 pushed a commit to xtchen96/LLaMA-Factory that referenced this issue Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants