Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] DeepSeek-V3 Enhancements #11539

Open
6 of 10 tasks
simon-mo opened this issue Dec 27, 2024 · 43 comments
Open
6 of 10 tasks

[Model] DeepSeek-V3 Enhancements #11539

simon-mo opened this issue Dec 27, 2024 · 43 comments
Labels
new model Requests to new models performance Performance-related issues

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Dec 27, 2024

This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute!

@simon-mo simon-mo added misc performance Performance-related issues new model Requests to new models and removed misc labels Dec 27, 2024
@simon-mo simon-mo changed the title [Model] Deepseek V3 Enhancements [Model] DeepSeek-V3 Enhancements Dec 27, 2024
@july8023
Copy link

If I want to deploy deepseek 600B use vllm and RTX4090, are there any restrictions? How many RTX 4090 do I need at least?

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

Is inference with A100s supported? How about quantization??

@mphilippnv
Copy link

Deepseek v3 doesn't appear to support pipeline parallelism. I get this error when attempting to deploy to 2 8x H100 nodes:

NotImplementedError: Pipeline parallelism is only supported for the following  architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].

I'm using --tensor-parallel-size 8 --pipeline-parallel-size 2

@simon-mo
Copy link
Collaborator Author

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that.
@fsaudm A100s are not supported because this models requires FP8 tensor cores.
@mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

@simon-mo right, A100s don't support fp8. Would the arg --dtype bfloat16 suffice? If not, I found the bf16 version in Huggingface, any insights on whether that would work?

@simon-mo
Copy link
Collaborator Author

The model currently does not support --dtype bfloat16 because it is natively trained in fp8. Can you point me to the bf16 version?

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main

, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.

https://github.com/deepseek-ai/DeepSeek-V3

@simon-mo
Copy link
Collaborator Author

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

@mphilippnv
Copy link

mphilippnv commented Dec 31, 2024

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that. @fsaudm A100s are not supported because this models requires FP8 tensor cores. @mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

Using v0.6.6

EDIT: Apologies, I was using 0.6.2. Redeploying helm chart with 0.6.6.post1. Will see how it goes.

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

Any knowledge of a working example of serving deepseekv3 on A100s with vLLM? I'll try later, but any hints or help is very much appreciated

@JamesBVMNetwork
Copy link

Hi everyone,
I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z 

Here’s the command I used:

--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager

Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

@glowwormX
Copy link

Hi everyone, I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z 

Here’s the command I used:

--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager

Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

I've had this problem, too. Is there a solution?

@ishaandatta
Copy link

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length.
We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM.
I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

@shaowei-su
Copy link

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining fp8e4nv data type is not supported on CUDA arch < 89 when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks

@merlintang
Copy link

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

@lhl
Copy link

lhl commented Jan 9, 2025

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing

Here's vLLM vs SGLang at concurrency=64 atm:

output(3)

Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.

@fan-niu
Copy link

fan-niu commented Jan 9, 2025

Same issue. I used 16 H100 GPUs, set TP=16, deployed using ray in k8s, and opened the IB network. I made a simple curl request, input 10 tokens, and output 242 tokens. This curl test It took 44 seconds. Can anyone help me figure out why?

@merlintang
Copy link

does the perf issues related to the MOE opt ? it is not included in the current version.?

@ishaandatta
Copy link

ishaandatta commented Jan 9, 2025

@shaowei-su I'm using the bf16 version you linked.

@lhl thank you for sharing this! I'm currently using tp=4 pp=6 as we're aiming for context lengths > 64k.
Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?
If so- I am wondering as to how deepseek-chat is able to achieve their throughput, I measured it at over 60 output tokens/sec

@lhl
Copy link

lhl commented Jan 10, 2025

Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?

for bs=1 SGLang outputs around 26 tok/s:

(sglang) ubuntu@ip-10-1-1-135:~$ python3 -m sglang.bench_serving --backend sglang --num-prompts 50 --max-concurrency 1 --port 8000
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=8000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=1, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 10354
#Output tokens: 11509
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [07:20<00:00,  8.82s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                1
Successful requests:                     50
Benchmark duration (s):                  440.98
Total input tokens:                      10354
Total generated tokens:                  11509
Total generated tokens (retokenized):    11467
Request throughput (req/s):              0.11
Input token throughput (tok/s):          23.48
Output token throughput (tok/s):         26.10
Total token throughput (tok/s):          49.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8819.11
Median E2E Latency (ms):                 4817.32
---------------Time to First Token----------------
Mean TTFT (ms):                          318.37
Median TTFT (ms):                        259.02
P99 TTFT (ms):                           1658.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.41
Median TPOT (ms):                        36.97
P99 TPOT (ms):                           37.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.18
Median ITL (ms):                         37.06
P99 ITL (ms):                            38.91
==================================================

You should read the DeepSeek Technical Report in the infrastructure, they deploy in 320 GPU blocks w/ specialized/separated functions.

That being said, there's certainly optimizations that can be made for "regular" inference. On vLLM, when doing throughput optimization, with some tuning I can generate >7000 tok/s on a single H100 node for a Llama 3 70B class model at c=512. DSv3 has about half the activations, and at c=512 sglang currently tops out at about 1100 tok/s on 2xH100 nodes (vLLM is about half of that). You could imagine that there might be a 5-10X in throughput optimization available based naively on activations/fwd pass. This is before spec decode like EAGLE or Medusa is factored in.

@fan-niu
Copy link

fan-niu commented Jan 10, 2025

@simon-mo Is there any way or plan to improve the speed of vllm on deepseek v3? Thanks a lot

@panpan0000
Copy link
Contributor

we also see 3 token/s on 16x H20 with TP=8,PP=2

@drikster80
Copy link
Contributor

When I tested TP=16 on GH200 nodes (FP8 version), I was getting ~7.1 t/s (single batch). Ironically, when I used TP=8 (max_model_len=2048 so it all fit), I was getting slightly faster, which seemed strange.

One of the issues that might be slowing VLLM down is that one of the MoE specific CUDA kernels is hard-coded for DSv3 to force the use of Global memory, which is significantly slower than shared memory. This is due to the limited amount of shared memory available (dependent on the GPU model... for example, the H100 has 227KB of shared memory per block).
https://github.com/vllm-project/vllm/blob/main/csrc/moe/moe_align_sum_kernels.cu#L232

I don't know how much effect this has for this specific kernel, but it likely has some consequence. Techniques like distributed shared memory (H100+ specific) might be able to be used, or only keeping the active experts in there... but unfortunately I don't know much about CUDA programming. Spent 2 days messing trying to implement the "active-expert only" approach, but only served to slow down to 4.5 t/s...

@xpmemeda
Copy link

您好。请问现在使用vllm部署,要支持tool call功能,应该使用哪个parser?

@WangxuP
Copy link

WangxuP commented Jan 16, 2025

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

I use vllm==0.6.6.post1 can support this feature?

@teknium1
Copy link

Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB

@tot0
Copy link

tot0 commented Jan 28, 2025

Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB

I get 10.5 tok/s (1 sequence) on 8*MI300x using sglang, just for reference.

@pseudotensor
Copy link

Same, about 13 tokens/sec (1 sequence, long output) on 8*MI300x using sglang. It has an unexpected TTFT lag for me of about 2 seconds though.

@Neo9061
Copy link

Neo9061 commented Feb 2, 2025

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing

Here's vLLM vs SGLang at concurrency=64 atm:

output(3)

Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.

Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status.

Active:
 1 node_62bc8c92be4ee6912d3ac7sfsff5db8acf209daa9e
 1 node_ed25244254634eb76cfsfdfc7db4cf366b4a86c9b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/16.0 GPU
 0B/3.88TiB memory
 0B/19.46GiB object_store_memory

Demands:
 (no resource demands)

Then I try to deploy the VLLM engine by following code

MODEL_ID=/mylocal/DeepSeek/DeepSeek-R1, 
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
 python -m vllm.entrypoints.openai.api_server \
 --model $MODEL_ID \
 --port 8002 \
 --tensor-parallel-size 16 \
 --max-model-len 20000 \
 --trust-remote-code \
 --distributed-executor-backend ray

But I kept seeing

Started a local Ray instance. View the dashboard at 127.0.0.1:8266 
WARNING 02-02 19:54:20 ray_utils.py:315] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 02-02 19:54:30 ray_utils.py:212] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.31.42.4': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.

Any instruction will be much appreciated! CC: @pseudotensor @teknium1

@lhl
Copy link

lhl commented Feb 3, 2025

Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status.

If you are having trouble with the docs https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes and the referenced helper script, I'm not sure I can help - I spent a fair amount of work adapting ray to play nice with my slurm setup, so it's not very applicable for raw nodes. I'd maybe search or start a "discussion" thread and see if you can get an answer.

Barring that, I will have to say that sglang's multi-node launching is dead simple, so you could give that a spin if you can't get vLLM working: https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands

@yaochengji
Copy link
Collaborator

Thans @simon-mo , does the EP support includes the optimization for the shared expert(s) as how it described in the DeepSeek V3 paper?

@ehuaa
Copy link

ehuaa commented Feb 7, 2025

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main

, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.

https://github.com/deepseek-ai/DeepSeek-V3

I wonder why this conversion cannot performed on A100s, this script seems don't need load all of the deepseek-v3 model on gpu. @fsaudm Can you show me the reason? Thanks

@lambert0312
Copy link

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main
, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.
https://github.com/deepseek-ai/DeepSeek-V3

I wonder why this conversion cannot performed on A100s, this script seems don't need load all of the deepseek-v3 model on gpu. @fsaudm Can you show me the reason? Thanks

Because the script processes weight files one by one, it reads the FP8 weights and converts them into BF16 format through calculation. Because GPUs such as A100/A800 cannot process the FP8 format, it cannot be used. @ehuaa

@tjchuangplus
Copy link

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

@mphilippnv
Copy link

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

@nowinkeyy
Copy link

Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status.

Active:
 1 node_62bc8c92be4ee6912d3ac7sfsff5db8acf209daa9e
 1 node_ed25244254634eb76cfsfdfc7db4cf366b4a86c9b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/16.0 GPU
 0B/3.88TiB memory
 0B/19.46GiB object_store_memory

Demands:
 (no resource demands)

Then I try to deploy the VLLM engine by following code

MODEL_ID=/mylocal/DeepSeek/DeepSeek-R1, 
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
 python -m vllm.entrypoints.openai.api_server \
 --model $MODEL_ID \
 --port 8002 \
 --tensor-parallel-size 16 \
 --max-model-len 20000 \
 --trust-remote-code \
 --distributed-executor-backend ray

But I kept seeing

Started a local Ray instance. View the dashboard at 127.0.0.1:8266 
WARNING 02-02 19:54:20 ray_utils.py:315] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 02-02 19:54:30 ray_utils.py:212] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.31.42.4': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.

Any instruction will be much appreciated! CC: @pseudotensor @teknium1

@Neo9061 Hello, I think the CUDA_VISIBLE_DEVICES environment variable is misconfigured. CUDA_VISIBLE_DEVICES should be used to set the GPU of the current node. For example, two 8*H100 machines, the CUDA_VISIBLE_DEVICES environment variable on both machines should be 0,1,2,3,4,5,6,7. This is how I set it. You can try it.

@tjchuangplus
Copy link

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run

@mphilippnv
Copy link

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run

I think it was bf16. It was whatever the standard settings are.

@tjchuangplus
Copy link

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run

I think it was bf16. It was whatever the standard settings are.

Thank you, yes, I have also verified that BF16 is feasible, but FP8 cannot run smoothly, possibly because VLLM parallel strategy is not yet supported

@aftersnow
Copy link

Thank you @simon-mo ! Do you have plans to support sequence parallelism?

@javading
Copy link

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.
Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.
Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining fp8e4nv data type is not supported on CUDA arch < 89 when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks

Hi I've had this problem, too. Is there a solution?

@tjchuangplus
Copy link

After using VLLM for a period of time, there may be an error message indicating insufficient CUDA memory. Have you encountered this before? How should we handle it?

`
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 95.22 GiB of which 3.11 GiB is free. Including non-PyTorch memory, this process has 92.10 GiB memory in use. Of the allocated memory 73
.21 GiB is allocated by PyTorch, with 82.00 MiB allocated in private pools (e.g., CUDA Graphs), and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandab
le_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/root/anaconda3/envs/deepseek/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 70, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
CRITICAL 02-16 02:55:40 launcher.py:74] AsyncLLMEngine has failed, terminating server process`

@ztxdcyy
Copy link

ztxdcyy commented Feb 18, 2025

After using VLLM for a period of time, there may be an error message indicating insufficient CUDA memory. Have you encountered this before? How should we handle it?

` torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 95.22 GiB of which 3.11 GiB is free. Including non-PyTorch memory, this process has 92.10 GiB memory in use. Of the allocated memory 73 .21 GiB is allocated by PyTorch, with 82.00 MiB allocated in private pools (e.g., CUDA Graphs), and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandab le_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/root/anaconda3/envs/deepseek/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 70, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. CRITICAL 02-16 02:55:40 launcher.py:74] AsyncLLMEngine has failed, terminating server process`

try:--enforce eager, maybe helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models performance Performance-related issues
Projects
None yet
Development

No branches or pull requests