[Model] DeepSeek-V3 Enhancements #11539

simon-mo · 2024-12-27T00:39:28Z

july8023 · 2024-12-30T03:35:59Z

If I want to deploy deepseek 600B use vllm and RTX4090, are there any restrictions? How many RTX 4090 do I need at least?

fsaudm · 2024-12-31T13:35:09Z

Is inference with A100s supported? How about quantization??

mphilippnv · 2024-12-31T15:12:34Z

Deepseek v3 doesn't appear to support pipeline parallelism. I get this error when attempting to deploy to 2 8x H100 nodes:

NotImplementedError: Pipeline parallelism is only supported for the following  architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].

I'm using --tensor-parallel-size 8 --pipeline-parallel-size 2

simon-mo · 2024-12-31T17:04:24Z

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that.
@fsaudm A100s are not supported because this models requires FP8 tensor cores.
@mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

fsaudm · 2024-12-31T17:36:54Z

@simon-mo right, A100s don't support fp8. Would the arg --dtype bfloat16 suffice? If not, I found the bf16 version in Huggingface, any insights on whether that would work?

simon-mo · 2024-12-31T17:38:40Z

The model currently does not support --dtype bfloat16 because it is natively trained in fp8. Can you point me to the bf16 version?

fsaudm · 2024-12-31T17:44:53Z

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main

, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.

https://github.com/deepseek-ai/DeepSeek-V3

simon-mo · 2024-12-31T17:47:51Z

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

mphilippnv · 2024-12-31T17:51:07Z

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that. @fsaudm A100s are not supported because this models requires FP8 tensor cores. @mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

Using v0.6.6

EDIT: Apologies, I was using 0.6.2. Redeploying helm chart with 0.6.6.post1. Will see how it goes.

fsaudm · 2024-12-31T17:51:59Z

Any knowledge of a working example of serving deepseekv3 on A100s with vLLM? I'll try later, but any hints or help is very much appreciated

JamesBVMNetwork · 2025-01-02T15:56:38Z

Hi everyone,
I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z

Here’s the command I used:

--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager

Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

glowwormX · 2025-01-07T14:26:24Z

Hi everyone, I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:
ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z 
Here’s the command I used:
--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager
Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

I've had this problem, too. Is there a solution?

ishaandatta · 2025-01-07T19:02:10Z

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length.
We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM.
I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

shaowei-su · 2025-01-09T04:36:23Z

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining fp8e4nv data type is not supported on CUDA arch < 89 when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks

merlintang · 2025-01-09T07:05:57Z

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

lhl · 2025-01-09T12:51:12Z

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing

Here's vLLM vs SGLang at concurrency=64 atm:

Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.

fan-niu · 2025-01-09T13:18:40Z

Same issue. I used 16 H100 GPUs, set TP=16, deployed using ray in k8s, and opened the IB network. I made a simple curl request, input 10 tokens, and output 242 tokens. This curl test It took 44 seconds. Can anyone help me figure out why?

merlintang · 2025-01-09T16:58:28Z

does the perf issues related to the MOE opt ? it is not included in the current version.?

ishaandatta · 2025-01-09T19:40:58Z

@shaowei-su I'm using the bf16 version you linked.

@lhl thank you for sharing this! I'm currently using tp=4 pp=6 as we're aiming for context lengths > 64k.
Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?
If so- I am wondering as to how deepseek-chat is able to achieve their throughput, I measured it at over 60 output tokens/sec

lhl · 2025-01-10T00:24:43Z

Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?

for bs=1 SGLang outputs around 26 tok/s:

(sglang) ubuntu@ip-10-1-1-135:~$ python3 -m sglang.bench_serving --backend sglang --num-prompts 50 --max-concurrency 1 --port 8000
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=8000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=1, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 10354
#Output tokens: 11509
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [07:20<00:00,  8.82s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                1
Successful requests:                     50
Benchmark duration (s):                  440.98
Total input tokens:                      10354
Total generated tokens:                  11509
Total generated tokens (retokenized):    11467
Request throughput (req/s):              0.11
Input token throughput (tok/s):          23.48
Output token throughput (tok/s):         26.10
Total token throughput (tok/s):          49.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8819.11
Median E2E Latency (ms):                 4817.32
---------------Time to First Token----------------
Mean TTFT (ms):                          318.37
Median TTFT (ms):                        259.02
P99 TTFT (ms):                           1658.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.41
Median TPOT (ms):                        36.97
P99 TPOT (ms):                           37.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.18
Median ITL (ms):                         37.06
P99 ITL (ms):                            38.91
==================================================

You should read the DeepSeek Technical Report in the infrastructure, they deploy in 320 GPU blocks w/ specialized/separated functions.

That being said, there's certainly optimizations that can be made for "regular" inference. On vLLM, when doing throughput optimization, with some tuning I can generate >7000 tok/s on a single H100 node for a Llama 3 70B class model at c=512. DSv3 has about half the activations, and at c=512 sglang currently tops out at about 1100 tok/s on 2xH100 nodes (vLLM is about half of that). You could imagine that there might be a 5-10X in throughput optimization available based naively on activations/fwd pass. This is before spec decode like EAGLE or Medusa is factored in.

fan-niu · 2025-01-10T01:35:37Z

@simon-mo Is there any way or plan to improve the speed of vllm on deepseek v3? Thanks a lot

panpan0000 · 2025-01-11T09:50:27Z

we also see 3 token/s on 16x H20 with TP=8,PP=2

drikster80 · 2025-01-14T00:46:50Z

When I tested TP=16 on GH200 nodes (FP8 version), I was getting ~7.1 t/s (single batch). Ironically, when I used TP=8 (max_model_len=2048 so it all fit), I was getting slightly faster, which seemed strange.

One of the issues that might be slowing VLLM down is that one of the MoE specific CUDA kernels is hard-coded for DSv3 to force the use of Global memory, which is significantly slower than shared memory. This is due to the limited amount of shared memory available (dependent on the GPU model... for example, the H100 has 227KB of shared memory per block).
https://github.com/vllm-project/vllm/blob/main/csrc/moe/moe_align_sum_kernels.cu#L232

I don't know how much effect this has for this specific kernel, but it likely has some consequence. Techniques like distributed shared memory (H100+ specific) might be able to be used, or only keeping the active experts in there... but unfortunately I don't know much about CUDA programming. Spent 2 days messing trying to implement the "active-expert only" approach, but only served to slow down to 4.5 t/s...

xpmemeda · 2025-01-15T13:41:10Z

您好。请问现在使用vllm部署，要支持tool call功能，应该使用哪个parser？

WangxuP · 2025-01-16T03:46:59Z

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

I use vllm==0.6.6.post1 can support this feature?

teknium1 · 2025-01-27T06:31:23Z

Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB

tot0 · 2025-01-28T00:02:38Z

Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB

I get 10.5 tok/s (1 sequence) on 8*MI300x using sglang, just for reference.

pseudotensor · 2025-01-28T09:43:35Z

Same, about 13 tokens/sec (1 sequence, long output) on 8*MI300x using sglang. It has an unexpected TTFT lag for me of about 2 seconds though.

Neo9061 · 2025-02-02T20:37:21Z

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing

Here's vLLM vs SGLang at concurrency=64 atm:

Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.

Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status.

Active:
 1 node_62bc8c92be4ee6912d3ac7sfsff5db8acf209daa9e
 1 node_ed25244254634eb76cfsfdfc7db4cf366b4a86c9b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/16.0 GPU
 0B/3.88TiB memory
 0B/19.46GiB object_store_memory

Demands:
 (no resource demands)

Then I try to deploy the VLLM engine by following code

MODEL_ID=/mylocal/DeepSeek/DeepSeek-R1, 
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
 python -m vllm.entrypoints.openai.api_server \
 --model $MODEL_ID \
 --port 8002 \
 --tensor-parallel-size 16 \
 --max-model-len 20000 \
 --trust-remote-code \
 --distributed-executor-backend ray

But I kept seeing

Started a local Ray instance. View the dashboard at 127.0.0.1:8266 
WARNING 02-02 19:54:20 ray_utils.py:315] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 02-02 19:54:30 ray_utils.py:212] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.31.42.4': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.

Any instruction will be much appreciated! CC: @pseudotensor @teknium1

lhl · 2025-02-03T08:35:54Z

Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status.

If you are having trouble with the docs https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes and the referenced helper script, I'm not sure I can help - I spent a fair amount of work adapting ray to play nice with my slurm setup, so it's not very applicable for raw nodes. I'd maybe search or start a "discussion" thread and see if you can get an answer.

Barring that, I will have to say that sglang's multi-node launching is dead simple, so you could give that a spin if you can't get vLLM working: https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands

yaochengji · 2025-02-04T18:21:34Z

Thans @simon-mo , does the EP support includes the optimization for the shared expert(s) as how it described in the DeepSeek V3 paper?

ehuaa · 2025-02-07T12:39:48Z

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main

, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.

https://github.com/deepseek-ai/DeepSeek-V3

I wonder why this conversion cannot performed on A100s, this script seems don't need load all of the deepseek-v3 model on gpu. @fsaudm Can you show me the reason? Thanks

lambert0312 · 2025-02-08T02:24:12Z

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main
, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.
https://github.com/deepseek-ai/DeepSeek-V3

I wonder why this conversion cannot performed on A100s, this script seems don't need load all of the deepseek-v3 model on gpu. @fsaudm Can you show me the reason? Thanks

Because the script processes weight files one by one, it reads the FP8 weights and converts them into BF16 format through calculation. Because GPUs such as A100/A800 cannot process the FP8 format, it cannot be used. @ehuaa

tjchuangplus · 2025-02-08T13:05:25Z

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

mphilippnv · 2025-02-08T15:33:48Z

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

nowinkeyy · 2025-02-10T02:23:53Z

Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status.

Active:
 1 node_62bc8c92be4ee6912d3ac7sfsff5db8acf209daa9e
 1 node_ed25244254634eb76cfsfdfc7db4cf366b4a86c9b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/16.0 GPU
 0B/3.88TiB memory
 0B/19.46GiB object_store_memory

Demands:
 (no resource demands)

Then I try to deploy the VLLM engine by following code

MODEL_ID=/mylocal/DeepSeek/DeepSeek-R1, 
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
 python -m vllm.entrypoints.openai.api_server \
 --model $MODEL_ID \
 --port 8002 \
 --tensor-parallel-size 16 \
 --max-model-len 20000 \
 --trust-remote-code \
 --distributed-executor-backend ray

But I kept seeing

Started a local Ray instance. View the dashboard at 127.0.0.1:8266 
WARNING 02-02 19:54:20 ray_utils.py:315] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 02-02 19:54:30 ray_utils.py:212] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.31.42.4': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.

Any instruction will be much appreciated! CC: @pseudotensor @teknium1

@Neo9061 Hello, I think the CUDA_VISIBLE_DEVICES environment variable is misconfigured. CUDA_VISIBLE_DEVICES should be used to set the GPU of the current node. For example, two 8*H100 machines, the CUDA_VISIBLE_DEVICES environment variable on both machines should be 0,1,2,3,4,5,6,7. This is how I set it. You can try it.

tjchuangplus · 2025-02-10T04:25:06Z

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run

mphilippnv · 2025-02-10T04:28:29Z

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run

I think it was bf16. It was whatever the standard settings are.

tjchuangplus · 2025-02-10T04:35:09Z

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run

I think it was bf16. It was whatever the standard settings are.

Thank you, yes, I have also verified that BF16 is feasible, but FP8 cannot run smoothly, possibly because VLLM parallel strategy is not yet supported

aftersnow · 2025-02-11T07:56:10Z

Thank you @simon-mo ! Do you have plans to support sequence parallelism?

javading · 2025-02-14T12:05:38Z

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.
Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.
Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining fp8e4nv data type is not supported on CUDA arch < 89 when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks

Hi I've had this problem, too. Is there a solution?

tjchuangplus · 2025-02-16T10:22:06Z

After using VLLM for a period of time, there may be an error message indicating insufficient CUDA memory. Have you encountered this before? How should we handle it?

`
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 95.22 GiB of which 3.11 GiB is free. Including non-PyTorch memory, this process has 92.10 GiB memory in use. Of the allocated memory 73
.21 GiB is allocated by PyTorch, with 82.00 MiB allocated in private pools (e.g., CUDA Graphs), and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandab
le_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/root/anaconda3/envs/deepseek/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 70, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
CRITICAL 02-16 02:55:40 launcher.py:74] AsyncLLMEngine has failed, terminating server process`

ztxdcyy · 2025-02-18T06:38:26Z

After using VLLM for a period of time, there may be an error message indicating insufficient CUDA memory. Have you encountered this before? How should we handle it?

` torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 95.22 GiB of which 3.11 GiB is free. Including non-PyTorch memory, this process has 92.10 GiB memory in use. Of the allocated memory 73 .21 GiB is allocated by PyTorch, with 82.00 MiB allocated in private pools (e.g., CUDA Graphs), and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandab le_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/root/anaconda3/envs/deepseek/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 70, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. CRITICAL 02-16 02:55:40 launcher.py:74] AsyncLLMEngine has failed, terminating server process`

try：--enforce eager, maybe helpful

simon-mo added misc performance Performance-related issues new model Requests to new models and removed misc labels Dec 27, 2024

simon-mo changed the title ~~[Model] Deepseek V3 Enhancements~~ [Model] DeepSeek-V3 Enhancements Dec 27, 2024

llsj14 mentioned this issue Jan 1, 2025

[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672

Merged

mowentian mentioned this issue Jan 2, 2025

the normal generation throughout reference deepseek-ai/DeepSeek-V3#24

Closed

dpoulopoulos mentioned this issue Feb 4, 2025

Evaluate DeepSeek V3 with Lumigator mozilla-ai/lumigator#777

Closed

1 task

Erickrus mentioned this issue Feb 6, 2025

[Bug]: crash when receive a request #12769

Open

1 task

[Model] DeepSeek-V3 Enhancements #11539

[Model] DeepSeek-V3 Enhancements #11539

Comments

simon-mo commented Dec 27, 2024 • edited Loading

july8023 commented Dec 30, 2024

fsaudm commented Dec 31, 2024

mphilippnv commented Dec 31, 2024

simon-mo commented Dec 31, 2024

fsaudm commented Dec 31, 2024

simon-mo commented Dec 31, 2024

fsaudm commented Dec 31, 2024

simon-mo commented Dec 31, 2024

mphilippnv commented Dec 31, 2024 • edited Loading

fsaudm commented Dec 31, 2024

JamesBVMNetwork commented Jan 2, 2025

glowwormX commented Jan 7, 2025

ishaandatta commented Jan 7, 2025

shaowei-su commented Jan 9, 2025

merlintang commented Jan 9, 2025

lhl commented Jan 9, 2025

fan-niu commented Jan 9, 2025

merlintang commented Jan 9, 2025

ishaandatta commented Jan 9, 2025 • edited Loading

lhl commented Jan 10, 2025

fan-niu commented Jan 10, 2025

panpan0000 commented Jan 11, 2025

drikster80 commented Jan 14, 2025

xpmemeda commented Jan 15, 2025

WangxuP commented Jan 16, 2025

teknium1 commented Jan 27, 2025

tot0 commented Jan 28, 2025

pseudotensor commented Jan 28, 2025

Neo9061 commented Feb 2, 2025 • edited Loading

lhl commented Feb 3, 2025

yaochengji commented Feb 4, 2025

ehuaa commented Feb 7, 2025

lambert0312 commented Feb 8, 2025

tjchuangplus commented Feb 8, 2025

mphilippnv commented Feb 8, 2025

nowinkeyy commented Feb 10, 2025

tjchuangplus commented Feb 10, 2025

mphilippnv commented Feb 10, 2025

tjchuangplus commented Feb 10, 2025

aftersnow commented Feb 11, 2025

javading commented Feb 14, 2025

tjchuangplus commented Feb 16, 2025

ztxdcyy commented Feb 18, 2025

simon-mo commented Dec 27, 2024 •

edited

Loading

mphilippnv commented Dec 31, 2024 •

edited

Loading

ishaandatta commented Jan 9, 2025 •

edited

Loading

Neo9061 commented Feb 2, 2025 •

edited

Loading