the normal generation throughout reference #24

ltm920716 · 2024-12-29T09:21:20Z

hi，
I deploy the v3 model with vllm on 8*H200（tp=8），and the generation throughout is round 10 tokens/s，I thinks this is some slow，so could you give me a reference about the normal generation throughout，or some method to improve the throughout，thanks！

GeeeekExplorer · 2024-12-31T08:24:06Z

In fact, the demo is only used to show the model structure. It is mainly limited by CPU and you can observe that the kernel launching is very slow using torch profiler. For faster speed you can try sglang or vllm.

ltm920716 · 2024-12-31T09:42:07Z

In fact, the demo is only used to show the model structure. It is mainly limited by CPU and you can observe that the kernel launching is very slow using torch profiler. For faster speed you can try sglang or vllm.

actually，I do deploy the model with vllm/vllm-openai:v0.6.6，so I want to know how to improve the throughout

GeeeekExplorer · 2024-12-31T12:38:48Z

I think cudagraph would be helpful if supported.

mowentian · 2025-01-02T08:19:39Z

Please refer to the work vllm DeepSeek-V3 Enhancements vllm-project/vllm#11539

mowentian closed this as completed Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the normal generation throughout reference #24

the normal generation throughout reference #24

ltm920716 commented Dec 29, 2024

GeeeekExplorer commented Dec 31, 2024 •

edited

Loading

ltm920716 commented Dec 31, 2024

GeeeekExplorer commented Dec 31, 2024

mowentian commented Jan 2, 2025

the normal generation throughout reference #24

the normal generation throughout reference #24

Comments

ltm920716 commented Dec 29, 2024

GeeeekExplorer commented Dec 31, 2024 • edited Loading

ltm920716 commented Dec 31, 2024

GeeeekExplorer commented Dec 31, 2024

mowentian commented Jan 2, 2025

GeeeekExplorer commented Dec 31, 2024 •

edited

Loading