Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the normal generation throughout reference #24

Closed
ltm920716 opened this issue Dec 29, 2024 · 4 comments
Closed

the normal generation throughout reference #24

ltm920716 opened this issue Dec 29, 2024 · 4 comments

Comments

@ltm920716
Copy link

hi,
I deploy the v3 model with vllm on 8*H200(tp=8),and the generation throughout is round 10 tokens/s,I thinks this is some slow,so could you give me a reference about the normal generation throughout,or some method to improve the throughout,thanks!

@GeeeekExplorer
Copy link
Contributor

GeeeekExplorer commented Dec 31, 2024

In fact, the demo is only used to show the model structure. It is mainly limited by CPU and you can observe that the kernel launching is very slow using torch profiler. For faster speed you can try sglang or vllm.

@ltm920716
Copy link
Author

In fact, the demo is only used to show the model structure. It is mainly limited by CPU and you can observe that the kernel launching is very slow using torch profiler. For faster speed you can try sglang or vllm.

actually,I do deploy the model with vllm/vllm-openai:v0.6.6,so I want to know how to improve the throughout

@GeeeekExplorer
Copy link
Contributor

I think cudagraph would be helpful if supported.

@mowentian
Copy link
Contributor

Please refer to the work vllm DeepSeek-V3 Enhancements vllm-project/vllm#11539

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants