-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the normal generation throughout reference #24
Comments
In fact, the demo is only used to show the model structure. It is mainly limited by CPU and you can observe that the kernel launching is very slow using torch profiler. For faster speed you can try sglang or vllm. |
actually,I do deploy the model with vllm/vllm-openai:v0.6.6,so I want to know how to improve the throughout |
I think cudagraph would be helpful if supported. |
Please refer to the work vllm DeepSeek-V3 Enhancements vllm-project/vllm#11539 |
hi,
I deploy the v3 model with vllm on 8*H200(tp=8),and the generation throughout is round 10 tokens/s,I thinks this is some slow,so could you give me a reference about the normal generation throughout,or some method to improve the throughout,thanks!
The text was updated successfully, but these errors were encountered: