-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sgl-kernel adapt tensorrt llm custom allreduce #2481
Conversation
@yizhang2077 Could we split this into two PRs? One for the sgl-kernel, which requires a new version release, and another for updating the Python package replacement and dependencies. |
OK |
4982d53
to
69df322
Compare
BTW We might also update the |
https://github.com/sgl-project/sglang/pull/2483/files#diff-b2b7e1471c20bf33dea3e63ed580e07dd360668f51ccd6c3347a031072651645R21 |
d03081f
to
ce283ff
Compare
|
|
there are some strange logs raised by ray and I don't know how to close it.... But it seems it will not affect the test |
Test size 4096 is worse than vLLM, other cases are better. |
Hi @HaiShaw ^ If we replace the current implementation of custom all reduce in SGLang with the trt llm custom all reduce implementation from sgl-kernel, will it affect ROCm? |
|
The third one has been fixed with #2487 We can add more unit tests to identify why it's slower in some cases. cc @yizhang2077 |
For second, I think the most likely reason is multi_gpu_barrier in trt llm is more coarse-grained than vllm, while vllm need use two barrier. trtllm also provide PUSH_MODE to use fine-grained barrier (however it need additional copy to shared buffer) and we can try it. |
Motivation
Modifications
Checklist
Next