-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch #1480
Conversation
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
@liangan1 Thanks for the contribution. Could you fix the unit tests https://github.com/sgl-project/sglang/tree/main/test? |
Sure. I will work on it and let you know when all UTs pass. |
It is almost there! There are only a few remaining issues for multi-gpu test cases. |
We will push some big refactor soon, starting from #1534. To prevent too many conflicts, it is better to merge this PR soon or split this PR into multiple smaller ones. |
Sorry, I don't have enough GPUs to reproduce this tensor parallel related UTs, do you have any comments abouts about this issues? According to the UT logs, the distributed backend and model init has been finished and the timeout occurs during the CUDAGraph initialization, but I don't need to change anything for this part. |
d28c647
to
3bddad3
Compare
Spliting this PR into smaller ones.
|
Pytorch already support XPU device since 2.4 release and xpu is also supported in OpenAI Trition. So, it should works with the Trition attention backend in SGLang. In this PR, We add 'xpu' device into SGLang.
Blocked issue: The vllm is only compatible with 2.3 now and he binary whl support for torch xpu should be ready since torch-2.5(should be ready in the Oct/2024), so we should wait the vllm to be compatible with torch-2.5.
Status
Both XPU & CUDA works with latency benchmark
LLama-2-7b works for the latency benchmark.
VLLM_TEST_COMPILE_NO_CUSTOM_OPS=1 python -m sglang.bench_latency --model-path ~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/ --device xxxx
Both XPU & CUDA generate same outputs with launch_server
python -m sglang.launch_server --model-path ~/models/llama7b/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/ --port 30000 --device xxx
ToDO In other PRs:
Functionality
Performance