Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize custom allreduce kernel #2904

Merged
merged 1 commit into from
Jan 15, 2025
Merged

Conversation

yizhang2077
Copy link
Collaborator

@yizhang2077 yizhang2077 commented Jan 15, 2025

Motivation

optimize custom allreduce

Modifications

  1. support copy input mode
  2. adapt vllm register cuda graph buffer to save a data copy from input tensor to ipc buffer
  3. optimze some tensorrt llm kernel performance

correctness test

unittest: python3 tests/test_trt_reduce.py python3 test/srt/test_custom_allreduce.py
end to end test: python3 python/sglang/llama3_eval.py --model-size 8b --provider sgl --task mmlu

unit test performance in H100 (without cuda graph)

python3 tests/test_trt_reduce.py
TL;DR: custom allreduce performance is faster than vllm when disable cuda graph since copy input mode can save launch kernel time

vllm

tp size data size latency(ms)
2/4/8 512 0.0095/0.0091/0.0101
2/4/8 4096 0.0088/0.0085/0.0093
2/4/8 32768 0.0087/0.0092/0.0118
2/4/8 262144 0.0129/0.0206/0.0241
2/4/8 524288 0.0177/0.0269/0.0304
2/4/8 1048576 0.0266/0.0402/0.0450

custom allreduce

tp size data size latency(ms)
2/4/8 512 0.0037/0.0048/0.0059
2/4/8 4096 0.0039/0.0037/0.0057
2/4/8 32768 0.0045/0.0048/0.0078
2/4/8 262144 0.0079/0.0139/0.0167
2/4/8 524288 0.0125/0.0202/0.0221
2/4/8 1048576 0.0199/0.0311/0.0338

end to end performance in H100

python -m sglang.bench_one_batch --model-path Meta-Llama-3.1-8B-Instruct --batch 1 2 4 8 16 32 64 128 256 512 --input-len 128 --output-len 1024 --run-name test_run --tp x

TL;DR: custom allreduce performance is faster than vllm in most all cases (although not very obvious,), especially in batch size = 256/512

vllm result

tp size batch size median decode latency(s)
2/4/8 1 0.00525/0.00392/0.00333
2/4/8 2 0.00545/0.00419/0.00358
2/4/8 4 0.00549/0.00419/0.00369
2/4/8 8 0.00557/0.00431/ 0.00378
2/4/8 16 0.00573/0.00443/0.00395
2/4/8 32 0.00608/0.00483/0.00458
2/4/8 64 0.00678/0.00551/0.00481
2/4/8 128 0.00894/0.00641/0.00553
2/4/8 256 0.02143/0.02143/0.02123
2/4/8 512 0.02226/0.02176/0.02161

custom allreduce result

tp size batch size median decode latency(s)
2/4/8 1 0.00514/0.00385/0.00333
2/4/8 2 0.00535/0.00412/0.00358
2/4/8 4 0.00537/0.00416/0.00367
2/4/8 8 0.00544/0.00423/0.00373
2/4/8 16 0.00561/0.00431/0.00387
2/4/8 32 0.00595/0.00471/0.00404
2/4/8 64 0.00663/0.00510/0.00468
2/4/8 128 0.00879/0.00634/0.00554
2/4/8 256 0.02051/0.02028/0.02023
2/4/8 512 0.02140/0.02080/0.02048

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

…t in kernel, fix performance issue in kernel
@yizhang2077 yizhang2077 changed the title support register graph buffer for custom allreduce, support copy input in kernel, fix performance issue in kernel optimize custom allreduce kernel Jan 15, 2025
@zhyncs zhyncs merged commit 6cb3974 into main Jan 15, 2025
3 checks passed
@zhyncs zhyncs deleted the custom-allreduce-support-cuda-graph branch January 15, 2025 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants