optimize custom allreduce kernel #2904

yizhang2077 · 2025-01-15T15:11:24Z

Motivation

optimize custom allreduce

Modifications

support copy input mode
adapt vllm register cuda graph buffer to save a data copy from input tensor to ipc buffer
optimze some tensorrt llm kernel performance

correctness test

unittest: python3 tests/test_trt_reduce.py python3 test/srt/test_custom_allreduce.py
end to end test: python3 python/sglang/llama3_eval.py --model-size 8b --provider sgl --task mmlu

unit test performance in H100 (without cuda graph)

python3 tests/test_trt_reduce.py
TL;DR: custom allreduce performance is faster than vllm when disable cuda graph since copy input mode can save launch kernel time

vllm

tp size	data size	latency(ms)
2/4/8	512	0.0095/0.0091/0.0101
2/4/8	4096	0.0088/0.0085/0.0093
2/4/8	32768	0.0087/0.0092/0.0118
2/4/8	262144	0.0129/0.0206/0.0241
2/4/8	524288	0.0177/0.0269/0.0304
2/4/8	1048576	0.0266/0.0402/0.0450

custom allreduce

tp size	data size	latency(ms)
2/4/8	512	0.0037/0.0048/0.0059
2/4/8	4096	0.0039/0.0037/0.0057
2/4/8	32768	0.0045/0.0048/0.0078
2/4/8	262144	0.0079/0.0139/0.0167
2/4/8	524288	0.0125/0.0202/0.0221
2/4/8	1048576	0.0199/0.0311/0.0338

end to end performance in H100

python -m sglang.bench_one_batch --model-path Meta-Llama-3.1-8B-Instruct --batch 1 2 4 8 16 32 64 128 256 512 --input-len 128 --output-len 1024 --run-name test_run --tp x

TL;DR: custom allreduce performance is faster than vllm in most all cases (although not very obvious,), especially in batch size = 256/512

vllm result

tp size	batch size	median decode latency(s)
2/4/8	1	0.00525/0.00392/0.00333
2/4/8	2	0.00545/0.00419/0.00358
2/4/8	4	0.00549/0.00419/0.00369
2/4/8	8	0.00557/0.00431/ 0.00378
2/4/8	16	0.00573/0.00443/0.00395
2/4/8	32	0.00608/0.00483/0.00458
2/4/8	64	0.00678/0.00551/0.00481
2/4/8	128	0.00894/0.00641/0.00553
2/4/8	256	0.02143/0.02143/0.02123
2/4/8	512	0.02226/0.02176/0.02161

custom allreduce result

tp size	batch size	median decode latency(s)
2/4/8	1	0.00514/0.00385/0.00333
2/4/8	2	0.00535/0.00412/0.00358
2/4/8	4	0.00537/0.00416/0.00367
2/4/8	8	0.00544/0.00423/0.00373
2/4/8	16	0.00561/0.00431/0.00387
2/4/8	32	0.00595/0.00471/0.00404
2/4/8	64	0.00663/0.00510/0.00468
2/4/8	128	0.00879/0.00634/0.00554
2/4/8	256	0.02051/0.02028/0.02023
2/4/8	512	0.02140/0.02080/0.02048

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

…t in kernel, fix performance issue in kernel

support register graph buffer for custom allreduce, support copy inpu…

4545419

…t in kernel, fix performance issue in kernel

yizhang2077 assigned zhyncs Jan 15, 2025

yizhang2077 requested review from zhyncs, ispobock, HandH1998, BBuf and merrymercy as code owners January 15, 2025 15:11

yizhang2077 changed the title ~~support register graph buffer for custom allreduce, support copy input in kernel, fix performance issue in kernel~~ optimize custom allreduce kernel Jan 15, 2025

yizhang2077 mentioned this pull request Jan 15, 2025

adapt custom allreduce for tensorrt llm #2511

Merged

3 tasks

zhyncs merged commit 6cb3974 into main Jan 15, 2025
3 checks passed

zhyncs deleted the custom-allreduce-support-cuda-graph branch January 15, 2025 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize custom allreduce kernel #2904

optimize custom allreduce kernel #2904

yizhang2077 commented Jan 15, 2025 •

edited

Loading

optimize custom allreduce kernel #2904

optimize custom allreduce kernel #2904

Conversation

yizhang2077 commented Jan 15, 2025 • edited Loading

Motivation

Modifications

correctness test

unit test performance in H100 (without cuda graph)

end to end performance in H100

Checklist

yizhang2077 commented Jan 15, 2025 •

edited

Loading