Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
optimize custom allreduce
Modifications
correctness test
unittest:
python3 tests/test_trt_reduce.py
python3 test/srt/test_custom_allreduce.py
end to end test:
python3 python/sglang/llama3_eval.py --model-size 8b --provider sgl --task mmlu
unit test performance in H100 (without cuda graph)
python3 tests/test_trt_reduce.py
TL;DR: custom allreduce performance is faster than vllm when disable cuda graph since copy input mode can save launch kernel time
vllm
custom allreduce
end to end performance in H100
python -m sglang.bench_one_batch --model-path Meta-Llama-3.1-8B-Instruct --batch 1 2 4 8 16 32 64 128 256 512 --input-len 128 --output-len 1024 --run-name test_run --tp x
TL;DR: custom allreduce performance is faster than vllm in most all cases (although not very obvious,), especially in batch size = 256/512
vllm result
custom allreduce result
Checklist