-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[misc] Add LoRA kernel micro benchmarks #11579
[misc] Add LoRA kernel micro benchmarks #11579
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
@jeejeelee This PR adds some tooling for benchmarking LoRA kernels. Should be useful for further optimizing LoRA kernels and for #11234 . Note that this PR emulates the @mgoin fyi |
benchmarks/kernels/benchmark_lora.py
Outdated
args.with_cuda_graph)) | ||
seq_len_timers.append( | ||
bench_optype(_ctx, args.arg_pool_size, bench_op, | ||
args.with_cuda_graph)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we need to ensure the compute results are aligned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For expand related operations, with add_inputs=True, testing for correctness on the benchmarking results is hard as the function is run an indeterminate number of times.
I have added a test_correctness method to BenchmarkTensor class that can be invoked with a CLI argument --test-correctness
. Note that this tests for correctness before the benchmarking is run. This should give us enough confidence about the validity of the results.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeejeelee
The numbers in the table are timings for 32 consecutive invocations of the benchmarking function, run from inside a cuda graph. The timings / 32
should yield time per single invocation. Sorry about the confusion, I should have mentioned this earlier. I have added comments and print statements in the code to make this clear.
When run in cuda graph mode, the graph is captured with N invocations of the benchmarking function.
vllm/benchmarks/kernels/utils.py
Line 133 in befe705
with torch.cuda.stream(stream): |
The reported time is the time taken for a single graph replay.
vllm/benchmarks/kernels/utils.py
Line 145 in befe705
return TBenchmark.Timer( |
I ran the benchmarks for SGMV expand again - Please look at rows 51 to 91 for the normalized timings.
https://docs.google.com/spreadsheets/d/1gSUNdZ08H-057SUnxeWhPKBWrg5Hc3QkRJS6YnAq6_E/edit?usp=sharing
- In the table, for smaller problem shapes, you can see how the normalized cuda graph timings don't have the triton kernel launch overheads
- In the table, you can also see how having a bigger pool of arguments helps in mitigating the caching effects during benchmarking.
About testing : I have added the functionality to test the outputs after the benchmarking run anyways 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me use {"bs": 8192, "sl": 8, "m": 65536, "k": 16, "n": 16384, "num_loras": 4, "sort_by_lora": true, "num_slices": 1} as an example. If I understand correctly, it would require 8192 execute to torch.mm versus 1 execute to the triton kernel, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand. but for
{"bs": 8192, "sl": 8, "m": 65536, "k": 16, "n": 16384, "num_loras": 4, "sort_by_lora": true, "num_slices": 1}
torch.mm's M K N are "m": 65536, "k": 16, "n": 16384
- A (65536 x 16) x B (16 x 16384) = C (65536 x 16384) - It is a single torch.mm call to compute all of C .
For LoRA, A matrix is the same size, but we have 4 B matrices (LoRA weights) and C is output based on LoRA ID mapping. Again it is a single triton call to compute all of C .
num_ops_in_cuda_graph=arg_pool_size) if with_cuda_graph else None | ||
with Bench(cuda_graph_params, ctx.bench_label(), | ||
ctx.bench_sublabel(op_type), description, torch.mm, | ||
**mm_kwargs) as bench: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ:Does torch.mm
support group gemm? If not, as baseline, how does it compute multi-lora gemm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik, it does not. I meant for the torch.mm
(just a matmul) benchmark to serve as a roofline. sorry about the confusion, I have renamed the functions and added a comment.
'max_seq_length': max_seq_len, | ||
'token_nums': num_tokens, | ||
'add_inputs': True, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If add_inputs
is True, the expand-related kernel performs group-gemm + outputs, rather than just group-gemm alone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was intentional, so we benchmark the most used and most expensive version. But, I see the value in passing this via the CLI. Added --expand-fn-add-inputs
argument to the CLI.
f5090c0
to
3c22546
Compare
"case. It is provided as a roofline for comparing our LoRA Kernel " | ||
"implementations. It is expected that the LoRA kernels will be " | ||
"slower than torch.mm in cases where num_loras is big. But for " | ||
"small num_loras the goal should be to match the torch.mm numbers.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeejeelee I have added this note on how to interpret the torch.mm numbers. The console output looks like
== All Results ====
[---------------------------------------------------------------------------------------- lora-torch.float16 | cugraph 32 ops ----------------------------------------------------------------------------------------]
| single-lora roofline using torch.mm (f16xf16=>f16) | SGMV_EXPAND(add_inputs=False) (f32xf16=>f16)
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{"bs": 16, "sl": 16, "m": 256, "k": 16, "n": 2048, "num_loras": 4, "sort_by_lora": true, "num_slices": 1} | 132.6 | 174.5
Times are in microseconds (us).
Note : The timings reported above is for 32 consecutive invocations of the benchmarking functions. Please divide by 32 for single invocation timings
Note on Comparison with torch.mm : The torch.mm numbers are benchmark numbers of a simple matmul emulating the single lora case. It is provided as a roofline for comparing our LoRA Kernel implementations. It is expected that the LoRA kernels will be slower than torch.mm in cases where num_loras is big. But for small num_loras the goal should be to match the torch.mm numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we output this information just once?
cad188e
to
7dc85ef
Compare
assert all([ | ||
bt.test_correctness(op_type, expand_fn_add_inputs) | ||
for bt in bench_tensors | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeejeelee I have removed the correctness testing on benchmarking results. Instead, we now test the benchmarking function before running the benchmarks.
Matching the outputs of the benchmarking runs are very flaky and intractable. The root-cause of the issue is the updates to the output matrices and the fact that the benchmarking script can run the benchmarking function multiple times.
for expand related functions, when add_inputs = True, the output matrix is updated arbitrary number of times making correct testing intractable.
for shrink functions, depending on if SPLIT_K is used in the kernels, the results are either added to the output or stored directly. When results are added to the output, correctness testing becomes intractable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution and patience. Overall LGTM after completing the modifications below.
from vllm.lora.ops.sgmv_expand import sgmv_expand | ||
from vllm.lora.ops.sgmv_shrink import sgmv_shrink | ||
from vllm.lora.ops.utils import _LORA_A_PTR_DICT, _LORA_B_PTR_DICT | ||
from vllm.utils import FlexibleArgumentParser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently merged ##11100, these imports need to be reimplemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up @jeejeelee . I have fixed it 🙌
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
test only benchmark tensors that participated Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
af63ecf
to
e31e1fd
Compare
Signed-off-by: ice-tong <[email protected]>
Signed-off-by: Bowen Wang <[email protected]>
Add LoRA kernel micro benchmarks for tuning/optimizing LoRA kernels
Added a utils.py in
benchmarks/kernels/
that implements a Bench class. This Bench class is abstract enough to use in other future benchmark implementations.The benchmarking script, can run in one of 3 modes,
Example :
python3 benchmarks/kernels/benchmark_lora.py range_bench --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --cuda-graph-nops 32 --hidden-sizes-start 1024 --hidden-sizes-end 4096 --hidden-sizes-increment 1024 --lora-ranks-start 8 --lora-ranks-end 24 --lora-ranks-increment 8
Use this to benchmark a range of hidden dimension sizes and lora-ranks
Example :
python3 benchmarks/kernels/benchmark_lora.py list_bench --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --hidden-sizes 2048 2049 4096 8192 --lora-ranks 2 8 16 20 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 ---cuda-graph-nops 32
When range benchmarking is too restrictive, use this version to simply list the hidden-dimension sizes and lora-rank values.
Example :
python3 benchmarks/kernels/benchmark_lora.py model_bench --models meta-llama/Llama-3-8b --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --lora-ranks 16 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --cuda-graph-nops 32
Specify a model to use the weight shapes in the model to understand the model execution performance.
Some benchmarks run on main, using
and later collated can be found here https://docs.google.com/spreadsheets/d/16iA8nZyuhfOctNg6KSJ1Y0Ve5udZKDOMsiDYDORNyks/edit?usp=sharing