[misc] Add LoRA kernel micro benchmarks #11579

varun-sundar-rabindranath · 2024-12-28T07:15:10Z

Add LoRA kernel micro benchmarks for tuning/optimizing LoRA kernels

The benchmarking script creates a pool of tensors for each kernel argument and uses the tensors in-order for benchmarking. Having a bigger argument pool helps in mitigating the caching effects during benchmarking.
The benchmarking script has the ability to run the kernels inside a cuda graph. This is particularly useful for benchmarking triton kernels due to their launch overhead.
The benchmarking script also benchmarks torch.mm as a baseline.

Added a utils.py in benchmarks/kernels/ that implements a Bench class. This Bench class is abstract enough to use in other future benchmark implementations.

The benchmarking script, can run in one of 3 modes,

range_bench
Example : python3 benchmarks/kernels/benchmark_lora.py range_bench --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --cuda-graph-nops 32 --hidden-sizes-start 1024 --hidden-sizes-end 4096 --hidden-sizes-increment 1024 --lora-ranks-start 8 --lora-ranks-end 24 --lora-ranks-increment 8

Use this to benchmark a range of hidden dimension sizes and lora-ranks

list_bench
Example : python3 benchmarks/kernels/benchmark_lora.py list_bench --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --hidden-sizes 2048 2049 4096 8192 --lora-ranks 2 8 16 20 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 ---cuda-graph-nops 32

When range benchmarking is too restrictive, use this version to simply list the hidden-dimension sizes and lora-rank values.

model_bench
Example : python3 benchmarks/kernels/benchmark_lora.py model_bench --models meta-llama/Llama-3-8b --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --lora-ranks 16 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --cuda-graph-nops 32

Specify a model to use the weight shapes in the model to understand the model execution performance.

Some benchmarks run on main, using

NUM_LORAS=(4)
BATCH_SIZES=(16 128 256 512 1024 2048 8192)
HIDDEN_SIZES=(1024 2048 4096 8192 16384)
RANKS=(16)

echo "Benchmarking bgmv punica kernels ..."
python3 benchmarks/kernels/benchmark_lora.py list_bench --dtype torch.float16 --arg-pool-size 32 --with-cuda-graph --num-loras ${NUM_LORAS[@]} --op-types bgmv_shrink bgmv_expand --seq-lengths 1 --hidden-sizes ${HIDDEN_SIZES[@]} --batch-sizes ${BATCH_SIZES[@]} --sort-by-lora-id 1

echo "Benchmarking sgmv punica kernels ..."
python3 benchmarks/kernels/benchmark_lora.py list_bench --dtype torch.float16 --arg-pool-size 32 --with-cuda-graph --num-loras ${NUM_LORAS[@]} --op-types sgmv_shrink sgmv_expand --seq-lengths 8 --hidden-sizes ${HIDDEN_SIZES[@]} --batch-sizes ${BATCH_SIZES[@]} --sort-by-lora-id 1

and later collated can be found here https://docs.google.com/spreadsheets/d/16iA8nZyuhfOctNg6KSJ1Y0Ve5udZKDOMsiDYDORNyks/edit?usp=sharing

github-actions · 2024-12-28T07:15:24Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

varun-sundar-rabindranath · 2024-12-28T07:33:32Z

@jeejeelee This PR adds some tooling for benchmarking LoRA kernels. Should be useful for further optimizing LoRA kernels and for #11234 . Note that this PR emulates the *_expand_slice operations by calling the kernels back-to-back like in the tests. However, the change should be simple enough to support #11234. PTAL.

@mgoin fyi

jeejeelee · 2024-12-31T07:05:27Z

benchmarks/kernels/benchmark_lora.py

+                                   args.with_cuda_graph))
+                seq_len_timers.append(
+                    bench_optype(_ctx, args.arg_pool_size, bench_op,
+                                 args.with_cuda_graph))


Perhaps we need to ensure the compute results are aligned

For expand related operations, with add_inputs=True, testing for correctness on the benchmarking results is hard as the function is run an indeterminate number of times.

I have added a test_correctness method to BenchmarkTensor class that can be invoked with a CLI argument --test-correctness. Note that this tests for correctness before the benchmarking is run. This should give us enough confidence about the validity of the results.

What do you think ?

I'm particularly surprised by the table execution time, especially the result shown in A164. SGMV shouldn't be this slow. So I think we should first verify that the calculation results are correct.

Thanks @jeejeelee

The numbers in the table are timings for 32 consecutive invocations of the benchmarking function, run from inside a cuda graph. The timings / 32 should yield time per single invocation. Sorry about the confusion, I should have mentioned this earlier. I have added comments and print statements in the code to make this clear.

When run in cuda graph mode, the graph is captured with N invocations of the benchmarking function.

vllm/benchmarks/kernels/utils.py

Line 133 in befe705

with torch.cuda.stream(stream):

The reported time is the time taken for a single graph replay.

vllm/benchmarks/kernels/utils.py

Line 145 in befe705

return TBenchmark.Timer(

I ran the benchmarks for SGMV expand again - Please look at rows 51 to 91 for the normalized timings.
https://docs.google.com/spreadsheets/d/1gSUNdZ08H-057SUnxeWhPKBWrg5Hc3QkRJS6YnAq6_E/edit?usp=sharing

In the table, for smaller problem shapes, you can see how the normalized cuda graph timings don't have the triton kernel launch overheads

In the table, you can also see how having a bigger pool of arguments helps in mitigating the caching effects during benchmarking.

About testing : I have added the functionality to test the outputs after the benchmarking run anyways 👍

Let me use {"bs": 8192, "sl": 8, "m": 65536, "k": 16, "n": 16384, "num_loras": 4, "sort_by_lora": true, "num_slices": 1} as an example. If I understand correctly, it would require 8192 execute to torch.mm versus 1 execute to the triton kernel, right?

Not sure I understand. but for

{"bs": 8192, "sl": 8, "m": 65536, "k": 16, "n": 16384, "num_loras": 4, "sort_by_lora": true, "num_slices": 1}

torch.mm's M K N are "m": 65536, "k": 16, "n": 16384 - A (65536 x 16) x B (16 x 16384) = C (65536 x 16384) - It is a single torch.mm call to compute all of C .

For LoRA, A matrix is the same size, but we have 4 B matrices (LoRA weights) and C is output based on LoRA ID mapping. Again it is a single triton call to compute all of C .

benchmarks/kernels/benchmark_lora.py

jeejeelee · 2024-12-31T07:11:44Z

benchmarks/kernels/benchmark_lora.py

+        num_ops_in_cuda_graph=arg_pool_size) if with_cuda_graph else None
+    with Bench(cuda_graph_params, ctx.bench_label(),
+               ctx.bench_sublabel(op_type), description, torch.mm,
+               **mm_kwargs) as bench:


QQ:Does torch.mm support group gemm? If not, as baseline, how does it compute multi-lora gemm?

afaik, it does not. I meant for the torch.mm (just a matmul) benchmark to serve as a roofline. sorry about the confusion, I have renamed the functions and added a comment.

jeejeelee · 2024-12-31T07:20:17Z

benchmarks/kernels/benchmark_lora.py

+            'max_seq_length': max_seq_len,
+            'token_nums': num_tokens,
+            'add_inputs': True,
+        }


If add_inputs is True, the expand-related kernel performs group-gemm + outputs, rather than just group-gemm alone

That was intentional, so we benchmark the most used and most expensive version. But, I see the value in passing this via the CLI. Added --expand-fn-add-inputs argument to the CLI.

varun-sundar-rabindranath · 2025-01-07T14:50:58Z

benchmarks/kernels/benchmark_lora.py

+          "case. It is provided as a roofline for comparing our LoRA Kernel "
+          "implementations. It is expected that the LoRA kernels will be "
+          "slower than torch.mm in cases where num_loras is big. But for "
+          "small num_loras the goal should be to match the torch.mm numbers.")


@jeejeelee I have added this note on how to interpret the torch.mm numbers. The console output looks like

== All Results ==== [---------------------------------------------------------------------------------------- lora-torch.float16 | cugraph 32 ops ----------------------------------------------------------------------------------------] | single-lora roofline using torch.mm (f16xf16=>f16) | SGMV_EXPAND(add_inputs=False) (f32xf16=>f16) 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ {"bs": 16, "sl": 16, "m": 256, "k": 16, "n": 2048, "num_loras": 4, "sort_by_lora": true, "num_slices": 1} | 132.6 | 174.5 Times are in microseconds (us). Note : The timings reported above is for 32 consecutive invocations of the benchmarking functions. Please divide by 32 for single invocation timings Note on Comparison with torch.mm : The torch.mm numbers are benchmark numbers of a simple matmul emulating the single lora case. It is provided as a roofline for comparing our LoRA Kernel implementations. It is expected that the LoRA kernels will be slower than torch.mm in cases where num_loras is big. But for small num_loras the goal should be to match the torch.mm numbers.

Can we output this information just once?

benchmarks/kernels/utils.py

benchmarks/kernels/benchmark_lora.py

varun-sundar-rabindranath · 2025-01-09T16:47:39Z

benchmarks/kernels/benchmark_lora.py

+        assert all([
+            bt.test_correctness(op_type, expand_fn_add_inputs)
+            for bt in bench_tensors
+        ])


@jeejeelee I have removed the correctness testing on benchmarking results. Instead, we now test the benchmarking function before running the benchmarks.

Matching the outputs of the benchmarking runs are very flaky and intractable. The root-cause of the issue is the updates to the output matrices and the fact that the benchmarking script can run the benchmarking function multiple times.
for expand related functions, when add_inputs = True, the output matrix is updated arbitrary number of times making correct testing intractable.
for shrink functions, depending on if SPLIT_K is used in the kernels, the results are either added to the output or stored directly. When results are added to the output, correctness testing becomes intractable.

benchmarks/kernels/benchmark_lora.py

jeejeelee

Thank you for your contribution and patience. Overall LGTM after completing the modifications below.

jeejeelee · 2025-01-15T02:01:00Z

benchmarks/kernels/benchmark_lora.py

+from vllm.lora.ops.sgmv_expand import sgmv_expand
+from vllm.lora.ops.sgmv_shrink import sgmv_shrink
+from vllm.lora.ops.utils import _LORA_A_PTR_DICT, _LORA_B_PTR_DICT
+from vllm.utils import FlexibleArgumentParser


We recently merged ##11100, these imports need to be reimplemented

Thanks for the heads up @jeejeelee . I have fixed it 🙌

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

test only benchmark tensors that participated Signed-off-by: Varun Sundar Rabindranath <[email protected]>

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

Signed-off-by: ice-tong <[email protected]>

Signed-off-by: Bowen Wang <[email protected]>

jeejeelee reviewed Dec 31, 2024

View reviewed changes

benchmarks/kernels/benchmark_lora.py Outdated Show resolved Hide resolved

jeejeelee reviewed Dec 31, 2024

View reviewed changes

benchmarks/kernels/benchmark_lora.py Outdated Show resolved Hide resolved

jeejeelee reviewed Dec 31, 2024

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/lora-micro-benchmarks branch from f5090c0 to 3c22546 Compare January 6, 2025 18:27

varun-sundar-rabindranath commented Jan 7, 2025

View reviewed changes

jeejeelee reviewed Jan 8, 2025

View reviewed changes

benchmarks/kernels/utils.py Outdated Show resolved Hide resolved

jeejeelee reviewed Jan 8, 2025

View reviewed changes

benchmarks/kernels/benchmark_lora.py Show resolved Hide resolved

jeejeelee reviewed Jan 8, 2025

View reviewed changes

benchmarks/kernels/benchmark_lora.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath force-pushed the varun/lora-micro-benchmarks branch from cad188e to 7dc85ef Compare January 9, 2025 16:32

varun-sundar-rabindranath commented Jan 9, 2025

View reviewed changes

jeejeelee reviewed Jan 13, 2025

View reviewed changes

benchmarks/kernels/benchmark_lora.py Outdated Show resolved Hide resolved

jeejeelee reviewed Jan 13, 2025

View reviewed changes

benchmarks/kernels/benchmark_lora.py Outdated Show resolved Hide resolved

jeejeelee reviewed Jan 13, 2025

View reviewed changes

benchmarks/kernels/benchmark_lora.py Outdated Show resolved Hide resolved

jeejeelee approved these changes Jan 15, 2025

View reviewed changes

Varun Sundar Rabindranath added 12 commits January 16, 2025 10:54

add lora benchmark files

b752e81

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

format

d66a4b0

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix

2406afc

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add output directory

5b43788

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix num slices

e356fac

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

format

bcc1518

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add expand_fn_add_inputs arg

96332ee

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

format

75ca94d

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add test_correctness

f8c900e

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add correctness testing and prints

d9aadfa

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

use stream capture

75ca40b

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

format

c7d6620

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

Varun Sundar Rabindranath added 5 commits January 16, 2025 10:54

fix comment print

06127d3

test only benchmark tensors that participated Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add comments

433d129

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

refactor

faebdc9

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix example commands

1b41291

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix imports

e31e1fd

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath force-pushed the varun/lora-micro-benchmarks branch from af63ecf to e31e1fd Compare January 16, 2025 11:04

mgoin approved these changes Jan 16, 2025

View reviewed changes

mgoin enabled auto-merge (squash) January 16, 2025 15:10

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 16, 2025

mgoin merged commit 5fd24ec into vllm-project:main Jan 16, 2025
48 checks passed

ice-tong pushed a commit to ice-tong/vllm that referenced this pull request Jan 18, 2025

[misc] Add LoRA kernel micro benchmarks (vllm-project#11579)

4e27a5d

Signed-off-by: ice-tong <[email protected]>

joennlae pushed a commit to 44ai-labs/vllm that referenced this pull request Jan 19, 2025

[misc] Add LoRA kernel micro benchmarks (vllm-project#11579)

0d4334a

joennlae pushed a commit to 44ai-labs/vllm that referenced this pull request Jan 19, 2025

[misc] Add LoRA kernel micro benchmarks (vllm-project#11579)

7283a52

abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025

[misc] Add LoRA kernel micro benchmarks (vllm-project#11579)

96cac74

Signed-off-by: Bowen Wang <[email protected]>

abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025

[misc] Add LoRA kernel micro benchmarks (vllm-project#11579)

58541f7

robertgshaw2-redhat mentioned this pull request Jan 25, 2025

[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc] Add LoRA kernel micro benchmarks #11579

[misc] Add LoRA kernel micro benchmarks #11579

varun-sundar-rabindranath commented Dec 28, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 28, 2024

varun-sundar-rabindranath commented Dec 28, 2024

jeejeelee Dec 31, 2024

varun-sundar-rabindranath Jan 2, 2025

jeejeelee Jan 3, 2025

varun-sundar-rabindranath Jan 3, 2025

jeejeelee Jan 7, 2025

varun-sundar-rabindranath Jan 7, 2025

jeejeelee Dec 31, 2024

varun-sundar-rabindranath Jan 2, 2025 •

edited

Loading

jeejeelee Dec 31, 2024 •

edited

Loading

varun-sundar-rabindranath Jan 2, 2025

varun-sundar-rabindranath Jan 7, 2025

jeejeelee Jan 13, 2025

varun-sundar-rabindranath Jan 9, 2025

jeejeelee left a comment

jeejeelee Jan 15, 2025

varun-sundar-rabindranath Jan 16, 2025

[misc] Add LoRA kernel micro benchmarks #11579

[misc] Add LoRA kernel micro benchmarks #11579

Conversation

varun-sundar-rabindranath commented Dec 28, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 28, 2024

varun-sundar-rabindranath commented Dec 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varun-sundar-rabindranath Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

jeejeelee Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeejeelee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Dec 28, 2024 •

edited by github-actions bot

Loading

varun-sundar-rabindranath Jan 2, 2025 •

edited

Loading

jeejeelee Dec 31, 2024 •

edited

Loading