[Misc] Kernel Benchmark for `RMSNorm` #11241

ywang96 · 2024-12-16T21:21:54Z

This PR ports the RMSNorm kernel benchmark authored by @BBuf in sgl-project/sglang#2486 to vLLM repo to compare kernel differences between our custom op and flashinfer.

Co-authored-by: @BBuf

Co-authored-by: Xiaoyu Zhang <[email protected]> Signed-off-by: Roger Wang <[email protected]>

Signed-off-by: Roger Wang <[email protected]>

github-actions · 2024-12-16T21:22:08Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mgoin

LGTM!

Here are my results just running on H100:

python benchmark_rmsnorm.py
...
rmsnorm-perf-without-residual:
    head_num  batch_size  seq_len  HuggingFace   FlashInfer         vLLM
0       32.0         1.0     64.0    52.703999     9.792000    11.744000
1       32.0         1.0    128.0    46.208002    11.648000    13.824000
2       32.0         1.0    256.0    52.928001    12.032000    14.272000
3       32.0         1.0    512.0    64.736001    14.208000    18.784000
4       32.0         1.0   1024.0    91.807999    19.872000    27.584000
5       32.0         4.0     64.0    53.056002    13.120000    14.656000
6       32.0         4.0    128.0    65.920003    14.688000    19.200001
7       32.0         4.0    256.0    94.463997    19.904001    27.327999
8       32.0         4.0    512.0   184.064001    31.168001    46.176001
9       32.0         4.0   1024.0   333.792001    60.864002    97.152002
10      32.0        16.0     64.0    92.896000    19.680001    27.200000
11      32.0        16.0    128.0   183.904007    30.975999    45.791999
12      32.0        16.0    256.0   332.704008    60.864002    97.184002
13      32.0        16.0    512.0   618.336022   109.024003   179.296002
14      32.0        16.0   1024.0  1192.352057   205.791995   343.456000
15      32.0        64.0     64.0   333.472013    60.896002    97.280003
16      32.0        64.0    128.0   617.824018   109.024003   179.296002
17      32.0        64.0    256.0  1192.288041   205.888003   343.423992
18      32.0        64.0    512.0  2335.776091   399.295986   671.711981
19      32.0        64.0   1024.0  4625.023842   789.951980  1330.960035
20      48.0         1.0     64.0    48.608001    10.144000    12.704000
21      48.0         1.0    128.0    53.056002    11.744000    14.848000
22      48.0         1.0    256.0    62.463999    13.504000    16.303999
23      48.0         1.0    512.0    80.959998    17.503999    22.720000
24      48.0         1.0   1024.0   142.752007    26.208000    34.623999
25      48.0         4.0     64.0    62.431999    13.504000    16.272001
26      48.0         4.0    128.0    80.159999    17.535999    22.752000
27      48.0         4.0    256.0   143.360004    26.144000    34.527998
28      48.0         4.0    512.0   266.719997    52.703999    72.031997
29      48.0         4.0   1024.0   476.480007    93.631998   133.056000
30      48.0        16.0     64.0   142.848000    26.144000    34.623999
31      48.0        16.0    128.0   266.128004    52.687999    72.095998
32      48.0        16.0    256.0   477.151990    93.567997   133.151993
33      48.0        16.0    512.0   904.640019   173.823997   249.696001
34      48.0        16.0   1024.0  1763.872027   334.975988   484.351993
35      48.0        64.0     64.0   477.344006    93.631998   133.248001
36      48.0        64.0    128.0   903.551996   173.840001   249.791995
37      48.0        64.0    256.0  1764.448047   334.895998   484.320015
38      48.0        64.0    512.0  3475.487947   658.688009   953.232050
39      48.0        64.0   1024.0  6898.752213  1307.775974  1889.744043

Clearly flashinfer seems to offer a benefit with these configurations.

Needed to install flashinfer with: uv pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/

jeejeelee · 2024-12-17T01:24:49Z

benchmarks/kernels/benchmark_rmsnorm.py

+                   seq_len=128,
+                   hidden_size=4096,
+                   use_residual=args.use_residual)
+


Making these configurable through args would be perfect.

That's a good point! @jeejeelee

Added in 71af57f

Signed-off-by: Roger Wang <[email protected]>

WoosukKwon · 2024-12-17T06:20:38Z

This is very good to know. The RMSNorm kernel (and the RoPE kernel) is not optimized enough. We should replace it with either the flash infer kernel or the Triton kernel.

Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]>

ywang96 and others added 3 commits December 16, 2024 12:56

port

4aa3c61

Co-authored-by: Xiaoyu Zhang <[email protected]> Signed-off-by: Roger Wang <[email protected]>

format and cleanup

a876b1f

Signed-off-by: Roger Wang <[email protected]>

format

1a15fc0

Signed-off-by: Roger Wang <[email protected]>

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2024

mgoin approved these changes Dec 16, 2024

View reviewed changes

jeejeelee approved these changes Dec 17, 2024

View reviewed changes

jeejeelee reviewed Dec 17, 2024

View reviewed changes

ywang96 added 2 commits December 16, 2024 21:54

make params configurable

71af57f

Signed-off-by: Roger Wang <[email protected]>

typo

42d3b90

Signed-off-by: Roger Wang <[email protected]>

ywang96 enabled auto-merge (squash) December 17, 2024 05:57

ywang96 merged commit 02222a0 into vllm-project:main Dec 17, 2024
35 checks passed

BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024

[Misc] Kernel Benchmark for RMSNorm (vllm-project#11241)

39a79fa

Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] Kernel Benchmark for `RMSNorm` #11241

[Misc] Kernel Benchmark for `RMSNorm` #11241

ywang96 commented Dec 16, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 16, 2024

mgoin left a comment

jeejeelee Dec 17, 2024

ywang96 Dec 17, 2024

WoosukKwon commented Dec 17, 2024

[Misc] Kernel Benchmark for RMSNorm #11241

[Misc] Kernel Benchmark for RMSNorm #11241

Conversation

ywang96 commented Dec 16, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 16, 2024

mgoin left a comment

Choose a reason for hiding this comment

jeejeelee Dec 17, 2024

Choose a reason for hiding this comment

ywang96 Dec 17, 2024

Choose a reason for hiding this comment

WoosukKwon commented Dec 17, 2024

[Misc] Kernel Benchmark for `RMSNorm` #11241

[Misc] Kernel Benchmark for `RMSNorm` #11241

ywang96 commented Dec 16, 2024 •

edited by github-actions bot

Loading