[Core] in batch prefix caching by delay scheduling #2442

rkooo567 · 2024-12-11T09:13:22Z

Motivation

Implement in-batch prefix caching.

If a new request has a short enough matching prefix, and if there are more than 1 such request that share the same prefix, we deprioritize them except one of them.

The feature benefits the workloads with many shared prefix prompts. Here's the extreme example we can be benefited from this feature.

    prompts = [
        *["Hello, my name is " * 50 + chr(i) for i in range(65, 87)],
        *[
            "The president of the United States is " * 50 + chr(i)
            for i in range(65, 87)
        ],
        *["capital of France is " * 50 + chr(i) for i in range(65, 87)],
        *["The future of AI is " * 50 + chr(i) for i in range(65, 87)],
        *["dkfjklwerkj lskjdfsa " * 50 + chr(i) for i in range(65, 87)],
        *["What time is it now? " * 50 + chr(i) for i in range(65, 87)],
        *["hello it's good to see you " * 50 + chr(i) for i in range(65, 87)],
        *["xai is the " * 50 + chr(i) for i in range(65, 87)],
        *["oops " * 50 + chr(i) for i in range(65, 87)],
        *["hello it's me " * 50 + chr(i) for i in range(65, 87)],
    ]

With this prompt, we can see 2X+ perf improvement

 after
e2e takes 807.3112380225211 ms
# before
e2e takes 1828.76543502789 ms

I also added bench_prefix.py script which is originally written from @Ying1123. With prefix_len=1024, gen_len=128, and sampling_size = 32, I could see 10+% performance improvement

# lpm-2
Throughput: 94.79 requests/s, 109553.50 tokens/s
# lpm
Throughput: 83.51 requests/s, 96514.85 tokens/s

The feature adds a slight more overhead to calc_priroity because it requires additional radix cache operation. I believe it will be negligible with overlap schedule. i also verified there's no regression from python -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 just in case.

Modifications

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

python/sglang/srt/managers/schedule_policy.py

ByronHsu · 2024-12-11T09:35:54Z

Awesome enhancement! Can we use #1990 instead of adding a new bench script?

ByronHsu · 2024-12-11T19:15:48Z

cc @MrAta i think this is useful for our case too

rkooo567 · 2024-12-11T19:15:54Z

removed benchmark_prefix.py (use Add gen-shared-prefix dataset in bench_serving #1990 instead)
Support dfs-weight. For the same extreme prompts, I see 2x improvement

# dfs-weight after
e2e takes 778.6834560101852 ms
# dfs-weight before
e2e takes 1497.2664599772543 ms

Ying1123 · 2024-12-11T19:34:13Z

Does #1990 also test the in-batch performance? We need a test to make sure this new algo is similar or better than using "prompt hint" - insert the shared part ahead of the batch.

ByronHsu · 2024-12-11T19:38:13Z

python/sglang/srt/managers/schedule_policy.py

+                # If there are more than 1 request that have small matching prefix from
+                # existing cache, but all those requests share the same prefix, we prefer
+                # to schedule only one of them so that we can increase the cache hit rate.
+                # We prefer to set IN_BATCH_PREFIX_CACHING_THRESHOLD > 0 because too small


IN_BATCH_PREFIX_CACHING_THRESHOLD > 0`

why >0? should this be > 32?

It is kind of common when the engine is long running (e.g., imagine "the").

What does imagine "the" mean?

if the threshold is 0, this optimization is not applied to prefix like "the", which is common.

Regarding the comment, I just meant == 0 is not ideal because it misses cases like "the" prefix

32 is also an arbitrary value actually. didn't do much tuning here

Makes sense

MrAta · 2024-12-11T20:02:58Z

This looks great!

rkooo567 · 2024-12-17T01:34:36Z

@merrymercy The current policy is if we don't have more requests to schedule, it just schedule it together (instead of delaying it), but @merrymercy pointed out it should still schedule 1 request and delay others. We will revert this PR and merge again with this fix.

done

a43eb11

rkooo567 requested review from merrymercy, Ying1123, hnyls2002 and ByronHsu as code owners December 11, 2024 09:13

rkooo567 commented Dec 11, 2024

View reviewed changes

python/sglang/srt/managers/schedule_policy.py Show resolved Hide resolved

rkooo567 added 2 commits December 11, 2024 11:14

update dfs-weight

ec2ad77

remove benchmark prefix

e2be5af

Ying1123 approved these changes Dec 11, 2024

View reviewed changes

ByronHsu approved these changes Dec 11, 2024

View reviewed changes

Merge branch 'main' into in-batch-prefix-caching

88f7e36

ByronHsu reviewed Dec 11, 2024

View reviewed changes

MrAta approved these changes Dec 11, 2024

View reviewed changes

ByronHsu merged commit 9208618 into sgl-project:main Dec 11, 2024
15 checks passed

merrymercy mentioned this pull request Dec 17, 2024

Add a benchmark script for in-batch prefix caching #2494

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] in batch prefix caching by delay scheduling #2442

[Core] in batch prefix caching by delay scheduling #2442

rkooo567 commented Dec 11, 2024

ByronHsu commented Dec 11, 2024

ByronHsu commented Dec 11, 2024

rkooo567 commented Dec 11, 2024

Ying1123 commented Dec 11, 2024

ByronHsu Dec 11, 2024

rkooo567 Dec 11, 2024

rkooo567 Dec 11, 2024

ByronHsu Dec 11, 2024

MrAta commented Dec 11, 2024

rkooo567 commented Dec 17, 2024

[Core] in batch prefix caching by delay scheduling #2442

[Core] in batch prefix caching by delay scheduling #2442

Conversation

rkooo567 commented Dec 11, 2024

Motivation

Modifications

Checklist

ByronHsu commented Dec 11, 2024

ByronHsu commented Dec 11, 2024

rkooo567 commented Dec 11, 2024

Ying1123 commented Dec 11, 2024

ByronHsu Dec 11, 2024

Choose a reason for hiding this comment

rkooo567 Dec 11, 2024

Choose a reason for hiding this comment

rkooo567 Dec 11, 2024

Choose a reason for hiding this comment

ByronHsu Dec 11, 2024

Choose a reason for hiding this comment

MrAta commented Dec 11, 2024

rkooo567 commented Dec 17, 2024