Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] in batch prefix caching by delay scheduling #2442

Merged
merged 4 commits into from
Dec 11, 2024

Conversation

rkooo567
Copy link
Contributor

Motivation

Implement in-batch prefix caching.

If a new request has a short enough matching prefix, and if there are more than 1 such request that share the same prefix, we deprioritize them except one of them.

The feature benefits the workloads with many shared prefix prompts. Here's the extreme example we can be benefited from this feature.

    prompts = [
        *["Hello, my name is " * 50 + chr(i) for i in range(65, 87)],
        *[
            "The president of the United States is " * 50 + chr(i)
            for i in range(65, 87)
        ],
        *["capital of France is " * 50 + chr(i) for i in range(65, 87)],
        *["The future of AI is " * 50 + chr(i) for i in range(65, 87)],
        *["dkfjklwerkj lskjdfsa " * 50 + chr(i) for i in range(65, 87)],
        *["What time is it now? " * 50 + chr(i) for i in range(65, 87)],
        *["hello it's good to see you " * 50 + chr(i) for i in range(65, 87)],
        *["xai is the " * 50 + chr(i) for i in range(65, 87)],
        *["oops " * 50 + chr(i) for i in range(65, 87)],
        *["hello it's me " * 50 + chr(i) for i in range(65, 87)],
    ]


With this prompt, we can see 2X+ perf improvement

 after
e2e takes 807.3112380225211 ms
# before
e2e takes 1828.76543502789 ms

I also added bench_prefix.py script which is originally written from @Ying1123. With prefix_len=1024, gen_len=128, and sampling_size = 32, I could see 10+% performance improvement

# lpm-2
Throughput: 94.79 requests/s, 109553.50 tokens/s
# lpm
Throughput: 83.51 requests/s, 96514.85 tokens/s

The feature adds a slight more overhead to calc_priroity because it requires additional radix cache operation. I believe it will be negligible with overlap schedule. i also verified there's no regression from python -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 just in case.

Modifications

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@ByronHsu
Copy link
Collaborator

Awesome enhancement! Can we use #1990 instead of adding a new bench script?

@ByronHsu
Copy link
Collaborator

cc @MrAta i think this is useful for our case too

@rkooo567
Copy link
Contributor Author

# dfs-weight after
e2e takes 778.6834560101852 ms
# dfs-weight before
e2e takes 1497.2664599772543 ms

@Ying1123
Copy link
Member

Does #1990 also test the in-batch performance? We need a test to make sure this new algo is similar or better than using "prompt hint" - insert the shared part ahead of the batch.

# If there are more than 1 request that have small matching prefix from
# existing cache, but all those requests share the same prefix, we prefer
# to schedule only one of them so that we can increase the cache hit rate.
# We prefer to set IN_BATCH_PREFIX_CACHING_THRESHOLD > 0 because too small
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IN_BATCH_PREFIX_CACHING_THRESHOLD > 0`

why >0? should this be > 32?

It is kind of common when the engine is long running (e.g., imagine "the").

What does imagine "the" mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the threshold is 0, this optimization is not applied to prefix like "the", which is common.

Regarding the comment, I just meant == 0 is not ideal because it misses cases like "the" prefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32 is also an arbitrary value actually. didn't do much tuning here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@MrAta
Copy link
Contributor

MrAta commented Dec 11, 2024

This looks great!

@ByronHsu ByronHsu merged commit 9208618 into sgl-project:main Dec 11, 2024
15 checks passed
@rkooo567
Copy link
Contributor Author

@merrymercy The current policy is if we don't have more requests to schedule, it just schedule it together (instead of delaying it), but @merrymercy pointed out it should still schedule 1 request and delay others. We will revert this PR and merge again with this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants