-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] in batch prefix caching by delay scheduling #2442
Conversation
Awesome enhancement! Can we use #1990 instead of adding a new bench script? |
cc @MrAta i think this is useful for our case too |
|
Does #1990 also test the in-batch performance? We need a test to make sure this new algo is similar or better than using "prompt hint" - insert the shared part ahead of the batch. |
# If there are more than 1 request that have small matching prefix from | ||
# existing cache, but all those requests share the same prefix, we prefer | ||
# to schedule only one of them so that we can increase the cache hit rate. | ||
# We prefer to set IN_BATCH_PREFIX_CACHING_THRESHOLD > 0 because too small |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IN_BATCH_PREFIX_CACHING_THRESHOLD > 0`
why >0? should this be > 32?
It is kind of common when the engine is long running (e.g., imagine "the").
What does imagine "the" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the threshold is 0, this optimization is not applied to prefix like "the", which is common.
Regarding the comment, I just meant == 0 is not ideal because it misses cases like "the" prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32 is also an arbitrary value actually. didn't do much tuning here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
This looks great! |
@merrymercy The current policy is if we don't have more requests to schedule, it just schedule it together (instead of delaying it), but @merrymercy pointed out it should still schedule 1 request and delay others. We will revert this PR and merge again with this fix. |
Motivation
Implement in-batch prefix caching.
If a new request has a short enough matching prefix, and if there are more than 1 such request that share the same prefix, we deprioritize them except one of them.
The feature benefits the workloads with many shared prefix prompts. Here's the extreme example we can be benefited from this feature.
With this prompt, we can see 2X+ perf improvement
I also added
bench_prefix.py
script which is originally written from @Ying1123. With prefix_len=1024, gen_len=128, and sampling_size = 32, I could see 10+% performance improvementThe feature adds a slight more overhead to
calc_priroity
because it requires additional radix cache operation. I believe it will be negligible with overlap schedule. i also verified there's no regression frompython -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
just in case.Modifications
Checklist