-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager #12003
Conversation
Signed-off-by: Chen Zhang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
vllm/v1/core/kv_cache_manager.py
Outdated
if len(block_hashes) * self.block_size == request.num_tokens: | ||
# When prompt length is divisible by the block size and all blocks | ||
# are cached, we need to recompute the last token. This have to be | ||
# achieved by re-computing an entire block because allocate_slots() | ||
# assumes num_computed_tokens is always a multiple of the block | ||
# size. This limitation can potentially be removed in the future to | ||
# slightly improve the performance. To achieve this, the last block | ||
# is removed from the computed block_hashes. | ||
block_hashes = ConstantList(block_hashes[:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was making a similar change but @WoosukKwon prefers to have it in scheduler, because this limitation is in the model runner instead of kv cache manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any suggestion on supporting sliding window? I don't want to make the complex special handling of sliding window in scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What handling you plan to add for sliding window in scheduler? In general we should try to be modulization. For this particular logic, I think we could just
class Scheduler:
...
def maybe_recompute_last_block(self, computed_blocks, num_computed_tokens):
...
def schedule(self):
...
computed_blocks, num_computed_tokens = self.kv_cache_manager.get_computed_blocks(...)
computed_blocks, num_computed_tokens = self.maybe_recompute_last_block(computed_blocks, num_computed_tokens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface is not very friendly to sliding window as the removal of last block needs a redo of get_computed_blocks
. It works for me to discuss about which way is better after implementing the maybe_recompute_last_block
for sliding window.
Is it ok to only change from computed_blocks = self.kv_cache_manager.get_computed_blocks(request)
to computed_blocks, num_computed_tokens = self.kv_cache_manager.get_computed_blocks(request)
in this pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'm ok with it.
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
@comaniac I've modified this pr to only changing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
def get_computed_blocks( | ||
self, request: Request) -> Tuple[List[KVCacheBlock], int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the docstring about the return type.
Signed-off-by: Chen Zhang <[email protected]>
Thanks for pointing it out. I've updated the docstring. |
…Manager (vllm-project#12003) Signed-off-by: ice-tong <[email protected]>
…Manager (vllm-project#12003) Signed-off-by: Bowen Wang <[email protected]>
This pr is a preparation for hybrid memory allocator (Ref: #11382)
Move the logic of full cache hit intoKVCacheManager
to prepare for supporting sliding window.For sliding window, if the sliding window size is 2 and block_size is 1, when
computed blocks=[1,2,x,4,5]
, where x is an evicted block, thecomputed_block
should be changed to[1,2]
instead of[1,2,x,4]
. It can't be achieved by the currentcomputed_blocks.pop()
, and will be implemented by introducing a new HybridKVCacheManager in a following pr.num_computed_tokens
intoKVCacheManager
. When there are multiple block_tables (and maybe with differentblock_size
to support Jamba),num_computed_tokens
will be more complex thannum_computed_tokens = len(computed_blocks) * self.block_size
CC @comaniac