Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Support KV Cache Compaction #10646

Open
1 task done
YaoJiayi opened this issue Nov 25, 2024 · 3 comments
Open
1 task done

[RFC]: Support KV Cache Compaction #10646

YaoJiayi opened this issue Nov 25, 2024 · 3 comments
Labels

Comments

@YaoJiayi
Copy link

Motivation.

KV cache compaction (i.e., token dropping) can significantly reduce memory footprint in llm serving (especially for long generation and large batch size workloads). The plan is to support the latest KV compaction methods, such as FastGen and DuoAttetnion, and also support a flexible interface for developers to add their own compaction methods.

Proposed Change.

To support KV Cache compaction, we need:

  1. Expose intermediate logits (i.e. attetnion weights) from attention kernels as a lot of token dropping decisions depend on attention weights.
  2. Support free_and_reallocate functionality to reduce memory fragmentation after memory compaction. A workaround is to use block_size=1.
  3. Support non-uniform memory layout. Currently, vllm assumes kv cache across different heads and layers have the same layout. However, some compaction methods require dropping different tokens (i.e., their KV cache) at different heads and layers.

A prototype is available at https://github.com/LMCache/LMCache/blob/compaction/examples/compactor/README.md .

Feel free to share any thoughts and comments!!

Feedback Period.

Several weeks.

CC List.

@simon-mo @KuntaiDu @comaniac @youkaichao

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@YaoJiayi YaoJiayi added the RFC label Nov 25, 2024
@KuntaiDu
Copy link
Collaborator

@heheda12345 @ywang96 This seems to be quite related to your new memory allocator.

@simon-mo
Copy link
Collaborator

On the research side, @lynnliu030 is experimenting with token drop's impact on memory allocation. I think we can revisit this around EOY and discuss the exact API change.

Can you also list some example KV compaction method you had in mind? I guess you currently have attention sink and H2O, but any other type you expect to support and how would that affect the design?

@YaoJiayi
Copy link
Author

The key challenge for supporting more methods is the memory management side. For both attention sink and H2O, they maintain the same number of tokens across different heads and layers. If we want to support more advanced methods, we need a more flexible memory layout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants