You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
KV cache compaction (i.e., token dropping) can significantly reduce memory footprint in llm serving (especially for long generation and large batch size workloads). The plan is to support the latest KV compaction methods, such as FastGen and DuoAttetnion, and also support a flexible interface for developers to add their own compaction methods.
Proposed Change.
To support KV Cache compaction, we need:
Expose intermediate logits (i.e. attetnion weights) from attention kernels as a lot of token dropping decisions depend on attention weights.
Support free_and_reallocate functionality to reduce memory fragmentation after memory compaction. A workaround is to use block_size=1.
Support non-uniform memory layout. Currently, vllm assumes kv cache across different heads and layers have the same layout. However, some compaction methods require dropping different tokens (i.e., their KV cache) at different heads and layers.
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
On the research side, @lynnliu030 is experimenting with token drop's impact on memory allocation. I think we can revisit this around EOY and discuss the exact API change.
Can you also list some example KV compaction method you had in mind? I guess you currently have attention sink and H2O, but any other type you expect to support and how would that affect the design?
The key challenge for supporting more methods is the memory management side. For both attention sink and H2O, they maintain the same number of tokens across different heads and layers. If we want to support more advanced methods, we need a more flexible memory layout.
Motivation.
KV cache compaction (i.e., token dropping) can significantly reduce memory footprint in llm serving (especially for long generation and large batch size workloads). The plan is to support the latest KV compaction methods, such as FastGen and DuoAttetnion, and also support a flexible interface for developers to add their own compaction methods.
Proposed Change.
To support KV Cache compaction, we need:
free_and_reallocate
functionality to reduce memory fragmentation after memory compaction. A workaround is to useblock_size=1
.A prototype is available at https://github.com/LMCache/LMCache/blob/compaction/examples/compactor/README.md .
Feel free to share any thoughts and comments!!
Feedback Period.
Several weeks.
CC List.
@simon-mo @KuntaiDu @comaniac @youkaichao
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: