[RFC]: Support KV Cache Compaction #10646

YaoJiayi · 2024-11-25T23:42:43Z

Motivation.

KV cache compaction (i.e., token dropping) can significantly reduce memory footprint in llm serving (especially for long generation and large batch size workloads). The plan is to support the latest KV compaction methods, such as FastGen and DuoAttetnion, and also support a flexible interface for developers to add their own compaction methods.

Proposed Change.

To support KV Cache compaction, we need:

Expose intermediate logits (i.e. attetnion weights) from attention kernels as a lot of token dropping decisions depend on attention weights.
Support free_and_reallocate functionality to reduce memory fragmentation after memory compaction. A workaround is to use block_size=1.
Support non-uniform memory layout. Currently, vllm assumes kv cache across different heads and layers have the same layout. However, some compaction methods require dropping different tokens (i.e., their KV cache) at different heads and layers.

A prototype is available at https://github.com/LMCache/LMCache/blob/compaction/examples/compactor/README.md .

Feel free to share any thoughts and comments!!

Feedback Period.

Several weeks.

CC List.

@simon-mo @KuntaiDu @comaniac @youkaichao

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

KuntaiDu · 2024-11-26T01:27:43Z

@heheda12345 @ywang96 This seems to be quite related to your new memory allocator.

simon-mo · 2024-11-26T21:45:49Z

On the research side, @lynnliu030 is experimenting with token drop's impact on memory allocation. I think we can revisit this around EOY and discuss the exact API change.

Can you also list some example KV compaction method you had in mind? I guess you currently have attention sink and H2O, but any other type you expect to support and how would that affect the design?

YaoJiayi · 2024-11-26T22:43:39Z

The key challenge for supporting more methods is the memory management side. For both attention sink and H2O, they maintain the same number of tokens across different heads and layers. If we want to support more advanced methods, we need a more flexible memory layout.

YaoJiayi added the RFC label Nov 25, 2024

heheda12345 mentioned this issue Dec 20, 2024

[RFC]: Hybrid Memory Allocator #11382

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Support KV Cache Compaction #10646

[RFC]: Support KV Cache Compaction #10646

YaoJiayi commented Nov 25, 2024

KuntaiDu commented Nov 26, 2024

simon-mo commented Nov 26, 2024

YaoJiayi commented Nov 26, 2024

[RFC]: Support KV Cache Compaction #10646

[RFC]: Support KV Cache Compaction #10646

Comments

YaoJiayi commented Nov 25, 2024

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

KuntaiDu commented Nov 26, 2024

simon-mo commented Nov 26, 2024

YaoJiayi commented Nov 26, 2024