[Feature] Support for Evicting Specific KV Cache to Save GPU Memory #2510

ChenlongDeng · 2024-12-18T11:54:28Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Hi, congratulations on the amazing work!

I’d like to know if there is currently a feature that allows evicting specific parts of the KV cache (i.e., KV cache of some tokens) to save GPU memory. This capability is becoming increasingly important for many use cases involving KV cache compression, such as in methods like StreamingLLM and H2O.

I noticed that a similar issue was previously raised, and it was addressed with the introduction of DoubleSparse.(#1347, #1459 ) While DoubleSparse does reduce the computational cost of attention, it doesn’t seem to explicitly support operations for evicting specific parts of the KV cache from GPU memory.

I’m curious if such functionality is achievable within the current design of SGLang. If not, are there any plans to support this feature in the future?

Thank you!

StreamingLLM: https://arxiv.org/abs/2309.17453
H2O: https://arxiv.org/abs/2306.14048
DoubleSparsity: https://arxiv.org/abs/2408.07092

Related resources

No response

shadowpa0327 mentioned this issue Jan 17, 2025

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support for Evicting Specific KV Cache to Save GPU Memory #2510

[Feature] Support for Evicting Specific KV Cache to Save GPU Memory #2510

ChenlongDeng commented Dec 18, 2024

[Feature] Support for Evicting Specific KV Cache to Save GPU Memory #2510

[Feature] Support for Evicting Specific KV Cache to Save GPU Memory #2510

Comments

ChenlongDeng commented Dec 18, 2024

Checklist

Motivation

Related resources