Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support for Evicting Specific KV Cache to Save GPU Memory #2510

Open
2 tasks done
ChenlongDeng opened this issue Dec 18, 2024 · 0 comments
Open
2 tasks done

Comments

@ChenlongDeng
Copy link

Checklist

Motivation

Hi, congratulations on the amazing work!

I’d like to know if there is currently a feature that allows evicting specific parts of the KV cache (i.e., KV cache of some tokens) to save GPU memory. This capability is becoming increasingly important for many use cases involving KV cache compression, such as in methods like StreamingLLM and H2O.

I noticed that a similar issue was previously raised, and it was addressed with the introduction of DoubleSparse.(#1347, #1459 ) While DoubleSparse does reduce the computational cost of attention, it doesn’t seem to explicitly support operations for evicting specific parts of the KV cache from GPU memory.

I’m curious if such functionality is achievable within the current design of SGLang. If not, are there any plans to support this feature in the future?

Thank you!

StreamingLLM: https://arxiv.org/abs/2309.17453
H2O: https://arxiv.org/abs/2306.14048
DoubleSparsity: https://arxiv.org/abs/2408.07092

Related resources

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant