-
Notifications
You must be signed in to change notification settings - Fork 711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] 4-bit quantized prefix cache #1374
Comments
@josephrocca May you adjust the |
This is a great feature and we welcome contributions on this. sglang/docs/en/hyperparameter_tuning.md Lines 28 to 32 in dff2860
|
to add to op request, this feature would also enable ~128K context window for 32B models on 24Gb cards (currently at around 20K) |
see also #1459 |
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed. |
Motivation
LMDeploy's 4-bit quantized prefix cache (along with 4-bit AWQ for weights) allows running ~70B models on 48GB of RAM with good performance for many-user scenarios. The prefix cache can hold more than 40,000 context tokens.
This is very handy, since it's often easier to get a GPU (or dual GPUs) with 48GB RAM than it is to get 80GB+ GPUs.
Note that I've benchmarked the output quality/accuracy of 4-bit prefix cache vs no quantization, and there was no significant accuracy drop with my internal benchmarks. For my use case, at least, it's a free perf boost.
Today I wanted to try comparing SGLang performance to LMDeploy, but (for a 70B model on 48GB GPU) SGLang OOMs for even a small number of concurrent requests.
I'm testing with LLama 2 AWQ model with ~2k token context and ~100 token outputs:
LMDeploy (handles 20 concurrent requests fine):
Using latest (
openmmlab/lmdeploy:v0.6.0a0-cu12
) docker image on 48GB NVIDIA A40 GPU:SGLang (OOM at >=4 concurrent requests):
Using latest (
lmsysorg/sglang:v0.3.0-cu121
) docker image on 48GB NVIDIA A40 GPU:For reference, here's some example OOM logs from SGLang that I'm seeing: https://gist.github.com/josephrocca/1c688e312f5d570ca9a4652485ff6a24
It would be great if SGLang could become competitive with LMDeploy in this type of scenario, and I think it's hard to compete in a many user-scenario without a 4-bit quantized prefix cache.
Related resources
No response
The text was updated successfully, but these errors were encountered: