Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] 4-bit quantized prefix cache #1374

Closed
josephrocca opened this issue Sep 10, 2024 · 5 comments
Closed

[Feature] 4-bit quantized prefix cache #1374

josephrocca opened this issue Sep 10, 2024 · 5 comments
Labels
enhancement New feature or request inactive

Comments

@josephrocca
Copy link
Contributor

josephrocca commented Sep 10, 2024

Motivation

LMDeploy's 4-bit quantized prefix cache (along with 4-bit AWQ for weights) allows running ~70B models on 48GB of RAM with good performance for many-user scenarios. The prefix cache can hold more than 40,000 context tokens.

This is very handy, since it's often easier to get a GPU (or dual GPUs) with 48GB RAM than it is to get 80GB+ GPUs.

Note that I've benchmarked the output quality/accuracy of 4-bit prefix cache vs no quantization, and there was no significant accuracy drop with my internal benchmarks. For my use case, at least, it's a free perf boost.

Today I wanted to try comparing SGLang performance to LMDeploy, but (for a 70B model on 48GB GPU) SGLang OOMs for even a small number of concurrent requests.

I'm testing with LLama 2 AWQ model with ~2k token context and ~100 token outputs:

LMDeploy (handles 20 concurrent requests fine):

Using latest (openmmlab/lmdeploy:v0.6.0a0-cu12) docker image on 48GB NVIDIA A40 GPU:

lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --server-port 3000 --tp $(nvidia-smi -L | wc -l) --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4 --log-level INFO

SGLang (OOM at >=4 concurrent requests):

Using latest (lmsysorg/sglang:v0.3.0-cu121) docker image on 48GB NVIDIA A40 GPU:

python3 -m sglang.launch_server --model-path lmdeploy/llama2-chat-70b-4bit --context-length 8192 --host 0.0.0.0 --port 3000 --tp-size $(nvidia-smi -L | wc -l)

For reference, here's some example OOM logs from SGLang that I'm seeing: https://gist.github.com/josephrocca/1c688e312f5d570ca9a4652485ff6a24

It would be great if SGLang could become competitive with LMDeploy in this type of scenario, and I think it's hard to compete in a many user-scenario without a 4-bit quantized prefix cache.

Related resources

No response

@zhyncs
Copy link
Member

zhyncs commented Sep 10, 2024

@merrymercy
Copy link
Contributor

merrymercy commented Sep 10, 2024

This is a great feature and we welcome contributions on this.
For your OOM issue, can you try to tune some parameters?

### Avoid out-of-memory by tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can decrease these parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.

@merrymercy merrymercy added the enhancement New feature or request label Sep 22, 2024
@smallstepman
Copy link

smallstepman commented Sep 29, 2024

to add to op request, this feature would also enable ~128K context window for 32B models on 24Gb cards (currently at around 20K)

@merrymercy
Copy link
Contributor

see also #1459

Copy link

github-actions bot commented Dec 6, 2024

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request inactive
Projects
None yet
Development

No branches or pull requests

4 participants