-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc]: Very High GPU RX/TX using vllm #11760
Comments
@alexpong0630 You can set a parameter like gpu_memory_utilization in python as below: I was also initially getting the same GPU consumption, after setting it, it is reduced to half |
@vishalkumardas your script limited context length to 8k. It is the reason to reduce consumption. I just wonder how 10k tokens drived into that big input caused over 10GiB/s RX. |
@alexpong0630 I have explored it until 25k tokens and 25k tokens require at least 50% GPU utilization. For my use case 8k is enough, I can set it to 35%, below that I am getting errors. |
@vishalkumardas actual my problem is not related to GPU utilization. my question is about how tokenizer convert 10k input token to that big input caused over 10GiB/s RX per request. |
I think it triggered cpu offloading Try setting cpu_offload_gb=0 , turn off cpu offloading |
@noooop Thanks for advise I tried setting --cpu-offload-gb 0 and getting same result (very high RX per request) ~/llm/vllm$ vllm serve Qwen/Qwen2.5-32B-Instruct-AWQ --pipeline-parallel-size 2 --enable-auto-tool-choice --tool-call-parser hermes --gpu-memory-utilization 0.9 --max_model_len 32000 --max-num-seqs 5 --kv-cache-dtype fp8_e4m3 --cpu-offload-gb 0 INFO 01-07 14:56:33 model_runner.py:1077] Loading model weights took 10.5267 GB |
try use chunked_prefill, You can reduce the following parameters according to the actual situation enable_chunked_prefill = True |
@noooop Thanks for answer. After enabling chunked prefill it does reduce RX & TX. But it slightly increased the TTFT. Actually I got same result with this guy (10K tokens, around 13 second for first token response). Based on my observation, ollama has much less data transfer for whole generation process |
Anything you want to discuss about vllm.
I found there are very big size of data transfer to GPU when making a request with 10K tokens.
VLLM result a very high TTFT compare to Ollama.
I dont think it is a normal data size of 10K tokens.
vllm version: v0.6.4.post1
There is how I run vllm
vllm serve Qwen/Qwen2.5-32B-Instruct-AWQ --pipeline-parallel-size 2 --enable-auto-tool-choice --tool-call-parser hermes --gpu-memory-utilization 0.9 --max_model_len 32000 --max-num-seqs 5 --kv-cache-dtype fp8_e4m3
There is my GPU receiving data (over 10GiB/s RX)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: