[Misc]: Very High GPU RX/TX using vllm #11760

alexpong0630 · 2025-01-06T06:17:32Z

Anything you want to discuss about vllm.

I found there are very big size of data transfer to GPU when making a request with 10K tokens.
VLLM result a very high TTFT compare to Ollama.

I dont think it is a normal data size of 10K tokens.

vllm version: v0.6.4.post1
There is how I run vllm
vllm serve Qwen/Qwen2.5-32B-Instruct-AWQ --pipeline-parallel-size 2 --enable-auto-tool-choice --tool-call-parser hermes --gpu-memory-utilization 0.9 --max_model_len 32000 --max-num-seqs 5 --kv-cache-dtype fp8_e4m3

There is my GPU receiving data (over 10GiB/s RX)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

vishalkumardas · 2025-01-06T08:23:11Z

@alexpong0630 You can set a parameter like gpu_memory_utilization in python as below:
engine = LLM(
model=model_path, # Path to the model directory
dtype="bfloat16",
tensor_parallel_size=1,
gpu_memory_utilization=0.50,
max_model_len=8192,
enforce_eager=True,
trust_remote_code=True
)

I was also initially getting the same GPU consumption, after setting it, it is reduced to half

alexpong0630 · 2025-01-06T09:37:30Z

@vishalkumardas your script limited context length to 8k. It is the reason to reduce consumption.

I just wonder how 10k tokens drived into that big input caused over 10GiB/s RX.

vishalkumardas · 2025-01-06T09:51:17Z

@alexpong0630 I have explored it until 25k tokens and 25k tokens require at least 50% GPU utilization. For my use case 8k is enough, I can set it to 35%, below that I am getting errors.

alexpong0630 · 2025-01-06T10:09:08Z

@vishalkumardas actual my problem is not related to GPU utilization.

my question is about how tokenizer convert 10k input token to that big input caused over 10GiB/s RX per request.

noooop · 2025-01-07T06:48:30Z

I think it triggered cpu offloading

Try setting cpu_offload_gb=0 , turn off cpu offloading

alexpong0630 · 2025-01-07T07:12:37Z

@noooop Thanks for advise

I tried setting --cpu-offload-gb 0 and getting same result (very high RX per request)

~/llm/vllm$ vllm serve Qwen/Qwen2.5-32B-Instruct-AWQ --pipeline-parallel-size 2 --enable-auto-tool-choice --tool-call-parser hermes --gpu-memory-utilization 0.9 --max_model_len 32000 --max-num-seqs 5 --kv-cache-dtype fp8_e4m3 --cpu-offload-gb 0
INFO 01-07 14:56:07 api_server.py:585] vLLM API server version 0.6.4.post1
INFO 01-07 14:56:07 api_server.py:586] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2.5-32B-Instruct-AWQ', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=True, tool_call_parser='hermes', tool_parser_plugin='', model='Qwen/Qwen2.5-32B-Instruct-AWQ', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='fp8_e4m3', quantization_param_path=None, max_model_len=32000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=2, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=5, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7d866b582c20>)
INFO 01-07 14:56:14 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
WARNING 01-07 14:56:14 config.py:428] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 01-07 14:56:14 config.py:758] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-07 14:56:14 config.py:1020] Defaulting to use mp for distributed inference
WARNING 01-07 14:56:14 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 01-07 14:56:14 config.py:479] Async output processing can not be enabled with pipeline parallel
INFO 01-07 14:56:14 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='Qwen/Qwen2.5-32B-Instruct-AWQ', speculative_config=None, tokenizer='Qwen/Qwen2.5-32B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=fp8_e4m3, quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-32B-Instruct-AWQ, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
WARNING 01-07 14:56:15 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 5 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS
in the external environment to tune this value as needed.
INFO 01-07 14:56:15 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 01-07 14:56:15 selector.py:261] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 01-07 14:56:15 selector.py:144] Using XFormers backend.
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:21 selector.py:261] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:21 selector.py:144] Using XFormers backend.
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:21 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 01-07 14:56:21 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:21 utils.py:961] Found nccl from library libnccl.so.2
INFO 01-07 14:56:21 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:21 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 01-07 14:56:21 model_runner.py:1072] Starting to load model Qwen/Qwen2.5-32B-Instruct-AWQ...
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:21 model_runner.py:1072] Starting to load model Qwen/Qwen2.5-32B-Instruct-AWQ...
INFO 01-07 14:56:22 weight_utils.py:243] Using model weights format ['.safetensors']
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:22 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:03<00:15, 3.78s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:07<00:11, 3.72s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:07<00:04, 2.07s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:09<00:01, 1.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:09<00:00, 1.42s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:09<00:00, 1.97s/it]

INFO 01-07 14:56:33 model_runner.py:1077] Loading model weights took 10.5267 GB
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:34 model_runner.py:1077] Loading model weights took 9.0756 GB
INFO 01-07 14:56:54 worker.py:232] Memory profiling results: total_gpu_memory=21.66GiB initial_memory_usage=10.96GiB peak_torch_memory=16.97GiB memory_usage_post_profile=10.96GiB non_torch_memory=0.43GiB kv_cache_size=2.10GiB gpu_memory_utilization=0.90
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:54 worker.py:232] Memory profiling results: total_gpu_memory=21.66GiB initial_memory_usage=9.35GiB peak_torch_memory=15.82GiB memory_usage_post_profile=9.36GiB non_torch_memory=0.28GiB kv_cache_size=3.40GiB gpu_memory_utilization=0.90
INFO 01-07 14:56:55 distributed_gpu_executor.py:57] # GPU blocks: 2153, # CPU blocks: 4096
INFO 01-07 14:56:55 distributed_gpu_executor.py:61] Maximum concurrency for 32000 tokens per request: 1.08x
INFO 01-07 14:56:57 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the
model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:57 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-07 14:56:57 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
(VllmWorkerProcess pid=12245) INFO 01-07 14:56:57 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
(VllmWorkerProcess pid=12245) INFO 01-07 14:57:00 model_runner.py:1518] Graph capturing finished in 3 secs, took 0.06 GiB
INFO 01-07 14:57:00 model_runner.py:1518] Graph capturing finished in 3 secs, took 0.06 GiB
INFO 01-07 14:57:00 serving_chat.py:70] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
INFO 01-07 14:57:00 launcher.py:19] Available routes are:
INFO 01-07 14:57:00 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 01-07 14:57:00 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 01-07 14:57:00 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 01-07 14:57:00 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 01-07 14:57:00 launcher.py:27] Route: /health, Methods: GET
INFO 01-07 14:57:00 launcher.py:27] Route: /tokenize, Methods: POST
INFO 01-07 14:57:00 launcher.py:27] Route: /detokenize, Methods: POST
INFO 01-07 14:57:00 launcher.py:27] Route: /v1/models, Methods: GET
INFO 01-07 14:57:00 launcher.py:27] Route: /version, Methods: GET
INFO 01-07 14:57:00 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 01-07 14:57:00 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 01-07 14:57:00 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO: Started server process [12203]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

noooop · 2025-01-07T08:09:04Z

try use chunked_prefill，
Maybe it can reduce the amount of data that needs to be transferred in each step.

You can reduce the following parameters according to the actual situation

enable_chunked_prefill = True
max_num_seqs=32
max_num_batched_tokens=2048 <- 2048 token can generally make the GPU reach saturation

alexpong0630 · 2025-01-14T08:27:37Z

@noooop Thanks for answer. After enabling chunked prefill it does reduce RX & TX. But it slightly increased the TTFT.

Actually I got same result with this guy (10K tokens, around 13 second for first token response).
But I got around 6-8 seconds when made it on Ollama.
[Performance]: decoding speed on long context

Based on my observation, ollama has much less data transfer for whole generation process

alexpong0630 added the misc label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc]: Very High GPU RX/TX using vllm #11760

[Misc]: Very High GPU RX/TX using vllm #11760

alexpong0630 commented Jan 6, 2025 •

edited

Loading

vishalkumardas commented Jan 6, 2025

alexpong0630 commented Jan 6, 2025

vishalkumardas commented Jan 6, 2025

alexpong0630 commented Jan 6, 2025

noooop commented Jan 7, 2025 •

edited

Loading

alexpong0630 commented Jan 7, 2025

noooop commented Jan 7, 2025 •

edited

Loading

alexpong0630 commented Jan 14, 2025

[Misc]: Very High GPU RX/TX using vllm #11760

[Misc]: Very High GPU RX/TX using vllm #11760

Comments

alexpong0630 commented Jan 6, 2025 • edited Loading

Anything you want to discuss about vllm.

Before submitting a new issue...

vishalkumardas commented Jan 6, 2025

alexpong0630 commented Jan 6, 2025

vishalkumardas commented Jan 6, 2025

alexpong0630 commented Jan 6, 2025

noooop commented Jan 7, 2025 • edited Loading

alexpong0630 commented Jan 7, 2025

noooop commented Jan 7, 2025 • edited Loading

alexpong0630 commented Jan 14, 2025

alexpong0630 commented Jan 6, 2025 •

edited

Loading

noooop commented Jan 7, 2025 •

edited

Loading

noooop commented Jan 7, 2025 •

edited

Loading