Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: reproducing vLLM performance benchmark #8176

Closed
1 task done
KuntaiDu opened this issue Sep 5, 2024 · 8 comments
Closed
1 task done

[Performance]: reproducing vLLM performance benchmark #8176

KuntaiDu opened this issue Sep 5, 2024 · 8 comments
Labels
performance Performance-related issues stale

Comments

@KuntaiDu
Copy link
Collaborator

KuntaiDu commented Sep 5, 2024

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

To reproduce vLLM's performance benchmark, please launch a shell in the following docker images:

  • SGlang: lmsysorg/sglang:v0.3.0-cu124
  • lmdeploy: openmmlab/lmdeploy:v0.6.0a0-cu12
  • TensorRT-LLM: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
  • vLLM: vllm/vllm-openai:v0.6.0

And then run the following bash script (don't forget to replace with your huggingface token that has Llama-3 model access):

export HF_TOKEN=<your HF TOKEN>
apt update
apt install -y wget unzip 
# download benchmarking code
wget -O benchmarking_code.zip https://buildkite.com/organizations/vllm/pipelines/performance-benchmark/builds/8532/jobs/0191bbbf-c603-4c15-9f5d-e0b2933ba097/artifacts/0191bd2a-d6cd-4f6d-b618-a7aa2c39456c
unzip benchmarking_code.zip
# remove previous results
rm -r ./benchmarks/results
VLLM_SOURCE_CODE_LOC=$(pwd) bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh

Your benchmarking results will be in ./benchmarks/results, with the name format of xxx_nightly_results.json and can be loaded and converted to pandas dataframe by pandas.DataFrame.from_dict(). Each benchmark run takes roughly 1 hour 10 minutes assuming that the model weights are already downloaded (and 1 hour 30 minutes for TensorRT-LLM as it needs to convert the model to triton inference engine).

When you run the H100 benchmark inside TensorRT-LLM docker container, you may experience a memory leaking issue (issue link). In this case, please add the following code

      # temporary fix for trt
      kill_gpu_processes
      bash -c "python3 /tensorrtllm_backend/scripts/launch_triton_server.py \
              --world_size=${tp} \
              --model_repo=/tensorrtllm_backend/triton_model_repo & " </dev/null >/dev/null 2>&1 &
      wait_for_server

to Line 211 (right after the for loop) in ./.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh to force TensorRT-LLM to restart the serve more often.

Known issue:

  • In different serving engines, the # of output tokens do not strictly align (even after setting ignore_eos or max_length due to imperfect implementation of these two flags in different engines). That said, the number of tokens generated by vLLM is roughly aligned with other engines as all engines are performing greedy sampling using the same model.

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@KuntaiDu KuntaiDu added the performance Performance-related issues label Sep 5, 2024
@zhyncs
Copy link
Contributor

zhyncs commented Sep 5, 2024

Hi all @WoosukKwon @zhuohan123 @KuntaiDu @alexm-neuralmagic cc @merrymercy @Ying1123 @hnyls2002

First of all, congratulations to vLLM on the improvement in offline throughput over the past month. However, there are some confusion or errors in this blog post, which I have pointed out in this document.

We reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0.
In short, with multi step enabled, in online scenarios, the Median TTFT of vLLM is 3 times that of SGLang, and the Median ITL is 10 times that of SGLang. Also, under maximum throughput, if vLLM does not set gpu util to 0.95 separately and uses the default configuration instead, its maximum throughput is lower than that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL.

ref https://x.com/zhyncs42/status/1831754352278839778

https://github.com/sgl-project/sglang/blob/main/benchmark/benchmark_vllm_060/README.md

01
02

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Sep 5, 2024

AFAIK the current multi-step scheduler will send multiple tokens inside one networking packet. As the benchmark is using 10 steps, and this causes the inflation on both TTFT (the first token needs to wait) and ITL (ITL only measures inter-network-packet latency in current vLLM implementation, so it will be 10x). That said, such inflation for me is not fundamental and can be improved by, for example, streaming output token-by-token.

@zhyncs
Copy link
Contributor

zhyncs commented Sep 5, 2024

@KuntaiDu Since the package sends 10 tokens at once, incorporating a streaming simulator to sequentially output tokens or introducing an initial delay for the first chunk will significantly raise inter-token latency or TTFT. The crucial aspect is that vLLM processes chunks of ten tokens together, rather than generating them individually.

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Sep 5, 2024

IIUC currently vLLM is still generating tokens one-by-one (the scheduling algorithm is run once per 10 steps, unless new request comes) but streaming out 10 tokens together. I am expecting that vLLM will stream out tokens 1 by 1 in the near future and both TTFT and ITL will be reduced after that.

@zhyncs
Copy link
Contributor

zhyncs commented Sep 5, 2024

@KuntaiDu We are discussing the current situation here, and right now the ITL is very high. For an explanation of ITL, you can refer to sgl-project/sglang#1340 (comment) In this scenario, this will introduce choppiness during online serving, leading to degraded user experiences. By the way, I am not challenging vLLM. On the contrary, I greatly appreciate a lot of the work done by vLLM, and I have always found the committers of vLLM such as @robertgshaw2-neuralmagic @ywang96 to be very open-minded. Looking forward to vLLM’s future improvements. Cheers.

@yao-matrix
Copy link

Since most of the optimizations are de-bottlenecking CPU, so I suppose CPU information in benchmark will be important for us to reproduce and analyze. Could you add your CPU info besides GPU info for your data? Thx.

@KuntaiDu
Copy link
Collaborator Author

Since most of the optimizations are de-bottlenecking CPU, so I suppose CPU information in benchmark will be important for us to reproduce and analyze. Could you add your CPU info besides GPU info for your data? Thx.

We are running the A100 benchmark on vLLM CI platform (AWS 8x A100 instance) and H100 benchmark on mosaic platform. However, as these instances are from cloud platform and different machines have different CPUs (even with the same spec), I don't have the accurate CPU spec in hand.

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues stale
Projects
None yet
Development

No branches or pull requests

3 participants