-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLAVA is slow due to unnecessary output tokens #1118
Comments
I have tested in vllm. and trt-llm is much slower than vllm with big batch size. |
Did you test multi-modal models or just language models? |
just LLM |
what is the version of the used tensorrt_llm? check the version by using the following code
|
It's [TensorRT-LLM] TensorRT-LLM version: |
@Gutianpei Can you please share the steps to reproduce the numbers you get? |
@Gutianpei @x-transformers @springsprite we have observed a similar behavior (i.e., running non-stop until reaching max_new_tokens) and have fixed that in the upcoming release. Can you please test again on top of the main branch after tomorrow's weekly release? |
Thanks for the help. This is the script I used, I just changed the logging in run.py to output the img/sec throughput:
|
Thanks for the help. It does not look like running non-stop until reaching max_new_tokens caused the issue, as I set max_new_tokens to 100 in my experiment and the throughput is still much slower than expected even if the model outputs 100 new tokens everytime. I'll try the latest release once it published, and please also take a look of my testing script above for reproducing the issue. |
@Gutianpei we pushed an update to the main branch, can you please try again on the latest main branch and see if the issue persists? Thank you. |
Thanks for the fix! The throughput is clearly improved, I got 12.38 img/sec versus perviously 9.8. Unfortunately, I think it's still much slower than it should be. I can get 11.2 img/sec from sglang and much higher throughput from vllm, I think in theory an int8/fp8 trt-llm engine should be much faster. Can you take a look at the script I used above to see if any parameters I got wrong? Also do you think the super long output token length slows down the generation? Thank you! |
@Gutianpei can you disable
Regarding, can you share some measurement stats on this observation? For example, is the comparison apple-to-apple in terms of batch size, precision, etc. And for throughput, we're working on inflight batching serving for enc-dec and multimodal models, so this is something that will help with throughput |
Also as we discussed under #1123 #1123 (comment), we compared with HF transformers and see the correct speed should be ~5x faster for llava, is this ok for your use case? |
Thanks for the reply. The fix you pushed this week definitely accelerates the generation, I got 1.5x faster on my end.
Here is the logging from the above script I used (
I set However I think there are a lot room for improving TRT-LLM, just want to make sure there is no bug and all my settings are correct. Excited to checkout the inflight batching in the future! |
@Gutianpei , some explanation on the perf question you have: the first thing to clarify is latency vs. throughput. Latency is absolute kernel/model performance which I believe TRT-LLM is doing SOTA. Meanwhile, throughput wise TRT-LLM is also doing SOTA for those models with inflight batching ENABLED. What's the difference between (1) your current run of TRT-LLM llava at certain batch size and (2) future run of TRT-LLM llava with inflight batching enabled, or vllm run (I'm less familar with sglang)? Because serving optimization matters a lot for throughput. Think of a batch of input images that will generate output lengths of 10, 50, 100, (1) will have to wait until the entire batch finishes, while (2) can continously processing new batches/images when 10 and 50 finishes earlier. So we should keep this in mind for throughput comparison. Meanwhile, for absolute kernel/model perf, your message is well received and we're working on improvements. For example, we found the data transfer of visual embedding from visual engine --> LLM engine can be optimized, and expect that to narrow the gap you observed |
@Gutianpei Can you try other quantisation methods like SmoothQuant or INT4 AWQ? |
@symphonylyh
I really appreciate your help and the trt-llm support. Will definitely try inflight batching once it's available. I'm closing this issue since my concerns are all addressed, thank you so much! @amukkara |
May I ask how do you verify that the reasoning result of TRT is correct?I can generate output using the method provided by the demo, but the output is different from transformers. How can I ensure that the output of the trt is correct? |
System Info
Who can help?
@kaiy
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Use the official code to run LLAVA1.5-13B
Expected behavior
Much higher throughput -- currently I got ~9.8 img/sec with batchsize=48, where sglang has 18.6 img/sec. TensorRT-LLM should be at least 2x faster than sglang or vllm.
actual behavior
See above
additional notes
I also benchmarked llama2 and the throughput is expected. Looking into the code I found the output ids contain all the image tokens, where official llava code only contain the text tokens. Is it possible the LLM part are predicting the image tokens as well so it cause the slowdown?
The text was updated successfully, but these errors were encountered: