-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance downgrade on dGPU Arc770 after loading more than one LLM model #12660
Comments
import os WHISPER_SAMPLING_RATE = 16000 def test_chatglm(llm_model, llm_tokenizer, report, is_report):
def test_sd(sd_model, report, is_report):
def test_minicpm(model, tokenizer, report, is_report):
def test_whisper(whisper_processor, whisper_model, report, is_report):
if name == 'main':
|
Please use below value to run the test cases (if want to run the case, set it to True): |
The code you provided has confusing indentation. Can you provide a formatted code? |
please modify attached test.txt to test.py |
Do you have any progress on this issue? Is there any other information that you need from me? |
Thank you, we have reproduced the results you provided, and the specific reasons are still under analysis. |
You need to comment out your warmup code. Like this:After comment all your warmup code. Test result:
After loading multiple models, the performance of the model will not decrease. Whisper has abnormal performance when loaded together with other non-text large models, but its performance is normal when loaded together with glm,so the problem should not be related to xpu |
Thanks for debugging and very useful discovery! I would follow your suggestion and retry. Thanks! |
Hi @qing-xu-intel, You could refer to our optimized text-to-speech model examples on Intel GPU (e.g. SpeechT5 and Bark) for more information :) |
hi, I tried the workaround - removing all the warm-up, however, inference latency is still very long, see below results: //with warm-up //remove warm-up //whisper only //chatglm only //minicpm only //sd only |
whisper latency without warmup (1.82s), is 2.5 times to only whisper latency (0.71s) |
https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md
https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/chatglm3
if load more than one models, the inference latency increase:
llm infer 1.22 s
wsp infer 1.01 s
llm infer 2.07 s
cpm infer 2.97 s
sd infer 0.74 s
wsp infer 1.93 s
The text was updated successfully, but these errors were encountered: