-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-stream is not supported #12723
Comments
While the stream style is supported:
|
The issue should be resolved by PR #11748. You might want to update ipex-llm to a version later than 2.1.0b20240810, or simply upgrade to the latest version. |
i just tried to update ipex-llm to 2.1.0 with
|
What version of |
i tried this command, but it seems that the new installed transformers doesn't support qwen2 model,
Here is the versions:
|
You may need to reinstall vllm after updating ipex-llm. It seems that the versions of transformers and torch are lower than recommended. Below is some recommended versions for these libs:
And it works in my environment:
|
Ok. Reinstalling vllm after updating ipex-llm to |
But the latest stable version |
And the latest pre-release version |
You could use |
@xiangyuT it seems that
And the packages versions:
Here are the error logs:
|
Understood. We are planning to update vllm-cpu to the latest version and address these issues. |
i use the following command to start server, and the above error does not exist anymore.
And there are 2 numa nodes, 112 cpu cores on my machine. Are there any methods or parameters to improve the throughput? @xiangyuT |
It's recommended to use a single NUMA node for vLLM server to avoid cross NUMA node memory access. You can configure this using export OMP_NUM_THREADS=56 # <CPU cores num in a single NUMA node>
numactl -C 0-55 -m 0 python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server ... Additionally, you can increase the memory size allocated for kv cache (default is 4(GB)) by setting environment variable export VLLM_CPU_KVCACHE_SPACE=64 |
i changed to
|
i use ipex-llm==2.1.0b20240805+vllm 0.4.2 to run Qwen2-7B-Instruct on CPU, the use curl to launch http request to call the api which is openai api compatible.
The server start command:
The curl command:
Then the server raised error after the inference finished:
The text was updated successfully, but these errors were encountered: