Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-stream is not supported #12723

Open
zengqingfu1442 opened this issue Jan 20, 2025 · 15 comments
Open

non-stream is not supported #12723

zengqingfu1442 opened this issue Jan 20, 2025 · 15 comments
Assignees

Comments

@zengqingfu1442
Copy link

zengqingfu1442 commented Jan 20, 2025

i use ipex-llm==2.1.0b20240805+vllm 0.4.2 to run Qwen2-7B-Instruct on CPU, the use curl to launch http request to call the api which is openai api compatible.
The server start command:

python -m ipex_llm.vllm.cpu.entrypoints.openai.api_server  
--model /datamnt/Qwen2-7B-Instruct --port 8080   
--served-model-name 'Qwen/Qwen2-7B-Instruct'  
--load-format 'auto' --device cpu --dtype bfloat16  
--load-in-low-bit sym_int4   
--max-num-batched-tokens 32768

The curl command:

time curl http://172.16.30.28:8080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen2-7B-Instruct",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
    ],
    "top_k": 1,
    "max_tokens": 256,
    "stream": false}'

Then the server raised error after the inference finished:

INFO 01-17 09:51:07 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%
INFO 01-17 09:51:09 async_llm_engine.py:120] Finished request cmpl-a6703cc7cb0140adaebbfdd9dbf1f1e5.
INFO:     172.16.30.28:47694 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/entrypoints/openai/api_server.py", line 117, in create_chat_completion
    invalidInputError(isinstance(generator, ChatCompletionResponse))
TypeError: invalidInputError() missing 1 required positional argument: 'errMsg'
@zengqingfu1442
Copy link
Author

While the stream style is supported:

time curl http://172.16.30.28:8080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen2-7B-Instruct",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
    ],
    "top_k": 1,
    "max_tokens": 256,
    "stream": true}'

@xiangyuT
Copy link
Contributor

The issue should be resolved by PR #11748. You might want to update ipex-llm to a version later than 2.1.0b20240810, or simply upgrade to the latest version.

@zengqingfu1442
Copy link
Author

The issue should be resolved by PR #11748. You might want to update ipex-llm to a version later than 2.1.0b20240810, or simply upgrade to the latest version.

i just tried to update ipex-llm to 2.1.0 with pip install ipex-llm -U, and then run but there is new errors:

2025-01-21 03:14:14,108 - INFO - vLLM API server version 0.4.2
2025-01-21 03:14:14,109 - INFO - args: Namespace(host=None, port=8081, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/datamnt/qingfu.zeng/qwen2.5-7b/Qwen2-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=32768, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, fully_sharded_loras=False, device='cpu', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, served_model_name=['Qwen/Qwen2-7B-Instruct'], engine_use_ray=False, disable_log_requests=False, max_log_len=None, load_in_low_bit='sym_int4')
INFO 01-21 03:14:14 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/datamnt/qingfu.zeng/qwen2.5-7b/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/datamnt/qingfu.zeng/qwen2.5-7b/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct)
WARNING 01-21 03:14:14 cpu_executor.py:116] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 01-21 03:14:14 cpu_executor.py:143] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 01-21 03:14:14 selector.py:42] Using Torch SDPA backend.
[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
2025-01-21 03:14:15,631 - INFO - Converting the current model to sym_int4 format......
2025-01-21 03:14:15,632 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-01-21 03:14:19,854 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 01-21 03:14:19 cpu_executor.py:72] # CPU blocks: 4681
INFO 01-21 03:14:20 serving_chat.py:388] Using default chat template:
INFO 01-21 03:14:20 serving_chat.py:388] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 01-21 03:14:20 serving_chat.py:388] You are a helpful assistant.<|im_end|>
INFO 01-21 03:14:20 serving_chat.py:388] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 01-21 03:14:20 serving_chat.py:388] ' + message['content'] + '<|im_end|>' + '
INFO 01-21 03:14:20 serving_chat.py:388] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 01-21 03:14:20 serving_chat.py:388] ' }}{% endif %}
INFO:     Started server process [1606134]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
INFO 01-21 03:14:27 async_llm_engine.py:529] Received request cmpl-eeadab04ed494514a105f9e7fb97508c: prompt: '<|im_start|>system\n你是一个写作助手<|im_end|>\n<|im_start|>user\n请帮忙写一篇描述江南春天的小作文<|im_end|>\n<|im_start|>assistant\n', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 56568, 101909, 105293, 110498, 151645, 198, 151644, 872, 198, 14880, 106128, 61443, 101555, 53481, 105811, 105303, 104006, 104745, 151645, 198, 151644, 77091, 198], lora_request: None.
INFO 01-21 03:14:27 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends.
INFO 01-21 03:14:27 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
ERROR 01-21 03:14:27 async_llm_engine.py:43] Engine background task failed
ERROR 01-21 03:14:27 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 01-21 03:14:27 async_llm_engine.py:43]     task.result()
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 01-21 03:14:27 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/root/miniconda3/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return fut.result()
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 01-21 03:14:27 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
ERROR 01-21 03:14:27 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
ERROR 01-21 03:14:27 async_llm_engine.py:43]     output = await make_async(self.driver_worker.execute_model
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/root/miniconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 01-21 03:14:27 async_llm_engine.py:43]     result = self.fn(*self.args, **self.kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
ERROR 01-21 03:14:27 async_llm_engine.py:43]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
ERROR 01-21 03:14:27 async_llm_engine.py:43]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
ERROR 01-21 03:14:27 async_llm_engine.py:43]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
ERROR 01-21 03:14:27 async_llm_engine.py:43]     hidden_states, residual = layer(
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
ERROR 01-21 03:14:27 async_llm_engine.py:43]     hidden_states = self.self_attn(
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
ERROR 01-21 03:14:27 async_llm_engine.py:43]     qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
ERROR 01-21 03:14:27 async_llm_engine.py:43] AttributeError: 'tuple' object has no attribute 'to'
2025-01-21 03:14:27,729 - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7efbbd3a76d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7efba730bfa0>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7efbbd3a76d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7efba730bfa0>>)>
Traceback (most recent call last):
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/root/miniconda3/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
    request_outputs = await self.engine.step_async()
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
    output = await self.model_executor.execute_model_async(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/root/miniconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
    hidden_states, residual = layer(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
    hidden_states = self.self_attn(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
    qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
AttributeError: 'tuple' object has no attribute 'to'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 01-21 03:14:27 async_llm_engine.py:154] Aborted request cmpl-eeadab04ed494514a105f9e7fb97508c.
INFO:     172.16.30.28:37206 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

@xiangyuT
Copy link
Contributor

i just tried to update ipex-llm to 2.1.0 with pip install ipex-llm -U, and then run but there is new errors:

What version of ipex-llm are you using right now? Maybe you could try pip install --pre --upgrade ipex-llm[all]==2.1.0b20240810 --extra-index-url https://download.pytorch.org/whl/cpu

@zengqingfu1442
Copy link
Author

i just tried to update ipex-llm to 2.1.0 with pip install ipex-llm -U, and then run but there is new errors:

What version of ipex-llm are you using right now? Maybe you could try pip install --pre --upgrade ipex-llm[all]==2.1.0b20240810 --extra-index-url https://download.pytorch.org/whl/cpu

i tried this command, but it seems that the new installed transformers doesn't support qwen2 model,

/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
2025-01-21 04:03:20,201 - INFO - vLLM API server version 0.4.2
2025-01-21 04:03:20,201 - INFO - args: Namespace(host=None, port=8081, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/datamnt/qingfu.zeng/qwen2.5-7b/Qwen2-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=32768, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, fully_sharded_loras=False, device='cpu', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, served_model_name=['Qwen/Qwen2-7B-Instruct'], engine_use_ray=False, disable_log_requests=False, max_log_len=None, load_in_low_bit='sym_int4')
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/entrypoints/openai/api_server.py", line 176, in <module>
    engine = IPEXLLMAsyncLLMEngine.from_engine_args(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/engine/engine.py", line 44, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/arg_utils.py", line 520, in create_engine_config
    model_config = ModelConfig(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/config.py", line 121, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/transformers_utils/config.py", line 23, in get_config
    config = AutoConfig.from_pretrained(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1098, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 795, in __getitem__
    raise KeyError(key)
KeyError: 'qwen2'

Here is the versions:

ipex-llm                          2.1.0b20240810
torch                             2.1.2+cpu
transformers                      4.36.2
vllm                              0.4.2+cpu

@xiangyuT
Copy link
Contributor

xiangyuT commented Jan 21, 2025

i just tried to update ipex-llm to 2.1.0 with pip install ipex-llm -U, and then run but there is new errors:

What version of ipex-llm are you using right now? Maybe you could try pip install --pre --upgrade ipex-llm[all]==2.1.0b20240810 --extra-index-url https://download.pytorch.org/whl/cpu

i tried this command, but it seems that the new installed transformers doesn't support qwen2 model,

Here is the versions:

ipex-llm                          2.1.0b20240810
torch                             2.1.2+cpu
transformers                      4.36.2
vllm                              0.4.2+cpu

You may need to reinstall vllm after updating ipex-llm. It seems that the versions of transformers and torch are lower than recommended.

Below is some recommended versions for these libs:

ipex-llm                          2.1.0b20240810
torch                             2.3.0+cpu
transformers                      4.40.0
vllm                              0.4.2+cpu

And it works in my environment:

time curl http://localhost:8080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
    ],
    "top_k": 1,
    "max_tokens": 256,
    "stream": false}'
{"id":"cmpl-5dbfbc00c74c4a10a9ab610cce3b4a2b","object":"chat.completion","created":1737439601,"model":"Qwen/Qwen1.5-7B-Chat","choices":[{"index":0,"message":{"role":"assistant","content":"标题:江南春韵——一幅细腻的水墨画卷\n\n江南,一个如诗如画的地方,她的春天,就像一幅淡雅的水墨画,静静地展现在世人面前,让人沉醉,让人向往。\n\n春天的江南,是温柔的诗。当冬日的寒霜渐渐消融,大地披上了一层嫩绿的轻纱。湖面上,柳丝轻拂,倒映着天空的蓝,湖水的碧,仿佛是诗人的笔尖轻轻一挥,就绘出了一幅淡雅的水墨画。湖边的桃花、樱花争艳斗丽,红的如火,粉的似霞,白的如雪,它们在春风中轻轻摇曳,仿佛在低语着春天的故事。空气中弥漫着淡淡的花香,那是春天的气息,清新,甜美,让人心旷神怡。\n\n春天的江南,是细腻的画。古镇的石板路,被岁月磨砺得光滑如玉,每一步都踏着历史的韵律。青瓦白墙,粉墙黛瓦,仿佛是画家的笔触,细腻而深沉。小桥流水,流水人家,水面上漂浮着几片嫩绿的荷叶,那是"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":26,"total_tokens":282,"completion_tokens":256}}
real    0m15.528s
user    0m0.007s
sys     0m0.004s

@zengqingfu1442
Copy link
Author

Ok. Reinstalling vllm after updating ipex-llm to 2.1.0b20240810 really works.

@zengqingfu1442
Copy link
Author

But the latest stable version ipex-llm==2.1.0 does not work.

@zengqingfu1442
Copy link
Author

And the latest pre-release version ipex-llm==2.2.0b20250120 does not work also.

@xiangyuT
Copy link
Contributor

But the latest stable version ipex-llm==2.1.0 does not work.

You could use 2.1.0b20240810 version for now. We will look into the issue and plan to update vllm-cpu in the future.

@zengqingfu1442
Copy link
Author

zengqingfu1442 commented Jan 21, 2025

@xiangyuT it seems that low-bit does not work when client send many async requests. My server start command is

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server --model /models/Qwen2-7B-Instruct --port 8000 --served-model-name 'Qwen/Qwen2-7B-Instruct' --load-format 'auto' --device cpu --dtype bfloat16 --load-in-low-bit sym_int4 --max-num-batched-tokens 32768

And the packages versions:

ipex-llm                          2.1.0b20240810
numpy                             1.26.4
torch                             2.3.0+cpu
transformers                      4.40.0
vllm                              0.4.2+cpu

Here are the error logs:

ERROR 01-21 07:17:07 async_llm_engine.py:43] Engine background task failed
ERROR 01-21 07:17:07 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 01-21 07:17:07 async_llm_engine.py:43]     task.result()
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 01-21 07:17:07 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return fut.result()
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 01-21 07:17:07 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
ERROR 01-21 07:17:07 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
ERROR 01-21 07:17:07 async_llm_engine.py:43]     output = await make_async(self.driver_worker.execute_model
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 01-21 07:17:07 async_llm_engine.py:43]     result = self.fn(*self.args, **self.kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
ERROR 01-21 07:17:07 async_llm_engine.py:43]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
ERROR 01-21 07:17:07 async_llm_engine.py:43]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     hidden_states, residual = layer(
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     hidden_states = self.self_attn(
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 801, in forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     result = F.linear(x, x0_fp32)
ERROR 01-21 07:17:07 async_llm_engine.py:43] RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float
2025-01-21 07:17:07,836 - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f062047f6d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f0618061570>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f062047f6d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f0618061570>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
    qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 801, in forward
    result = F.linear(x, x0_fp32)
RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 01-21 07:17:07 async_llm_engine.py:154] Aborted request cmpl-9779de511d3440918525b446930d12f7.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 259, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 255, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 232, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 563, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f05f2f21030

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 113, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 252, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 255, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 244, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 167, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 666, in generate
    |     raise e
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 660, in generate
    |     async for request_output in stream:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 77, in __anext__
    |     raise result
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    |     task.result()
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    |     has_requests_in_progress = await asyncio.wait_for(
    |   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    |     return fut.result()
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
    |     request_outputs = await self.engine.step_async()
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
    |     output = await self.model_executor.execute_model_async(
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
    |     output = await make_async(self.driver_worker.execute_model
    |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    |     result = self.fn(*self.args, **self.kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    |     return func(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
    |     output = self.model_runner.execute_model(seq_group_metadata_list,
    |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    |     return func(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
    |     hidden_states = model_executable(**execute_model_kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
    |     hidden_states = self.model(input_ids, positions, kv_caches,
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
    |     hidden_states, residual = layer(
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
    |     hidden_states = self.self_attn(
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
    |     qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 801, in forward
    |     result = F.linear(x, x0_fp32)
    | RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float
    +------------------------------------
INFO:     172.16.30.194:43850 - "POST /v1/chat/completions HTTP/1.1" 200 OK

@xiangyuT
Copy link
Contributor

@xiangyuT it seems that low-bit does not work when client send many async requests. My server start command is

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server --model /models/Qwen2-7B-Instruct --port 8000 --served-model-name 'Qwen/Qwen2-7B-Instruct' --load-format 'auto' --device cpu --dtype bfloat16 --load-in-low-bit sym_int4 --max-num-batched-tokens 32768

And the packages versions:

ipex-llm                          2.1.0b20240810
numpy                             1.26.4
torch                             2.3.0+cpu
transformers                      4.40.0
vllm                              0.4.2+cpu

Understood. We are planning to update vllm-cpu to the latest version and address these issues.

@zengqingfu1442
Copy link
Author

zengqingfu1442 commented Jan 21, 2025

i use the following command to start server, and the above error does not exist anymore.

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server 
--model /models/Qwen2-7B-Instruct --port 8000 
--served-model-name 'Qwen/Qwen2-7B-Instruct' 
--trust-remote-code --device cpu 
--dtype bfloat16 
--enforce-eager 
--load-in-low-bit bf16 
--max-num-batched-tokens 32768

And there are 2 numa nodes, 112 cpu cores on my machine. Are there any methods or parameters to improve the throughput? @xiangyuT

@xiangyuT
Copy link
Contributor

i use the following command to start server, and the above error does not exist anymore.

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server 
--model /models/Qwen2-7B-Instruct --port 8000 
--served-model-name 'Qwen/Qwen2-7B-Instruct' 
--trust-remote-code --device cpu 
--dtype bfloat16 
--enforce-eager 
--load-in-low-bit bf16 
--max-num-batched-tokens 32768

And there are 2 numa nodes, 112 cpu cores on my machine. Are there any methods or parameters to improve the throughput? @xiangyuT

It's recommended to use a single NUMA node for vLLM server to avoid cross NUMA node memory access. You can configure this using numactl with the following command:

export OMP_NUM_THREADS=56 # <CPU cores num in a single NUMA node>
numactl -C 0-55 -m 0 python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server ...

Additionally, you can increase the memory size allocated for kv cache (default is 4(GB)) by setting environment variable VLLM_CPU_KVCACHE_SPACE:

export VLLM_CPU_KVCACHE_SPACE=64

@zengqingfu1442
Copy link
Author

i use the following command to start server, and the above error does not exist anymore.

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server 
--model /models/Qwen2-7B-Instruct --port 8000 
--served-model-name 'Qwen/Qwen2-7B-Instruct' 
--trust-remote-code --device cpu 
--dtype bfloat16 
--enforce-eager 
--load-in-low-bit bf16 
--max-num-batched-tokens 32768

And there are 2 numa nodes, 112 cpu cores on my machine. Are there any methods or parameters to improve the throughput? @xiangyuT

i changed to --load-in-low-bit bf16 and the above erros disappeared but the following erros occurred:

ERROR 01-21 09:46:05 async_llm_engine.py:504] Engine iteration timed out. This should never happen!
ERROR 01-21 09:46:05 async_llm_engine.py:43] Engine background task failed
ERROR 01-21 09:46:05 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 01-21 09:46:05 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
ERROR 01-21 09:46:05 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
ERROR 01-21 09:46:05 async_llm_engine.py:43]     output = await make_async(self.driver_worker.execute_model
ERROR 01-21 09:46:05 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 01-21 09:46:05 async_llm_engine.py:43]
ERROR 01-21 09:46:05 async_llm_engine.py:43] During handling of the above exception, another exception occurred:
ERROR 01-21 09:46:05 async_llm_engine.py:43]
ERROR 01-21 09:46:05 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
ERROR 01-21 09:46:05 async_llm_engine.py:43]     return fut.result()
ERROR 01-21 09:46:05 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 01-21 09:46:05 async_llm_engine.py:43]
ERROR 01-21 09:46:05 async_llm_engine.py:43] The above exception was the direct cause of the following exception:
ERROR 01-21 09:46:05 async_llm_engine.py:43]
ERROR 01-21 09:46:05 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 01-21 09:46:05 async_llm_engine.py:43]     task.result()
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 01-21 09:46:05 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
ERROR 01-21 09:46:05 async_llm_engine.py:43]     raise exceptions.TimeoutError() from exc
ERROR 01-21 09:46:05 async_llm_engine.py:43] asyncio.exceptions.TimeoutError
2025-01-21 09:46:05,104 - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f72512336d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f7248c055a0>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f72512336d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f7248c055a0>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 01-21 09:46:05 async_llm_engine.py:154] Aborted request cmpl-2ae9cb0e967348cf925e84d25e4b593f.
INFO 01-21 09:46:05 async_llm_engine.py:154] Aborted request cmpl-8121301dc77f4c67a580c3fc1d54fd94.
INFO 01-21 09:46:05 async_llm_engine.py:154] Aborted request cmpl-44ca3bb6588f48dd8cbe872fbdaaf30c.
INFO 01-21 09:46:05 async_llm_engine.py:154] Aborted request cmpl-0826632f3284405d94b50d516f1e5c5a.
INFO:     172.16.30.194:35144 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants