-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Llama 3.2 90b crash #10648
Comments
cc @heheda12345 |
Is multi-lora supported for this model or for this |
This model doesn't support LoRA. Please check the Supported Models page for more details for each model. |
Happens under model load for me as well, we don't use LoRA for this deployment, therefore issue is not LoRA related. |
Can you share the log with |
@wallashss and I have found a way to reproduce this crash by processing a request with more (groups of)
(NB: this crashes it with both the 90B and 11B model) |
@ywang96 @DarkLight1337 @Isotr0py Any suggestions on rejecting the requests with different number of image tokens and input images instead of making the server crash? |
An update: |
We see that in production too, even with checks that there are not two <|image|> tags in the array. From what we can tell it's happening under heavy load. |
@wallashss and I got another bug fix merged in for another way to trigger the crash: #12347 We still need to deploy to see if we get crashes in prouction, but hopefully this is the last fix we need for this issue. cc: @yessenzhar in case you are able to test it as well |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
When running 90b LLama 3.2 vision model using vllm, see command below:
python3 -m vllm.entrypoints.openai.api_server --served-model-name=meta-llama/Llama-3.2-90B-Vision-Instruct --model=/data/001 --tensor-parallel-size=4 --max-num-seqs=96 --max-log-len=0 --load-format=safetensors --host=0.0.0.0 --port=80 --max-num-seqs=64 --gpu-memory-utilization=0.95 --enforce-eager --max-model-len=32768
It occasionally crashes, see logs
INFO 11-25 16:02:04 logger.py:37] Received request cmpl-6ccba6262268459ab8a7317db4e6a4dd-0: prompt: '', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.9, top_k=10, min_p=0.0, seed=3497136763981421277, stop=['<|eot_id|>', '<|end_of_text|>', '<|eom_id|>'], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [], lora_request: None, prompt_adapter_request: None.
INFO 11-25 16:02:04 logger.py:37] Received request chatcmpl-0f4e61f625404415ae19e6c65a805c48: prompt: '', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.3, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=['<|eot_id|>', '<|end_of_text|>', '<|eom_id|>'], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=750, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 11-25 16:02:05 engine.py:275] Added request cmpl-6ccba6262268459ab8a7317db4e6a4dd-0.
INFO 11-25 16:02:05 engine.py:275] Added request chatcmpl-0f4e61f625404415ae19e6c65a805c48.
INFO 11-25 16:02:05 logger.py:37] Received request chatcmpl-8599dad45f1d4e808952b46daef59e09: prompt: '', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.3, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=['<|eot_id|>', '<|end_of_text|>', '<|eom_id|>'], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=750, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
CRITICAL 11-25 16:02:05 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.244.54.175:43202 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 11-25 16:02:05 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.244.54.169:53022 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 11-25 16:02:05 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.244.51.218:57838 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 11-25 16:02:05 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.244.53.240:41778 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 11-25 16:02:05 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.244.10.36:41008 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 11-25 16:02:05 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.244.54.171:36942 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 11-25 16:02:05 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.244.51.220:60982 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 11-25 16:02:05 engine.py:143] RuntimeError('CUDA error: invalid configuration argument\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.\n')ERROR 11-25 16:02:05 engine.py:143] Traceback (most recent call last):
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 141, in start
ERROR 11-25 16:02:05 engine.py:143] self.run_engine_loop()
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 204, in run_engine_loop
ERROR 11-25 16:02:05 engine.py:143] request_outputs = self.engine_step()
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 222, in engine_step
ERROR 11-25 16:02:05 engine.py:143] raise e
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 213, in engine_step
ERROR 11-25 16:02:05 engine.py:143] return self.engine.step()
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1456, in step
ERROR 11-25 16:02:05 engine.py:143] outputs = self.model_executor.execute_model(
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 11-25 16:02:05 engine.py:143] driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 155, in _driver_execute_model
ERROR 11-25 16:02:05 engine.py:143] return self.driver_worker.execute_model(execute_model_req)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 11-25 16:02:05 engine.py:143] output = self.model_runner.execute_model(
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-25 16:02:05 engine.py:143] return func(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_model
ERROR 11-25 16:02:05 engine.py:143] hidden_or_intermediate_states = model_executable(
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 11-25 16:02:05 engine.py:143] return self._call_impl(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 11-25 16:02:05 engine.py:143] return forward_call(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1363, in forward
ERROR 11-25 16:02:05 engine.py:143] outputs = self.language_model(
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 11-25 16:02:05 engine.py:143] return self._call_impl(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 11-25 16:02:05 engine.py:143] return forward_call(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1030, in forward
ERROR 11-25 16:02:05 engine.py:143] hidden_states = self.model(
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 11-25 16:02:05 engine.py:143] return self._call_impl(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 11-25 16:02:05 engine.py:143] return forward_call(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 962, in forward
ERROR 11-25 16:02:05 engine.py:143] hidden_states = decoder_layer(
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 11-25 16:02:05 engine.py:143] return self._call_impl(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 11-25 16:02:05 engine.py:143] return forward_call(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 880, in forward
ERROR 11-25 16:02:05 engine.py:143] hidden_states = self.cross_attn(
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 11-25 16:02:05 engine.py:143] return self._call_impl(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 11-25 16:02:05 engine.py:143] return forward_call(*args, **kwargs)
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 764, in forward
ERROR 11-25 16:02:05 engine.py:143] output = self.attention_with_mask(q, k, v, kv_cache,
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 807, in attention_with_mask
ERROR 11-25 16:02:05 engine.py:143] self.head_dim).contiguous()
ERROR 11-25 16:02:05 engine.py:143] ^^^^^^^^^^^^
ERROR 11-25 16:02:05 engine.py:143] RuntimeError: CUDA error: invalid configuration argument
ERROR 11-25 16:02:05 engine.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-25 16:02:05 engine.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-25 16:02:05 engine.py:143] Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.ERROR 11-25 16:02:05 engine.py:143]
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop.
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] Traceback (most recent call last):
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] output = executor(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 85, in start_worker_execution_loop
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] output = self.execute_model(execute_model_req=None)
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 343, in execute_model
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] output = self.model_runner.execute_model(
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_model
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] hidden_or_intermediate_states = model_executable(
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
�[1;36m(VllmWorkerProcess pid=352)�[0;0m ERROR 11-25 16:02:05 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
...
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: