[Usage]: Llama3.2 Vision Instruct prompt format #11508

QuanHoangDanh · 2024-12-26T08:14:05Z

Your current environment

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.46.2
[pip3] triton==3.1.0

How would you like to use vllm

I want to run inference of a Llama3.2 Vision Instruct model. In the Meta's prompt guide they said that

Its important to postion the <|image|> tag appropriately in the prompt. Image will only attend to the subsequent text tokens

that means in the prompt, I shoud have format like this {content": [{"type": "image"}, {"type": "text", "text": "..."}]} but in practically this prompt format gives better result {content": [{"type": "text", "text": "..."}, {"type": "image"}]}. Does anyone have an issue like this? If anyone knows why practical performance is different from theory, please tell me. Thanks in advance.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 · 2024-12-26T10:17:39Z

cc @heheda12345

heheda12345 · 2024-12-26T10:24:49Z

Does the current mllama chat template support your input format?

QuanHoangDanh · 2024-12-26T23:30:55Z

@heheda12345 current mllama supports my input format. But the mllama model is a late fusion model, so I think it is heavily affected by the order of the input format.

ywang96 · 2024-12-31T20:05:33Z

The example usage has been updated in #11567 to correctly reflect the prompt guide from Meta, so closing this issue as its completed.

QuanHoangDanh added the usage How to use vllm label Dec 26, 2024

ywang96 closed this as completed Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Llama3.2 Vision Instruct prompt format #11508

[Usage]: Llama3.2 Vision Instruct prompt format #11508

QuanHoangDanh commented Dec 26, 2024

DarkLight1337 commented Dec 26, 2024

heheda12345 commented Dec 26, 2024

QuanHoangDanh commented Dec 26, 2024 •

edited

Loading

ywang96 commented Dec 31, 2024

[Usage]: Llama3.2 Vision Instruct prompt format #11508

[Usage]: Llama3.2 Vision Instruct prompt format #11508

Comments

QuanHoangDanh commented Dec 26, 2024

Your current environment

How would you like to use vllm

Before submitting a new issue...

DarkLight1337 commented Dec 26, 2024

heheda12345 commented Dec 26, 2024

QuanHoangDanh commented Dec 26, 2024 • edited Loading

ywang96 commented Dec 31, 2024

QuanHoangDanh commented Dec 26, 2024 •

edited

Loading