You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Its important to postion the <|image|> tag appropriately in the prompt. Image will only attend to the subsequent text tokens
that means in the prompt, I shoud have format like this {content": [{"type": "image"}, {"type": "text", "text": "..."}]} but in practically this prompt format gives better result {content": [{"type": "text", "text": "..."}, {"type": "image"}]}. Does anyone have an issue like this? If anyone knows why practical performance is different from theory, please tell me. Thanks in advance.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
@heheda12345 current mllama supports my input format. But the mllama model is a late fusion model, so I think it is heavily affected by the order of the input format.
Your current environment
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.46.2
[pip3] triton==3.1.0
How would you like to use vllm
I want to run inference of a Llama3.2 Vision Instruct model. In the Meta's prompt guide they said that
that means in the prompt, I shoud have format like this {content": [{"type": "image"}, {"type": "text", "text": "..."}]} but in practically this prompt format gives better result {content": [{"type": "text", "text": "..."}, {"type": "image"}]}. Does anyone have an issue like this? If anyone knows why practical performance is different from theory, please tell me. Thanks in advance.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: