Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: PixtralHF accuracy on MMMU regressed since 0.6.4.post1 #11816

Closed
1 task done
mgoin opened this issue Jan 7, 2025 · 2 comments · Fixed by #11891
Closed
1 task done

[Bug]: PixtralHF accuracy on MMMU regressed since 0.6.4.post1 #11816

mgoin opened this issue Jan 7, 2025 · 2 comments · Fixed by #11891
Labels
bug Something isn't working

Comments

@mgoin
Copy link
Member

mgoin commented Jan 7, 2025

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

It seems to be that pixtral_hf accuracy has been affected since the last known good result from 0.6.4.post1.

Reference results on HF model card, we will look at `MMMU (CoT) ~= 51%. Evals ran using mistral-evals

vLLM 0.6.4.post1, server and eval:

> uv pip install vllm==0.6.4.post1
> vllm serve nm-testing/pixtral-12b-FP8-dynamic --max-num-seqs 30 --max-model-len 30000 --limit-mm-per-prompt image=5 --port 9000

> python -m eval.run eval_vllm --model_name nm-testing/pixtral-12b-FP8-dynamic --url http://0.0.0.0:9000 --output_dir output/ --eval_name "mmmu"
...
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.5044444444444445,
    "anywhere_in_answer_relaxed_correctness": 0.5044444444444445
}
================================================================================

vLLM 0.6.5, server and eval:

> uv pip install vllm==0.6.5
> vllm serve nm-testing/pixtral-12b-FP8-dynamic --max-num-seqs 30 --max-model-len 30000 --limit-mm-per-prompt image=5 --port 9000

> python -m eval.run eval_vllm --model_name nm-testing/pixtral-12b-FP8-dynamic --url http://0.0.0.0:9000 --output_dir output/ --eval_name "mmmu"
...
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.0011111111111111111,
    "anywhere_in_answer_relaxed_correctness": 0.3466666666666667
}
================================================================================

vLLM using #11741, server and eval:

> uv pip install vllm==0.6.5
> vllm serve nm-testing/pixtral-12b-FP8-dynamic --max-num-seqs 30 --max-model-len 30000 --limit-mm-per-prompt image=5 --port 9000

> python -m eval.run eval_vllm --model_name nm-testing/pixtral-12b-FP8-dynamic --url http://0.0.0.0:9000 --output_dir output/ --eval_name "mmmu"
...
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.0011111111111111111,
    "anywhere_in_answer_relaxed_correctness": 0.3466666666666667
}
================================================================================

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@mgoin mgoin added the bug Something isn't working label Jan 7, 2025
@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 8, 2025

Bisection results (updated as I make more progress):

#10347 PASS
#10371 PASS
#9919 <--
#10386 FAIL
#10361 FAIL
#10415 FAIL
#10180 FAIL
#10128 FAIL
#10973 FAIL

It appears that the chat template content format for Pixtral-HF is parsed as openai format instead of string format. Upon further inspection, the chat template is indeed in openai format. Looking into why that results in incorrect output...

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 9, 2025

I found that the chat template actually has a typo in it.

          {%- if message["content"] is not string %}
              {%- for chunk in message["content"] %}
                  {%- if chunk["type"] == "text" %}
-                     {{- chunk["content"] }}
+                     {{- chunk["text"] }}
                  {%- elif chunk["type"] == "image" %}
                      {{- "[IMG]" }}
                  {%- else %}
                      {{- raise_exception("Unrecognized content type!") }}
                  {%- endif %}
              {%- endfor %}
          {%- else %}
              {{- message["content"] }}
          {%- endif %}

To be compatible with OpenAI schema, the inner key should be text, not content.

Update: Reposted this on a similar thread on Pixtral-HF repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants