[Feature] Update o1 evaluation with JudgeLLM #1795

tonysy · 2024-12-30T06:26:02Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

MaiziXiao · 2024-12-30T08:17:51Z

opencompass/evaluator/generic_llm_evaluator.py

+def count_chinese_characters(text):
+    words = re.findall(r'[\u4e00-\u9fff]', text)
+    return len(words)
+
+
+def count_english_words(text):
+    words = re.findall(r'\b[a-zA-Z]+\b', text)
+    return len(words)


What are these two functions for?

MaiziXiao · 2024-12-30T08:18:41Z

opencompass/configs/datasets/mmlu/mmlu_stem_0shot_gen_216503.py

+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.


Would the model be confused to reply "A | B" or "CORRECT | INCORRECT" with this evaluation prompt?

MaiziXiao · 2024-12-30T08:20:41Z

opencompass/tasks/subjective_eval.py

+        if self.keep_judger_postfix:
+            return self.name_prefix + task_name + \
+                '--judge-by--' + model_abbr_from_cfg(self.judge_cfg)
+        else:
+            return self.name_prefix + task_name


Any case we will set keep_judger_postfix to False?

tonysy added 2 commits December 30, 2024 06:20

Update Generic LLM Evaluator

f2cd241

Update o1 style evaluator

a75fc62

mm-assistant bot assigned bittersweet1999 Dec 30, 2024

tonysy temporarily deployed to prod December 30, 2024 06:26 — with GitHub Actions Inactive

tonysy requested a review from MaiziXiao December 30, 2024 06:26

MaiziXiao reviewed Dec 30, 2024

View reviewed changes

liushz approved these changes Dec 30, 2024

View reviewed changes

liushz merged commit 98435dd into open-compass:main Dec 30, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Update o1 evaluation with JudgeLLM #1795

[Feature] Update o1 evaluation with JudgeLLM #1795

tonysy commented Dec 30, 2024

MaiziXiao Dec 30, 2024

MaiziXiao Dec 30, 2024

MaiziXiao Dec 30, 2024

[Feature] Update o1 evaluation with JudgeLLM #1795

[Feature] Update o1 evaluation with JudgeLLM #1795

Conversation

tonysy commented Dec 30, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

MaiziXiao Dec 30, 2024

Choose a reason for hiding this comment

MaiziXiao Dec 30, 2024

Choose a reason for hiding this comment

MaiziXiao Dec 30, 2024

Choose a reason for hiding this comment