Korbench #1713

epsilondylan · 2024-11-23T17:18:13Z

Motivation

This PR introduces the implementation of the KOR-Bench dataset and evaluator into the project. The goal is to support the evaluation of models on KOR-Bench tasks, enabling researchers and developers to assess model performance on various reasoning tasks in the Korean language.

Modification

Dataset Implementation:
- Added korbenchDataset class for loading and processing the KOR-Bench dataset.
- Included support for multiple tasks (cipher, logic, operation, puzzle, counterfactual, and mixed) and modes (zero-shot, three-shot, subquestions).
- Implemented data loading functions such as read_yaml, read_json_or_jsonl, and read_json_or_jsonl_with_idx to handle dataset files and configurations.
Evaluator Implementation:
- Developed KorBenchEvaluator class for evaluating model responses against the KOR-Bench benchmarks.
- Integrated evaluation functions like evaluate_responses and evaluate_response_vs_answer to compare model outputs with ground truth answers.
- Added support for calculating accuracy, pass rates, and handling special cases like mixed modes and counterfactual reasoning.
Utilities and Helper Functions:
- Added helper functions for text extraction, cleaning, and comparison (e.g., extract_text_from_brackets, compare_math_expressions).
- Included logging for better debugging and information tracking.
- Ensured compatibility with the OpenCompass framework by registering the dataset and evaluator modules.
Documentation and Testing:
- Updated docstrings and comments for better code understanding.
- Added unit tests to cover the new functionalities and ensure correctness.

BC-breaking (Optional)

This PR does not introduce any backward compatibility issues. Existing functionalities and modules remain unaffected.

Use cases (Optional)

Model Evaluation: Researchers can now evaluate their models on KOR-Bench tasks using the provided dataset and evaluator classes.
Benchmarking: The implementation allows for consistent benchmarking across different models and configurations on Korean reasoning tasks.
Flexible Configurations: Supports various modes (zero-shot, three-shot, subquestions) and tasks, providing flexibility in evaluation settings.

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug is added in the unit tests.
The modification is covered by complete unit tests. Added unit tests ensure the correctness of the new code.
The documentation has been modified accordingly, including docstrings and example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

MaiziXiao

LGTM

epsilondylan added 7 commits November 20, 2024 15:44

first version for korbench

1454063

first stage for korbench

32acf49

korbench_1

24f4b92

korbench_1

5f03282

korbench_1

c34e363

korbench_1

3eca16d

korbench_1

7052318

mm-assistant bot assigned tonysy Nov 23, 2024

epsilondylan temporarily deployed to prod November 23, 2024 17:22 — with GitHub Actions Inactive

epsilondylan added 5 commits November 25, 2024 11:14

korbench_1_revised

a9a9940

korbench_combined_1

ff7d29e

korbench_combined_1

9f39a01

kor_combined

38cd8b8

kor_combined

16d183e

MaiziXiao approved these changes Nov 25, 2024

View reviewed changes

epsilondylan temporarily deployed to prod November 25, 2024 11:39 — with GitHub Actions Inactive

update

3643a1d

MaiziXiao temporarily deployed to prod November 25, 2024 11:59 — with GitHub Actions Inactive

MaiziXiao merged commit 300adc3 into open-compass:main Nov 25, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Korbench #1713

Korbench #1713

epsilondylan commented Nov 23, 2024

MaiziXiao left a comment

Korbench #1713

Korbench #1713

Conversation

epsilondylan commented Nov 23, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

MaiziXiao left a comment

Choose a reason for hiding this comment