Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
JohnTang93 committed May 31, 2024
1 parent 2dbe154 commit 9659063
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 2 deletions.
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# MTVQA
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

<img src="./images/mtvqa_examples.png" width="95%" height="95%">

> Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial ''Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Furthermore, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still large room for performance improvement, underscoring the value of the dataset. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension.

**[Project Page [This Page]](https://github.com/bytedance/MTVQA)** | **[Paper](https://arxiv.org/abs/2405.11985)** |**[Dataset and Leaderboard](https://huggingface.co/datasets/ByteDance/MTVQA)**|

# Data
| RawData |[Train]() |[Test]()
| [Huggingface Dataset](https://huggingface.co/datasets/ByteDance/MTVQA)
| [RawData] (https://drive.google.com/file/d/1u09EVNVj17ws_AHEB7Y0eZiSPseTJUTx/view?usp=sharing) | [Huggingface Dataset](https://huggingface.co/datasets/ByteDance/MTVQA)

# Evaluation
The test code for evaluating models in the paper can be found in [scripts](./scripts)
Expand All @@ -24,3 +26,6 @@ If you wish to refer to the baseline results published here, please use the foll
primaryClass={cs.CV}
}
```

# Bias, Risks, and Limitations
Your access to and use of this dataset are at your own risk. We do not guarantee the accuracy of this dataset. The dataset is provided “as is” and we make no warranty or representation to you with respect to it and we expressly disclaim, and hereby expressly waive, all warranties, express, implied, statutory or otherwise. This includes, without limitation, warranties of quality, performance, merchantability or fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. In no event will we be liable to you on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this public license or use of the licensed material. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
Binary file added images/mtvqa_examples.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 29 additions & 0 deletions scripts/acc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#Copyright (2024) Bytedance Ltd.
import os
import argparse
import json
import numpy as np
def evaluate_exact_match_accuracy(entries):
scores = []
for elem in entries:
if isinstance(elem['annotation'], str):
elem['annotation'] = [elem['annotation']]
score = max([
(1.0 if
(ann.strip().lower() in elem['answer'].strip().lower().replace(".","") ) else 0.0)
for ann in elem['annotation']
])
scores.append(score)
return sum(scores) / len(scores)


parser = argparse.ArgumentParser()
parser.add_argument("--model_result_path", default='evaluation/sample_result')
args = parser.parse_args()
acc_items = []
for line in os.listdir(args.model_result_path):
items = json.load(open(os.path.join(args.model_result_path,line)))
lang_acc = evaluate_exact_match_accuracy(items)
acc_items.append(lang_acc)
print(line.strip('.json'), 'acc:', lang_acc)
print('Average acc:', np.mean(acc_items))

0 comments on commit 9659063

Please sign in to comment.