Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate squad evaluation metrics overall and separately for text answers and no answers #698

Merged
merged 3 commits into from
Feb 1, 2021

Conversation

julian-risch
Copy link
Member

@julian-risch julian-risch commented Jan 26, 2021

Squad evaluation metrics for QA are now calculated a) overall (as before), b) for questions with text answer and c) for questions with no answer.

Questions with no answer are identified by (start,end) == (-1,-1) and the calculation of the metrics is done by splitting the predictions and labels accordingly into two sets.

Also: Fixing a bug that appears when processing ground truth labels where either the first token in the text is the correct (and complete) answer or the very last token. These cases were wrongly handled as impossible_to_answer. Example IDs in dev-v2.json: '57340d124776f419006617bf', '57377ec7c3c5551400e51f09'

Limitations: The number of tokens in a passage passage_len_t and the index of the last token answer_end_t are counterintuitive. There are cases when answer_end_t == passage_len_t.

Closes #686

…dictions for short documents

Checking whether any of the ground truth labels is (-1,-1) to identify no_answer questions (instead of checking only the first label)
@julian-risch
Copy link
Member Author

I ran the question_answering_accuracy.py benchmark and can confirm that the numbers are in the same range:
gold_EM = 0.784721
gold_f1 = 0.826671
gold_tnacc = 0.843594 # top 1 recall

'EM': 0.7843005137707403
'f1': 0.8260896852846605
'top_n_accuracy': 0.8430893624189337

…n output.

Fixed that some text_answers were wrongly handled as no_answers when answer was first or last token
@julian-risch
Copy link
Member Author

The benchmark results for no_answer questions are now exactly the same for our implementation and the official squad evaluation. The results for text_answer questions still differ (slightly).

Our evaluation:
'EM': 0.7847216373283922, 'f1': 0.8268405564698051, 'top_n_accuracy': 0.8437631601111766,
'EM_text_answer': 0.7513495276653172, 'f1_text_answer': 0.8357081523221991, 'top_n_accuracy_text_answer': 0.8696018893387314, 'Total_text_answer': 5928,
'EM_no_answer': 0.8179983179142136, 'f1_no_answer': 0.8179983179142136, 'top_n_accuracy_no_answer': 0.8179983179142136, 'Total_no_answer': 5945,

Official squad evaluation:
{
"exact": 79.87029394424324,
"f1": 82.91251169582613,
"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945
}

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature, also good catch of the conversion bug. LG!

@julian-risch julian-risch merged commit 5ecc1ed into master Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add no_answer scores to QA evaluation
2 participants