Calculate squad evaluation metrics overall and separately for text answers and no answers #698

julian-risch · 2021-01-26T15:00:12Z

Squad evaluation metrics for QA are now calculated a) overall (as before), b) for questions with text answer and c) for questions with no answer.

Questions with no answer are identified by (start,end) == (-1,-1) and the calculation of the metrics is done by splitting the predictions and labels accordingly into two sets.

Also: Fixing a bug that appears when processing ground truth labels where either the first token in the text is the correct (and complete) answer or the very last token. These cases were wrongly handled as impossible_to_answer. Example IDs in dev-v2.json: '57340d124776f419006617bf', '57377ec7c3c5551400e51f09'

Limitations: The number of tokens in a passage passage_len_t and the index of the last token answer_end_t are counterintuitive. There are cases when answer_end_t == passage_len_t.

Closes #686

…swers and no answers

…dictions for short documents Checking whether any of the ground truth labels is (-1,-1) to identify no_answer questions (instead of checking only the first label)

julian-risch · 2021-01-27T09:47:08Z

I ran the question_answering_accuracy.py benchmark and can confirm that the numbers are in the same range:
gold_EM = 0.784721
gold_f1 = 0.826671
gold_tnacc = 0.843594 # top 1 recall

'EM': 0.7843005137707403
'f1': 0.8260896852846605
'top_n_accuracy': 0.8430893624189337

…n output. Fixed that some text_answers were wrongly handled as no_answers when answer was first or last token

julian-risch · 2021-01-27T21:29:27Z

The benchmark results for no_answer questions are now exactly the same for our implementation and the official squad evaluation. The results for text_answer questions still differ (slightly).

Our evaluation:
'EM': 0.7847216373283922, 'f1': 0.8268405564698051, 'top_n_accuracy': 0.8437631601111766,
'EM_text_answer': 0.7513495276653172, 'f1_text_answer': 0.8357081523221991, 'top_n_accuracy_text_answer': 0.8696018893387314, 'Total_text_answer': 5928,
'EM_no_answer': 0.8179983179142136, 'f1_no_answer': 0.8179983179142136, 'top_n_accuracy_no_answer': 0.8179983179142136, 'Total_no_answer': 5945,

Official squad evaluation:
{
"exact": 79.87029394424324,
"f1": 82.91251169582613,
"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945
}

Timoeller

Nice feature, also good catch of the conversion bug. LG!

Calculate squad evaluation metrics overall and separately for text an…

23282fa

…swers and no answers

julian-risch requested a review from Timoeller January 26, 2021 15:00

Setting default of n_best_per_sample to n_best to generate enough pre…

701279a

…dictions for short documents Checking whether any of the ground truth labels is (-1,-1) to identify no_answer questions (instead of checking only the first label)

Adding total number of text_answer and no_answer samples to evaluatio…

40c9614

…n output. Fixed that some text_answers were wrongly handled as no_answers when answer was first or last token

Timoeller approved these changes Jan 28, 2021

View reviewed changes

julian-risch merged commit 5ecc1ed into master Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate squad evaluation metrics overall and separately for text answers and no answers #698

Calculate squad evaluation metrics overall and separately for text answers and no answers #698

julian-risch commented Jan 26, 2021 •

edited

Loading

julian-risch commented Jan 27, 2021

julian-risch commented Jan 27, 2021

Timoeller left a comment

Calculate squad evaluation metrics overall and separately for text answers and no answers #698

Calculate squad evaluation metrics overall and separately for text answers and no answers #698

Conversation

julian-risch commented Jan 26, 2021 • edited Loading

julian-risch commented Jan 27, 2021

julian-risch commented Jan 27, 2021

Timoeller left a comment

Choose a reason for hiding this comment

julian-risch commented Jan 26, 2021 •

edited

Loading