Calculate squad evaluation metrics overall and separately for text answers and no answers #698
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Squad evaluation metrics for QA are now calculated a) overall (as before), b) for questions with text answer and c) for questions with no answer.
Questions with no answer are identified by (start,end) == (-1,-1) and the calculation of the metrics is done by splitting the predictions and labels accordingly into two sets.
Also: Fixing a bug that appears when processing ground truth labels where either the first token in the text is the correct (and complete) answer or the very last token. These cases were wrongly handled as impossible_to_answer. Example IDs in dev-v2.json: '57340d124776f419006617bf', '57377ec7c3c5551400e51f09'
Limitations: The number of tokens in a passage
passage_len_t
and the index of the last tokenanswer_end_t
are counterintuitive. There are cases whenanswer_end_t == passage_len_t
.Closes #686