This code implements the methods described in (Dror et al., 2017):
"Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets." Rotem Dror, Gili Baumer, Marina Bogomolov and Roi Reichart. Accepted to the Transactions of the Association for Computational Linguistics (TACL).
The implemented methods help researchers and engineers draw statistically sound conclusions about the difference in performance between two algorithms, based on multiple comparisons between these algorithms. In each of these multiple comparisons the algorithms are applied to a dataset and the researcher writes down the performance of each algorithm on the dataset (e.g. an accuracy measure) as well as the p-value generated by a statistical significance test that is held in order to estimate the robustness of the difference between the algorithms (e.g. t-test or a bootstrap test). The methods implemented here are the Bonferroni and the Fisher tests for counting the number of datasets for which one algorithm is significantly better than the other, and the Holm procedure for identifying these datasets.
Please see the paper as to why retrieving the datasets for which one algorithm performs better than the other with a p-value below a desired threshold does not provide a statistically sound method for solving the above problem.
Our code requires a list of p-values from the comparisons of both algorithms on multiple datasets - a single p-value for each dataset. If you debate how to choose your significance test please consult our paper which discusses this issue for four representative NLP applications.
The Input:
- Write down a comma separated list of the p-values.
- Write down the desired significance level (alpha). The Algorithm will output 2 estimators:
- B for Bonferroni if the datasets are dependent.
- F for Fisher if the datasets are independent.
The Algorithm will output:
- An estimation (K estimator) of the number of datasets with a significant effect according to Bonferroni method.
- An estimation (K estimator) of the number of datasets with a significant effect according to Fisher method.
- The indices of the datasets recognized by the Holm procedure (Rejection list).
- Notice that the number of datasets recognized by the Holm procedure should be exactly K-Bonferroni, K-Fisher can be equal\smaller\larger than this number (see paper for more details).
Enter p-values :
0.168,0.297,0.357,0.019,0.218,0.001
{'dataset1': 0.168, 'dataset2': 0.297, 'dataset3': 0.357, 'dataset4': 0.019, 'dataset5': 0.218, 'dataset6': 0.001}
Enter significance level:
0.05
The Bonferroni-k estimator for the number of datasets with effect is: 1
The Fisher-k estimator for the number of datasets with effect is: 2
The rejections list according to the Holm procedure is:
dataset6
If you make use of this software for research purposes, we'll appreciate citing the following:
@Article{Q17-1033,
author = "Dror, Rotem
and Baumer, Gili
and Bogomolov, Marina
and Reichart, Roi",
title = "Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets",
journal = "Transactions of the Association for Computational Linguistics",
year = "2017",
volume = "5",
pages = "471--486",
url = "http://aclweb.org/anthology/Q17-1033"
}
- 0.1.0 The first proper release.
- 0.2.0 Output both k estimators.
This file and the code was written by Rotem Dror. The methods are described in the above paper (Dror et al., 2017). For questions please write to: [email protected]