diff --git a/.gitignore b/.gitignore index 79830d6..1c6d0f0 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,8 @@ __pycache__/ *.py[cod] *$py.class +*/.DS_Store + # C extensions *.so diff --git a/README.md b/README.md index eb2ae47..dab2ebc 100644 --- a/README.md +++ b/README.md @@ -47,14 +47,15 @@ Reinforcement Learning (Henderson et al., 2018) and Computer Vision (Borji, 2017 To help mitigate this problem, this package supplies fully-tested re-implementations of useful functions for significance testing: -* Statistical Significance tests such as Almost Stochastic Order (Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and - permutation-randomization (Noreen, 1989). +* Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), + bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989). * Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936). * Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size. All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation -[here](https://deep-significance.readthedocs.io/en/latest/) or the scenarios in the section [Examples](#examples). +[here](https://deep-significance.readthedocs.io/en/latest/) , the scenarios in the section [Examples](#examples) or +the [demo Jupyter notebook](https://github.com/Kaleidophon/deep-significance/tree/main/paper/deep-significance%20demo.ipynb). ## :inbox_tray: Installation @@ -74,46 +75,51 @@ Another option is to clone the repository and install the package locally: --- **tl;dr**: Use `aso()` to compare scores for two models. If the returned `eps_min < 0.5`, A is better than B. The lower -`eps_min`, the more confident the result. +`eps_min`, the more confident the result (we recommend to check `eps_min < 0.2` and record `eps_min` alongside +experimental results). :warning: Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See [General Recommendations & other notes](#general-recommendations). --- -In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply +In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as [this blog post](https://machinelearningmastery.com/statistical-hypothesis-tests/) for a general overview or [Dror et al. (2018)](https://www.aclweb.org/anthology/P18-1128.pdf) for a NLP-specific point of view. -In general, in statistical significance testing, we usually compare two algorithms and on a dataset using -some evaluation metric (we assume a higher = better). The difference between the two algorithms on the -data is then defined as - -

- -where is our test statistic. We then test the following **null hypothesis**: - -

- -Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A -is better than B (what we actually would like to see). Most statistical significance tests operate using -*p-values*, which define the probability that under the null-hypothesis, the expected by the test is larger than or -equal to the observed difference (that is, for a one-sided test, i.e. we assume A to be better than B): - -

- -We can interpret this equation as follows: Assuming that A is *not* better than B, the test assumes a corresponding distribution -of differences that is drawn from. How does our actually observed difference fit in there? -This is what the p-value is expressing: If this probability is high, is in line with what we expected under -the null hypothesis, so we conclude A not to better than B. If the -probability is low, that means that is quite unlikely under the null hypothesis and that the reverse -case is more likely - i.e. that it is -likely *larger* than - and we conclude that A is indeed better than B. Note that **the p-value does not -express whether the null hypothesis is true**. - -To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough -for us to reject the null hypothesis, this is called the significance level and it is often set to be 0.05. +We assume that we have two sets of scores we would like to compare, and , +for instance obtained by running two models and multiple times with a different random seed. +We can then define a one-sided test statistic based on the gathered observations. +An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis: + +

+ +That means that we actually assume the opposite of our desired case, namely that is not better than , +but equally as good or worse, as indicated by the value of the test statistic. +Usually, the goal becomes to reject this null hypothesis using the SST. +*p*-value testing is a frequentist method in the realm of SST. +It introduces the notion of data that *could have been observed* if we were to repeat our experiment again using +the same conditions, which we will write with superscript in order to distinguish them from our actually +observed scores (Gelman et al., 2021). +We then define the *p*-value as the probability that, under the null hypothesis, the test statistic using replicated +observation is larger than or equal to the *observed* test statistic: + +

+ +We can interpret this expression as follows: Assuming that is not better than , the test +assumes a corresponding distribution of statistics that is drawn from. So how does the observed test statistic + fit in here? This is what the -value expresses: When the +probability is high, is in line with what we expected under the +null hypothesis, so we can *not* reject the null hypothesis, or in other words, we \emph{cannot} conclude + to be better than . If the probability is low, that means that the observed + is quite unlikely under the null hypothesis and that the reverse case is +more likely - i.e. that it is likely larger than - and we conclude that is indeed better than +. Note that **the -value does not express whether the null hypothesis is true**. To make our decision +about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level +, often set to 0.05 - that the *p*-value has to fall below. However, it has been argued that a better practice +involves reporting the *p*-value alongside the results without a pidgeonholing of results into significant and non-significant +(Wasserstein et al., 2019). ### Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks @@ -121,8 +127,8 @@ for us to reject the null hypothesis, this is called the significance level , we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). -For this reason, Dror et al. (2019) consider the notion of *almost stochastic dominance* by quantifying the extent to -which stochastic order is being violated (red area): +For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of *almost stochastic dominance* +by quantifying the extent to which stochastic order is being violated (red area): ![](img/aso.png) -ASO returns a value , which expresses the amount of violation of stochastic order. If -, A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as +ASO returns a value , which expresses (an upper bound to) the amount of violation of stochastic order. If + (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret as a *confidence score*. The lower it is, the more sure we can be that A is better than B. Note: **ASO does not compute p-values.** Instead, the null hypothesis formulated as -

+

-If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5. +If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 +(see the discussion in [this section](#general-recommendations)). Furthermore, the significance level is determined as an input argument when running ASO and actively influence the resulting . @@ -159,12 +166,15 @@ We can now simply apply the ASO test: import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores N = 5 # Number of random seeds my_model_scores = np.random.normal(loc=0.9, scale=0.8, size=N) baseline_scores = np.random.normal(loc=0, scale=1, size=N) -min_eps = aso(my_model_scores, baseline_scores) # min_eps = 0.0, so A is better +min_eps = aso(my_model_scores, baseline_scores, seed=seed) # min_eps = 0.225, so A is better ``` Note that ASO **does not make any assumptions about the distributions of the scores**. @@ -185,6 +195,9 @@ which corresponds to the Bonferroni correction (Bonferroni et al., 1936): import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 3 # Number of datasets N = 5 # Number of random seeds @@ -192,8 +205,8 @@ my_model_scores_per_dataset = [np.random.normal(loc=0.3, scale=0.8, size=N) for baseline_scores_per_dataset = [np.random.normal(loc=0, scale=1, size=N) for _ in range(M)] # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / M) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] -# eps_min = [0.1565800030782686, 1, 0.0] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=M, seed=seed) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] +# eps_min = [0.006370113450148568, 0.6534772728574852, 0.0] ``` ### Scenario 3 - Comparing sample-level scores @@ -212,6 +225,9 @@ from itertools import product import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 40 # Number of data points N = 3 # Number of random seeds @@ -220,7 +236,9 @@ baseline_scored_samples_per_run = [np.random.normal(loc=0, scale=1, size=M) for pairs = list(product(my_model_scored_samples_per_run, baseline_scored_samples_per_run)) # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / len(pairs)) for a, b in pairs] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=len(pairs), seed=seed) for a, b in pairs] +# eps_min = [0.3831678636198528, 0.07194780234194881, 0.9152792807128325, 0.5273463008857844, 0.14946944524461184, 1.0, +# 0.6099543280369378, 0.22387448804041898, 1.0] ``` ### Scenario 4 - Comparing more than two models @@ -249,6 +267,9 @@ Let's look at an example: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -257,20 +278,19 @@ M = 3 # Number of different models / algorithms # Here, we will sample from N(0.1, 0.8), N(0.15, 0.8), N(0.2, 0.8) my_models_scores = np.array([np.random.normal(loc=loc, scale=0.8, size=N) for loc in np.arange(0.1, 0.1 + 0.05 * M, step=0.05)]) -eps_min = multi_aso(my_models_scores, confidence_level=0.05) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, seed=seed) # eps_min = -# array([[1., 1., 1.], -# [0., 1., 1.], -# [0., 0., 1.]]) +# array([[1. , 0.92621655, 1. ], +# [1. , 1. , 1. ], +# [0.82081635, 0.73048716, 1. ]]) ``` In the example, `eps_min` is now a matrix, containing the score between all pairs of models (for the same model, it set to 1 by default). The matrix is always to be read as ASO(row, column). The function applies the bonferroni correction for multiple comparisons by -default, but this can be turned off by using `use_bonferroni=False`. In order to save compute, the above symmetry -property is used as well, but this can also be disabled by `use_symmetry=False`. +default, but this can be turned off by using `use_bonferroni=False`. Lastly, when the `scores` argument is a dictionary and the function is called with `return_df=True`, the resulting matrix is given as a `pandas.DataFrame` for increased readability: @@ -278,6 +298,9 @@ given as a `pandas.DataFrame` for increased readability: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -294,14 +317,14 @@ my_models_scores = { # ... # } -eps_min = multi_aso(my_models_scores, confidence_level=0.05, return_df=True) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, return_df=True, seed=seed) # This is now a DataFrame! # eps_min = -# model 1 model 2 model 3 -# model 1 1.0 1.0 1.0 -# model 2 0.0 1.0 1.0 -# model 3 1.0 0.0 1.0 +# model 1 model 2 model 3 +# model 1 1.000000 0.926217 1.0 +# model 2 1.000000 1.000000 1.0 +# model 3 0.820816 0.730487 1.0 ``` @@ -315,7 +338,7 @@ score. Below lists some example snippets reporting the results of scenarios 1 an We compared all pairs of models based on five random seeds each using ASO with a confidence level of $\alpha = 0.05$ (before adjusting for all pair-wise comparisons using the Bonferroni correction). Almost stochastic - dominance ($\epsilon_\text{min} < 0.5)$ is indicated in table X. + dominance ($\epsilon_\text{min} < \tau$ with $\tau = 0.2$) is indicated in table X. ### :control_knobs: Sample size @@ -384,11 +407,11 @@ from deepsig import aso import numpy as np from timeit import timeit -a = np.random.normal(size=5) -b = np.random.normal(size=5) +a = np.random.normal(size=1000) +b = np.random.normal(size=1000) -print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 146.6909574989986 -print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 50.416724971000804 +print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 393.6318126 +print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 139.73514621799995n ``` #### :electric_plug: Compatibility with PyTorch, Tensorflow, Jax & Numpy @@ -438,11 +461,15 @@ as many scores as possible should be collected, especially if the variance betwe Because this is usually infeasible in practice, Bouthilier et al. (2020) recommend to **vary all other sources of variation** between runs to obtain the most trustworthy estimate of the "true" performance, such as data shuffling, weight initialization etc. -* `num_samples` and `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not +* `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not recommended as the result of the test will also become less accurate. Technically, is a upper bound that becomes tighter with the number of samples and bootstrap iterations (del Barrio et al., 2017). Thus, increasing the number of jobs with `num_jobs` instead is always preferred. +* While we could declare a model stochastically dominant with , we found this to have a comparatively high +Type I error (false positives). Tests [in our paper](https://arxiv.org/pdf/2204.06815.pdf) have shown that a more useful threshold that trades of Type I and + Type II error between different scenarios might be . + * Bootstrap and permutation-randomization are all non-parametric tests, i.e. they don't make any assumptions about the distribution of our test metric. Nevertheless, they differ in their *statistical power*, which is defined as the probability that the null hypothesis is being rejected given that there is a difference between A and B. In other words, the more powerful @@ -454,7 +481,17 @@ the distribution of our test metric. Nevertheless, they differ in their *statist ### :mortar_board: Cite -If you use the ASO test via `aso()`, please cite the original work: +Using this package in general, please cite the following: + + @article{ulmer2022deep, + title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks}, + author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes}, + journal={arXiv preprint arXiv:2204.06815}, + year={2022} + } + + +If you use the ASO test via `aso()` or `multi_aso, please cite the original works: @inproceedings{dror2019deep, author = {Rotem Dror and @@ -475,21 +512,20 @@ If you use the ASO test via `aso()`, please cite the original work: timestamp = {Tue, 28 Jan 2020 10:27:52 +0100}, } -Using this package in general, please cite the following: - - @software{dennis_ulmer_2021_4638709, - author = {Dennis Ulmer}, - title = {{deep-significance: Easy and Better Significance - Testing for Deep Neural Networks}}, - month = mar, - year = 2021, - note = {https://github.com/Kaleidophon/deep-significance}, - publisher = {Zenodo}, - version = {v1.0.0a}, - doi = {10.5281/zenodo.4638709}, - url = {https://doi.org/10.5281/zenodo.4638709} + @incollection{del2018optimal, + title={An optimal transportation approach for assessing almost stochastic order}, + author={Del Barrio, Eustasio and Cuesta-Albertos, Juan A and Matr{\'a}n, Carlos}, + booktitle={The Mathematics of the Uncertain}, + pages={33--44}, + year={2018}, + publisher={Springer} } +For instance, you can write + + In order to compare models, we use the Almost Stochastic Order test \citep{del2018optimal, dror2019deep} as + implemented by \citet{ulmer2022deep}. + ### :medal_sports: Acknowledgements This package was created out of discussions of the [NLPnorth group](https://nlpnorth.github.io/) at the IT University @@ -526,6 +562,9 @@ Dror, Rotem, Shlomov, Segev, and Reichart, Roi. "Deep dominance-how to properly Efron, Bradley, and Robert J. Tibshirani. "An introduction to the bootstrap." CRC press, 1994. +Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, Donald B Rubin, John +Carlin, Hal Stern, Donald Rubin, and David Dunson. Bayesian data analysis third edition, 2021. + Henderson, Peter, et al. "Deep reinforcement learning that matters." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018: 6391-6401 @@ -534,4 +573,7 @@ Narang, Sharan, et al. "Do Transformer Modifications Transfer Across Implementat Noreen, Eric W. "Computer intensive methods for hypothesis testing: An introduction." Wiley, New York (1989). +Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. Moving to a world beyond “p< 0.05”, +2019 + Yuan, Ke‐Hai, and Kentaro Hayashi. "Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models." British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110. \ No newline at end of file diff --git a/README_RAW.md b/README_RAW.md index bd8908d..abdbe96 100644 --- a/README_RAW.md +++ b/README_RAW.md @@ -47,14 +47,15 @@ Reinforcement Learning (Henderson et al., 2018) and Computer Vision (Borji, 2017 To help mitigate this problem, this package supplies fully-tested re-implementations of useful functions for significance testing: -* Statistical Significance tests such as Almost Stochastic Order (Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and - permutation-randomization (Noreen, 1989). +* Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), + bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989). * Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936). * Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size. All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation -[here](https://deep-significance.readthedocs.io/en/latest/) or the scenarios in the section [Examples](#examples). +[here](https://deep-significance.readthedocs.io/en/latest/) , the scenarios in the section [Examples](#examples) or +the [demo Jupyter notebook](https://github.com/Kaleidophon/deep-significance/tree/main/paper/deep-significance%20demo.ipynb). ## :inbox_tray: Installation @@ -74,52 +75,55 @@ Another option is to clone the repository and install the package locally: --- **tl;dr**: Use `aso()` to compare scores for two models. If the returned `eps_min < 0.5`, A is better than B. The lower -`eps_min`, the more confident the result. +`eps_min`, the more confident the result (we recommend to check `eps_min < 0.2` and record `eps_min` alongside +experimental results). :warning: Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See [General Recommendations & other notes](#general-recommendations). --- -In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply +In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as [this blog post](https://machinelearningmastery.com/statistical-hypothesis-tests/) for a general overview or [Dror et al. (2018)](https://www.aclweb.org/anthology/P18-1128.pdf) for a NLP-specific point of view. -In general, in statistical significance testing, we usually compare two algorithms $A$ and $B$ on a dataset $X$ using -some evaluation metric $\mathcal{M}$ (we assume a higher = better). The difference between the two algorithms on the -data is then defined as +We assume that we have two sets of scores we would like to compare, $\mathbb{S}_\mathbb{A}$ and $\mathbb{S}_\mathbb{B}$, +for instance obtained by running two models $\mathbb{A}$ and $\mathbb{B}$ multiple times with a different random seed. +We can then define a one-sided test statistic $\delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B})$ based on the gathered observations. +An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis: $$ -\delta(X) = \mathcal{M}(A, X) - \mathcal{M}(B, X) +H_0: \delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B}) \le 0 $$ -where $\delta(X)$ is our test statistic. We then test the following **null hypothesis**: +That means that we actually assume the opposite of our desired case, namely that $\mathbb{A}$ is not better than $\mathbb{B}$, +but equally as good or worse, as indicated by the value of the test statistic. +Usually, the goal becomes to reject this null hypothesis using the SST. +*p*-value testing is a frequentist method in the realm of SST. +It introduces the notion of data that *could have been observed* if we were to repeat our experiment again using +the same conditions, which we will write with superscript $\text{rep}$ in order to distinguish them from our actually +observed scores (Gelman et al., 2021). +We then define the *p*-value as the probability that, under the null hypothesis, the test statistic using replicated +observation is larger than or equal to the *observed* test statistic: $$ -H_0: \delta(X) \le 0 +p(\delta(\mathbb{S}_\mathbb{A}^\text{rep}, \mathbb{S}_\mathbb{B}^\text{rep}) \ge \delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B})|H_0) $$ -Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A -is better than B (what we actually would like to see). Most statistical significance tests operate using -*p-values*, which define the probability that under the null-hypothesis, the $\delta(X)$ expected by the test is larger than or -equal to the observed difference $\delta_{\text{obs}}$ (that is, for a one-sided test, i.e. we assume A to be better than B): - -$$ -P(\delta(X) \ge \delta_\text{obs}| H_0) -$$ - -We can interpret this equation as follows: Assuming that A is *not* better than B, the test assumes a corresponding distribution -of differences that $\delta(X)$ is drawn from. How does our actually observed difference $\delta_\text{obs}$ fit in there? -This is what the p-value is expressing: If this probability is high, $\delta_\text{obs}$ is in line with what we expected under -the null hypothesis, so we conclude A not to better than B. If the -probability is low, that means that $\delta_\text{obs}$ is quite unlikely under the null hypothesis and that the reverse -case is more likely - i.e. that it is -likely *larger* than $\delta(X)$ - and we conclude that A is indeed better than B. Note that **the p-value does not -express whether the null hypothesis is true**. - -To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough -for us to reject the null hypothesis, this is called the significance level $\alpha$ and it is often set to be 0.05. +We can interpret this expression as follows: Assuming that $\mathbb{A}$ is not better than $\mathbb{B}$, the test +assumes a corresponding distribution of statistics that $\delta$ is drawn from. So how does the observed test statistic +$\delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B})$ fit in here? This is what the $p$-value expresses: When the +probability is high, $\delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B})$ is in line with what we expected under the +null hypothesis, so we can *not* reject the null hypothesis, or in other words, we \emph{cannot} conclude +$\mathbb{A}$ to be better than $\mathbb{B}$. If the probability is low, that means that the observed +$\delta(\mathbb{S}, \mathbb{S}_\mathbb{B})$ is quite unlikely under the null hypothesis and that the reverse case is +more likely - i.e. that it is likely larger than - and we conclude that $\mathbb{A}$ is indeed better than +$\mathbb{B}$. Note that **the $p$-value does not express whether the null hypothesis is true**. To make our decision +about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level +$\alpha$, often set to 0.05 - that the *p*-value has to fall below. However, it has been argued that a better practice +involves reporting the *p*-value alongside the results without a pidgeonholing of results into significant and non-significant +(Wasserstein et al., 2019). ### Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks @@ -127,8 +131,8 @@ for us to reject the null hypothesis, this is called the significance level $\al Deep neural networks are highly non-linear models, having their performance highly dependent on hyperparameters, random seeds and other (stochastic) factors. Therefore, comparing the means of two models across several runs might not be enough to decide if a model A is better than B. In fact, **even aggregating more statistics like standard deviation, minimum -or maximum might not be enough** to make a decision. For this reason, Dror et al. (2019) introduced *Almost Stochastic -Order* (ASO), a test to compare two score distributions. +or maximum might not be enough** to make a decision. For this reason, del Barrio et al. (2017) and Dror et al. (2019) +introduced *Almost Stochastic Order* (ASO), a test to compare two score distributions. It builds on the concept of *stochastic order*: We can compare two distributions and declare one as *stochastically dominant* by comparing their cumulative distribution functions: @@ -138,21 +142,22 @@ by comparing their cumulative distribution functions: Here, the CDF of A is given in red and in green for B. If the CDF of A is lower than B for every $x$, we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). -For this reason, Dror et al. (2019) consider the notion of *almost stochastic dominance* by quantifying the extent to -which stochastic order is being violated (red area): +For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of *almost stochastic dominance* +by quantifying the extent to which stochastic order is being violated (red area): ![](img/aso.png) -ASO returns a value $\epsilon_\text{min}$, which expresses the amount of violation of stochastic order. If -$\epsilon_\text{min} < 0.5$, A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as +ASO returns a value $\epsilon_\text{min}$, which expresses (an upper bound to) the amount of violation of stochastic order. If +$\epsilon_\text{min} < \tau$ (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret $\epsilon_\text{min}$ as a *confidence score*. The lower it is, the more sure we can be that A is better than B. Note: **ASO does not compute p-values.** Instead, the null hypothesis formulated as $$ -H_0: \epsilon_\text{min} \ge 0.5 +H_0: \epsilon_\text{min} \ge \tau $$ -If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5. +If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 +(see the discussion in [this section](#general-recommendations)). Furthermore, the significance level $\alpha$ is determined as an input argument when running ASO and actively influence the resulting $\epsilon_\text{min}$. @@ -167,12 +172,15 @@ We can now simply apply the ASO test: import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores N = 5 # Number of random seeds my_model_scores = np.random.normal(loc=0.9, scale=0.8, size=N) baseline_scores = np.random.normal(loc=0, scale=1, size=N) -min_eps = aso(my_model_scores, baseline_scores) # min_eps = 0.0, so A is better +min_eps = aso(my_model_scores, baseline_scores, seed=seed) # min_eps = 0.225, so A is better ``` Note that ASO **does not make any assumptions about the distributions of the scores**. @@ -193,6 +201,9 @@ which corresponds to the Bonferroni correction (Bonferroni et al., 1936): import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 3 # Number of datasets N = 5 # Number of random seeds @@ -200,8 +211,8 @@ my_model_scores_per_dataset = [np.random.normal(loc=0.3, scale=0.8, size=N) for baseline_scores_per_dataset = [np.random.normal(loc=0, scale=1, size=N) for _ in range(M)] # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / M) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] -# eps_min = [0.1565800030782686, 1, 0.0] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=M, seed=seed) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] +# eps_min = [0.006370113450148568, 0.6534772728574852, 0.0] ``` ### Scenario 3 - Comparing sample-level scores @@ -220,6 +231,9 @@ from itertools import product import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 40 # Number of data points N = 3 # Number of random seeds @@ -228,7 +242,9 @@ baseline_scored_samples_per_run = [np.random.normal(loc=0, scale=1, size=M) for pairs = list(product(my_model_scored_samples_per_run, baseline_scored_samples_per_run)) # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / len(pairs)) for a, b in pairs] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=len(pairs), seed=seed) for a, b in pairs] +# eps_min = [0.3831678636198528, 0.07194780234194881, 0.9152792807128325, 0.5273463008857844, 0.14946944524461184, 1.0, +# 0.6099543280369378, 0.22387448804041898, 1.0] ``` ### Scenario 4 - Comparing more than two models @@ -259,6 +275,9 @@ Let's look at an example: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -267,20 +286,19 @@ M = 3 # Number of different models / algorithms # Here, we will sample from N(0.1, 0.8), N(0.15, 0.8), N(0.2, 0.8) my_models_scores = np.array([np.random.normal(loc=loc, scale=0.8, size=N) for loc in np.arange(0.1, 0.1 + 0.05 * M, step=0.05)]) -eps_min = multi_aso(my_models_scores, confidence_level=0.05) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, seed=seed) # eps_min = -# array([[1., 1., 1.], -# [0., 1., 1.], -# [0., 0., 1.]]) +# array([[1. , 0.92621655, 1. ], +# [1. , 1. , 1. ], +# [0.82081635, 0.73048716, 1. ]]) ``` In the example, `eps_min` is now a matrix, containing the $\epsilon_\text{min}$ score between all pairs of models (for the same model, it set to 1 by default). The matrix is always to be read as ASO(row, column). The function applies the bonferroni correction for multiple comparisons by -default, but this can be turned off by using `use_bonferroni=False`. In order to save compute, the above symmetry -property is used as well, but this can also be disabled by `use_symmetry=False`. +default, but this can be turned off by using `use_bonferroni=False`. Lastly, when the `scores` argument is a dictionary and the function is called with `return_df=True`, the resulting matrix is given as a `pandas.DataFrame` for increased readability: @@ -288,6 +306,9 @@ given as a `pandas.DataFrame` for increased readability: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -304,14 +325,14 @@ my_models_scores = { # ... # } -eps_min = multi_aso(my_models_scores, confidence_level=0.05, return_df=True) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, return_df=True, seed=seed) # This is now a DataFrame! # eps_min = -# model 1 model 2 model 3 -# model 1 1.0 1.0 1.0 -# model 2 0.0 1.0 1.0 -# model 3 1.0 0.0 1.0 +# model 1 model 2 model 3 +# model 1 1.000000 0.926217 1.0 +# model 2 1.000000 1.000000 1.0 +# model 3 0.820816 0.730487 1.0 ``` @@ -325,7 +346,7 @@ score. Below lists some example snippets reporting the results of scenarios 1 an We compared all pairs of models based on five random seeds each using ASO with a confidence level of $\alpha = 0.05$ (before adjusting for all pair-wise comparisons using the Bonferroni correction). Almost stochastic - dominance ($\epsilon_\text{min} < 0.5)$ is indicated in table X. + dominance ($\epsilon_\text{min} < \tau$ with $\tau = 0.2$) is indicated in table X. ### :control_knobs: Sample size @@ -394,11 +415,11 @@ from deepsig import aso import numpy as np from timeit import timeit -a = np.random.normal(size=5) -b = np.random.normal(size=5) +a = np.random.normal(size=1000) +b = np.random.normal(size=1000) -print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 146.6909574989986 -print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 50.416724971000804 +print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 393.6318126 +print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 139.73514621799995n ``` #### :electric_plug: Compatibility with PyTorch, Tensorflow, Jax & Numpy @@ -448,11 +469,15 @@ as many scores as possible should be collected, especially if the variance betwe Because this is usually infeasible in practice, Bouthilier et al. (2020) recommend to **vary all other sources of variation** between runs to obtain the most trustworthy estimate of the "true" performance, such as data shuffling, weight initialization etc. -* `num_samples` and `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not +* `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not recommended as the result of the test will also become less accurate. Technically, $\epsilon_\text{min}$ is a upper bound that becomes tighter with the number of samples and bootstrap iterations (del Barrio et al., 2017). Thus, increasing the number of jobs with `num_jobs` instead is always preferred. +* While we could declare a model stochastically dominant with $\epsilon_\text{min} < 0.5$, we found this to have a comparatively high +Type I error (false positives). Tests [in our paper](https://arxiv.org/pdf/2204.06815.pdf) have shown that a more useful threshold that trades of Type I and + Type II error between different scenarios might be $\tau = 0.2$. + * Bootstrap and permutation-randomization are all non-parametric tests, i.e. they don't make any assumptions about the distribution of our test metric. Nevertheless, they differ in their *statistical power*, which is defined as the probability that the null hypothesis is being rejected given that there is a difference between A and B. In other words, the more powerful @@ -464,7 +489,17 @@ the distribution of our test metric. Nevertheless, they differ in their *statist ### :mortar_board: Cite -If you use the ASO test via `aso()`, please cite the original work: +Using this package in general, please cite the following: + + @article{ulmer2022deep, + title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks}, + author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes}, + journal={arXiv preprint arXiv:2204.06815}, + year={2022} + } + + +If you use the ASO test via `aso()` or `multi_aso, please cite the original works: @inproceedings{dror2019deep, author = {Rotem Dror and @@ -485,21 +520,20 @@ If you use the ASO test via `aso()`, please cite the original work: timestamp = {Tue, 28 Jan 2020 10:27:52 +0100}, } -Using this package in general, please cite the following: - - @software{dennis_ulmer_2021_4638709, - author = {Dennis Ulmer}, - title = {{deep-significance: Easy and Better Significance - Testing for Deep Neural Networks}}, - month = mar, - year = 2021, - note = {https://github.com/Kaleidophon/deep-significance}, - publisher = {Zenodo}, - version = {v1.0.0a}, - doi = {10.5281/zenodo.4638709}, - url = {https://doi.org/10.5281/zenodo.4638709} + @incollection{del2018optimal, + title={An optimal transportation approach for assessing almost stochastic order}, + author={Del Barrio, Eustasio and Cuesta-Albertos, Juan A and Matr{\'a}n, Carlos}, + booktitle={The Mathematics of the Uncertain}, + pages={33--44}, + year={2018}, + publisher={Springer} } +For instance, you can write + + In order to compare models, we use the Almost Stochastic Order test \citep{del2018optimal, dror2019deep} as + implemented by \citet{ulmer2022deep}. + ### :medal_sports: Acknowledgements This package was created out of discussions of the [NLPnorth group](https://nlpnorth.github.io/) at the IT University @@ -536,6 +570,9 @@ Dror, Rotem, Shlomov, Segev, and Reichart, Roi. "Deep dominance-how to properly Efron, Bradley, and Robert J. Tibshirani. "An introduction to the bootstrap." CRC press, 1994. +Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, Donald B Rubin, John +Carlin, Hal Stern, Donald Rubin, and David Dunson. Bayesian data analysis third edition, 2021. + Henderson, Peter, et al. "Deep reinforcement learning that matters." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018: 6391-6401 @@ -544,4 +581,7 @@ Narang, Sharan, et al. "Do Transformer Modifications Transfer Across Implementat Noreen, Eric W. "Computer intensive methods for hypothesis testing: An introduction." Wiley, New York (1989). +Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. Moving to a world beyond “p< 0.05”, +2019 + Yuan, Ke‐Hai, and Kentaro Hayashi. "Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models." British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110. \ No newline at end of file diff --git a/deepsig/__init__.py b/deepsig/__init__.py index d64b5ce..be84c06 100644 --- a/deepsig/__init__.py +++ b/deepsig/__init__.py @@ -5,5 +5,5 @@ from deepsig.permutation import permutation_test from deepsig.sample_size import aso_uncertainty_reduction, bootstrap_power_analysis -__version__ = "1.2.3" +__version__ = "1.2.5" __author__ = "Dennis Ulmer" diff --git a/deepsig/aso.py b/deepsig/aso.py index a92708d..c67a51e 100644 --- a/deepsig/aso.py +++ b/deepsig/aso.py @@ -8,7 +8,7 @@ from warnings import warn # EXT -from joblib import Parallel, delayed +from joblib import Parallel, delayed, wrap_non_picklable_objects from joblib.externals.loky import set_loky_pickler import numpy as np import pandas as pd @@ -20,9 +20,8 @@ ArrayLike, ScoreCollection, score_pair_conversion, - ALLOWED_TYPES, - CONVERSIONS, ) +from deepsig.utils import _progress_iter, _get_num_models # MISC set_loky_pickler("dill") # Avoid weird joblib error with multi_aso @@ -32,7 +31,8 @@ def aso( scores_a: ArrayLike, scores_b: ArrayLike, - confidence_level: float = 0.05, + confidence_level: float = 0.95, + num_comparisons: int = 1, num_samples: int = 1000, num_bootstrap_iterations: int = 1000, dt: float = 0.005, @@ -60,7 +60,9 @@ def aso( scores_b: List[float] Scores of algorithm B. confidence_level: float - Desired confidence level of test. Set to 0.05 by default. + Desired confidence level of test. Set to 0.95 by default. + num_comparisons: int + Number of comparisons that the test is being used for. Is used to perform a Bonferroni correction. num_samples: int Number of samples from the score distributions during every bootstrap iteration when estimating sigma. num_bootstrap_iterations: int @@ -84,15 +86,15 @@ def aso( assert ( len(scores_a) > 0 and len(scores_b) > 0 ), "Both lists of scores must be non-empty." - assert num_samples > 0, "num_samples must be positive, {} found.".format( - num_samples - ) assert ( num_bootstrap_iterations > 0 ), "num_samples must be positive, {} found.".format(num_bootstrap_iterations) assert num_jobs > 0, "Number of jobs has to be at least 1, {} found.".format( num_jobs ) + assert ( + num_comparisons > 0 + ), "Number of comparisons has to be at least 1, {} found.".format(num_comparisons) # TODO: Remove in future version if num_samples != 1000: @@ -101,83 +103,47 @@ def aso( DeprecationWarning, ) - violation_ratio = compute_violation_ratio(scores_a, scores_b, dt) + # TODO: Remove in future version + if confidence_level < 0.95: + warn( + "'confidence_level' was refactored in version 1.2.4 to be more intuitive and usually should be in the .95 -" + f".99 range, but {confidence_level} was found. If you tried to adjust the confidence level for multiple " + f"comparisons, try the new num_comparisons argument instead.", + UserWarning, + ) + + if num_comparisons > 1: + confidence_level += (1 - confidence_level) / num_comparisons + + violation_ratio = compute_violation_ratio( + scores_a=scores_a, scores_b=scores_b, dt=dt + ) # Based on the actual number of samples quantile_func_a = get_quantile_function(scores_a) quantile_func_b = get_quantile_function(scores_b) - def _progress_iter(high: int, progress_bar: tqdm): - """ - This function is used when a shared progress bar is passed from multi_aso() - every time the iterator yields an - element, the progress bar is updated by one. It essentially behaves like a simplified range() function. - - Parameters - ---------- - high: int - Number of elements in iterator. - progress_bar: tqdm - Shared progress bar. - """ - current = 0 - - while current < high: - yield current - current += 1 - progress_bar.update(1) - - # Add progress bar if applicable - if show_progress and _progress_bar is None: - iters = tqdm(range(num_bootstrap_iterations), desc="Bootstrap iterations") - - # Shared progress bar when called from multi_aso() - elif _progress_bar is not None: - iters = _progress_iter(num_bootstrap_iterations, _progress_bar) - - else: - iters = range(num_bootstrap_iterations) - - # Set seeds for different jobs if applicable - # "Sub-seeds" for jobs are just seed argument + job index - seeds = ( - [None] * num_bootstrap_iterations - if seed is None - else [seed + offset for offset in range(1, num_bootstrap_iterations + 1)] + samples = get_bootstrapped_violation_ratios( + scores_a, + scores_b, + quantile_func_a, + quantile_func_b, + num_bootstrap_iterations, + dt, + num_jobs, + show_progress, + seed, + _progress_bar, ) - - def _bootstrap_iter(seed: Optional[int] = None): - """ - One bootstrap iteration. Wrapped in a function so it can be handed to joblib.Parallel. - """ - # When running multiple jobs, these modules have to be re-imported for some reason to avoid an error - # Use dir() to check whether module is available in local scope: - # https://stackoverflow.com/questions/30483246/how-to-check-if-a-module-has-been-imported - if "numpy" not in dir() or "deepsig" not in dir(): - import numpy as np - from deepsig.aso import compute_violation_ratio - - if seed is not None: - np.random.seed(seed) - - sampled_scores_a = quantile_func_a(np.random.uniform(0, 1, len(scores_a))) - sampled_scores_b = quantile_func_b(np.random.uniform(0, 1, len(scores_b))) - sample = compute_violation_ratio( - sampled_scores_a, - sampled_scores_b, - dt, - ) - - return sample - - # Initialize worker pool and start iterations - parallel = Parallel(n_jobs=num_jobs) - samples = parallel(delayed(_bootstrap_iter)(seed) for seed, _ in zip(seeds, iters)) + samples = np.array(samples) const = np.sqrt(len(scores_a) * len(scores_b) / (len(scores_a) + len(scores_b))) sigma_hat = np.std(const * (samples - violation_ratio)) # Compute eps_min and make sure it stays in [0, 1] min_epsilon = np.clip( - violation_ratio - (1 / const) * sigma_hat * normal.ppf(confidence_level), 0, 1 + violation_ratio - (1 / const) * sigma_hat * normal.ppf(1 - confidence_level), + 0, + 1, ) return min_epsilon @@ -185,7 +151,7 @@ def _bootstrap_iter(seed: Optional[int] = None): def multi_aso( scores: ScoreCollection, - confidence_level: float = 0.05, + confidence_level: float = 0.95, use_bonferroni: bool = True, use_symmetry: bool = True, num_samples: int = 1000, @@ -207,7 +173,7 @@ def multi_aso( Collection of model scores. Should be either dictionary of model name to model scores, nested Python list, 2D numpy or Jax array, or 2D Tensorflow or PyTorch tensor. confidence_level: float - Desired confidence level of test. Set to 0.05 by default. + Desired confidence level of test. Set to 0.95 by default. use_bonferroni: bool Indicate whether Bonferroni correction should be applied to confidence level in order to adjust for the number of comparisons. Default is True. @@ -243,12 +209,28 @@ def multi_aso( DeprecationWarning, ) + # TODO: Remove in future version + if not use_symmetry: + warn( + "'use_symmetry' argument is being ignored in the current version and will be deprecated in version 1.3!", + DeprecationWarning, + ) + + # TODO: Remove in future version + if confidence_level < 0.95: + warn( + "'confidence_level' was refactored in version 1.2.4 to be more intuitive and usually should be in the .95 -" + f".99 range, but {confidence_level} was found.", + UserWarning, + ) + num_models = _get_num_models(scores) num_comparisons = num_models * (num_models - 1) / 2 eps_min = np.eye(num_models) # Initialize score matrix if use_bonferroni: - confidence_level /= num_comparisons + # Increase the confidence level based in oder to mitigate the multiple comparisons problem + confidence_level += (1 - confidence_level) / num_comparisons # Iterate over simple indices or dictionary keys depending on type of scores argument indices = list(range(num_models)) if type(scores) != dict else list(scores.keys()) @@ -266,38 +248,57 @@ def multi_aso( for i, key_i in enumerate(indices): for j, key_j in enumerate(indices[(i + 1) :], start=i + 1): scores_a, scores_b = scores[key_i], scores[key_j] + quantile_func_a = get_quantile_function(scores_a) + quantile_func_b = get_quantile_function(scores_b) + const = np.sqrt( + len(scores_a) * len(scores_b) / (len(scores_a) + len(scores_b)) + ) - eps_min[i, j] = aso( + violation_ratio_ab = compute_violation_ratio( + dt=dt, + quantile_func_a=quantile_func_a, + quantile_func_b=quantile_func_b, + ) + violation_ratio_ba = ( + 1 - violation_ratio_ab + ) # Exploit symmetry of violation ratio here + samples_ab = get_bootstrapped_violation_ratios( scores_a, scores_b, - confidence_level=confidence_level, - num_samples=1000, # TODO: Avoid double warning, remove in future version - num_bootstrap_iterations=num_bootstrap_iterations, - dt=dt, - num_jobs=num_jobs, - show_progress=False, - seed=seed, - _progress_bar=progress_bar, + quantile_func_a, + quantile_func_b, + num_bootstrap_iterations, + dt, + num_jobs, + show_progress, + seed, + progress_bar, + ) + samples_ab = np.array(samples_ab) + + # This quantity is the same for both, so we only have to compute it once, see + # (samples_ab - violation_ratio_ab) + # = (1 - samples_ba - 1 + violation_ratio_ba) + # = (samples_ba - violation_ratio_ba) + sigma_hat = np.std(const * (samples_ab - violation_ratio_ab)) + + # Compute eps_min and make sure it stays in [0, 1] + min_epsilon_ab = np.clip( + violation_ratio_ab + - (1 / const) * sigma_hat * normal.ppf(1 - confidence_level), + 0, + 1, + ) + min_epsilon_ba = np.clip( + violation_ratio_ba + - (1 / const) * sigma_hat * normal.ppf(1 - confidence_level), + 0, + 1, ) - # Use ASO(A, B, alpha) = 1 - ASO(B, A, alpha) - if use_symmetry: - eps_min[j, i] = 1 - eps_min[i, j] - - # Compute ASO(B, A, alpha) separately - else: - eps_min[i, j] = aso( - scores_b, - scores_a, - confidence_level=confidence_level, - num_samples=1000, # TODO: Avoid double warning, remove in future version - num_bootstrap_iterations=num_bootstrap_iterations, - dt=dt, - num_jobs=num_jobs, - show_progress=False, - seed=seed, - _progress_bar=progress_bar, - ) + # Set values + eps_min[i, j] = min_epsilon_ab + eps_min[j, i] = min_epsilon_ba if type(scores) == dict and return_df: eps_min = pd.DataFrame(data=eps_min, index=list(scores.keys())) @@ -306,37 +307,61 @@ def multi_aso( return eps_min -def compute_violation_ratio(scores_a: np.array, scores_b: np.array, dt: float) -> float: +def compute_violation_ratio( + scores_a: Optional[np.array] = None, + scores_b: Optional[np.array] = None, + quantile_func_a: Optional[Callable] = None, + quantile_func_b: Optional[Callable] = None, + dt: float = 0.001, +) -> float: """ Compute the violation ration e_W2 (equation 4 + 5). Parameters ---------- - scores_a: List[float] + scores_a: Optional[np.array] Scores of algorithm A. - scores_b: List[float] + scores_b: Optional[np.array] Scores of algorithm B. dt: float Differential for t during integral calculation. + quantile_func_a: Optional[Callable] + Quantile function based on the first set of scores. + quantile_func_b: Optional[Callable] + Quantile function based on the second set of scores. Returns ------- float Return violation ratio. """ - squared_wasserstein_dist = 0 - int_violation_set = 0 # Integral over violation set A_X - quantile_func_a = get_quantile_function(scores_a) - quantile_func_b = get_quantile_function(scores_b) + assert ( + scores_a is not None or quantile_func_a is not None + ), "Either scores or quantile function are required for the first sample, neither found." + + assert ( + scores_b is not None or quantile_func_b is not None + ), "Either scores or quantile function are required for the second sample, neither found." + + if quantile_func_a is None: + quantile_func_a = get_quantile_function(scores_a) + + if quantile_func_b is None: + quantile_func_b = get_quantile_function(scores_b) + + t = np.arange(dt, 1, dt) # Points we integrate over + f = quantile_func_a(t) # F-1(t) + g = quantile_func_b(t) # G-1(t) + diff = g - f + squared_wasserstein_dist = np.sum(diff ** 2 * dt) - for p in np.arange(0, 1, dt): - diff = quantile_func_b(p) - quantile_func_a(p) - squared_wasserstein_dist += (diff ** 2) * dt - int_violation_set += (max(diff, 0) ** 2) * dt + # Now only consider points where stochastic order is being violated and set the rest to 0 + diff[f >= g] = 0 + int_violation_set = np.sum(diff[1:] ** 2 * dt) # Ignore t = 0 since t in (0, 1) if squared_wasserstein_dist == 0: warn("Division by zero encountered in violation ratio.") - violation_ratio = 0 + violation_ratio = 0.5 else: violation_ratio = int_violation_set / squared_wasserstein_dist @@ -361,7 +386,7 @@ def get_quantile_function(scores: np.array) -> Callable: # When running multiple jobs via joblib, numpy has to be re-imported for some reason to avoid an error # Use dir() to check whether module is available in local scope: # https://stackoverflow.com/questions/30483246/how-to-check-if-a-module-has-been-imported - if "numpy" not in dir(): + if "np" not in dir(): import numpy as np def _quantile_function(p: float) -> float: @@ -369,54 +394,100 @@ def _quantile_function(p: float) -> float: num = len(scores) index = int(np.ceil(num * p)) - return cdf[min(num - 1, max(0, index - 1))] + return cdf[np.clip(index - 1, 0, num - 1)] return np.vectorize(_quantile_function) -def _get_num_models(scores: ScoreCollection) -> int: +def get_bootstrapped_violation_ratios( + scores_a: ArrayLike, + scores_b: ArrayLike, + quantile_func_a: Callable, + quantile_func_b: Callable, + num_bootstrap_iterations: int, + dt: float, + num_jobs: int, + show_progress: bool, + seed: Optional[int], + _progress_bar: Optional[tqdm], +) -> List[float]: """ - Retrieve the number of models from a ScoreCollection for multi_aso(). + Retrieve violation ratios computed based on a number of bootstrap samples. Parameters ---------- - scores: ScoreCollection - Collection of model scores. Should be either dictionary of model name to model scores, nested Python list, - 2D numpy or Jax array, or 2D Tensorflow or PyTorch tensor. + scores_a: List[float] + Scores of algorithm A. + scores_b: List[float] + Scores of algorithm B. + quantile_func_a: Callable + Quantile function based on the first set of scores. + quantile_func_b: Callable + Quantile function based on the second set of scores. + num_bootstrap_iterations: int + Number of bootstrap iterations when estimating sigma. + dt: float + Differential for t during integral calculation. + num_jobs: int + Number of threads that bootstrap iterations are divided among. + show_progress: bool + Show progress bar. Default is True. + seed: Optional[int] + Set seed for reproducibility purposes. Default is None (meaning no seed is used). + _progress_bar: Optional[tqdm] + Hands over a progress bar object when called by multi_aso(). Only for internal use. Returns ------- - int - Number of models. + List[float] + Bootstrapped violation ratios. """ - # Python dictionary - if isinstance(scores, dict): - if len(scores) < 2: - raise ValueError( - "'scores' argument should contain at least two sets of scores, but only {} found.".format( - len(scores) - ) - ) + # Add progress bar if applicable + if show_progress and _progress_bar is None: + iters = tqdm(range(num_bootstrap_iterations), desc="Bootstrap iterations") - return len(scores) + # Shared progress bar when called from multi_aso() + elif _progress_bar is not None: + iters = _progress_iter(num_bootstrap_iterations, _progress_bar) - # (Nested) python list - elif isinstance(scores, list): - if not isinstance(scores[0], list): - raise TypeError( - "'scores' argument must be nested list of scores when Python lists are used, but elements of type {} " - "found".format(type(scores[0]).__name__) - ) + else: + iters = range(num_bootstrap_iterations) - return len(scores) + # Set seeds for different jobs if applicable + # "Sub-seeds" for jobs are just seed argument + job index + seeds = ( + [None] * num_bootstrap_iterations + if seed is None + else [seed + offset for offset in range(1, num_bootstrap_iterations + 1)] + ) - # Numpy / Jax arrays, Tensorflow / PyTorch tensor - elif type(scores) in ALLOWED_TYPES: - scores = CONVERSIONS[type(scores)](scores) # Convert to numpy array + @wrap_non_picklable_objects + def _bootstrap_iter(seed: Optional[int] = None): + """ + One bootstrap iteration. Wrapped in a function so it can be handed to joblib.Parallel. + """ + # When running multiple jobs, these modules have to be re-imported for some reason to avoid an error + # Use dir() to check whether module is available in local scope: + # https://stackoverflow.com/questions/30483246/how-to-check-if-a-module-has-been-imported + if "numpy" not in dir() or "deepsig" not in dir(): + import numpy as np + from deepsig.aso import compute_violation_ratio - return scores.shape[0] + if seed is not None: + np.random.seed(seed) - raise TypeError( - "Invalid type for 'scores', should be nested Python list, dict, Jax / Numpy array or Tensorflow / PyTorch " - "tensor, '{}' found.".format(type(scores).__name__) - ) + sampled_scores_a = quantile_func_a(np.random.uniform(0, 1, len(scores_a))) + sampled_scores_b = quantile_func_b(np.random.uniform(0, 1, len(scores_b))) + sample = compute_violation_ratio( + scores_a=sampled_scores_a, + scores_b=sampled_scores_b, + dt=dt, + ) + + return sample + + # Initialize worker pool and start iterations + parallel = Parallel(n_jobs=num_jobs) + samples = parallel(delayed(_bootstrap_iter)(seed) for seed, _ in zip(seeds, iters)) + + return samples diff --git a/deepsig/bootstrap.py b/deepsig/bootstrap.py index aec8459..578d0ba 100644 --- a/deepsig/bootstrap.py +++ b/deepsig/bootstrap.py @@ -3,7 +3,11 @@ `(Efron & Tibshirani, 1994) `_. """ +# STD +from typing import Optional + # EXT +from joblib import Parallel, delayed import numpy as np # PKG @@ -12,7 +16,11 @@ @score_pair_conversion def bootstrap_test( - scores_a: ArrayLike, scores_b: ArrayLike, num_samples: int = 1000 + scores_a: ArrayLike, + scores_b: ArrayLike, + num_samples: int = 1000, + num_jobs: int = 1, + seed: Optional[int] = None, ) -> float: """ Implementation of paired bootstrap test. A p-value is being estimated by comparing the mean of scores @@ -26,10 +34,14 @@ def bootstrap_test( ---------- scores_a: ArrayLike Scores of algorithm A. - scores_b: ArrrayLike + scores_b: ArrayLike Scores of algorithm B. num_samples: int Number of bootstrap samples used for estimation. + num_jobs: int + Number of threads that bootstrap iterations are divided among. + seed: Optional[int] + Set seed for reproducibility purposes. Default is None (meaning no seed is used). Returns ------- @@ -46,17 +58,42 @@ def bootstrap_test( N = len(scores_a) delta = np.mean(scores_a) - np.mean(scores_b) - num_larger = 0 - for _ in range(num_samples): + # Set seeds for different jobs if applicable + # "Sub-seeds" for jobs are just seed argument + job index + seeds = ( + [None] * num_samples + if seed is None + else [seed + offset for offset in range(1, num_samples + 1)] + ) + + def _bootstrap_iter(delta: float, seed: Optional[int] = None): + """ + One bootstrap iteration. Wrapped in a function so it can be handed to joblib.Parallel. + """ + # When running multiple jobs, modules have to be re-imported for some reason to avoid an error + # Use dir() to check whether module is available in local scope: + # https://stackoverflow.com/questions/30483246/how-to-check-if-a-module-has-been-imported + if "np" not in dir(): + import numpy as np + + if seed is not None: + np.random.seed(seed) + resampled_scores_a = np.random.choice(scores_a, N) resampled_scores_b = np.random.choice(scores_b, N) new_delta = np.mean(resampled_scores_a - resampled_scores_b) - if new_delta >= 2 * delta: - num_larger += 1 + return int(new_delta >= 2 * delta) + + # Initialize worker pool and start iterations + parallel = Parallel(n_jobs=num_jobs) + samples = parallel( + delayed(_bootstrap_iter)(delta, seed) + for _, seed in zip(range(num_samples), seeds) + ) - p_value = num_larger / num_samples + p_value = sum(samples) / num_samples return p_value diff --git a/deepsig/permutation.py b/deepsig/permutation.py index 2cc66bb..2cebb20 100644 --- a/deepsig/permutation.py +++ b/deepsig/permutation.py @@ -2,7 +2,11 @@ Implementation of paired sign test. """ +# STD +from typing import Optional + # EXT +from joblib import Parallel, delayed import numpy as np # PKG @@ -11,7 +15,11 @@ @score_pair_conversion def permutation_test( - scores_a: ArrayLike, scores_b: ArrayLike, num_samples: int = 1000 + scores_a: ArrayLike, + scores_b: ArrayLike, + num_samples: int = 1000, + num_jobs: int = 1, + seed: Optional[int] = None, ) -> float: """ Implementation of a permutation-randomization test. Scores of A and B will be randomly swapped and the difference @@ -28,6 +36,10 @@ def permutation_test( Scores of algorithm B. num_samples: int Number of permutations used for estimation. + num_jobs: int + Number of threads that bootstrap iterations are divided among. + seed: Optional[int] + Set seed for reproducibility purposes. Default is None (meaning no seed is used). Returns ------- @@ -44,11 +56,28 @@ def permutation_test( N = len(scores_a) delta = np.mean(scores_a - scores_b) - num_larger = 0 - # Do the permutations - for _ in range(num_samples): - # Swap entries of a and b with 50 % probability + # Set seeds for different jobs if applicable + # "Sub-seeds" for jobs are just seed argument + job index + seeds = ( + [None] * num_samples + if seed is None + else [seed + offset for offset in range(1, num_samples + 1)] + ) + + def _bootstrap_iter(delta: float, seed: Optional[int] = None): + """ + One bootstrap iteration. Wrapped in a function so it can be handed to joblib.Parallel. + """ + # When running multiple jobs, modules have to be re-imported for some reason to avoid an error + # Use dir() to check whether module is available in local scope: + # https://stackoverflow.com/questions/30483246/how-to-check-if-a-module-has-been-imported + if "np" not in dir(): + import numpy as np + + if seed is not None: + np.random.seed(seed) + swapped_a, swapped_b = zip( *[ (scores_a[i], scores_b[i]) @@ -59,9 +88,15 @@ def permutation_test( ) swapped_a, swapped_b = np.array(swapped_a), np.array(swapped_b) - if np.mean(swapped_a - swapped_b) >= delta: - num_larger += 1 + return int(np.mean(swapped_a - swapped_b) >= delta) + + # Initialize worker pool and start iterations + parallel = Parallel(n_jobs=num_jobs) + samples = parallel( + delayed(_bootstrap_iter)(delta, seed) + for _, seed in zip(range(num_samples), seeds) + ) - p_value = (num_larger + 1) / (num_samples + 1) + p_value = (sum(samples) + 1) / (num_samples + 1) return p_value diff --git a/deepsig/sample_size.py b/deepsig/sample_size.py index e395ea2..9c79a0a 100644 --- a/deepsig/sample_size.py +++ b/deepsig/sample_size.py @@ -8,7 +8,7 @@ # EXT import numpy as np -from scipy.stats import ttest_ind +from scipy.stats import ttest_rel from tqdm import tqdm # PROJECT @@ -115,8 +115,8 @@ def bootstrap_power_analysis( # Set default significance test to Welch's t-test if significance_test is None: - significance_test = lambda scores_a, scores_b: ttest_ind( - scores_a, scores_b, equal_var=False, alternative="greater" + significance_test = lambda scores_a, scores_b: ttest_rel( + scores_a, scores_b, alternative="greater" ).pvalue iters = ( diff --git a/deepsig/tests/test_aso.py b/deepsig/tests/test_aso.py index 8f00f79..4bb0cf0 100644 --- a/deepsig/tests/test_aso.py +++ b/deepsig/tests/test_aso.py @@ -3,6 +3,7 @@ """ # STD +from itertools import product import unittest # EXT @@ -11,7 +12,7 @@ import tensorflow as tf # import jax.numpy as jnp -from scipy.stats import wasserstein_distance, pearsonr +from scipy.stats import wasserstein_distance, pearsonr, norm, laplace, rayleigh # PKG from deepsig.aso import ( @@ -42,21 +43,38 @@ def test_assertions(self): aso([3, 4], []) with self.assertRaises(AssertionError): - aso([1, 2, 3], [3, 4, 5], num_samples=-1, show_progress=False) + aso([1, 2, 3], [3, 4, 5], num_bootstrap_iterations=-1, show_progress=False) with self.assertRaises(AssertionError): - aso([1, 2, 3], [3, 4, 5], num_samples=0, show_progress=False) + aso([1, 2, 3], [3, 4, 5], num_bootstrap_iterations=0, show_progress=False) with self.assertRaises(AssertionError): - aso([1, 2, 3], [3, 4, 5], num_bootstrap_iterations=-1, show_progress=False) + aso([1, 2, 3], [3, 4, 5], num_jobs=0, show_progress=False) + def test_argument_combos(self): + """ + Try different combinations of inputs arguments for compute_violation_ratio(). + """ + scores_a = np.random.normal(size=5) + scores_b = np.random.normal(size=5) + quantile_func_a = norm.ppf + quantile_func_b = norm.ppf + + # All of these should work + for kwarg1, kwarg2 in product( + [{"scores_a": scores_a}, {"quantile_func_a": quantile_func_a}], + [{"scores_b": scores_b}, {"quantile_func_b": quantile_func_b}], + ): + compute_violation_ratio(**{**kwarg1, **kwarg2}) + + # These should create errors with self.assertRaises(AssertionError): - aso([1, 2, 3], [3, 4, 5], num_bootstrap_iterations=0, show_progress=False) + compute_violation_ratio(scores_a=scores_a, quantile_func_a=quantile_func_a) with self.assertRaises(AssertionError): - aso([1, 2, 3], [3, 4, 5], num_jobs=0, show_progress=False) + compute_violation_ratio(scores_b=scores_b, quantile_func_b=quantile_func_b) - def test_compute_violation_ratio(self): + def test_compute_violation_ratio_correlation(self): """ Test whether violation ratio is being computed correctly. """ @@ -81,6 +99,61 @@ def test_compute_violation_ratio(self): rho, _ = pearsonr(violation_ratios, inv_sqw_dists) self.assertGreaterEqual(rho, 0.85) + def test_compute_violation_ratio_exact(self): + """ + Test the value of the violation ratio given some exact CDFs. + """ + test_dists = [ + ( + np.random.normal, + norm.ppf, + {"loc": 0.275, "scale": 1.5}, + {"loc": 0.25, "scale": 1}, + ), + ( + np.random.laplace, + laplace.ppf, + {"loc": 0.275, "scale": 1.5}, + {"loc": 0.25, "scale": 1}, + ), + (np.random.rayleigh, rayleigh.ppf, {"scale": 1.05}, {"scale": 1}), + ] + + for sample_func, ppf, params_a, params_b in test_dists: + quantile_func_a = lambda x: ppf(x, **params_a) + quantile_func_b = lambda x: ppf(x, **params_b) + violation_ratio_ab_exact = compute_violation_ratio( + quantile_func_a=quantile_func_a, quantile_func_b=quantile_func_b + ) + violation_ratio_ba_exact = compute_violation_ratio( + quantile_func_a=quantile_func_b, quantile_func_b=quantile_func_a + ) + + samples_a = sample_func(size=self.num_samples, **params_a) + samples_b = sample_func(size=self.num_samples, **params_b) + violation_ratio_ab_sampled = compute_violation_ratio( + scores_a=samples_a, scores_b=samples_b + ) + violation_ratio_ba_sampled = compute_violation_ratio( + scores_a=samples_b, scores_b=samples_a + ) + + # Check symmetries + self.assertAlmostEqual( + violation_ratio_ab_exact, 1 - violation_ratio_ba_exact, delta=0.05 + ) + self.assertAlmostEqual( + violation_ratio_ab_sampled, 1 - violation_ratio_ba_sampled, delta=0.05 + ) + + # Check closeness to exact value + self.assertAlmostEqual( + violation_ratio_ab_exact, violation_ratio_ab_sampled, delta=0.05 + ) + self.assertAlmostEqual( + violation_ratio_ba_exact, violation_ratio_ba_sampled, delta=0.05 + ) + def test_get_quantile_function(self): """ Test whether quantile function is working correctly. Values for normal distribution taken from @@ -150,9 +223,9 @@ class MultiASOTests(unittest.TestCase): def setUp(self) -> None: self.aso_kwargs = { - "num_samples": 100, "num_bootstrap_iterations": 100, "num_jobs": 4, + "show_progress": False, } self.num_models = 3 self.num_seeds = 100 @@ -164,13 +237,6 @@ def setUp(self) -> None: self.scores_dict = { "model{}".format(i): scores for i, scores in enumerate(self.scores) } - # Test case based on https://github.com/Kaleidophon/deep-significance/issues/7 - self.mikes_scores_dict = { - "x": np.array([59.13, 58.03, 59.18, 58.78, 58.5]), - "y": np.array([58.13, 59.19, 59.94, 60.08, 59.85]), - "z": np.array([58.77, 58.86, 59.58, 59.59, 59.64]), - "w": np.array([58.16, 58.49, 59.87, 58.94, 58.96]), - } self.scores_numpy = np.array(self.scores) self.scores_torch = torch.from_numpy(self.scores_numpy) self.scores_tensorflow = tf.convert_to_tensor(self.scores_numpy) @@ -201,59 +267,6 @@ def test_bonferroni_correction(self): ) self.assertTrue(np.all(corrected_scores >= uncorrected_scores)) - def test_symmetry(self): - """ - Test flag that toggles the use of the symmetry property. - """ - seed = 4321 - asymmetric_scores = multi_aso( - self.scores_numpy, seed=seed, use_symmetry=False, **self.aso_kwargs - ) - symmetric_scores = multi_aso(self.scores_numpy, seed=seed, **self.aso_kwargs) - - self.assertTrue( - np.all( - np.tril(symmetric_scores, -1) == np.tril((1 - symmetric_scores).T, -1) - ) - ) - self.assertTrue( - np.any( - np.tril(asymmetric_scores, -1) == np.tril((1 - asymmetric_scores).T, -1) - ) - ) - self.assertTrue( - np.all(np.diag(symmetric_scores) == 1) - ) # Check all diagonals to be one - self.assertTrue( - np.all(np.diag(asymmetric_scores) == 1) - ) # Check all diagonals to be one - - # Cover Mike's test case: https://github.com/Kaleidophon/deep-significance/issues/7 - mikes_asymmetric_scores = multi_aso( - self.mikes_scores_dict, seed=seed, use_symmetry=False, **self.aso_kwargs - ) - mikes_symmetric_scores = multi_aso( - self.mikes_scores_dict, seed=seed, **self.aso_kwargs - ) - self.assertTrue( - np.all( - np.tril(mikes_symmetric_scores, -1) - == np.tril((1 - mikes_symmetric_scores).T, -1) - ) - ) - self.assertTrue( - np.any( - np.tril(mikes_asymmetric_scores, -1) - == np.tril((1 - mikes_asymmetric_scores).T, -1) - ) - ) - self.assertTrue( - np.all(np.diag(mikes_symmetric_scores) == 1) - ) # Check all diagonals to be one - self.assertTrue( - np.all(np.diag(mikes_asymmetric_scores) == 1) - ) # Check all diagonals to be one - def test_result_df(self): """ Test the creation of a results DataFrame. @@ -312,9 +325,9 @@ def test_extreme_cases(self): ) self.assertAlmostEqual(eps_min2, 0, delta=0.01) - def test_dependency_on_alpha(self): + def test_dependency_on_confidence_level(self): """ - Make sure that the minimum epsilon threshold increases as we increase the confidence level. + Make sure that the minimum epsilon threshold decreases as we increase the confidence level. """ samples_normal1 = np.random.normal( loc=0.1, size=self.num_samples @@ -325,11 +338,11 @@ def test_dependency_on_alpha(self): min_epsilons = [] seed = 6666 - for alpha in np.arange(0.8, 0.1, -0.1): + for confidence_level in np.arange(0.1, 0.8, 0.1): min_eps = aso( samples_normal1, samples_normal2, - confidence_level=alpha, + confidence_level=confidence_level, num_bootstrap_iterations=100, show_progress=False, num_jobs=4, @@ -340,65 +353,3 @@ def test_dependency_on_alpha(self): self.assertEqual( list(sorted(min_epsilons)), min_epsilons ) # Make sure min_epsilon decreases - - def test_dependency_on_samples(self): - """ - Make sure that the minimum epsilon threshold decreases as we increase the number of samples. - """ - min_epsilons = [] - seed = 7890 - - for num_samples in [80, 1000, 8000]: - samples_normal2 = np.random.normal( - loc=0, scale=1.1, size=num_samples - ) # Scores for algorithm B - samples_normal1 = samples_normal2 + 1e-3 - - min_eps = aso( - samples_normal1, - samples_normal2, - num_bootstrap_iterations=100, - show_progress=False, - num_jobs=4, - seed=seed, - ) - min_epsilons.append(min_eps) - - self.assertEqual( - list(sorted(min_epsilons, reverse=True)), min_epsilons - ) # Make sure min_epsilon decreases - - def test_symmetry(self): - """ - Test whether ASO(A, B, alpha) = 1 - ASO(B, A, alpha) holds. - """ - parameters = [ - ((0, 0.5), (0, 1)), - ((-0.5, 0.1), (-0.6, 0.2)), - ((0.5, 0.21), (0.7, 0.1)), - ((0.1, 0.3), (0.2, 0.1)), - ] - - for (loc1, scale1), (loc2, scale2) in parameters: - samples_normal1 = np.random.normal( - loc=loc1, scale=scale1, size=2000 - ) # New scores for algorithm A - samples_normal2 = np.random.normal( - loc=loc2, scale=scale2, size=2000 - ) # Scores for algorithm B - - eps_min1 = aso( - samples_normal1, - samples_normal2, - show_progress=True, # Show progress so travis CI build doesn't time out - num_jobs=4, - num_bootstrap_iterations=1000, - ) - eps_min2 = aso( - samples_normal2, - samples_normal1, - show_progress=True, # Show progress so travis CI build doesn't time out - num_jobs=4, - num_bootstrap_iterations=1000, - ) - self.assertAlmostEqual(eps_min1, 1 - eps_min2, delta=0.2) diff --git a/deepsig/utils.py b/deepsig/utils.py new file mode 100644 index 0000000..eb930a4 --- /dev/null +++ b/deepsig/utils.py @@ -0,0 +1,77 @@ +""" +Module comprising test-unrelated utility functions. +""" + +# EXT +from tqdm import tqdm + +# PKG +from deepsig.conversion import ScoreCollection, ALLOWED_TYPES, CONVERSIONS + + +def _progress_iter(high: int, progress_bar: tqdm): + """ + This function is used when a shared progress bar is passed from multi_aso() - every time the iterator yields an + element, the progress bar is updated by one. It essentially behaves like a simplified range() function. + + Parameters + ---------- + high: int + Number of elements in iterator. + progress_bar: tqdm + Shared progress bar. + """ + current = 0 + + while current < high: + yield current + current += 1 + progress_bar.update(1) + + +def _get_num_models(scores: ScoreCollection) -> int: + """ + Retrieve the number of models from a ScoreCollection for multi_aso(). + + Parameters + ---------- + scores: ScoreCollection + Collection of model scores. Should be either dictionary of model name to model scores, nested Python list, + 2D numpy or Jax array, or 2D Tensorflow or PyTorch tensor. + + Returns + ------- + int + Number of models. + """ + # Python dictionary + if isinstance(scores, dict): + if len(scores) < 2: + raise ValueError( + "'scores' argument should contain at least two sets of scores, but only {} found.".format( + len(scores) + ) + ) + + return len(scores) + + # (Nested) python list + elif isinstance(scores, list): + if not isinstance(scores[0], list): + raise TypeError( + "'scores' argument must be nested list of scores when Python lists are used, but elements of type {} " + "found".format(type(scores[0]).__name__) + ) + + return len(scores) + + # Numpy / Jax arrays, Tensorflow / PyTorch tensor + elif type(scores) in ALLOWED_TYPES: + scores = CONVERSIONS[type(scores)](scores) # Convert to numpy array + + return scores.shape[0] + + raise TypeError( + "Invalid type for 'scores', should be nested Python list, dict, Jax / Numpy array or Tensorflow / PyTorch " + "tensor, '{}' found.".format(type(scores).__name__) + ) diff --git a/docs/README_DOCS.md b/docs/README_DOCS.md index b344874..61cb439 100644 --- a/docs/README_DOCS.md +++ b/docs/README_DOCS.md @@ -47,14 +47,15 @@ Reinforcement Learning (Henderson et al., 2018) and Computer Vision (Borji, 2017 To help mitigate this problem, this package supplies fully-tested re-implementations of useful functions for significance testing: -* Statistical Significance tests such as Almost Stochastic Order (Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and - permutation-randomization (Noreen, 1989). +* Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), + bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989). * Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936). * Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size. All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation -[here](https://deep-significance.readthedocs.io/en/latest/) or the scenarios in the section [Examples](#examples). +[here](https://deep-significance.readthedocs.io/en/latest/) , the scenarios in the section [Examples](#examples) or +the [demo Jupyter notebook](https://github.com/Kaleidophon/deep-significance/tree/main/paper/deep-significance%20demo.ipynb). ## |:inbox_tray:| Installation @@ -74,46 +75,51 @@ Another option is to clone the repository and install the package locally: --- **tl;dr**: Use `aso()` to compare scores for two models. If the returned `eps_min < 0.5`, A is better than B. The lower -`eps_min`, the more confident the result. +`eps_min`, the more confident the result (we recommend to check `eps_min < 0.2` and record `eps_min` alongside +experimental results). |:warning:| Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See [General Recommendations & other notes](#general-recommendations). --- -In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply +In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as [this blog post](https://machinelearningmastery.com/statistical-hypothesis-tests/) for a general overview or [Dror et al. (2018)](https://www.aclweb.org/anthology/P18-1128.pdf) for a NLP-specific point of view. -In general, in statistical significance testing, we usually compare two algorithms and on a dataset using -some evaluation metric (we assume a higher = better). The difference between the two algorithms on the -data is then defined as - -

- -where is our test statistic. We then test the following **null hypothesis**: - -

- -Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A -is better than B (what we actually would like to see). Most statistical significance tests operate using -*p-values*, which define the probability that under the null-hypothesis, the expected by the test is larger than or -equal to the observed difference (that is, for a one-sided test, i.e. we assume A to be better than B): - -

- -We can interpret this equation as follows: Assuming that A is *not* better than B, the test assumes a corresponding distribution -of differences that is drawn from. How does our actually observed difference fit in there? -This is what the p-value is expressing: If this probability is high, is in line with what we expected under -the null hypothesis, so we conclude A not to better than B. If the -probability is low, that means that is quite unlikely under the null hypothesis and that the reverse -case is more likely - i.e. that it is -likely *larger* than - and we conclude that A is indeed better than B. Note that **the p-value does not -express whether the null hypothesis is true**. - -To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough -for us to reject the null hypothesis, this is called the significance level and it is often set to be 0.05. +We assume that we have two sets of scores we would like to compare, and , +for instance obtained by running two models and multiple times with a different random seed. +We can then define a one-sided test statistic based on the gathered observations. +An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis: + +

+ +That means that we actually assume the opposite of our desired case, namely that is not better than , +but equally as good or worse, as indicated by the value of the test statistic. +Usually, the goal becomes to reject this null hypothesis using the SST. +*p*-value testing is a frequentist method in the realm of SST. +It introduces the notion of data that *could have been observed* if we were to repeat our experiment again using +the same conditions, which we will write with superscript in order to distinguish them from our actually +observed scores (Gelman et al., 2021). +We then define the *p*-value as the probability that, under the null hypothesis, the test statistic using replicated +observation is larger than or equal to the *observed* test statistic: + +

+ +We can interpret this expression as follows: Assuming that is not better than , the test +assumes a corresponding distribution of statistics that is drawn from. So how does the observed test statistic + fit in here? This is what the -value expresses: When the +probability is high, is in line with what we expected under the +null hypothesis, so we can *not* reject the null hypothesis, or in other words, we \emph{cannot} conclude + to be better than . If the probability is low, that means that the observed + is quite unlikely under the null hypothesis and that the reverse case is +more likely - i.e. that it is likely larger than - and we conclude that is indeed better than +. Note that **the -value does not express whether the null hypothesis is true**. To make our decision +about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level +, often set to 0.05 - that the *p*-value has to fall below. However, it has been argued that a better practice +involves reporting the *p*-value alongside the results without a pidgeonholing of results into significant and non-significant +(Wasserstein et al., 2019). ### Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks @@ -121,8 +127,8 @@ for us to reject the null hypothesis, this is called the significance level , we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). -For this reason, Dror et al. (2019) consider the notion of *almost stochastic dominance* by quantifying the extent to -which stochastic order is being violated (red area): +For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of *almost stochastic dominance* +by quantifying the extent to which stochastic order is being violated (red area): ![](img/aso.png) -ASO returns a value , which expresses the amount of violation of stochastic order. If -, A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as +ASO returns a value , which expresses (an upper bound to) the amount of violation of stochastic order. If + (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret as a *confidence score*. The lower it is, the more sure we can be that A is better than B. Note: **ASO does not compute p-values.** Instead, the null hypothesis formulated as -

+

-If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5. +If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 +(see the discussion in [this section](#general-recommendations)). Furthermore, the significance level is determined as an input argument when running ASO and actively influence the resulting . @@ -159,12 +166,15 @@ We can now simply apply the ASO test: import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores N = 5 # Number of random seeds my_model_scores = np.random.normal(loc=0.9, scale=0.8, size=N) baseline_scores = np.random.normal(loc=0, scale=1, size=N) -min_eps = aso(my_model_scores, baseline_scores) # min_eps = 0.0, so A is better +min_eps = aso(my_model_scores, baseline_scores, seed=seed) # min_eps = 0.225, so A is better ``` Note that ASO **does not make any assumptions about the distributions of the scores**. @@ -185,6 +195,9 @@ which corresponds to the Bonferroni correction (Bonferroni et al., 1936): import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 3 # Number of datasets N = 5 # Number of random seeds @@ -192,8 +205,8 @@ my_model_scores_per_dataset = [np.random.normal(loc=0.3, scale=0.8, size=N) for baseline_scores_per_dataset = [np.random.normal(loc=0, scale=1, size=N) for _ in range(M)] # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / M) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] -# eps_min = [0.1565800030782686, 1, 0.0] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=M, seed=seed) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] +# eps_min = [0.006370113450148568, 0.6534772728574852, 0.0] ``` ### Scenario 3 - Comparing sample-level scores @@ -212,6 +225,9 @@ from itertools import product import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 40 # Number of data points N = 3 # Number of random seeds @@ -220,7 +236,9 @@ baseline_scored_samples_per_run = [np.random.normal(loc=0, scale=1, size=M) for pairs = list(product(my_model_scored_samples_per_run, baseline_scored_samples_per_run)) # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / len(pairs)) for a, b in pairs] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=len(pairs), seed=seed) for a, b in pairs] +# eps_min = [0.3831678636198528, 0.07194780234194881, 0.9152792807128325, 0.5273463008857844, 0.14946944524461184, 1.0, +# 0.6099543280369378, 0.22387448804041898, 1.0] ``` ### Scenario 4 - Comparing more than two models @@ -249,6 +267,9 @@ Let's look at an example: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -257,20 +278,19 @@ M = 3 # Number of different models / algorithms # Here, we will sample from N(0.1, 0.8), N(0.15, 0.8), N(0.2, 0.8) my_models_scores = np.array([np.random.normal(loc=loc, scale=0.8, size=N) for loc in np.arange(0.1, 0.1 + 0.05 * M, step=0.05)]) -eps_min = multi_aso(my_models_scores, confidence_level=0.05) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, seed=seed) # eps_min = -# array([[1., 1., 1.], -# [0., 1., 1.], -# [0., 0., 1.]]) +# array([[1. , 0.92621655, 1. ], +# [1. , 1. , 1. ], +# [0.82081635, 0.73048716, 1. ]]) ``` In the example, `eps_min` is now a matrix, containing the score between all pairs of models (for the same model, it set to 1 by default). The matrix is always to be read as ASO(row, column). The function applies the bonferroni correction for multiple comparisons by -default, but this can be turned off by using `use_bonferroni=False`. In order to save compute, the above symmetry -property is used as well, but this can also be disabled by `use_symmetry=False`. +default, but this can be turned off by using `use_bonferroni=False`. Lastly, when the `scores` argument is a dictionary and the function is called with `return_df=True`, the resulting matrix is given as a `pandas.DataFrame` for increased readability: @@ -278,6 +298,9 @@ given as a `pandas.DataFrame` for increased readability: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -294,14 +317,14 @@ my_models_scores = { # ... # } -eps_min = multi_aso(my_models_scores, confidence_level=0.05, return_df=True) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, return_df=True, seed=seed) # This is now a DataFrame! # eps_min = -# model 1 model 2 model 3 -# model 1 1.0 1.0 1.0 -# model 2 0.0 1.0 1.0 -# model 3 1.0 0.0 1.0 +# model 1 model 2 model 3 +# model 1 1.000000 0.926217 1.0 +# model 2 1.000000 1.000000 1.0 +# model 3 0.820816 0.730487 1.0 ``` @@ -315,7 +338,7 @@ score. Below lists some example snippets reporting the results of scenarios 1 an We compared all pairs of models based on five random seeds each using ASO with a confidence level of $\alpha = 0.05$ (before adjusting for all pair-wise comparisons using the Bonferroni correction). Almost stochastic - dominance ($\epsilon_\text{min} < 0.5)$ is indicated in table X. + dominance ($\epsilon_\text{min} < \tau$ with $\tau = 0.2$) is indicated in table X. ### |:control_knobs:| Sample size @@ -384,11 +407,11 @@ from deepsig import aso import numpy as np from timeit import timeit -a = np.random.normal(size=5) -b = np.random.normal(size=5) +a = np.random.normal(size=1000) +b = np.random.normal(size=1000) -print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 146.6909574989986 -print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 50.416724971000804 +print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 393.6318126 +print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 139.73514621799995n ``` #### |:electric_plug:| Compatibility with PyTorch, Tensorflow, Jax & Numpy @@ -438,11 +461,15 @@ as many scores as possible should be collected, especially if the variance betwe Because this is usually infeasible in practice, Bouthilier et al. (2020) recommend to **vary all other sources of variation** between runs to obtain the most trustworthy estimate of the "true" performance, such as data shuffling, weight initialization etc. -* `num_samples` and `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not +* `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not recommended as the result of the test will also become less accurate. Technically, is a upper bound that becomes tighter with the number of samples and bootstrap iterations (del Barrio et al., 2017). Thus, increasing the number of jobs with `num_jobs` instead is always preferred. +* While we could declare a model stochastically dominant with , we found this to have a comparatively high +Type I error (false positives). Tests [in our paper](https://arxiv.org/pdf/2204.06815.pdf) have shown that a more useful threshold that trades of Type I and + Type II error between different scenarios might be . + * Bootstrap and permutation-randomization are all non-parametric tests, i.e. they don't make any assumptions about the distribution of our test metric. Nevertheless, they differ in their *statistical power*, which is defined as the probability that the null hypothesis is being rejected given that there is a difference between A and B. In other words, the more powerful @@ -454,7 +481,17 @@ the distribution of our test metric. Nevertheless, they differ in their *statist ### |:mortar_board:| Cite -If you use the ASO test via `aso()`, please cite the original work: +Using this package in general, please cite the following: + + @article{ulmer2022deep, + title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks}, + author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes}, + journal={arXiv preprint arXiv:2204.06815}, + year={2022} + } + + +If you use the ASO test via `aso()` or `multi_aso, please cite the original works: @inproceedings{dror2019deep, author = {Rotem Dror and @@ -475,21 +512,20 @@ If you use the ASO test via `aso()`, please cite the original work: timestamp = {Tue, 28 Jan 2020 10:27:52 +0100}, } -Using this package in general, please cite the following: - - @software{dennis_ulmer_2021_4638709, - author = {Dennis Ulmer}, - title = {{deep-significance: Easy and Better Significance - Testing for Deep Neural Networks}}, - month = mar, - year = 2021, - note = {https://github.com/Kaleidophon/deep-significance}, - publisher = {Zenodo}, - version = {v1.0.0a}, - doi = {10.5281/zenodo.4638709}, - url = {https://doi.org/10.5281/zenodo.4638709} + @incollection{del2018optimal, + title={An optimal transportation approach for assessing almost stochastic order}, + author={Del Barrio, Eustasio and Cuesta-Albertos, Juan A and Matr{\'a}n, Carlos}, + booktitle={The Mathematics of the Uncertain}, + pages={33--44}, + year={2018}, + publisher={Springer} } +For instance, you can write + + In order to compare models, we use the Almost Stochastic Order test \citep{del2018optimal, dror2019deep} as + implemented by \citet{ulmer2022deep}. + ### |:medal_sports:| Acknowledgements This package was created out of discussions of the [NLPnorth group](https://nlpnorth.github.io/) at the IT University @@ -526,6 +562,9 @@ Dror, Rotem, Shlomov, Segev, and Reichart, Roi. "Deep dominance-how to properly Efron, Bradley, and Robert J. Tibshirani. "An introduction to the bootstrap." CRC press, 1994. +Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, Donald B Rubin, John +Carlin, Hal Stern, Donald Rubin, and David Dunson. Bayesian data analysis third edition, 2021. + Henderson, Peter, et al. "Deep reinforcement learning that matters." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018: 6391-6401 @@ -534,4 +573,7 @@ Narang, Sharan, et al. "Do Transformer Modifications Transfer Across Implementat Noreen, Eric W. "Computer intensive methods for hypothesis testing: An introduction." Wiley, New York (1989). +Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. Moving to a world beyond “p< 0.05”, +2019 + Yuan, Ke‐Hai, and Kentaro Hayashi. "Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models." British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110. \ No newline at end of file diff --git a/docs/build/html/.buildinfo b/docs/build/html/.buildinfo index 7aeb335..0e310f4 100644 --- a/docs/build/html/.buildinfo +++ b/docs/build/html/.buildinfo @@ -1,4 +1,4 @@ # Sphinx build info version 1 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. -config: fae97644b57b58ebf06591c09d196b41 +config: 173784e3cf37332c2864e4679c2177e7 tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/docs/build/html/README_DOCS.html b/docs/build/html/README_DOCS.html index b80af70..f10858c 100644 --- a/docs/build/html/README_DOCS.html +++ b/docs/build/html/README_DOCS.html @@ -4,22 +4,22 @@ - + + deep-significance: Easy and Better Significance Testing for Deep Neural Networks — deep-significance 0.9 documentation - - + + - + - @@ -146,7 +146,7 @@

Table Of Contents

-
+

deep-significance: Easy and Better Significance Testing for Deep Neural Networks

Build Status Coverage Status @@ -177,7 +177,7 @@

deep-significance: Easy and Better Significance Testing for Deep Neural Netw
  • |:people_holding_hands:| Papers using deep-significance

  • |:books:| Bibliography

  • - -
    +here , the scenarios in the section Examples or +the demo Jupyter notebook.

    +

    +

    |:inbox_tray:| Installation

    The package can simply be installed using pip by running

    pip3 install deepsig
    @@ -216,62 +217,72 @@ 

    |:inbox_tray:| Installation +

    +

    |:bookmark:| Examples


    tl;dr: Use aso() to compare scores for two models. If the returned eps_min < 0.5, A is better than B. The lower -eps_min, the more confident the result.

    +eps_min, the more confident the result (we recommend to check eps_min < 0.2 and record eps_min alongside +experimental results).

    |:warning:| Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See General Recommendations & other notes.


    -

    In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply +

    In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as this blog post for a general overview or Dror et al. (2018) for a NLP-specific point of view.

    -

    In general, in statistical significance testing, we usually compare two algorithms and on a dataset using -some evaluation metric (we assume a higher = better). The difference between the two algorithms on the -data is then defined as

    -

    where is our test statistic. We then test the following null hypothesis:

    -

    Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A -is better than B (what we actually would like to see). Most statistical significance tests operate using -p-values, which define the probability that under the null-hypothesis, the expected by the test is larger than or -equal to the observed difference (that is, for a one-sided test, i.e. we assume A to be better than B):

    -

    We can interpret this equation as follows: Assuming that A is not better than B, the test assumes a corresponding distribution -of differences that is drawn from. How does our actually observed difference fit in there? -This is what the p-value is expressing: If this probability is high, is in line with what we expected under -the null hypothesis, so we conclude A not to better than B. If the -probability is low, that means that is quite unlikely under the null hypothesis and that the reverse -case is more likely - i.e. that it is -likely larger than - and we conclude that A is indeed better than B. Note that the p-value does not -express whether the null hypothesis is true.

    -

    To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough -for us to reject the null hypothesis, this is called the significance level and it is often set to be 0.05.

    -
    +

    We assume that we have two sets of scores we would like to compare, and , +for instance obtained by running two models and multiple times with a different random seed. +We can then define a one-sided test statistic based on the gathered observations. +An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis:

    +

    That means that we actually assume the opposite of our desired case, namely that is not better than , +but equally as good or worse, as indicated by the value of the test statistic. +Usually, the goal becomes to reject this null hypothesis using the SST. +p-value testing is a frequentist method in the realm of SST. +It introduces the notion of data that could have been observed if we were to repeat our experiment again using +the same conditions, which we will write with superscript in order to distinguish them from our actually +observed scores (Gelman et al., 2021). +We then define the p-value as the probability that, under the null hypothesis, the test statistic using replicated +observation is larger than or equal to the observed test statistic:

    +

    We can interpret this expression as follows: Assuming that is not better than , the test +assumes a corresponding distribution of statistics that is drawn from. So how does the observed test statistic + fit in here? This is what the -value expresses: When the +probability is high, is in line with what we expected under the +null hypothesis, so we can not reject the null hypothesis, or in other words, we \emph{cannot} conclude + to be better than . If the probability is low, that means that the observed + is quite unlikely under the null hypothesis and that the reverse case is +more likely - i.e. that it is likely larger than - and we conclude that is indeed better than +. Note that the -value does not express whether the null hypothesis is true. To make our decision +about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level +, often set to 0.05 - that the p-value has to fall below. However, it has been argued that a better practice +involves reporting the p-value alongside the results without a pidgeonholing of results into significant and non-significant +(Wasserstein et al., 2019).

    +

    Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks

    Deep neural networks are highly non-linear models, having their performance highly dependent on hyperparameters, random seeds and other (stochastic) factors. Therefore, comparing the means of two models across several runs might not be enough to decide if a model A is better than B. In fact, even aggregating more statistics like standard deviation, minimum -or maximum might not be enough to make a decision. For this reason, Dror et al. (2019) introduced Almost Stochastic -Order (ASO), a test to compare two score distributions.

    +or maximum might not be enough to make a decision. For this reason, del Barrio et al. (2017) and Dror et al. (2019) +introduced Almost Stochastic Order (ASO), a test to compare two score distributions.

    It builds on the concept of stochastic order: We can compare two distributions and declare one as stochastically dominant by comparing their cumulative distribution functions:

    _images/so.png

    Here, the CDF of A is given in red and in green for B. If the CDF of A is lower than B for every , we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). -For this reason, Dror et al. (2019) consider the notion of almost stochastic dominance by quantifying the extent to -which stochastic order is being violated (red area):

    +For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of almost stochastic dominance +by quantifying the extent to which stochastic order is being violated (red area):

    _images/aso.png

    -

    ASO returns a value , which expresses the amount of violation of stochastic order. If -, A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as +

    ASO returns a value , which expresses (an upper bound to) the amount of violation of stochastic order. If + (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret as a confidence score. The lower it is, the more sure we can be that A is better than B. Note: ASO does not compute p-values. Instead, the null hypothesis formulated as

    -

    If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5. +

    If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 +(see the discussion in this section). Furthermore, the significance level is determined as an input argument when running ASO and actively influence the resulting .

    -
    -
    +
    +

    Scenario 1 - Comparing multiple runs of two models

    In the simplest scenario, we have retrieved a set of scores from a model A and a baseline B on a dataset, stemming from various model runs with different seeds. We want to test whether our model A is better than B (higher scores = better)- @@ -279,12 +290,15 @@

    Scenario 1 - Comparing multiple runs of two models
    import numpy as np
     from deepsig import aso
     
    +seed = 1234
    +np.random.seed(seed)
    +
     # Simulate scores
     N = 5  # Number of random seeds
     my_model_scores = np.random.normal(loc=0.9, scale=0.8, size=N)
     baseline_scores = np.random.normal(loc=0, scale=1, size=N)
     
    -min_eps = aso(my_model_scores, baseline_scores)  # min_eps = 0.0, so A is better
    +min_eps = aso(my_model_scores, baseline_scores, seed=seed)  # min_eps = 0.225, so A is better
     

    Note that ASO does not make any assumptions about the distributions of the scores. @@ -292,8 +306,8 @@

    Scenario 1 - Comparing multiple runs of two models -

    - -
    -
    + +

    Scenario 3 - Comparing sample-level scores

    In previous examples, we have assumed that we compare two algorithms A and B based on their performance per run, i.e. we run each algorithm once per random seed and obtain exactly one score on our test set. In some cases however, @@ -328,6 +345,9 @@

    Scenario 3 - Comparing sample-level scoresimport numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 40 # Number of data points N = 3 # Number of random seeds @@ -336,11 +356,13 @@

    Scenario 3 - Comparing sample-level scorespairs = list(product(my_model_scored_samples_per_run, baseline_scored_samples_per_run)) # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / len(pairs)) for a, b in pairs] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=len(pairs), seed=seed) for a, b in pairs] +# eps_min = [0.3831678636198528, 0.07194780234194881, 0.9152792807128325, 0.5273463008857844, 0.14946944524461184, 1.0, +# 0.6099543280369378, 0.22387448804041898, 1.0]

    - -
    + +

    Scenario 4 - Comparing more than two models

    Similarly, when comparing multiple models (now again on a per-seed basis), we can use a similar approach like in the previous example. For instance, for three models, we can create a matrix and fill the entries @@ -358,6 +380,9 @@

    Scenario 4 - Comparing more than two models

    In the example, eps_min is now a matrix, containing the score between all pairs of models (for the same model, it set to 1 by default). The matrix is always to be read as ASO(row, column).

    The function applies the bonferroni correction for multiple comparisons by -default, but this can be turned off by using use_bonferroni=False. In order to save compute, the above symmetry -property is used as well, but this can also be disabled by use_symmetry=False.

    +default, but this can be turned off by using use_bonferroni=False.

    Lastly, when the scores argument is a dictionary and the function is called with return_df=True, the resulting matrix is given as a pandas.DataFrame for increased readability:

    - -
    + +

    |:newspaper:| How to report results

    When ASO used, two important details have to be reported, namely the confidence level and the score. Below lists some example snippets reporting the results of scenarios 1 and 4:

    @@ -419,11 +446,11 @@

    |:newspaper:| How to report results +

    +

    |:control_knobs:| Sample size

    It can be hard to determine whether the currently collected set of scores is large enough to allow for reliable significance testing or whether more scores are required. For this reason, deep-significance also implements functions to aid the decision of whether to @@ -469,10 +496,10 @@

    |:control_knobs:| Sample size# But adding two runs to scores2 only increases tightness by 1.06! So spending two more runs on scores1 is better

    - -
    + +

    |:sparkles:| Other features

    -
    - -
    + +

    |:electric_plug:| Compatibility with PyTorch, Tensorflow, Jax & Numpy

    All tests implemented in this package also can take PyTorch / Tensorflow tensors and Jax or NumPy arrays as arguments:

    from deepsig import aso 
    @@ -501,13 +528,13 @@ 

    |:electric_plug:| Compatibility with PyTorch, Tensorflow, Jax & Numpyaso(a, b) # It just works!

    -
    -
    + +

    |:woman_farmer:| Setting seeds for replicability

    In order to ensure replicability, both aso() and multi_aso() supply as seed argument. This even works when multiple jobs are used!

    -
    -
    + +

    |:game_die:| Permutation and bootstrap test

    Should you be suspicious of ASO and want to revert to the good old faithful tests, this package also implements the paired-bootstrap as well as the permutation randomization test. Note that as discussed in the next section, these @@ -523,9 +550,9 @@

    |:game_die:| Permutation and bootstrap testprint(bootstrap_test(a, b)) # 0.103

    - - -
    + + +

    General recommendations & other notes

    -
    -
    + +

    |:mortar_board:| Cite

    -

    If you use the ASO test via aso(), please cite the original work:

    +

    Using this package in general, please cite the following:

    +
    @article{ulmer2022deep,
    +  title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks},
    +  author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes},
    +  journal={arXiv preprint arXiv:2204.06815},
    +  year={2022}
    +}
    +
    +
    +

    If you use the ASO test via aso() or `multi_aso, please cite the original works:

    @inproceedings{dror2019deep,
       author    = {Rotem Dror and
                    Segev Shlomov and
    @@ -569,25 +608,24 @@ 

    |:mortar_board:| Citedoi = {10.18653/v1/p19-1266}, timestamp = {Tue, 28 Jan 2020 10:27:52 +0100}, } -

    -
    -

    Using this package in general, please cite the following:

    -
    @software{dennis_ulmer_2021_4638709,
    -  author       = {Dennis Ulmer},
    -  title        = {{deep-significance: Easy and Better Significance 
    -                   Testing for Deep Neural Networks}},
    -  month        = mar,
    -  year         = 2021,
    -  note         = {https://github.com/Kaleidophon/deep-significance},
    -  publisher    = {Zenodo},
    -  version      = {v1.0.0a},
    -  doi          = {10.5281/zenodo.4638709},
    -  url          = {https://doi.org/10.5281/zenodo.4638709}
    +
    +@incollection{del2018optimal,
    +  title={An optimal transportation approach for assessing almost stochastic order},
    +  author={Del Barrio, Eustasio and Cuesta-Albertos, Juan A and Matr{\'a}n, Carlos},
    +  booktitle={The Mathematics of the Uncertain},
    +  pages={33--44},
    +  year={2018},
    +  publisher={Springer}
     }
     
    +

    For instance, you can write

    +
    In order to compare models, we use the Almost Stochastic Order test \citep{del2018optimal, dror2019deep} as 
    +implemented by \citet{ulmer2022deep}.
    +
    -
    +
    +

    |:medal_sports:| Acknowledgements

    This package was created out of discussions of the NLPnorth group at the IT University Copenhagen, whose members I want to thank for their feedback. The code in this repository is in multiple places based on @@ -597,8 +635,8 @@

    |:medal_sports:| Acknowledgementshere. The inline latex equations were rendered using readme2latex.

    -

    -
    + +

    |:people_holding_hands:| Papers using deep-significance

    In this last section of the readme, I would like to refer to works already using deep-significance. Open an issue or pull request if you would like to see your work added here!

    @@ -606,8 +644,8 @@

    |:people_holding_hands:| Papers using deep-significance

    “From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding” (van der Groot et al., 2021)

  • “Cartography Active Learning” (Zhang & Plank, 2021)

  • -

    -
    + +

    |:books:| Bibliography

    Del Barrio, Eustasio, Juan A. Cuesta-Albertos, and Carlos Matrán. “An optimal transportation approach for assessing almost stochastic order.” The Mathematics of the Uncertain. Springer, Cham, 2018. 33-44.

    Bonferroni, Carlo. “Teoria statistica delle classi e calcolo delle probabilita.” Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8 (1936): 3-62.

    @@ -616,14 +654,18 @@

    |:books:| BibliographyDror, Rotem, et al. “The hitchhiker’s guide to testing statistical significance in natural language processing.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.

    Dror, Rotem, Shlomov, Segev, and Reichart, Roi. “Deep dominance-how to properly compare deep neural models.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.

    Efron, Bradley, and Robert J. Tibshirani. “An introduction to the bootstrap.” CRC press, 1994.

    +

    Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, Donald B Rubin, John +Carlin, Hal Stern, Donald Rubin, and David Dunson. Bayesian data analysis third edition, 2021.

    Henderson, Peter, et al. “Deep reinforcement learning that matters.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.

    Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. “Visualizing the Loss Landscape of Neural Nets.” NeurIPS 2018: 6391-6401

    Narang, Sharan, et al. “Do Transformer Modifications Transfer Across Implementations and Applications?.” arXiv preprint arXiv:2102.11972 (2021).

    Noreen, Eric W. “Computer intensive methods for hypothesis testing: An introduction.” Wiley, New York (1989).

    +

    Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. Moving to a world beyond “p< 0.05”, +2019

    Yuan, Ke‐Hai, and Kentaro Hayashi. “Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models.” British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110.

    -

    - - + + + @@ -636,7 +678,7 @@

    |:books:| Bibliography
    Back to top diff --git a/docs/build/html/_sources/README_DOCS.md.txt b/docs/build/html/_sources/README_DOCS.md.txt index b344874..61cb439 100644 --- a/docs/build/html/_sources/README_DOCS.md.txt +++ b/docs/build/html/_sources/README_DOCS.md.txt @@ -47,14 +47,15 @@ Reinforcement Learning (Henderson et al., 2018) and Computer Vision (Borji, 2017 To help mitigate this problem, this package supplies fully-tested re-implementations of useful functions for significance testing: -* Statistical Significance tests such as Almost Stochastic Order (Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and - permutation-randomization (Noreen, 1989). +* Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), + bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989). * Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936). * Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size. All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation -[here](https://deep-significance.readthedocs.io/en/latest/) or the scenarios in the section [Examples](#examples). +[here](https://deep-significance.readthedocs.io/en/latest/) , the scenarios in the section [Examples](#examples) or +the [demo Jupyter notebook](https://github.com/Kaleidophon/deep-significance/tree/main/paper/deep-significance%20demo.ipynb). ## |:inbox_tray:| Installation @@ -74,46 +75,51 @@ Another option is to clone the repository and install the package locally: --- **tl;dr**: Use `aso()` to compare scores for two models. If the returned `eps_min < 0.5`, A is better than B. The lower -`eps_min`, the more confident the result. +`eps_min`, the more confident the result (we recommend to check `eps_min < 0.2` and record `eps_min` alongside +experimental results). |:warning:| Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See [General Recommendations & other notes](#general-recommendations). --- -In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply +In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as [this blog post](https://machinelearningmastery.com/statistical-hypothesis-tests/) for a general overview or [Dror et al. (2018)](https://www.aclweb.org/anthology/P18-1128.pdf) for a NLP-specific point of view. -In general, in statistical significance testing, we usually compare two algorithms and on a dataset using -some evaluation metric (we assume a higher = better). The difference between the two algorithms on the -data is then defined as - -

    - -where is our test statistic. We then test the following **null hypothesis**: - -

    - -Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A -is better than B (what we actually would like to see). Most statistical significance tests operate using -*p-values*, which define the probability that under the null-hypothesis, the expected by the test is larger than or -equal to the observed difference (that is, for a one-sided test, i.e. we assume A to be better than B): - -

    - -We can interpret this equation as follows: Assuming that A is *not* better than B, the test assumes a corresponding distribution -of differences that is drawn from. How does our actually observed difference fit in there? -This is what the p-value is expressing: If this probability is high, is in line with what we expected under -the null hypothesis, so we conclude A not to better than B. If the -probability is low, that means that is quite unlikely under the null hypothesis and that the reverse -case is more likely - i.e. that it is -likely *larger* than - and we conclude that A is indeed better than B. Note that **the p-value does not -express whether the null hypothesis is true**. - -To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough -for us to reject the null hypothesis, this is called the significance level and it is often set to be 0.05. +We assume that we have two sets of scores we would like to compare, and , +for instance obtained by running two models and multiple times with a different random seed. +We can then define a one-sided test statistic based on the gathered observations. +An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis: + +

    + +That means that we actually assume the opposite of our desired case, namely that is not better than , +but equally as good or worse, as indicated by the value of the test statistic. +Usually, the goal becomes to reject this null hypothesis using the SST. +*p*-value testing is a frequentist method in the realm of SST. +It introduces the notion of data that *could have been observed* if we were to repeat our experiment again using +the same conditions, which we will write with superscript in order to distinguish them from our actually +observed scores (Gelman et al., 2021). +We then define the *p*-value as the probability that, under the null hypothesis, the test statistic using replicated +observation is larger than or equal to the *observed* test statistic: + +

    + +We can interpret this expression as follows: Assuming that is not better than , the test +assumes a corresponding distribution of statistics that is drawn from. So how does the observed test statistic + fit in here? This is what the -value expresses: When the +probability is high, is in line with what we expected under the +null hypothesis, so we can *not* reject the null hypothesis, or in other words, we \emph{cannot} conclude + to be better than . If the probability is low, that means that the observed + is quite unlikely under the null hypothesis and that the reverse case is +more likely - i.e. that it is likely larger than - and we conclude that is indeed better than +. Note that **the -value does not express whether the null hypothesis is true**. To make our decision +about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level +, often set to 0.05 - that the *p*-value has to fall below. However, it has been argued that a better practice +involves reporting the *p*-value alongside the results without a pidgeonholing of results into significant and non-significant +(Wasserstein et al., 2019). ### Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks @@ -121,8 +127,8 @@ for us to reject the null hypothesis, this is called the significance level , we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). -For this reason, Dror et al. (2019) consider the notion of *almost stochastic dominance* by quantifying the extent to -which stochastic order is being violated (red area): +For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of *almost stochastic dominance* +by quantifying the extent to which stochastic order is being violated (red area): ![](img/aso.png) -ASO returns a value , which expresses the amount of violation of stochastic order. If -, A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as +ASO returns a value , which expresses (an upper bound to) the amount of violation of stochastic order. If + (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret as a *confidence score*. The lower it is, the more sure we can be that A is better than B. Note: **ASO does not compute p-values.** Instead, the null hypothesis formulated as -

    +

    -If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5. +If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 +(see the discussion in [this section](#general-recommendations)). Furthermore, the significance level is determined as an input argument when running ASO and actively influence the resulting . @@ -159,12 +166,15 @@ We can now simply apply the ASO test: import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores N = 5 # Number of random seeds my_model_scores = np.random.normal(loc=0.9, scale=0.8, size=N) baseline_scores = np.random.normal(loc=0, scale=1, size=N) -min_eps = aso(my_model_scores, baseline_scores) # min_eps = 0.0, so A is better +min_eps = aso(my_model_scores, baseline_scores, seed=seed) # min_eps = 0.225, so A is better ``` Note that ASO **does not make any assumptions about the distributions of the scores**. @@ -185,6 +195,9 @@ which corresponds to the Bonferroni correction (Bonferroni et al., 1936): import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 3 # Number of datasets N = 5 # Number of random seeds @@ -192,8 +205,8 @@ my_model_scores_per_dataset = [np.random.normal(loc=0.3, scale=0.8, size=N) for baseline_scores_per_dataset = [np.random.normal(loc=0, scale=1, size=N) for _ in range(M)] # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / M) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] -# eps_min = [0.1565800030782686, 1, 0.0] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=M, seed=seed) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] +# eps_min = [0.006370113450148568, 0.6534772728574852, 0.0] ``` ### Scenario 3 - Comparing sample-level scores @@ -212,6 +225,9 @@ from itertools import product import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 40 # Number of data points N = 3 # Number of random seeds @@ -220,7 +236,9 @@ baseline_scored_samples_per_run = [np.random.normal(loc=0, scale=1, size=M) for pairs = list(product(my_model_scored_samples_per_run, baseline_scored_samples_per_run)) # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / len(pairs)) for a, b in pairs] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=len(pairs), seed=seed) for a, b in pairs] +# eps_min = [0.3831678636198528, 0.07194780234194881, 0.9152792807128325, 0.5273463008857844, 0.14946944524461184, 1.0, +# 0.6099543280369378, 0.22387448804041898, 1.0] ``` ### Scenario 4 - Comparing more than two models @@ -249,6 +267,9 @@ Let's look at an example: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -257,20 +278,19 @@ M = 3 # Number of different models / algorithms # Here, we will sample from N(0.1, 0.8), N(0.15, 0.8), N(0.2, 0.8) my_models_scores = np.array([np.random.normal(loc=loc, scale=0.8, size=N) for loc in np.arange(0.1, 0.1 + 0.05 * M, step=0.05)]) -eps_min = multi_aso(my_models_scores, confidence_level=0.05) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, seed=seed) # eps_min = -# array([[1., 1., 1.], -# [0., 1., 1.], -# [0., 0., 1.]]) +# array([[1. , 0.92621655, 1. ], +# [1. , 1. , 1. ], +# [0.82081635, 0.73048716, 1. ]]) ``` In the example, `eps_min` is now a matrix, containing the score between all pairs of models (for the same model, it set to 1 by default). The matrix is always to be read as ASO(row, column). The function applies the bonferroni correction for multiple comparisons by -default, but this can be turned off by using `use_bonferroni=False`. In order to save compute, the above symmetry -property is used as well, but this can also be disabled by `use_symmetry=False`. +default, but this can be turned off by using `use_bonferroni=False`. Lastly, when the `scores` argument is a dictionary and the function is called with `return_df=True`, the resulting matrix is given as a `pandas.DataFrame` for increased readability: @@ -278,6 +298,9 @@ given as a `pandas.DataFrame` for increased readability: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -294,14 +317,14 @@ my_models_scores = { # ... # } -eps_min = multi_aso(my_models_scores, confidence_level=0.05, return_df=True) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, return_df=True, seed=seed) # This is now a DataFrame! # eps_min = -# model 1 model 2 model 3 -# model 1 1.0 1.0 1.0 -# model 2 0.0 1.0 1.0 -# model 3 1.0 0.0 1.0 +# model 1 model 2 model 3 +# model 1 1.000000 0.926217 1.0 +# model 2 1.000000 1.000000 1.0 +# model 3 0.820816 0.730487 1.0 ``` @@ -315,7 +338,7 @@ score. Below lists some example snippets reporting the results of scenarios 1 an We compared all pairs of models based on five random seeds each using ASO with a confidence level of $\alpha = 0.05$ (before adjusting for all pair-wise comparisons using the Bonferroni correction). Almost stochastic - dominance ($\epsilon_\text{min} < 0.5)$ is indicated in table X. + dominance ($\epsilon_\text{min} < \tau$ with $\tau = 0.2$) is indicated in table X. ### |:control_knobs:| Sample size @@ -384,11 +407,11 @@ from deepsig import aso import numpy as np from timeit import timeit -a = np.random.normal(size=5) -b = np.random.normal(size=5) +a = np.random.normal(size=1000) +b = np.random.normal(size=1000) -print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 146.6909574989986 -print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 50.416724971000804 +print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 393.6318126 +print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 139.73514621799995n ``` #### |:electric_plug:| Compatibility with PyTorch, Tensorflow, Jax & Numpy @@ -438,11 +461,15 @@ as many scores as possible should be collected, especially if the variance betwe Because this is usually infeasible in practice, Bouthilier et al. (2020) recommend to **vary all other sources of variation** between runs to obtain the most trustworthy estimate of the "true" performance, such as data shuffling, weight initialization etc. -* `num_samples` and `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not +* `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not recommended as the result of the test will also become less accurate. Technically, is a upper bound that becomes tighter with the number of samples and bootstrap iterations (del Barrio et al., 2017). Thus, increasing the number of jobs with `num_jobs` instead is always preferred. +* While we could declare a model stochastically dominant with , we found this to have a comparatively high +Type I error (false positives). Tests [in our paper](https://arxiv.org/pdf/2204.06815.pdf) have shown that a more useful threshold that trades of Type I and + Type II error between different scenarios might be . + * Bootstrap and permutation-randomization are all non-parametric tests, i.e. they don't make any assumptions about the distribution of our test metric. Nevertheless, they differ in their *statistical power*, which is defined as the probability that the null hypothesis is being rejected given that there is a difference between A and B. In other words, the more powerful @@ -454,7 +481,17 @@ the distribution of our test metric. Nevertheless, they differ in their *statist ### |:mortar_board:| Cite -If you use the ASO test via `aso()`, please cite the original work: +Using this package in general, please cite the following: + + @article{ulmer2022deep, + title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks}, + author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes}, + journal={arXiv preprint arXiv:2204.06815}, + year={2022} + } + + +If you use the ASO test via `aso()` or `multi_aso, please cite the original works: @inproceedings{dror2019deep, author = {Rotem Dror and @@ -475,21 +512,20 @@ If you use the ASO test via `aso()`, please cite the original work: timestamp = {Tue, 28 Jan 2020 10:27:52 +0100}, } -Using this package in general, please cite the following: - - @software{dennis_ulmer_2021_4638709, - author = {Dennis Ulmer}, - title = {{deep-significance: Easy and Better Significance - Testing for Deep Neural Networks}}, - month = mar, - year = 2021, - note = {https://github.com/Kaleidophon/deep-significance}, - publisher = {Zenodo}, - version = {v1.0.0a}, - doi = {10.5281/zenodo.4638709}, - url = {https://doi.org/10.5281/zenodo.4638709} + @incollection{del2018optimal, + title={An optimal transportation approach for assessing almost stochastic order}, + author={Del Barrio, Eustasio and Cuesta-Albertos, Juan A and Matr{\'a}n, Carlos}, + booktitle={The Mathematics of the Uncertain}, + pages={33--44}, + year={2018}, + publisher={Springer} } +For instance, you can write + + In order to compare models, we use the Almost Stochastic Order test \citep{del2018optimal, dror2019deep} as + implemented by \citet{ulmer2022deep}. + ### |:medal_sports:| Acknowledgements This package was created out of discussions of the [NLPnorth group](https://nlpnorth.github.io/) at the IT University @@ -526,6 +562,9 @@ Dror, Rotem, Shlomov, Segev, and Reichart, Roi. "Deep dominance-how to properly Efron, Bradley, and Robert J. Tibshirani. "An introduction to the bootstrap." CRC press, 1994. +Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, Donald B Rubin, John +Carlin, Hal Stern, Donald Rubin, and David Dunson. Bayesian data analysis third edition, 2021. + Henderson, Peter, et al. "Deep reinforcement learning that matters." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018: 6391-6401 @@ -534,4 +573,7 @@ Narang, Sharan, et al. "Do Transformer Modifications Transfer Across Implementat Noreen, Eric W. "Computer intensive methods for hypothesis testing: An introduction." Wiley, New York (1989). +Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. Moving to a world beyond “p< 0.05”, +2019 + Yuan, Ke‐Hai, and Kentaro Hayashi. "Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models." British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110. \ No newline at end of file diff --git a/docs/build/html/_static/basic.css b/docs/build/html/_static/basic.css index 24bc73e..912859b 100644 --- a/docs/build/html/_static/basic.css +++ b/docs/build/html/_static/basic.css @@ -4,7 +4,7 @@ * * Sphinx stylesheet -- basic theme. * - * :copyright: Copyright 2007-2020 by the Sphinx team, see AUTHORS. + * :copyright: Copyright 2007-2021 by the Sphinx team, see AUTHORS. * :license: BSD, see LICENSE for details. * */ @@ -130,7 +130,7 @@ ul.search li a { font-weight: bold; } -ul.search li div.context { +ul.search li p.context { color: #888; margin: 2px 0 0 30px; text-align: left; @@ -277,25 +277,25 @@ p.rubric { font-weight: bold; } -img.align-left, .figure.align-left, object.align-left { +img.align-left, figure.align-left, .figure.align-left, object.align-left { clear: left; float: left; margin-right: 1em; } -img.align-right, .figure.align-right, object.align-right { +img.align-right, figure.align-right, .figure.align-right, object.align-right { clear: right; float: right; margin-left: 1em; } -img.align-center, .figure.align-center, object.align-center { +img.align-center, figure.align-center, .figure.align-center, object.align-center { display: block; margin-left: auto; margin-right: auto; } -img.align-default, .figure.align-default { +img.align-default, figure.align-default, .figure.align-default { display: block; margin-left: auto; margin-right: auto; @@ -319,7 +319,8 @@ img.align-default, .figure.align-default { /* -- sidebars -------------------------------------------------------------- */ -div.sidebar { +div.sidebar, +aside.sidebar { margin: 0 0 0.5em 1em; border: 1px solid #ddb; padding: 7px; @@ -377,12 +378,14 @@ div.body p.centered { /* -- content of sidebars/topics/admonitions -------------------------------- */ div.sidebar > :last-child, +aside.sidebar > :last-child, div.topic > :last-child, div.admonition > :last-child { margin-bottom: 0; } div.sidebar::after, +aside.sidebar::after, div.topic::after, div.admonition::after, blockquote::after { @@ -455,20 +458,22 @@ td > :last-child { /* -- figures --------------------------------------------------------------- */ -div.figure { +div.figure, figure { margin: 0.5em; padding: 0.5em; } -div.figure p.caption { +div.figure p.caption, figcaption { padding: 0.3em; } -div.figure p.caption span.caption-number { +div.figure p.caption span.caption-number, +figcaption span.caption-number { font-style: italic; } -div.figure p.caption span.caption-text { +div.figure p.caption span.caption-text, +figcaption span.caption-text { } /* -- field list styles ----------------------------------------------------- */ @@ -503,6 +508,63 @@ table.hlist td { vertical-align: top; } +/* -- object description styles --------------------------------------------- */ + +.sig { + font-family: 'Consolas', 'Menlo', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', monospace; +} + +.sig-name, code.descname { + background-color: transparent; + font-weight: bold; +} + +.sig-name { + font-size: 1.1em; +} + +code.descname { + font-size: 1.2em; +} + +.sig-prename, code.descclassname { + background-color: transparent; +} + +.optional { + font-size: 1.3em; +} + +.sig-paren { + font-size: larger; +} + +.sig-param.n { + font-style: italic; +} + +/* C++ specific styling */ + +.sig-inline.c-texpr, +.sig-inline.cpp-texpr { + font-family: unset; +} + +.sig.c .k, .sig.c .kt, +.sig.cpp .k, .sig.cpp .kt { + color: #0033B3; +} + +.sig.c .m, +.sig.cpp .m { + color: #1750EB; +} + +.sig.c .s, .sig.c .sc, +.sig.cpp .s, .sig.cpp .sc { + color: #067D17; +} + /* -- other body styles ----------------------------------------------------- */ @@ -629,14 +691,6 @@ dl.glossary dt { font-size: 1.1em; } -.optional { - font-size: 1.3em; -} - -.sig-paren { - font-size: larger; -} - .versionmodified { font-style: italic; } @@ -764,8 +818,13 @@ div.code-block-caption code { } table.highlighttable td.linenos, -div.doctest > div.highlight span.gp { /* gp: Generic.Prompt */ - user-select: none; +span.linenos, +div.highlight span.gp { /* gp: Generic.Prompt */ + user-select: none; + -webkit-user-select: text; /* Safari fallback only */ + -webkit-user-select: none; /* Chrome/Safari */ + -moz-user-select: none; /* Firefox */ + -ms-user-select: none; /* IE10+ */ } div.code-block-caption span.caption-number { @@ -780,16 +839,6 @@ div.literal-block-wrapper { margin: 1em 0; } -code.descname { - background-color: transparent; - font-weight: bold; - font-size: 1.2em; -} - -code.descclassname { - background-color: transparent; -} - code.xref, a code { background-color: transparent; font-weight: bold; diff --git a/docs/build/html/_static/doctools.js b/docs/build/html/_static/doctools.js index daccd20..8cbf1b1 100644 --- a/docs/build/html/_static/doctools.js +++ b/docs/build/html/_static/doctools.js @@ -4,7 +4,7 @@ * * Sphinx JavaScript utilities for all documentation. * - * :copyright: Copyright 2007-2020 by the Sphinx team, see AUTHORS. + * :copyright: Copyright 2007-2021 by the Sphinx team, see AUTHORS. * :license: BSD, see LICENSE for details. * */ @@ -29,9 +29,14 @@ if (!window.console || !console.firebug) { /** * small helper function to urldecode strings + * + * See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/decodeURIComponent#Decoding_query_parameters_from_a_URL */ jQuery.urldecode = function(x) { - return decodeURIComponent(x).replace(/\+/g, ' '); + if (!x) { + return x + } + return decodeURIComponent(x.replace(/\+/g, ' ')); }; /** @@ -285,9 +290,10 @@ var Documentation = { initOnKeyListeners: function() { $(document).keydown(function(event) { var activeElementType = document.activeElement.tagName; - // don't navigate when in search box or textarea + // don't navigate when in search box, textarea, dropdown or button if (activeElementType !== 'TEXTAREA' && activeElementType !== 'INPUT' && activeElementType !== 'SELECT' - && !event.altKey && !event.ctrlKey && !event.metaKey && !event.shiftKey) { + && activeElementType !== 'BUTTON' && !event.altKey && !event.ctrlKey && !event.metaKey + && !event.shiftKey) { switch (event.keyCode) { case 37: // left var prevHref = $('link[rel="prev"]').prop('href'); @@ -295,12 +301,14 @@ var Documentation = { window.location.href = prevHref; return false; } + break; case 39: // right var nextHref = $('link[rel="next"]').prop('href'); if (nextHref) { window.location.href = nextHref; return false; } + break; } } }); diff --git a/docs/build/html/_static/language_data.js b/docs/build/html/_static/language_data.js index d2b4ee9..863704b 100644 --- a/docs/build/html/_static/language_data.js +++ b/docs/build/html/_static/language_data.js @@ -5,7 +5,7 @@ * This script contains the language-specific data used by searchtools.js, * namely the list of stopwords, stemmer, scorer and splitter. * - * :copyright: Copyright 2007-2020 by the Sphinx team, see AUTHORS. + * :copyright: Copyright 2007-2021 by the Sphinx team, see AUTHORS. * :license: BSD, see LICENSE for details. * */ @@ -13,7 +13,8 @@ var stopwords = ["a","and","are","as","at","be","but","by","for","if","in","into","is","it","near","no","not","of","on","or","such","that","the","their","then","there","these","they","this","to","was","will","with"]; -/* Non-minified version JS is _stemmer.js if file is provided */ +/* Non-minified version is copied as a separate JS file, is available */ + /** * Porter Stemmer */ @@ -199,7 +200,6 @@ var Stemmer = function() { - var splitChars = (function() { var result = {}; var singles = [96, 180, 187, 191, 215, 247, 749, 885, 903, 907, 909, 930, 1014, 1648, diff --git a/docs/build/html/_static/pygments.css b/docs/build/html/_static/pygments.css index f346859..691aeb8 100644 --- a/docs/build/html/_static/pygments.css +++ b/docs/build/html/_static/pygments.css @@ -1,7 +1,7 @@ -pre { line-height: 125%; margin: 0; } -td.linenos pre { color: #000000; background-color: #f0f0f0; padding-left: 5px; padding-right: 5px; } -span.linenos { color: #000000; background-color: #f0f0f0; padding-left: 5px; padding-right: 5px; } -td.linenos pre.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; } +pre { line-height: 125%; } +td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; } +span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; } +td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; } span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; } .highlight .hll { background-color: #ffffcc } .highlight { background: #eeffcc; } diff --git a/docs/build/html/_static/searchtools.js b/docs/build/html/_static/searchtools.js index 970d0d9..e09f926 100644 --- a/docs/build/html/_static/searchtools.js +++ b/docs/build/html/_static/searchtools.js @@ -4,7 +4,7 @@ * * Sphinx JavaScript utilities for the full-text search. * - * :copyright: Copyright 2007-2020 by the Sphinx team, see AUTHORS. + * :copyright: Copyright 2007-2021 by the Sphinx team, see AUTHORS. * :license: BSD, see LICENSE for details. * */ @@ -59,10 +59,10 @@ var Search = { _pulse_status : -1, htmlToText : function(htmlString) { - var htmlElement = document.createElement('span'); - htmlElement.innerHTML = htmlString; - $(htmlElement).find('.headerlink').remove(); - docContent = $(htmlElement).find('[role=main]')[0]; + var virtualDocument = document.implementation.createHTMLDocument('virtual'); + var htmlElement = $(htmlString, virtualDocument); + htmlElement.find('.headerlink').remove(); + docContent = htmlElement.find('[role=main]')[0]; if(docContent === undefined) { console.warn("Content block not found. Sphinx search tries to obtain it " + "via '[role=main]'. Could you check your theme or template."); @@ -248,7 +248,7 @@ var Search = { // results left, load the summary and display it if (results.length) { var item = results.pop(); - var listItem = $('
  • '); + var listItem = $('
  • '); var requestUrl = ""; var linkUrl = ""; if (DOCUMENTATION_OPTIONS.BUILDER === 'dirhtml') { @@ -273,9 +273,9 @@ var Search = { if (item[3]) { listItem.append($(' (' + item[3] + ')')); Search.output.append(listItem); - listItem.slideDown(5, function() { + setTimeout(function() { displayNextItem(); - }); + }, 5); } else if (DOCUMENTATION_OPTIONS.HAS_SOURCE) { $.ajax({url: requestUrl, dataType: "text", @@ -285,16 +285,16 @@ var Search = { listItem.append(Search.makeSearchSummary(data, searchterms, hlterms)); } Search.output.append(listItem); - listItem.slideDown(5, function() { + setTimeout(function() { displayNextItem(); - }); + }, 5); }}); } else { // no source available, just display title Search.output.append(listItem); - listItem.slideDown(5, function() { + setTimeout(function() { displayNextItem(); - }); + }, 5); } } // search finished, update title and status message @@ -379,6 +379,13 @@ var Search = { return results; }, + /** + * See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions + */ + escapeRegExp : function(string) { + return string.replace(/[.*+\-?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string + }, + /** * search for full-text terms in the index */ @@ -402,13 +409,14 @@ var Search = { ]; // add support for partial matches if (word.length > 2) { + var word_regex = this.escapeRegExp(word); for (var w in terms) { - if (w.match(word) && !terms[word]) { + if (w.match(word_regex) && !terms[word]) { _o.push({files: terms[w], score: Scorer.partialTerm}) } } for (var w in titleterms) { - if (w.match(word) && !titleterms[word]) { + if (w.match(word_regex) && !titleterms[word]) { _o.push({files: titleterms[w], score: Scorer.partialTitle}) } } @@ -501,7 +509,7 @@ var Search = { var excerpt = ((start > 0) ? '...' : '') + $.trim(text.substr(start, 240)) + ((start + 240 - text.length) ? '...' : ''); - var rv = $('
    ').text(excerpt); + var rv = $('

    ').text(excerpt); $.each(hlwords, function() { rv = rv.highlightText(this, 'highlighted'); }); diff --git a/docs/build/html/_static/underscore.js b/docs/build/html/_static/underscore.js index 5b55f32..cf177d4 100644 --- a/docs/build/html/_static/underscore.js +++ b/docs/build/html/_static/underscore.js @@ -1,31 +1,6 @@ -// Underscore.js 1.3.1 -// (c) 2009-2012 Jeremy Ashkenas, DocumentCloud Inc. -// Underscore is freely distributable under the MIT license. -// Portions of Underscore are inspired or borrowed from Prototype, -// Oliver Steele's Functional, and John Resig's Micro-Templating. -// For all details and documentation: -// http://documentcloud.github.com/underscore -(function(){function q(a,c,d){if(a===c)return a!==0||1/a==1/c;if(a==null||c==null)return a===c;if(a._chain)a=a._wrapped;if(c._chain)c=c._wrapped;if(a.isEqual&&b.isFunction(a.isEqual))return a.isEqual(c);if(c.isEqual&&b.isFunction(c.isEqual))return c.isEqual(a);var e=l.call(a);if(e!=l.call(c))return false;switch(e){case "[object String]":return a==String(c);case "[object Number]":return a!=+a?c!=+c:a==0?1/a==1/c:a==+c;case "[object Date]":case "[object Boolean]":return+a==+c;case "[object RegExp]":return a.source== -c.source&&a.global==c.global&&a.multiline==c.multiline&&a.ignoreCase==c.ignoreCase}if(typeof a!="object"||typeof c!="object")return false;for(var f=d.length;f--;)if(d[f]==a)return true;d.push(a);var f=0,g=true;if(e=="[object Array]"){if(f=a.length,g=f==c.length)for(;f--;)if(!(g=f in a==f in c&&q(a[f],c[f],d)))break}else{if("constructor"in a!="constructor"in c||a.constructor!=c.constructor)return false;for(var h in a)if(b.has(a,h)&&(f++,!(g=b.has(c,h)&&q(a[h],c[h],d))))break;if(g){for(h in c)if(b.has(c, -h)&&!f--)break;g=!f}}d.pop();return g}var r=this,G=r._,n={},k=Array.prototype,o=Object.prototype,i=k.slice,H=k.unshift,l=o.toString,I=o.hasOwnProperty,w=k.forEach,x=k.map,y=k.reduce,z=k.reduceRight,A=k.filter,B=k.every,C=k.some,p=k.indexOf,D=k.lastIndexOf,o=Array.isArray,J=Object.keys,s=Function.prototype.bind,b=function(a){return new m(a)};if(typeof exports!=="undefined"){if(typeof module!=="undefined"&&module.exports)exports=module.exports=b;exports._=b}else r._=b;b.VERSION="1.3.1";var j=b.each= -b.forEach=function(a,c,d){if(a!=null)if(w&&a.forEach===w)a.forEach(c,d);else if(a.length===+a.length)for(var e=0,f=a.length;e2;a== -null&&(a=[]);if(y&&a.reduce===y)return e&&(c=b.bind(c,e)),f?a.reduce(c,d):a.reduce(c);j(a,function(a,b,i){f?d=c.call(e,d,a,b,i):(d=a,f=true)});if(!f)throw new TypeError("Reduce of empty array with no initial value");return d};b.reduceRight=b.foldr=function(a,c,d,e){var f=arguments.length>2;a==null&&(a=[]);if(z&&a.reduceRight===z)return e&&(c=b.bind(c,e)),f?a.reduceRight(c,d):a.reduceRight(c);var g=b.toArray(a).reverse();e&&!f&&(c=b.bind(c,e));return f?b.reduce(g,c,d,e):b.reduce(g,c)};b.find=b.detect= -function(a,c,b){var e;E(a,function(a,g,h){if(c.call(b,a,g,h))return e=a,true});return e};b.filter=b.select=function(a,c,b){var e=[];if(a==null)return e;if(A&&a.filter===A)return a.filter(c,b);j(a,function(a,g,h){c.call(b,a,g,h)&&(e[e.length]=a)});return e};b.reject=function(a,c,b){var e=[];if(a==null)return e;j(a,function(a,g,h){c.call(b,a,g,h)||(e[e.length]=a)});return e};b.every=b.all=function(a,c,b){var e=true;if(a==null)return e;if(B&&a.every===B)return a.every(c,b);j(a,function(a,g,h){if(!(e= -e&&c.call(b,a,g,h)))return n});return e};var E=b.some=b.any=function(a,c,d){c||(c=b.identity);var e=false;if(a==null)return e;if(C&&a.some===C)return a.some(c,d);j(a,function(a,b,h){if(e||(e=c.call(d,a,b,h)))return n});return!!e};b.include=b.contains=function(a,c){var b=false;if(a==null)return b;return p&&a.indexOf===p?a.indexOf(c)!=-1:b=E(a,function(a){return a===c})};b.invoke=function(a,c){var d=i.call(arguments,2);return b.map(a,function(a){return(b.isFunction(c)?c||a:a[c]).apply(a,d)})};b.pluck= -function(a,c){return b.map(a,function(a){return a[c]})};b.max=function(a,c,d){if(!c&&b.isArray(a))return Math.max.apply(Math,a);if(!c&&b.isEmpty(a))return-Infinity;var e={computed:-Infinity};j(a,function(a,b,h){b=c?c.call(d,a,b,h):a;b>=e.computed&&(e={value:a,computed:b})});return e.value};b.min=function(a,c,d){if(!c&&b.isArray(a))return Math.min.apply(Math,a);if(!c&&b.isEmpty(a))return Infinity;var e={computed:Infinity};j(a,function(a,b,h){b=c?c.call(d,a,b,h):a;bd?1:0}),"value")};b.groupBy=function(a,c){var d={},e=b.isFunction(c)?c:function(a){return a[c]};j(a,function(a,b){var c=e(a,b);(d[c]||(d[c]=[])).push(a)});return d};b.sortedIndex=function(a, -c,d){d||(d=b.identity);for(var e=0,f=a.length;e>1;d(a[g])=0})})};b.difference=function(a){var c=b.flatten(i.call(arguments,1));return b.filter(a,function(a){return!b.include(c,a)})};b.zip=function(){for(var a=i.call(arguments),c=b.max(b.pluck(a,"length")),d=Array(c),e=0;e=0;d--)b=[a[d].apply(this,b)];return b[0]}}; -b.after=function(a,b){return a<=0?b():function(){if(--a<1)return b.apply(this,arguments)}};b.keys=J||function(a){if(a!==Object(a))throw new TypeError("Invalid object");var c=[],d;for(d in a)b.has(a,d)&&(c[c.length]=d);return c};b.values=function(a){return b.map(a,b.identity)};b.functions=b.methods=function(a){var c=[],d;for(d in a)b.isFunction(a[d])&&c.push(d);return c.sort()};b.extend=function(a){j(i.call(arguments,1),function(b){for(var d in b)a[d]=b[d]});return a};b.defaults=function(a){j(i.call(arguments, -1),function(b){for(var d in b)a[d]==null&&(a[d]=b[d])});return a};b.clone=function(a){return!b.isObject(a)?a:b.isArray(a)?a.slice():b.extend({},a)};b.tap=function(a,b){b(a);return a};b.isEqual=function(a,b){return q(a,b,[])};b.isEmpty=function(a){if(b.isArray(a)||b.isString(a))return a.length===0;for(var c in a)if(b.has(a,c))return false;return true};b.isElement=function(a){return!!(a&&a.nodeType==1)};b.isArray=o||function(a){return l.call(a)=="[object Array]"};b.isObject=function(a){return a===Object(a)}; -b.isArguments=function(a){return l.call(a)=="[object Arguments]"};if(!b.isArguments(arguments))b.isArguments=function(a){return!(!a||!b.has(a,"callee"))};b.isFunction=function(a){return l.call(a)=="[object Function]"};b.isString=function(a){return l.call(a)=="[object String]"};b.isNumber=function(a){return l.call(a)=="[object Number]"};b.isNaN=function(a){return a!==a};b.isBoolean=function(a){return a===true||a===false||l.call(a)=="[object Boolean]"};b.isDate=function(a){return l.call(a)=="[object Date]"}; -b.isRegExp=function(a){return l.call(a)=="[object RegExp]"};b.isNull=function(a){return a===null};b.isUndefined=function(a){return a===void 0};b.has=function(a,b){return I.call(a,b)};b.noConflict=function(){r._=G;return this};b.identity=function(a){return a};b.times=function(a,b,d){for(var e=0;e/g,">").replace(/"/g,""").replace(/'/g,"'").replace(/\//g,"/")};b.mixin=function(a){j(b.functions(a), -function(c){K(c,b[c]=a[c])})};var L=0;b.uniqueId=function(a){var b=L++;return a?a+b:b};b.templateSettings={evaluate:/<%([\s\S]+?)%>/g,interpolate:/<%=([\s\S]+?)%>/g,escape:/<%-([\s\S]+?)%>/g};var t=/.^/,u=function(a){return a.replace(/\\\\/g,"\\").replace(/\\'/g,"'")};b.template=function(a,c){var d=b.templateSettings,d="var __p=[],print=function(){__p.push.apply(__p,arguments);};with(obj||{}){__p.push('"+a.replace(/\\/g,"\\\\").replace(/'/g,"\\'").replace(d.escape||t,function(a,b){return"',_.escape("+ -u(b)+"),'"}).replace(d.interpolate||t,function(a,b){return"',"+u(b)+",'"}).replace(d.evaluate||t,function(a,b){return"');"+u(b).replace(/[\r\n\t]/g," ")+";__p.push('"}).replace(/\r/g,"\\r").replace(/\n/g,"\\n").replace(/\t/g,"\\t")+"');}return __p.join('');",e=new Function("obj","_",d);return c?e(c,b):function(a){return e.call(this,a,b)}};b.chain=function(a){return b(a).chain()};var m=function(a){this._wrapped=a};b.prototype=m.prototype;var v=function(a,c){return c?b(a).chain():a},K=function(a,c){m.prototype[a]= -function(){var a=i.call(arguments);H.call(a,this._wrapped);return v(c.apply(b,a),this._chain)}};b.mixin(b);j("pop,push,reverse,shift,sort,splice,unshift".split(","),function(a){var b=k[a];m.prototype[a]=function(){var d=this._wrapped;b.apply(d,arguments);var e=d.length;(a=="shift"||a=="splice")&&e===0&&delete d[0];return v(d,this._chain)}});j(["concat","join","slice"],function(a){var b=k[a];m.prototype[a]=function(){return v(b.apply(this._wrapped,arguments),this._chain)}});m.prototype.chain=function(){this._chain= -true;return this};m.prototype.value=function(){return this._wrapped}}).call(this); +!function(n,r){"object"==typeof exports&&"undefined"!=typeof module?module.exports=r():"function"==typeof define&&define.amd?define("underscore",r):(n="undefined"!=typeof globalThis?globalThis:n||self,function(){var t=n._,e=n._=r();e.noConflict=function(){return n._=t,e}}())}(this,(function(){ +// Underscore.js 1.13.1 +// https://underscorejs.org +// (c) 2009-2021 Jeremy Ashkenas, Julian Gonggrijp, and DocumentCloud and Investigative Reporters & Editors +// Underscore may be freely distributed under the MIT license. +var n="1.13.1",r="object"==typeof self&&self.self===self&&self||"object"==typeof global&&global.global===global&&global||Function("return this")()||{},t=Array.prototype,e=Object.prototype,u="undefined"!=typeof Symbol?Symbol.prototype:null,o=t.push,i=t.slice,a=e.toString,f=e.hasOwnProperty,c="undefined"!=typeof ArrayBuffer,l="undefined"!=typeof DataView,s=Array.isArray,p=Object.keys,v=Object.create,h=c&&ArrayBuffer.isView,y=isNaN,d=isFinite,g=!{toString:null}.propertyIsEnumerable("toString"),b=["valueOf","isPrototypeOf","toString","propertyIsEnumerable","hasOwnProperty","toLocaleString"],m=Math.pow(2,53)-1;function j(n,r){return r=null==r?n.length-1:+r,function(){for(var t=Math.max(arguments.length-r,0),e=Array(t),u=0;u=0&&t<=m}}function J(n){return function(r){return null==r?void 0:r[n]}}var G=J("byteLength"),H=K(G),Q=/\[object ((I|Ui)nt(8|16|32)|Float(32|64)|Uint8Clamped|Big(I|Ui)nt64)Array\]/;var X=c?function(n){return h?h(n)&&!q(n):H(n)&&Q.test(a.call(n))}:C(!1),Y=J("length");function Z(n,r){r=function(n){for(var r={},t=n.length,e=0;e":">",'"':""","'":"'","`":"`"},Cn=Ln($n),Kn=Ln(_n($n)),Jn=tn.templateSettings={evaluate:/<%([\s\S]+?)%>/g,interpolate:/<%=([\s\S]+?)%>/g,escape:/<%-([\s\S]+?)%>/g},Gn=/(.)^/,Hn={"'":"'","\\":"\\","\r":"r","\n":"n","\u2028":"u2028","\u2029":"u2029"},Qn=/\\|'|\r|\n|\u2028|\u2029/g;function Xn(n){return"\\"+Hn[n]}var Yn=/^\s*(\w|\$)+\s*$/;var Zn=0;function nr(n,r,t,e,u){if(!(e instanceof r))return n.apply(t,u);var o=Mn(n.prototype),i=n.apply(o,u);return _(i)?i:o}var rr=j((function(n,r){var t=rr.placeholder,e=function(){for(var u=0,o=r.length,i=Array(o),a=0;a1)ur(a,r-1,t,e),u=e.length;else for(var f=0,c=a.length;f0&&(t=r.apply(this,arguments)),n<=1&&(r=null),t}}var lr=rr(cr,2);function sr(n,r,t){r=qn(r,t);for(var e,u=nn(n),o=0,i=u.length;o0?0:u-1;o>=0&&o0?a=o>=0?o:Math.max(o+f,a):f=o>=0?Math.min(o+1,f):o+f+1;else if(t&&o&&f)return e[o=t(e,u)]===u?o:-1;if(u!=u)return(o=r(i.call(e,a,f),$))>=0?o+a:-1;for(o=n>0?a:f-1;o>=0&&o0?0:i-1;for(u||(e=r[o?o[a]:a],a+=n);a>=0&&a=3;return r(n,Fn(t,u,4),e,o)}}var Ar=wr(1),xr=wr(-1);function Sr(n,r,t){var e=[];return r=qn(r,t),jr(n,(function(n,t,u){r(n,t,u)&&e.push(n)})),e}function Or(n,r,t){r=qn(r,t);for(var e=!er(n)&&nn(n),u=(e||n).length,o=0;o=0}var Br=j((function(n,r,t){var e,u;return D(r)?u=r:(r=Nn(r),e=r.slice(0,-1),r=r[r.length-1]),_r(n,(function(n){var o=u;if(!o){if(e&&e.length&&(n=In(n,e)),null==n)return;o=n[r]}return null==o?o:o.apply(n,t)}))}));function Nr(n,r){return _r(n,Rn(r))}function Ir(n,r,t){var e,u,o=-1/0,i=-1/0;if(null==r||"number"==typeof r&&"object"!=typeof n[0]&&null!=n)for(var a=0,f=(n=er(n)?n:jn(n)).length;ao&&(o=e);else r=qn(r,t),jr(n,(function(n,t,e){((u=r(n,t,e))>i||u===-1/0&&o===-1/0)&&(o=n,i=u)}));return o}function Tr(n,r,t){if(null==r||t)return er(n)||(n=jn(n)),n[Wn(n.length-1)];var e=er(n)?En(n):jn(n),u=Y(e);r=Math.max(Math.min(r,u),0);for(var o=u-1,i=0;i1&&(e=Fn(e,r[1])),r=an(n)):(e=qr,r=ur(r,!1,!1),n=Object(n));for(var u=0,o=r.length;u1&&(t=r[1])):(r=_r(ur(r,!1,!1),String),e=function(n,t){return!Er(r,t)}),Ur(n,e,t)}));function zr(n,r,t){return i.call(n,0,Math.max(0,n.length-(null==r||t?1:r)))}function Lr(n,r,t){return null==n||n.length<1?null==r||t?void 0:[]:null==r||t?n[0]:zr(n,n.length-r)}function $r(n,r,t){return i.call(n,null==r||t?1:r)}var Cr=j((function(n,r){return r=ur(r,!0,!0),Sr(n,(function(n){return!Er(r,n)}))})),Kr=j((function(n,r){return Cr(n,r)}));function Jr(n,r,t,e){A(r)||(e=t,t=r,r=!1),null!=t&&(t=qn(t,e));for(var u=[],o=[],i=0,a=Y(n);ir?(e&&(clearTimeout(e),e=null),a=c,i=n.apply(u,o),e||(u=o=null)):e||!1===t.trailing||(e=setTimeout(f,l)),i};return c.cancel=function(){clearTimeout(e),a=0,e=u=o=null},c},debounce:function(n,r,t){var e,u,o,i,a,f=function(){var c=zn()-u;r>c?e=setTimeout(f,r-c):(e=null,t||(i=n.apply(a,o)),e||(o=a=null))},c=j((function(c){return a=this,o=c,u=zn(),e||(e=setTimeout(f,r),t&&(i=n.apply(a,o))),i}));return c.cancel=function(){clearTimeout(e),e=o=a=null},c},wrap:function(n,r){return rr(r,n)},negate:fr,compose:function(){var n=arguments,r=n.length-1;return function(){for(var t=r,e=n[r].apply(this,arguments);t--;)e=n[t].call(this,e);return e}},after:function(n,r){return function(){if(--n<1)return r.apply(this,arguments)}},before:cr,once:lr,findKey:sr,findIndex:vr,findLastIndex:hr,sortedIndex:yr,indexOf:gr,lastIndexOf:br,find:mr,detect:mr,findWhere:function(n,r){return mr(n,Dn(r))},each:jr,forEach:jr,map:_r,collect:_r,reduce:Ar,foldl:Ar,inject:Ar,reduceRight:xr,foldr:xr,filter:Sr,select:Sr,reject:function(n,r,t){return Sr(n,fr(qn(r)),t)},every:Or,all:Or,some:Mr,any:Mr,contains:Er,includes:Er,include:Er,invoke:Br,pluck:Nr,where:function(n,r){return Sr(n,Dn(r))},max:Ir,min:function(n,r,t){var e,u,o=1/0,i=1/0;if(null==r||"number"==typeof r&&"object"!=typeof n[0]&&null!=n)for(var a=0,f=(n=er(n)?n:jn(n)).length;ae||void 0===t)return 1;if(tIndex — deep-significance 0.9 documentation - - + + - + - @@ -216,6 +215,10 @@

    D

    G

    +
  • Documentation
  • @@ -141,7 +141,7 @@

    Table Of Contents

    -
    +

    deep-significance: Easy and Better Significance Testing for Deep Neural Networks

    Coverage Status License: GPL v3 @@ -170,7 +170,7 @@

    deep-significance: Easy and Better Significance Testing for Deep Neural Netw
  • |:people_holding_hands:| Papers using deep-significance

  • |:books:| Bibliography

  • -
    +

    ⁉️ Why?

    Although Deep Learning has undergone spectacular growth in the recent decade, a large portion of experimental evidence is not supported by statistical hypothesis tests. Instead, @@ -187,15 +187,16 @@

    ⁉️ Why? -
  • Statistical Significance tests such as Almost Stochastic Order (Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and -permutation-randomization (Noreen, 1989).

  • +
  • Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), +bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989).

  • Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936).

  • Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size.

  • All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation -here or the scenarios in the section Examples.

    -
    +here , the scenarios in the section Examples or +the demo Jupyter notebook.

    +

    📥 Installation

    The package can simply be installed using pip by running

    pip3 install deepsig
    @@ -208,64 +209,74 @@ 

    📥 Installation +

    +

    🔖 Examples


    tl;dr: Use aso() to compare scores for two models. If the returned eps_min < 0.5, A is better than B. The lower -eps_min, the more confident the result.

    +eps_min, the more confident the result (we recommend to check eps_min < 0.2 and record eps_min alongside +experimental results).

    ⚠️ Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See General Recommendations & other notes.


    -

    In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply +

    In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as this blog post for a general overview or Dror et al. (2018) for a NLP-specific point of view.

    -

    In general, in statistical significance testing, we usually compare two algorithms and on a dataset using -some evaluation metric (we assume a higher = better). The difference between the two algorithms on the -data is then defined as

    -

    where is our test statistic. We then test the following null hypothesis:

    -

    Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A -is better than B (what we actually would like to see). Most statistical significance tests operate using -p-values, which define the probability that under the null-hypothesis, the expected by the test is larger than or -equal to the observed difference (that is, for a one-sided test, i.e. we assume A to be better than B):

    -

    We can interpret this equation as follows: Assuming that A is not better than B, the test assumes a corresponding distribution -of differences that is drawn from. How does our actually observed difference fit in there? -This is what the p-value is expressing: If this probability is high, is in line with what we expected under -the null hypothesis, so we conclude A not to better than B. If the -probability is low, that means that is quite unlikely under the null hypothesis and that the reverse -case is more likely - i.e. that it is -likely larger than - and we conclude that A is indeed better than B. Note that the p-value does not -express whether the null hypothesis is true.

    -

    To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough -for us to reject the null hypothesis, this is called the significance level and it is often set to be 0.05.

    -
    -

    -
    +

    We assume that we have two sets of scores we would like to compare, and , +for instance obtained by running two models and multiple times with a different random seed. +We can then define a one-sided test statistic based on the gathered observations. +An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis:

    +

    That means that we actually assume the opposite of our desired case, namely that is not better than , +but equally as good or worse, as indicated by the value of the test statistic. +Usually, the goal becomes to reject this null hypothesis using the SST. +p-value testing is a frequentist method in the realm of SST. +It introduces the notion of data that could have been observed if we were to repeat our experiment again using +the same conditions, which we will write with superscript in order to distinguish them from our actually +observed scores (Gelman et al., 2021). +We then define the p-value as the probability that, under the null hypothesis, the test statistic using replicated +observation is larger than or equal to the observed test statistic:

    +

    We can interpret this expression as follows: Assuming that is not better than , the test +assumes a corresponding distribution of statistics that is drawn from. So how does the observed test statistic + fit in here? This is what the -value expresses: When the +probability is high, is in line with what we expected under the +null hypothesis, so we can not reject the null hypothesis, or in other words, we emph{cannot} conclude + to be better than . If the probability is low, that means that the observed + is quite unlikely under the null hypothesis and that the reverse case is +more likely - i.e. that it is likely larger than - and we conclude that is indeed better than +. Note that the :raw-html-m2r:`<img src=”2ec6e630f199f589a2402fdf3e0289d5.svg?invert_in_darkmode” align=middle width=8.270567249999992pt height=14.15524440000002pt/>`-value does not express whether the null hypothesis is true. To make our decision +about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level +, often set to 0.05 - that the p-value has to fall below. However, it has been argued that a better practice +involves reporting the p-value alongside the results without a pidgeonholing of results into significant and non-significant +(Wasserstein et al., 2019).

    +

    + +

    Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks

    Deep neural networks are highly non-linear models, having their performance highly dependent on hyperparameters, random seeds and other (stochastic) factors. Therefore, comparing the means of two models across several runs might not be enough to decide if a model A is better than B. In fact, even aggregating more statistics like standard deviation, minimum -or maximum might not be enough to make a decision. For this reason, Dror et al. (2019) introduced Almost Stochastic -Order (ASO), a test to compare two score distributions.

    +or maximum might not be enough to make a decision. For this reason, del Barrio et al. (2017) and Dror et al. (2019) +introduced Almost Stochastic Order (ASO), a test to compare two score distributions.

    It builds on the concept of stochastic order: We can compare two distributions and declare one as stochastically dominant by comparing their cumulative distribution functions:

    Here, the CDF of A is given in red and in green for B. If the CDF of A is lower than B for every , we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). -For this reason, Dror et al. (2019) consider the notion of almost stochastic dominance by quantifying the extent to -which stochastic order is being violated (red area):

    +For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of almost stochastic dominance +by quantifying the extent to which stochastic order is being violated (red area):

    -

    ASO returns a value , which expresses the amount of violation of stochastic order. If -, A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as +

    ASO returns a value , which expresses (an upper bound to) the amount of violation of stochastic order. If + (where tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret as a confidence score. The lower it is, the more sure we can be that A is better than B. Note: ASO does not compute p-values. Instead, the null hypothesis formulated as

    -

    If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5. +

    If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 +(see the discussion in this section). Furthermore, the significance level is determined as an input argument when running ASO and actively influence the resulting .

    -
    -
    + +

    Scenario 1 - Comparing multiple runs of two models

    In the simplest scenario, we have retrieved a set of scores from a model A and a baseline B on a dataset, stemming from various model runs with different seeds. We want to test whether our model A is better than B (higher scores = better)- @@ -273,12 +284,15 @@

    Scenario 1 - Comparing multiple runs of two models
    import numpy as np
     from deepsig import aso
     
    +seed = 1234
    +np.random.seed(seed)
    +
     # Simulate scores
     N = 5  # Number of random seeds
     my_model_scores = np.random.normal(loc=0.9, scale=0.8, size=N)
     baseline_scores = np.random.normal(loc=0, scale=1, size=N)
     
    -min_eps = aso(my_model_scores, baseline_scores)  # min_eps = 0.0, so A is better
    +min_eps = aso(my_model_scores, baseline_scores, seed=seed)  # min_eps = 0.225, so A is better
     

    Note that ASO does not make any assumptions about the distributions of the scores. @@ -286,8 +300,8 @@

    Scenario 1 - Comparing multiple runs of two models -

    - -
    - - -

    In the example, eps_min is now a matrix, containing the score between all pairs of models (for the same model, it set to 1 by default). The matrix is always to be read as ASO(row, column).

    The function applies the bonferroni correction for multiple comparisons by -default, but this can be turned off by using use_bonferroni=False. In order to save compute, the above symmetry -property is used as well, but this can also be disabled by use_symmetry=False.

    +default, but this can be turned off by using use_bonferroni=False.

    Lastly, when the scores argument is a dictionary and the function is called with return_df=True, the resulting matrix is given as a pandas.DataFrame for increased readability:

    - -
    + +

    📰 How to report results

    When ASO used, two important details have to be reported, namely the confidence level and the score. Below lists some example snippets reporting the results of scenarios 1 and 4:

    @@ -413,11 +440,11 @@

    📰 How to report results +

    +

    🎛️ Sample size

    It can be hard to determine whether the currently collected set of scores is large enough to allow for reliable significance testing or whether more scores are required. For this reason, deep-significance also implements functions to aid the decision of whether to @@ -463,8 +490,8 @@

    🎛️ Sample size# But adding two runs to scores2 only increases tightness by 1.06! So spending two more runs on scores1 is better

    - -

    All tests implemented in this package also can take PyTorch / Tensorflow tensors and Jax or NumPy arrays as arguments:

    @@ -506,8 +533,8 @@

    ✨ Other featuresprint(bootstrap_test(a, b)) # 0.103 - -
    + +

    General recommendations & other notes

    -
    -
    + +

    🎓 Cite

    -

    If you use the ASO test via aso(), please cite the original work:

    +

    Using this package in general, please cite the following:

    +
    @article{ulmer2022deep,
    +  title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks},
    +  author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes},
    +  journal={arXiv preprint arXiv:2204.06815},
    +  year={2022}
    +}
    +
    +
    +

    If you use the ASO test via aso() or `multi_aso, please cite the original works:

    @inproceedings{dror2019deep,
       author    = {Rotem Dror and
                    Segev Shlomov and
    @@ -551,26 +590,25 @@ 

    🎓 Citedoi = {10.18653/v1/p19-1266}, timestamp = {Tue, 28 Jan 2020 10:27:52 +0100}, } -

    -
    -

    Using this package in general, please cite the following:

    -
    @software{dennis_ulmer_2021_4638709,
    -  author       = {Dennis Ulmer},
    -  title        = {{deep-significance: Easy and Better Significance
    -                   Testing for Deep Neural Networks}},
    -  month        = mar,
    -  year         = 2021,
    -  note         = {https://github.com/Kaleidophon/deep-significance},
    -  publisher    = {Zenodo},
    -  version      = {v1.0.0a},
    -  doi          = {10.5281/zenodo.4638709},
    -  url          = {https://doi.org/10.5281/zenodo.4638709}
    +
    +@incollection{del2018optimal,
    +  title={An optimal transportation approach for assessing almost stochastic order},
    +  author={Del Barrio, Eustasio and Cuesta-Albertos, Juan A and Matr{\'a}n, Carlos},
    +  booktitle={The Mathematics of the Uncertain},
    +  pages={33--44},
    +  year={2018},
    +  publisher={Springer}
     }
     
    +

    For instance, you can write

    +
    In order to compare models, we use the Almost Stochastic Order test \citep{del2018optimal, dror2019deep} as
    +implemented by \citet{ulmer2022deep}.
    +
    -
    +
    +

    🏅 Acknowledgements

    This package was created out of discussions of the NLPnorth group at the IT University Copenhagen, whose members I want to thank for their feedback. The code in this repository is in multiple places based on several of Rotem Dror’s repositories, namely @@ -579,18 +617,18 @@

    🏅 Acknowledgementshere. The inline latex equations were rendered using readme2latex.

    -

    -
    -

    🧑‍🤝‍🧑 Papers using deep-significance

    + +
    +

    🧑‍🤝‍🧑 Papers using deep-significance

    In this last section of the readme, I would like to refer to works already using deep-significance. Open an issue or pull request if you would like to see your work added here!

    -
    -
    -

    📚 Bibliography

    + +
    +

    📚 Bibliography

    Del Barrio, Eustasio, Juan A. Cuesta-Albertos, and Carlos Matrán. “An optimal transportation approach for assessing almost stochastic order.” The Mathematics of the Uncertain. Springer, Cham, 2018. 33-44.

    Bonferroni, Carlo. “Teoria statistica delle classi e calcolo delle probabilita.” Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8 (1936): 3-62.

    Borji, Ali. “Negative results in computer vision: A perspective.” Image and Vision Computing 69 (2018): 1-8.

    @@ -598,20 +636,24 @@

    📚 Bibliography +

    + +

    Documentation

    Re-implementation of Almost Stochastic Order (ASO) by Dror et al. (2019). The code here heavily borrows from their original code base.

    -
    -deepsig.aso.aso(scores_a: Union[tensorflow.python.framework.ops.EagerTensor, tensorflow.python.framework.ops.Tensor, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], scores_b: Union[tensorflow.python.framework.ops.EagerTensor, tensorflow.python.framework.ops.Tensor, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], confidence_level: float = 0.05, num_samples: int = 1000, num_bootstrap_iterations: int = 1000, dt: float = 0.005, num_jobs: int = 1, show_progress: bool = True, seed: Optional[int] = None, _progress_bar: Optional[tqdm.std.tqdm] = None) → float
    +
    +deepsig.aso.aso(scores_a: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], scores_b: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], confidence_level: float = 0.95, num_comparisons: int = 1, num_samples: int = 1000, num_bootstrap_iterations: int = 1000, dt: float = 0.005, num_jobs: int = 1, show_progress: bool = True, seed: Optional[int] = None, _progress_bar: Optional[tqdm.std.tqdm] = None) float

    Performs the Almost Stochastic Order test by Dror et al. (2019). The function takes two list of scores as input (they do not have to be of the same length) and returns an upper bound to the violation ratio - the minimum epsilon threshold. scores_a should contain scores of the algorithm which we suspect to be better (in this setup, @@ -628,7 +670,9 @@

    📚 Bibliography -
    -deepsig.aso.compute_violation_ratio(scores_a: numpy.array, scores_b: numpy.array, dt: float) → float
    +
    +deepsig.aso.compute_violation_ratio(scores_a: Optional[numpy.array] = None, scores_b: Optional[numpy.array] = None, quantile_func_a: Optional[Callable] = None, quantile_func_b: Optional[Callable] = None, dt: float = 0.001) float

    Compute the violation ration e_W2 (equation 4 + 5).

    Parameters
    +
    scores_a: Optional[np.array]

    Scores of algorithm A.

    +
    +
    scores_b: Optional[np.array]

    Scores of algorithm B.

    +
    +
    dt: float

    Differential for t during integral calculation.

    +
    +
    quantile_func_a: Optional[Callable]

    Quantile function based on the first set of scores.

    +
    +
    quantile_func_b: Optional[Callable]

    Quantile function based on the second set of scores.

    +
    +
    +
    +
    Returns
    +
    +
    float

    Return violation ratio.

    +
    +
    +
    +
    +

    + +
    +
    +deepsig.aso.get_bootstrapped_violation_ratios(scores_a: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], scores_b: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], quantile_func_a: Callable, quantile_func_b: Callable, num_bootstrap_iterations: int, dt: float, num_jobs: int, show_progress: bool, seed: Optional[int], _progress_bar: Optional[tqdm.std.tqdm]) List[float]
    +

    Retrieve violation ratios computed based on a number of bootstrap samples.

    +
    +
    Parameters
    +
    scores_a: List[float]

    Scores of algorithm A.

    scores_b: List[float]

    Scores of algorithm B.

    +
    quantile_func_a: Callable

    Quantile function based on the first set of scores.

    +
    +
    quantile_func_b: Callable

    Quantile function based on the second set of scores.

    +
    +
    num_bootstrap_iterations: int

    Number of bootstrap iterations when estimating sigma.

    +
    dt: float

    Differential for t during integral calculation.

    +
    num_jobs: int

    Number of threads that bootstrap iterations are divided among.

    +
    +
    show_progress: bool

    Show progress bar. Default is True.

    +
    +
    seed: Optional[int]

    Set seed for reproducibility purposes. Default is None (meaning no seed is used).

    +
    +
    _progress_bar: Optional[tqdm]

    Hands over a progress bar object when called by multi_aso(). Only for internal use.

    +
    Returns
    -
    float

    Return violation ratio.

    +
    List[float]

    Bootstrapped violation ratios.

    @@ -680,8 +766,8 @@

    📚 Bibliography -
    -deepsig.aso.get_quantile_function(scores: numpy.array) → Callable
    +
    +deepsig.aso.get_quantile_function(scores: numpy.array) Callable

    Return the quantile function corresponding to an empirical distribution of scores.

    Parameters
    @@ -700,8 +786,8 @@

    📚 Bibliography -
    -deepsig.aso.multi_aso(scores: Union[Dict[str, List[float]], Dict[str, numpy.array], numpy.array, List[List[float]]], confidence_level: float = 0.05, use_bonferroni: bool = True, use_symmetry: bool = True, num_samples: int = 1000, num_bootstrap_iterations: int = 1000, dt: float = 0.005, num_jobs: int = 1, return_df: bool = False, show_progress: bool = True, seed: Optional[int] = None) → Union[numpy.array, pandas.core.frame.DataFrame]
    +
    +deepsig.aso.multi_aso(scores: Union[Dict[str, List[float]], Dict[str, numpy.array], numpy.array, List[List[float]]], confidence_level: float = 0.95, use_bonferroni: bool = True, use_symmetry: bool = True, num_samples: int = 1000, num_bootstrap_iterations: int = 1000, dt: float = 0.005, num_jobs: int = 1, return_df: bool = False, show_progress: bool = True, seed: Optional[int] = None) Union[numpy.array, pandas.core.frame.DataFrame]

    Provides easy function to compare the scores of multiple models at ones. Scores can be supplied in various forms (dictionary, nested list, 2D arrays or tensors). Returns a matrix (or pandas.DataFrame) with results. Applies Bonferroni correction to confidence level by default, but can be disabled by use_bonferroni=False.

    @@ -711,7 +797,7 @@

    📚 Bibliography

    Implementation of paired bootstrap test (Efron & Tibshirani, 1994).

    -
    -deepsig.bootstrap.bootstrap_test(scores_a: Union[tensorflow.python.framework.ops.EagerTensor, tensorflow.python.framework.ops.Tensor, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], scores_b: Union[tensorflow.python.framework.ops.EagerTensor, tensorflow.python.framework.ops.Tensor, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], num_samples: int = 1000) → float
    +
    +deepsig.bootstrap.bootstrap_test(scores_a: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], scores_b: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], num_samples: int = 1000, num_jobs: int = 1, seed: Optional[int] = None) float

    Implementation of paired bootstrap test. A p-value is being estimated by comparing the mean of scores for two algorithms to the means of resampled populations, where num_samples determines the number of times we resample.

    @@ -761,10 +847,14 @@

    📚 Bibliography
    scores_a: ArrayLike

    Scores of algorithm A.

    -
    scores_b: ArrrayLike

    Scores of algorithm B.

    +
    scores_b: ArrayLike

    Scores of algorithm B.

    num_samples: int

    Number of bootstrap samples used for estimation.

    +
    num_jobs: int

    Number of threads that bootstrap iterations are divided among.

    +
    +
    seed: Optional[int]

    Set seed for reproducibility purposes. Default is None (meaning no seed is used).

    +

    Returns
    @@ -781,8 +871,8 @@

    📚 Bibliographythis codebase corresponding to the Dror et al. (2017) publication.

    -
    -deepsig.correction.bonferroni_correction(p_values: Union[tensorflow.python.framework.ops.EagerTensor, tensorflow.python.framework.ops.Tensor, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array]) → numpy.array
    +
    +deepsig.correction.bonferroni_correction(p_values: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array]) numpy.array

    Correct for multiple comparisons based on Bonferroni’s method.

    Parameters
    @@ -801,8 +891,8 @@

    📚 Bibliography -
    -deepsig.correction.calculate_partial_conjunction(sorted_p_values: numpy.array, u: int) → float
    +
    +deepsig.correction.calculate_partial_conjunction(sorted_p_values: numpy.array, u: int) float

    Calculate the partial conjunction p-value for u out of N.

    Parameters
    @@ -824,8 +914,8 @@

    📚 Bibliography

    Implementation of paired sign test.

    -
    -deepsig.permutation.permutation_test(scores_a: Union[tensorflow.python.framework.ops.EagerTensor, tensorflow.python.framework.ops.Tensor, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], scores_b: Union[tensorflow.python.framework.ops.EagerTensor, tensorflow.python.framework.ops.Tensor, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], num_samples: int = 1000) → float
    +
    +deepsig.permutation.permutation_test(scores_a: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], scores_b: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], num_samples: int = 1000, num_jobs: int = 1, seed: Optional[int] = None) float

    Implementation of a permutation-randomization test. Scores of A and B will be randomly swapped and the difference in samples is then compared to the original difference.

    The test is single-tailed, where we want to verify that the algorithm corresponding to scores_a is better than @@ -839,6 +929,10 @@

    📚 BibliographyReturns @@ -852,8 +946,8 @@

    📚 Bibliography

    Implement functions to help determine the right sample size for experiments.

    -
    -deepsig.sample_size.aso_uncertainty_reduction(m_old: int, n_old: int, m_new: int, n_new: int) → float
    +
    +deepsig.sample_size.aso_uncertainty_reduction(m_old: int, n_old: int, m_new: int, n_new: int) float

    Compute the reduction of uncertainty of tightness of estimate for violation ratio e_W2(F, G). This is based on the CLT in del Barrio et al. (2018) Theorem 2.4 / eq. 9.

    @@ -879,8 +973,8 @@

    📚 Bibliography -
    -deepsig.sample_size.bootstrap_power_analysis(scores: Union[tensorflow.python.framework.ops.EagerTensor, tensorflow.python.framework.ops.Tensor, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], scalar: float = 1.25, num_bootstrap_iterations: int = 5000, significance_threshold: float = 0.05, significance_test: Optional[Callable] = None, show_progress: bool = True, seed: Optional[int] = None) → float
    +
    +deepsig.sample_size.bootstrap_power_analysis(scores: Union[jax.interpreters.xla._DeviceArray, torch.Tensor, torch.LongTensor, torch.FloatTensor, List[float], numpy.array], scalar: float = 1.25, num_bootstrap_iterations: int = 5000, significance_threshold: float = 0.05, significance_test: Optional[Callable] = None, show_progress: bool = True, seed: Optional[int] = None) float

    Perform bootstrap power analysis [1] to see whether the amount of collected scores is sufficient. It determines the statistical power of the sample, i.e. the probability of an statistically significant effect to be found given that there is one (that is, the lower the power, the higher the probability of a Type II error).

    @@ -926,7 +1020,7 @@

    📚 Bibliography
    Back to top diff --git a/docs/build/html/objects.inv b/docs/build/html/objects.inv index bb9f936..4ccfc1f 100644 Binary files a/docs/build/html/objects.inv and b/docs/build/html/objects.inv differ diff --git a/docs/build/html/search.html b/docs/build/html/search.html index 88f25a4..a23960a 100644 --- a/docs/build/html/search.html +++ b/docs/build/html/search.html @@ -11,15 +11,14 @@ Search — deep-significance 0.9 documentation - - + + - + - @@ -134,7 +133,7 @@

    Search