diff --git a/.gitignore b/.gitignore index 79830d6..1c6d0f0 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,8 @@ __pycache__/ *.py[cod] *$py.class +*/.DS_Store + # C extensions *.so diff --git a/README.md b/README.md index eb2ae47..dab2ebc 100644 --- a/README.md +++ b/README.md @@ -47,14 +47,15 @@ Reinforcement Learning (Henderson et al., 2018) and Computer Vision (Borji, 2017 To help mitigate this problem, this package supplies fully-tested re-implementations of useful functions for significance testing: -* Statistical Significance tests such as Almost Stochastic Order (Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and - permutation-randomization (Noreen, 1989). +* Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), + bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989). * Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936). * Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size. All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation -[here](https://deep-significance.readthedocs.io/en/latest/) or the scenarios in the section [Examples](#examples). +[here](https://deep-significance.readthedocs.io/en/latest/) , the scenarios in the section [Examples](#examples) or +the [demo Jupyter notebook](https://github.com/Kaleidophon/deep-significance/tree/main/paper/deep-significance%20demo.ipynb). ## :inbox_tray: Installation @@ -74,46 +75,51 @@ Another option is to clone the repository and install the package locally: --- **tl;dr**: Use `aso()` to compare scores for two models. If the returned `eps_min < 0.5`, A is better than B. The lower -`eps_min`, the more confident the result. +`eps_min`, the more confident the result (we recommend to check `eps_min < 0.2` and record `eps_min` alongside +experimental results). :warning: Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See [General Recommendations & other notes](#general-recommendations). --- -In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply +In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as [this blog post](https://machinelearningmastery.com/statistical-hypothesis-tests/) for a general overview or [Dror et al. (2018)](https://www.aclweb.org/anthology/P18-1128.pdf) for a NLP-specific point of view. -In general, in statistical significance testing, we usually compare two algorithms and on a dataset using -some evaluation metric (we assume a higher = better). The difference between the two algorithms on the -data is then defined as - -
- -where is our test statistic. We then test the following **null hypothesis**: - - - -Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A -is better than B (what we actually would like to see). Most statistical significance tests operate using -*p-values*, which define the probability that under the null-hypothesis, the expected by the test is larger than or -equal to the observed difference (that is, for a one-sided test, i.e. we assume A to be better than B): - - - -We can interpret this equation as follows: Assuming that A is *not* better than B, the test assumes a corresponding distribution -of differences that is drawn from. How does our actually observed difference fit in there? -This is what the p-value is expressing: If this probability is high, is in line with what we expected under -the null hypothesis, so we conclude A not to better than B. If the -probability is low, that means that is quite unlikely under the null hypothesis and that the reverse -case is more likely - i.e. that it is -likely *larger* than - and we conclude that A is indeed better than B. Note that **the p-value does not -express whether the null hypothesis is true**. - -To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough -for us to reject the null hypothesis, this is called the significance level and it is often set to be 0.05. +We assume that we have two sets of scores we would like to compare, and , +for instance obtained by running two models and multiple times with a different random seed. +We can then define a one-sided test statistic based on the gathered observations. +An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis: + + + +That means that we actually assume the opposite of our desired case, namely that is not better than , +but equally as good or worse, as indicated by the value of the test statistic. +Usually, the goal becomes to reject this null hypothesis using the SST. +*p*-value testing is a frequentist method in the realm of SST. +It introduces the notion of data that *could have been observed* if we were to repeat our experiment again using +the same conditions, which we will write with superscript in order to distinguish them from our actually +observed scores (Gelman et al., 2021). +We then define the *p*-value as the probability that, under the null hypothesis, the test statistic using replicated +observation is larger than or equal to the *observed* test statistic: + + + +We can interpret this expression as follows: Assuming that is not better than , the test +assumes a corresponding distribution of statistics that is drawn from. So how does the observed test statistic + fit in here? This is what the -value expresses: When the +probability is high, is in line with what we expected under the +null hypothesis, so we can *not* reject the null hypothesis, or in other words, we \emph{cannot} conclude + to be better than . If the probability is low, that means that the observed + is quite unlikely under the null hypothesis and that the reverse case is +more likely - i.e. that it is likely larger than - and we conclude that is indeed better than +. Note that **the -value does not express whether the null hypothesis is true**. To make our decision +about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level +, often set to 0.05 - that the *p*-value has to fall below. However, it has been argued that a better practice +involves reporting the *p*-value alongside the results without a pidgeonholing of results into significant and non-significant +(Wasserstein et al., 2019). ### Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks @@ -121,8 +127,8 @@ for us to reject the null hypothesis, this is called the significance level , we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). -For this reason, Dror et al. (2019) consider the notion of *almost stochastic dominance* by quantifying the extent to -which stochastic order is being violated (red area): +For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of *almost stochastic dominance* +by quantifying the extent to which stochastic order is being violated (red area): ![](img/aso.png) -ASO returns a value , which expresses the amount of violation of stochastic order. If -, A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as +ASO returns a value , which expresses (an upper bound to) the amount of violation of stochastic order. If + (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret as a *confidence score*. The lower it is, the more sure we can be that A is better than B. Note: **ASO does not compute p-values.** Instead, the null hypothesis formulated as - + -If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5. +If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 +(see the discussion in [this section](#general-recommendations)). Furthermore, the significance level is determined as an input argument when running ASO and actively influence the resulting . @@ -159,12 +166,15 @@ We can now simply apply the ASO test: import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores N = 5 # Number of random seeds my_model_scores = np.random.normal(loc=0.9, scale=0.8, size=N) baseline_scores = np.random.normal(loc=0, scale=1, size=N) -min_eps = aso(my_model_scores, baseline_scores) # min_eps = 0.0, so A is better +min_eps = aso(my_model_scores, baseline_scores, seed=seed) # min_eps = 0.225, so A is better ``` Note that ASO **does not make any assumptions about the distributions of the scores**. @@ -185,6 +195,9 @@ which corresponds to the Bonferroni correction (Bonferroni et al., 1936): import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 3 # Number of datasets N = 5 # Number of random seeds @@ -192,8 +205,8 @@ my_model_scores_per_dataset = [np.random.normal(loc=0.3, scale=0.8, size=N) for baseline_scores_per_dataset = [np.random.normal(loc=0, scale=1, size=N) for _ in range(M)] # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / M) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] -# eps_min = [0.1565800030782686, 1, 0.0] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=M, seed=seed) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] +# eps_min = [0.006370113450148568, 0.6534772728574852, 0.0] ``` ### Scenario 3 - Comparing sample-level scores @@ -212,6 +225,9 @@ from itertools import product import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 40 # Number of data points N = 3 # Number of random seeds @@ -220,7 +236,9 @@ baseline_scored_samples_per_run = [np.random.normal(loc=0, scale=1, size=M) for pairs = list(product(my_model_scored_samples_per_run, baseline_scored_samples_per_run)) # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / len(pairs)) for a, b in pairs] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=len(pairs), seed=seed) for a, b in pairs] +# eps_min = [0.3831678636198528, 0.07194780234194881, 0.9152792807128325, 0.5273463008857844, 0.14946944524461184, 1.0, +# 0.6099543280369378, 0.22387448804041898, 1.0] ``` ### Scenario 4 - Comparing more than two models @@ -249,6 +267,9 @@ Let's look at an example: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -257,20 +278,19 @@ M = 3 # Number of different models / algorithms # Here, we will sample from N(0.1, 0.8), N(0.15, 0.8), N(0.2, 0.8) my_models_scores = np.array([np.random.normal(loc=loc, scale=0.8, size=N) for loc in np.arange(0.1, 0.1 + 0.05 * M, step=0.05)]) -eps_min = multi_aso(my_models_scores, confidence_level=0.05) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, seed=seed) # eps_min = -# array([[1., 1., 1.], -# [0., 1., 1.], -# [0., 0., 1.]]) +# array([[1. , 0.92621655, 1. ], +# [1. , 1. , 1. ], +# [0.82081635, 0.73048716, 1. ]]) ``` In the example, `eps_min` is now a matrix, containing the score between all pairs of models (for the same model, it set to 1 by default). The matrix is always to be read as ASO(row, column). The function applies the bonferroni correction for multiple comparisons by -default, but this can be turned off by using `use_bonferroni=False`. In order to save compute, the above symmetry -property is used as well, but this can also be disabled by `use_symmetry=False`. +default, but this can be turned off by using `use_bonferroni=False`. Lastly, when the `scores` argument is a dictionary and the function is called with `return_df=True`, the resulting matrix is given as a `pandas.DataFrame` for increased readability: @@ -278,6 +298,9 @@ given as a `pandas.DataFrame` for increased readability: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -294,14 +317,14 @@ my_models_scores = { # ... # } -eps_min = multi_aso(my_models_scores, confidence_level=0.05, return_df=True) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, return_df=True, seed=seed) # This is now a DataFrame! # eps_min = -# model 1 model 2 model 3 -# model 1 1.0 1.0 1.0 -# model 2 0.0 1.0 1.0 -# model 3 1.0 0.0 1.0 +# model 1 model 2 model 3 +# model 1 1.000000 0.926217 1.0 +# model 2 1.000000 1.000000 1.0 +# model 3 0.820816 0.730487 1.0 ``` @@ -315,7 +338,7 @@ score. Below lists some example snippets reporting the results of scenarios 1 an We compared all pairs of models based on five random seeds each using ASO with a confidence level of $\alpha = 0.05$ (before adjusting for all pair-wise comparisons using the Bonferroni correction). Almost stochastic - dominance ($\epsilon_\text{min} < 0.5)$ is indicated in table X. + dominance ($\epsilon_\text{min} < \tau$ with $\tau = 0.2$) is indicated in table X. ### :control_knobs: Sample size @@ -384,11 +407,11 @@ from deepsig import aso import numpy as np from timeit import timeit -a = np.random.normal(size=5) -b = np.random.normal(size=5) +a = np.random.normal(size=1000) +b = np.random.normal(size=1000) -print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 146.6909574989986 -print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 50.416724971000804 +print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 393.6318126 +print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 139.73514621799995n ``` #### :electric_plug: Compatibility with PyTorch, Tensorflow, Jax & Numpy @@ -438,11 +461,15 @@ as many scores as possible should be collected, especially if the variance betwe Because this is usually infeasible in practice, Bouthilier et al. (2020) recommend to **vary all other sources of variation** between runs to obtain the most trustworthy estimate of the "true" performance, such as data shuffling, weight initialization etc. -* `num_samples` and `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not +* `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not recommended as the result of the test will also become less accurate. Technically, is a upper bound that becomes tighter with the number of samples and bootstrap iterations (del Barrio et al., 2017). Thus, increasing the number of jobs with `num_jobs` instead is always preferred. +* While we could declare a model stochastically dominant with , we found this to have a comparatively high +Type I error (false positives). Tests [in our paper](https://arxiv.org/pdf/2204.06815.pdf) have shown that a more useful threshold that trades of Type I and + Type II error between different scenarios might be . + * Bootstrap and permutation-randomization are all non-parametric tests, i.e. they don't make any assumptions about the distribution of our test metric. Nevertheless, they differ in their *statistical power*, which is defined as the probability that the null hypothesis is being rejected given that there is a difference between A and B. In other words, the more powerful @@ -454,7 +481,17 @@ the distribution of our test metric. Nevertheless, they differ in their *statist ### :mortar_board: Cite -If you use the ASO test via `aso()`, please cite the original work: +Using this package in general, please cite the following: + + @article{ulmer2022deep, + title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks}, + author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes}, + journal={arXiv preprint arXiv:2204.06815}, + year={2022} + } + + +If you use the ASO test via `aso()` or `multi_aso, please cite the original works: @inproceedings{dror2019deep, author = {Rotem Dror and @@ -475,21 +512,20 @@ If you use the ASO test via `aso()`, please cite the original work: timestamp = {Tue, 28 Jan 2020 10:27:52 +0100}, } -Using this package in general, please cite the following: - - @software{dennis_ulmer_2021_4638709, - author = {Dennis Ulmer}, - title = {{deep-significance: Easy and Better Significance - Testing for Deep Neural Networks}}, - month = mar, - year = 2021, - note = {https://github.com/Kaleidophon/deep-significance}, - publisher = {Zenodo}, - version = {v1.0.0a}, - doi = {10.5281/zenodo.4638709}, - url = {https://doi.org/10.5281/zenodo.4638709} + @incollection{del2018optimal, + title={An optimal transportation approach for assessing almost stochastic order}, + author={Del Barrio, Eustasio and Cuesta-Albertos, Juan A and Matr{\'a}n, Carlos}, + booktitle={The Mathematics of the Uncertain}, + pages={33--44}, + year={2018}, + publisher={Springer} } +For instance, you can write + + In order to compare models, we use the Almost Stochastic Order test \citep{del2018optimal, dror2019deep} as + implemented by \citet{ulmer2022deep}. + ### :medal_sports: Acknowledgements This package was created out of discussions of the [NLPnorth group](https://nlpnorth.github.io/) at the IT University @@ -526,6 +562,9 @@ Dror, Rotem, Shlomov, Segev, and Reichart, Roi. "Deep dominance-how to properly Efron, Bradley, and Robert J. Tibshirani. "An introduction to the bootstrap." CRC press, 1994. +Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, Donald B Rubin, John +Carlin, Hal Stern, Donald Rubin, and David Dunson. Bayesian data analysis third edition, 2021. + Henderson, Peter, et al. "Deep reinforcement learning that matters." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018: 6391-6401 @@ -534,4 +573,7 @@ Narang, Sharan, et al. "Do Transformer Modifications Transfer Across Implementat Noreen, Eric W. "Computer intensive methods for hypothesis testing: An introduction." Wiley, New York (1989). +Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. Moving to a world beyond “p< 0.05”, +2019 + Yuan, Ke‐Hai, and Kentaro Hayashi. "Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models." British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110. \ No newline at end of file diff --git a/README_RAW.md b/README_RAW.md index bd8908d..abdbe96 100644 --- a/README_RAW.md +++ b/README_RAW.md @@ -47,14 +47,15 @@ Reinforcement Learning (Henderson et al., 2018) and Computer Vision (Borji, 2017 To help mitigate this problem, this package supplies fully-tested re-implementations of useful functions for significance testing: -* Statistical Significance tests such as Almost Stochastic Order (Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and - permutation-randomization (Noreen, 1989). +* Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), + bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989). * Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936). * Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size. All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation -[here](https://deep-significance.readthedocs.io/en/latest/) or the scenarios in the section [Examples](#examples). +[here](https://deep-significance.readthedocs.io/en/latest/) , the scenarios in the section [Examples](#examples) or +the [demo Jupyter notebook](https://github.com/Kaleidophon/deep-significance/tree/main/paper/deep-significance%20demo.ipynb). ## :inbox_tray: Installation @@ -74,52 +75,55 @@ Another option is to clone the repository and install the package locally: --- **tl;dr**: Use `aso()` to compare scores for two models. If the returned `eps_min < 0.5`, A is better than B. The lower -`eps_min`, the more confident the result. +`eps_min`, the more confident the result (we recommend to check `eps_min < 0.2` and record `eps_min` alongside +experimental results). :warning: Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See [General Recommendations & other notes](#general-recommendations). --- -In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply +In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as [this blog post](https://machinelearningmastery.com/statistical-hypothesis-tests/) for a general overview or [Dror et al. (2018)](https://www.aclweb.org/anthology/P18-1128.pdf) for a NLP-specific point of view. -In general, in statistical significance testing, we usually compare two algorithms $A$ and $B$ on a dataset $X$ using -some evaluation metric $\mathcal{M}$ (we assume a higher = better). The difference between the two algorithms on the -data is then defined as +We assume that we have two sets of scores we would like to compare, $\mathbb{S}_\mathbb{A}$ and $\mathbb{S}_\mathbb{B}$, +for instance obtained by running two models $\mathbb{A}$ and $\mathbb{B}$ multiple times with a different random seed. +We can then define a one-sided test statistic $\delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B})$ based on the gathered observations. +An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis: $$ -\delta(X) = \mathcal{M}(A, X) - \mathcal{M}(B, X) +H_0: \delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B}) \le 0 $$ -where $\delta(X)$ is our test statistic. We then test the following **null hypothesis**: +That means that we actually assume the opposite of our desired case, namely that $\mathbb{A}$ is not better than $\mathbb{B}$, +but equally as good or worse, as indicated by the value of the test statistic. +Usually, the goal becomes to reject this null hypothesis using the SST. +*p*-value testing is a frequentist method in the realm of SST. +It introduces the notion of data that *could have been observed* if we were to repeat our experiment again using +the same conditions, which we will write with superscript $\text{rep}$ in order to distinguish them from our actually +observed scores (Gelman et al., 2021). +We then define the *p*-value as the probability that, under the null hypothesis, the test statistic using replicated +observation is larger than or equal to the *observed* test statistic: $$ -H_0: \delta(X) \le 0 +p(\delta(\mathbb{S}_\mathbb{A}^\text{rep}, \mathbb{S}_\mathbb{B}^\text{rep}) \ge \delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B})|H_0) $$ -Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A -is better than B (what we actually would like to see). Most statistical significance tests operate using -*p-values*, which define the probability that under the null-hypothesis, the $\delta(X)$ expected by the test is larger than or -equal to the observed difference $\delta_{\text{obs}}$ (that is, for a one-sided test, i.e. we assume A to be better than B): - -$$ -P(\delta(X) \ge \delta_\text{obs}| H_0) -$$ - -We can interpret this equation as follows: Assuming that A is *not* better than B, the test assumes a corresponding distribution -of differences that $\delta(X)$ is drawn from. How does our actually observed difference $\delta_\text{obs}$ fit in there? -This is what the p-value is expressing: If this probability is high, $\delta_\text{obs}$ is in line with what we expected under -the null hypothesis, so we conclude A not to better than B. If the -probability is low, that means that $\delta_\text{obs}$ is quite unlikely under the null hypothesis and that the reverse -case is more likely - i.e. that it is -likely *larger* than $\delta(X)$ - and we conclude that A is indeed better than B. Note that **the p-value does not -express whether the null hypothesis is true**. - -To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough -for us to reject the null hypothesis, this is called the significance level $\alpha$ and it is often set to be 0.05. +We can interpret this expression as follows: Assuming that $\mathbb{A}$ is not better than $\mathbb{B}$, the test +assumes a corresponding distribution of statistics that $\delta$ is drawn from. So how does the observed test statistic +$\delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B})$ fit in here? This is what the $p$-value expresses: When the +probability is high, $\delta(\mathbb{S}_\mathbb{A}, \mathbb{S}_\mathbb{B})$ is in line with what we expected under the +null hypothesis, so we can *not* reject the null hypothesis, or in other words, we \emph{cannot} conclude +$\mathbb{A}$ to be better than $\mathbb{B}$. If the probability is low, that means that the observed +$\delta(\mathbb{S}, \mathbb{S}_\mathbb{B})$ is quite unlikely under the null hypothesis and that the reverse case is +more likely - i.e. that it is likely larger than - and we conclude that $\mathbb{A}$ is indeed better than +$\mathbb{B}$. Note that **the $p$-value does not express whether the null hypothesis is true**. To make our decision +about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level +$\alpha$, often set to 0.05 - that the *p*-value has to fall below. However, it has been argued that a better practice +involves reporting the *p*-value alongside the results without a pidgeonholing of results into significant and non-significant +(Wasserstein et al., 2019). ### Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks @@ -127,8 +131,8 @@ for us to reject the null hypothesis, this is called the significance level $\al Deep neural networks are highly non-linear models, having their performance highly dependent on hyperparameters, random seeds and other (stochastic) factors. Therefore, comparing the means of two models across several runs might not be enough to decide if a model A is better than B. In fact, **even aggregating more statistics like standard deviation, minimum -or maximum might not be enough** to make a decision. For this reason, Dror et al. (2019) introduced *Almost Stochastic -Order* (ASO), a test to compare two score distributions. +or maximum might not be enough** to make a decision. For this reason, del Barrio et al. (2017) and Dror et al. (2019) +introduced *Almost Stochastic Order* (ASO), a test to compare two score distributions. It builds on the concept of *stochastic order*: We can compare two distributions and declare one as *stochastically dominant* by comparing their cumulative distribution functions: @@ -138,21 +142,22 @@ by comparing their cumulative distribution functions: Here, the CDF of A is given in red and in green for B. If the CDF of A is lower than B for every $x$, we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). -For this reason, Dror et al. (2019) consider the notion of *almost stochastic dominance* by quantifying the extent to -which stochastic order is being violated (red area): +For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of *almost stochastic dominance* +by quantifying the extent to which stochastic order is being violated (red area): ![](img/aso.png) -ASO returns a value $\epsilon_\text{min}$, which expresses the amount of violation of stochastic order. If -$\epsilon_\text{min} < 0.5$, A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as +ASO returns a value $\epsilon_\text{min}$, which expresses (an upper bound to) the amount of violation of stochastic order. If +$\epsilon_\text{min} < \tau$ (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret $\epsilon_\text{min}$ as a *confidence score*. The lower it is, the more sure we can be that A is better than B. Note: **ASO does not compute p-values.** Instead, the null hypothesis formulated as $$ -H_0: \epsilon_\text{min} \ge 0.5 +H_0: \epsilon_\text{min} \ge \tau $$ -If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5. +If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 +(see the discussion in [this section](#general-recommendations)). Furthermore, the significance level $\alpha$ is determined as an input argument when running ASO and actively influence the resulting $\epsilon_\text{min}$. @@ -167,12 +172,15 @@ We can now simply apply the ASO test: import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores N = 5 # Number of random seeds my_model_scores = np.random.normal(loc=0.9, scale=0.8, size=N) baseline_scores = np.random.normal(loc=0, scale=1, size=N) -min_eps = aso(my_model_scores, baseline_scores) # min_eps = 0.0, so A is better +min_eps = aso(my_model_scores, baseline_scores, seed=seed) # min_eps = 0.225, so A is better ``` Note that ASO **does not make any assumptions about the distributions of the scores**. @@ -193,6 +201,9 @@ which corresponds to the Bonferroni correction (Bonferroni et al., 1936): import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 3 # Number of datasets N = 5 # Number of random seeds @@ -200,8 +211,8 @@ my_model_scores_per_dataset = [np.random.normal(loc=0.3, scale=0.8, size=N) for baseline_scores_per_dataset = [np.random.normal(loc=0, scale=1, size=N) for _ in range(M)] # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / M) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] -# eps_min = [0.1565800030782686, 1, 0.0] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=M, seed=seed) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)] +# eps_min = [0.006370113450148568, 0.6534772728574852, 0.0] ``` ### Scenario 3 - Comparing sample-level scores @@ -220,6 +231,9 @@ from itertools import product import numpy as np from deepsig import aso +seed = 1234 +np.random.seed(seed) + # Simulate scores for three datasets M = 40 # Number of data points N = 3 # Number of random seeds @@ -228,7 +242,9 @@ baseline_scored_samples_per_run = [np.random.normal(loc=0, scale=1, size=M) for pairs = list(product(my_model_scored_samples_per_run, baseline_scored_samples_per_run)) # epsilon_min values with Bonferroni correction -eps_min = [aso(a, b, confidence_level=0.05 / len(pairs)) for a, b in pairs] +eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=len(pairs), seed=seed) for a, b in pairs] +# eps_min = [0.3831678636198528, 0.07194780234194881, 0.9152792807128325, 0.5273463008857844, 0.14946944524461184, 1.0, +# 0.6099543280369378, 0.22387448804041898, 1.0] ``` ### Scenario 4 - Comparing more than two models @@ -259,6 +275,9 @@ Let's look at an example: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -267,20 +286,19 @@ M = 3 # Number of different models / algorithms # Here, we will sample from N(0.1, 0.8), N(0.15, 0.8), N(0.2, 0.8) my_models_scores = np.array([np.random.normal(loc=loc, scale=0.8, size=N) for loc in np.arange(0.1, 0.1 + 0.05 * M, step=0.05)]) -eps_min = multi_aso(my_models_scores, confidence_level=0.05) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, seed=seed) # eps_min = -# array([[1., 1., 1.], -# [0., 1., 1.], -# [0., 0., 1.]]) +# array([[1. , 0.92621655, 1. ], +# [1. , 1. , 1. ], +# [0.82081635, 0.73048716, 1. ]]) ``` In the example, `eps_min` is now a matrix, containing the $\epsilon_\text{min}$ score between all pairs of models (for the same model, it set to 1 by default). The matrix is always to be read as ASO(row, column). The function applies the bonferroni correction for multiple comparisons by -default, but this can be turned off by using `use_bonferroni=False`. In order to save compute, the above symmetry -property is used as well, but this can also be disabled by `use_symmetry=False`. +default, but this can be turned off by using `use_bonferroni=False`. Lastly, when the `scores` argument is a dictionary and the function is called with `return_df=True`, the resulting matrix is given as a `pandas.DataFrame` for increased readability: @@ -288,6 +306,9 @@ given as a `pandas.DataFrame` for increased readability: ```python import numpy as np from deepsig import multi_aso + +seed = 1234 +np.random.seed(seed) N = 5 # Number of random seeds M = 3 # Number of different models / algorithms @@ -304,14 +325,14 @@ my_models_scores = { # ... # } -eps_min = multi_aso(my_models_scores, confidence_level=0.05, return_df=True) +eps_min = multi_aso(my_models_scores, confidence_level=0.95, return_df=True, seed=seed) # This is now a DataFrame! # eps_min = -# model 1 model 2 model 3 -# model 1 1.0 1.0 1.0 -# model 2 0.0 1.0 1.0 -# model 3 1.0 0.0 1.0 +# model 1 model 2 model 3 +# model 1 1.000000 0.926217 1.0 +# model 2 1.000000 1.000000 1.0 +# model 3 0.820816 0.730487 1.0 ``` @@ -325,7 +346,7 @@ score. Below lists some example snippets reporting the results of scenarios 1 an We compared all pairs of models based on five random seeds each using ASO with a confidence level of $\alpha = 0.05$ (before adjusting for all pair-wise comparisons using the Bonferroni correction). Almost stochastic - dominance ($\epsilon_\text{min} < 0.5)$ is indicated in table X. + dominance ($\epsilon_\text{min} < \tau$ with $\tau = 0.2$) is indicated in table X. ### :control_knobs: Sample size @@ -394,11 +415,11 @@ from deepsig import aso import numpy as np from timeit import timeit -a = np.random.normal(size=5) -b = np.random.normal(size=5) +a = np.random.normal(size=1000) +b = np.random.normal(size=1000) -print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 146.6909574989986 -print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 50.416724971000804 +print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5)) # 393.6318126 +print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5)) # 139.73514621799995n ``` #### :electric_plug: Compatibility with PyTorch, Tensorflow, Jax & Numpy @@ -448,11 +469,15 @@ as many scores as possible should be collected, especially if the variance betwe Because this is usually infeasible in practice, Bouthilier et al. (2020) recommend to **vary all other sources of variation** between runs to obtain the most trustworthy estimate of the "true" performance, such as data shuffling, weight initialization etc. -* `num_samples` and `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not +* `num_bootstrap_iterations` can be reduced to increase the speed of `aso()`. However, this is not recommended as the result of the test will also become less accurate. Technically, $\epsilon_\text{min}$ is a upper bound that becomes tighter with the number of samples and bootstrap iterations (del Barrio et al., 2017). Thus, increasing the number of jobs with `num_jobs` instead is always preferred. +* While we could declare a model stochastically dominant with $\epsilon_\text{min} < 0.5$, we found this to have a comparatively high +Type I error (false positives). Tests [in our paper](https://arxiv.org/pdf/2204.06815.pdf) have shown that a more useful threshold that trades of Type I and + Type II error between different scenarios might be $\tau = 0.2$. + * Bootstrap and permutation-randomization are all non-parametric tests, i.e. they don't make any assumptions about the distribution of our test metric. Nevertheless, they differ in their *statistical power*, which is defined as the probability that the null hypothesis is being rejected given that there is a difference between A and B. In other words, the more powerful @@ -464,7 +489,17 @@ the distribution of our test metric. Nevertheless, they differ in their *statist ### :mortar_board: Cite -If you use the ASO test via `aso()`, please cite the original work: +Using this package in general, please cite the following: + + @article{ulmer2022deep, + title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks}, + author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes}, + journal={arXiv preprint arXiv:2204.06815}, + year={2022} + } + + +If you use the ASO test via `aso()` or `multi_aso, please cite the original works: @inproceedings{dror2019deep, author = {Rotem Dror and @@ -485,21 +520,20 @@ If you use the ASO test via `aso()`, please cite the original work: timestamp = {Tue, 28 Jan 2020 10:27:52 +0100}, } -Using this package in general, please cite the following: - - @software{dennis_ulmer_2021_4638709, - author = {Dennis Ulmer}, - title = {{deep-significance: Easy and Better Significance - Testing for Deep Neural Networks}}, - month = mar, - year = 2021, - note = {https://github.com/Kaleidophon/deep-significance}, - publisher = {Zenodo}, - version = {v1.0.0a}, - doi = {10.5281/zenodo.4638709}, - url = {https://doi.org/10.5281/zenodo.4638709} + @incollection{del2018optimal, + title={An optimal transportation approach for assessing almost stochastic order}, + author={Del Barrio, Eustasio and Cuesta-Albertos, Juan A and Matr{\'a}n, Carlos}, + booktitle={The Mathematics of the Uncertain}, + pages={33--44}, + year={2018}, + publisher={Springer} } +For instance, you can write + + In order to compare models, we use the Almost Stochastic Order test \citep{del2018optimal, dror2019deep} as + implemented by \citet{ulmer2022deep}. + ### :medal_sports: Acknowledgements This package was created out of discussions of the [NLPnorth group](https://nlpnorth.github.io/) at the IT University @@ -536,6 +570,9 @@ Dror, Rotem, Shlomov, Segev, and Reichart, Roi. "Deep dominance-how to properly Efron, Bradley, and Robert J. Tibshirani. "An introduction to the bootstrap." CRC press, 1994. +Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, Donald B Rubin, John +Carlin, Hal Stern, Donald Rubin, and David Dunson. Bayesian data analysis third edition, 2021. + Henderson, Peter, et al. "Deep reinforcement learning that matters." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018: 6391-6401 @@ -544,4 +581,7 @@ Narang, Sharan, et al. "Do Transformer Modifications Transfer Across Implementat Noreen, Eric W. "Computer intensive methods for hypothesis testing: An introduction." Wiley, New York (1989). +Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. Moving to a world beyond “p< 0.05”, +2019 + Yuan, Ke‐Hai, and Kentaro Hayashi. "Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models." British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110. \ No newline at end of file diff --git a/deepsig/__init__.py b/deepsig/__init__.py index d64b5ce..be84c06 100644 --- a/deepsig/__init__.py +++ b/deepsig/__init__.py @@ -5,5 +5,5 @@ from deepsig.permutation import permutation_test from deepsig.sample_size import aso_uncertainty_reduction, bootstrap_power_analysis -__version__ = "1.2.3" +__version__ = "1.2.5" __author__ = "Dennis Ulmer" diff --git a/deepsig/aso.py b/deepsig/aso.py index a92708d..c67a51e 100644 --- a/deepsig/aso.py +++ b/deepsig/aso.py @@ -8,7 +8,7 @@ from warnings import warn # EXT -from joblib import Parallel, delayed +from joblib import Parallel, delayed, wrap_non_picklable_objects from joblib.externals.loky import set_loky_pickler import numpy as np import pandas as pd @@ -20,9 +20,8 @@ ArrayLike, ScoreCollection, score_pair_conversion, - ALLOWED_TYPES, - CONVERSIONS, ) +from deepsig.utils import _progress_iter, _get_num_models # MISC set_loky_pickler("dill") # Avoid weird joblib error with multi_aso @@ -32,7 +31,8 @@ def aso( scores_a: ArrayLike, scores_b: ArrayLike, - confidence_level: float = 0.05, + confidence_level: float = 0.95, + num_comparisons: int = 1, num_samples: int = 1000, num_bootstrap_iterations: int = 1000, dt: float = 0.005, @@ -60,7 +60,9 @@ def aso( scores_b: List[float] Scores of algorithm B. confidence_level: float - Desired confidence level of test. Set to 0.05 by default. + Desired confidence level of test. Set to 0.95 by default. + num_comparisons: int + Number of comparisons that the test is being used for. Is used to perform a Bonferroni correction. num_samples: int Number of samples from the score distributions during every bootstrap iteration when estimating sigma. num_bootstrap_iterations: int @@ -84,15 +86,15 @@ def aso( assert ( len(scores_a) > 0 and len(scores_b) > 0 ), "Both lists of scores must be non-empty." - assert num_samples > 0, "num_samples must be positive, {} found.".format( - num_samples - ) assert ( num_bootstrap_iterations > 0 ), "num_samples must be positive, {} found.".format(num_bootstrap_iterations) assert num_jobs > 0, "Number of jobs has to be at least 1, {} found.".format( num_jobs ) + assert ( + num_comparisons > 0 + ), "Number of comparisons has to be at least 1, {} found.".format(num_comparisons) # TODO: Remove in future version if num_samples != 1000: @@ -101,83 +103,47 @@ def aso( DeprecationWarning, ) - violation_ratio = compute_violation_ratio(scores_a, scores_b, dt) + # TODO: Remove in future version + if confidence_level < 0.95: + warn( + "'confidence_level' was refactored in version 1.2.4 to be more intuitive and usually should be in the .95 -" + f".99 range, but {confidence_level} was found. If you tried to adjust the confidence level for multiple " + f"comparisons, try the new num_comparisons argument instead.", + UserWarning, + ) + + if num_comparisons > 1: + confidence_level += (1 - confidence_level) / num_comparisons + + violation_ratio = compute_violation_ratio( + scores_a=scores_a, scores_b=scores_b, dt=dt + ) # Based on the actual number of samples quantile_func_a = get_quantile_function(scores_a) quantile_func_b = get_quantile_function(scores_b) - def _progress_iter(high: int, progress_bar: tqdm): - """ - This function is used when a shared progress bar is passed from multi_aso() - every time the iterator yields an - element, the progress bar is updated by one. It essentially behaves like a simplified range() function. - - Parameters - ---------- - high: int - Number of elements in iterator. - progress_bar: tqdm - Shared progress bar. - """ - current = 0 - - while current < high: - yield current - current += 1 - progress_bar.update(1) - - # Add progress bar if applicable - if show_progress and _progress_bar is None: - iters = tqdm(range(num_bootstrap_iterations), desc="Bootstrap iterations") - - # Shared progress bar when called from multi_aso() - elif _progress_bar is not None: - iters = _progress_iter(num_bootstrap_iterations, _progress_bar) - - else: - iters = range(num_bootstrap_iterations) - - # Set seeds for different jobs if applicable - # "Sub-seeds" for jobs are just seed argument + job index - seeds = ( - [None] * num_bootstrap_iterations - if seed is None - else [seed + offset for offset in range(1, num_bootstrap_iterations + 1)] + samples = get_bootstrapped_violation_ratios( + scores_a, + scores_b, + quantile_func_a, + quantile_func_b, + num_bootstrap_iterations, + dt, + num_jobs, + show_progress, + seed, + _progress_bar, ) - - def _bootstrap_iter(seed: Optional[int] = None): - """ - One bootstrap iteration. Wrapped in a function so it can be handed to joblib.Parallel. - """ - # When running multiple jobs, these modules have to be re-imported for some reason to avoid an error - # Use dir() to check whether module is available in local scope: - # https://stackoverflow.com/questions/30483246/how-to-check-if-a-module-has-been-imported - if "numpy" not in dir() or "deepsig" not in dir(): - import numpy as np - from deepsig.aso import compute_violation_ratio - - if seed is not None: - np.random.seed(seed) - - sampled_scores_a = quantile_func_a(np.random.uniform(0, 1, len(scores_a))) - sampled_scores_b = quantile_func_b(np.random.uniform(0, 1, len(scores_b))) - sample = compute_violation_ratio( - sampled_scores_a, - sampled_scores_b, - dt, - ) - - return sample - - # Initialize worker pool and start iterations - parallel = Parallel(n_jobs=num_jobs) - samples = parallel(delayed(_bootstrap_iter)(seed) for seed, _ in zip(seeds, iters)) + samples = np.array(samples) const = np.sqrt(len(scores_a) * len(scores_b) / (len(scores_a) + len(scores_b))) sigma_hat = np.std(const * (samples - violation_ratio)) # Compute eps_min and make sure it stays in [0, 1] min_epsilon = np.clip( - violation_ratio - (1 / const) * sigma_hat * normal.ppf(confidence_level), 0, 1 + violation_ratio - (1 / const) * sigma_hat * normal.ppf(1 - confidence_level), + 0, + 1, ) return min_epsilon @@ -185,7 +151,7 @@ def _bootstrap_iter(seed: Optional[int] = None): def multi_aso( scores: ScoreCollection, - confidence_level: float = 0.05, + confidence_level: float = 0.95, use_bonferroni: bool = True, use_symmetry: bool = True, num_samples: int = 1000, @@ -207,7 +173,7 @@ def multi_aso( Collection of model scores. Should be either dictionary of model name to model scores, nested Python list, 2D numpy or Jax array, or 2D Tensorflow or PyTorch tensor. confidence_level: float - Desired confidence level of test. Set to 0.05 by default. + Desired confidence level of test. Set to 0.95 by default. use_bonferroni: bool Indicate whether Bonferroni correction should be applied to confidence level in order to adjust for the number of comparisons. Default is True. @@ -243,12 +209,28 @@ def multi_aso( DeprecationWarning, ) + # TODO: Remove in future version + if not use_symmetry: + warn( + "'use_symmetry' argument is being ignored in the current version and will be deprecated in version 1.3!", + DeprecationWarning, + ) + + # TODO: Remove in future version + if confidence_level < 0.95: + warn( + "'confidence_level' was refactored in version 1.2.4 to be more intuitive and usually should be in the .95 -" + f".99 range, but {confidence_level} was found.", + UserWarning, + ) + num_models = _get_num_models(scores) num_comparisons = num_models * (num_models - 1) / 2 eps_min = np.eye(num_models) # Initialize score matrix if use_bonferroni: - confidence_level /= num_comparisons + # Increase the confidence level based in oder to mitigate the multiple comparisons problem + confidence_level += (1 - confidence_level) / num_comparisons # Iterate over simple indices or dictionary keys depending on type of scores argument indices = list(range(num_models)) if type(scores) != dict else list(scores.keys()) @@ -266,38 +248,57 @@ def multi_aso( for i, key_i in enumerate(indices): for j, key_j in enumerate(indices[(i + 1) :], start=i + 1): scores_a, scores_b = scores[key_i], scores[key_j] + quantile_func_a = get_quantile_function(scores_a) + quantile_func_b = get_quantile_function(scores_b) + const = np.sqrt( + len(scores_a) * len(scores_b) / (len(scores_a) + len(scores_b)) + ) - eps_min[i, j] = aso( + violation_ratio_ab = compute_violation_ratio( + dt=dt, + quantile_func_a=quantile_func_a, + quantile_func_b=quantile_func_b, + ) + violation_ratio_ba = ( + 1 - violation_ratio_ab + ) # Exploit symmetry of violation ratio here + samples_ab = get_bootstrapped_violation_ratios( scores_a, scores_b, - confidence_level=confidence_level, - num_samples=1000, # TODO: Avoid double warning, remove in future version - num_bootstrap_iterations=num_bootstrap_iterations, - dt=dt, - num_jobs=num_jobs, - show_progress=False, - seed=seed, - _progress_bar=progress_bar, + quantile_func_a, + quantile_func_b, + num_bootstrap_iterations, + dt, + num_jobs, + show_progress, + seed, + progress_bar, + ) + samples_ab = np.array(samples_ab) + + # This quantity is the same for both, so we only have to compute it once, see + # (samples_ab - violation_ratio_ab) + # = (1 - samples_ba - 1 + violation_ratio_ba) + # = (samples_ba - violation_ratio_ba) + sigma_hat = np.std(const * (samples_ab - violation_ratio_ab)) + + # Compute eps_min and make sure it stays in [0, 1] + min_epsilon_ab = np.clip( + violation_ratio_ab + - (1 / const) * sigma_hat * normal.ppf(1 - confidence_level), + 0, + 1, + ) + min_epsilon_ba = np.clip( + violation_ratio_ba + - (1 / const) * sigma_hat * normal.ppf(1 - confidence_level), + 0, + 1, ) - # Use ASO(A, B, alpha) = 1 - ASO(B, A, alpha) - if use_symmetry: - eps_min[j, i] = 1 - eps_min[i, j] - - # Compute ASO(B, A, alpha) separately - else: - eps_min[i, j] = aso( - scores_b, - scores_a, - confidence_level=confidence_level, - num_samples=1000, # TODO: Avoid double warning, remove in future version - num_bootstrap_iterations=num_bootstrap_iterations, - dt=dt, - num_jobs=num_jobs, - show_progress=False, - seed=seed, - _progress_bar=progress_bar, - ) + # Set values + eps_min[i, j] = min_epsilon_ab + eps_min[j, i] = min_epsilon_ba if type(scores) == dict and return_df: eps_min = pd.DataFrame(data=eps_min, index=list(scores.keys())) @@ -306,37 +307,61 @@ def multi_aso( return eps_min -def compute_violation_ratio(scores_a: np.array, scores_b: np.array, dt: float) -> float: +def compute_violation_ratio( + scores_a: Optional[np.array] = None, + scores_b: Optional[np.array] = None, + quantile_func_a: Optional[Callable] = None, + quantile_func_b: Optional[Callable] = None, + dt: float = 0.001, +) -> float: """ Compute the violation ration e_W2 (equation 4 + 5). Parameters ---------- - scores_a: List[float] + scores_a: Optional[np.array] Scores of algorithm A. - scores_b: List[float] + scores_b: Optional[np.array] Scores of algorithm B. dt: float Differential for t during integral calculation. + quantile_func_a: Optional[Callable] + Quantile function based on the first set of scores. + quantile_func_b: Optional[Callable] + Quantile function based on the second set of scores. Returns ------- float Return violation ratio. """ - squared_wasserstein_dist = 0 - int_violation_set = 0 # Integral over violation set A_X - quantile_func_a = get_quantile_function(scores_a) - quantile_func_b = get_quantile_function(scores_b) + assert ( + scores_a is not None or quantile_func_a is not None + ), "Either scores or quantile function are required for the first sample, neither found." + + assert ( + scores_b is not None or quantile_func_b is not None + ), "Either scores or quantile function are required for the second sample, neither found." + + if quantile_func_a is None: + quantile_func_a = get_quantile_function(scores_a) + + if quantile_func_b is None: + quantile_func_b = get_quantile_function(scores_b) + + t = np.arange(dt, 1, dt) # Points we integrate over + f = quantile_func_a(t) # F-1(t) + g = quantile_func_b(t) # G-1(t) + diff = g - f + squared_wasserstein_dist = np.sum(diff ** 2 * dt) - for p in np.arange(0, 1, dt): - diff = quantile_func_b(p) - quantile_func_a(p) - squared_wasserstein_dist += (diff ** 2) * dt - int_violation_set += (max(diff, 0) ** 2) * dt + # Now only consider points where stochastic order is being violated and set the rest to 0 + diff[f >= g] = 0 + int_violation_set = np.sum(diff[1:] ** 2 * dt) # Ignore t = 0 since t in (0, 1) if squared_wasserstein_dist == 0: warn("Division by zero encountered in violation ratio.") - violation_ratio = 0 + violation_ratio = 0.5 else: violation_ratio = int_violation_set / squared_wasserstein_dist @@ -361,7 +386,7 @@ def get_quantile_function(scores: np.array) -> Callable: # When running multiple jobs via joblib, numpy has to be re-imported for some reason to avoid an error # Use dir() to check whether module is available in local scope: # https://stackoverflow.com/questions/30483246/how-to-check-if-a-module-has-been-imported - if "numpy" not in dir(): + if "np" not in dir(): import numpy as np def _quantile_function(p: float) -> float: @@ -369,54 +394,100 @@ def _quantile_function(p: float) -> float: num = len(scores) index = int(np.ceil(num * p)) - return cdf[min(num - 1, max(0, index - 1))] + return cdf[np.clip(index - 1, 0, num - 1)] return np.vectorize(_quantile_function) -def _get_num_models(scores: ScoreCollection) -> int: +def get_bootstrapped_violation_ratios( + scores_a: ArrayLike, + scores_b: ArrayLike, + quantile_func_a: Callable, + quantile_func_b: Callable, + num_bootstrap_iterations: int, + dt: float, + num_jobs: int, + show_progress: bool, + seed: Optional[int], + _progress_bar: Optional[tqdm], +) -> List[float]: """ - Retrieve the number of models from a ScoreCollection for multi_aso(). + Retrieve violation ratios computed based on a number of bootstrap samples. Parameters ---------- - scores: ScoreCollection - Collection of model scores. Should be either dictionary of model name to model scores, nested Python list, - 2D numpy or Jax array, or 2D Tensorflow or PyTorch tensor. + scores_a: List[float] + Scores of algorithm A. + scores_b: List[float] + Scores of algorithm B. + quantile_func_a: Callable + Quantile function based on the first set of scores. + quantile_func_b: Callable + Quantile function based on the second set of scores. + num_bootstrap_iterations: int + Number of bootstrap iterations when estimating sigma. + dt: float + Differential for t during integral calculation. + num_jobs: int + Number of threads that bootstrap iterations are divided among. + show_progress: bool + Show progress bar. Default is True. + seed: Optional[int] + Set seed for reproducibility purposes. Default is None (meaning no seed is used). + _progress_bar: Optional[tqdm] + Hands over a progress bar object when called by multi_aso(). Only for internal use. Returns ------- - int - Number of models. + List[float] + Bootstrapped violation ratios. """ - # Python dictionary - if isinstance(scores, dict): - if len(scores) < 2: - raise ValueError( - "'scores' argument should contain at least two sets of scores, but only {} found.".format( - len(scores) - ) - ) + # Add progress bar if applicable + if show_progress and _progress_bar is None: + iters = tqdm(range(num_bootstrap_iterations), desc="Bootstrap iterations") - return len(scores) + # Shared progress bar when called from multi_aso() + elif _progress_bar is not None: + iters = _progress_iter(num_bootstrap_iterations, _progress_bar) - # (Nested) python list - elif isinstance(scores, list): - if not isinstance(scores[0], list): - raise TypeError( - "'scores' argument must be nested list of scores when Python lists are used, but elements of type {} " - "found".format(type(scores[0]).__name__) - ) + else: + iters = range(num_bootstrap_iterations) - return len(scores) + # Set seeds for different jobs if applicable + # "Sub-seeds" for jobs are just seed argument + job index + seeds = ( + [None] * num_bootstrap_iterations + if seed is None + else [seed + offset for offset in range(1, num_bootstrap_iterations + 1)] + ) - # Numpy / Jax arrays, Tensorflow / PyTorch tensor - elif type(scores) in ALLOWED_TYPES: - scores = CONVERSIONS[type(scores)](scores) # Convert to numpy array + @wrap_non_picklable_objects + def _bootstrap_iter(seed: Optional[int] = None): + """ + One bootstrap iteration. Wrapped in a function so it can be handed to joblib.Parallel. + """ + # When running multiple jobs, these modules have to be re-imported for some reason to avoid an error + # Use dir() to check whether module is available in local scope: + # https://stackoverflow.com/questions/30483246/how-to-check-if-a-module-has-been-imported + if "numpy" not in dir() or "deepsig" not in dir(): + import numpy as np + from deepsig.aso import compute_violation_ratio - return scores.shape[0] + if seed is not None: + np.random.seed(seed) - raise TypeError( - "Invalid type for 'scores', should be nested Python list, dict, Jax / Numpy array or Tensorflow / PyTorch " - "tensor, '{}' found.".format(type(scores).__name__) - ) + sampled_scores_a = quantile_func_a(np.random.uniform(0, 1, len(scores_a))) + sampled_scores_b = quantile_func_b(np.random.uniform(0, 1, len(scores_b))) + sample = compute_violation_ratio( + scores_a=sampled_scores_a, + scores_b=sampled_scores_b, + dt=dt, + ) + + return sample + + # Initialize worker pool and start iterations + parallel = Parallel(n_jobs=num_jobs) + samples = parallel(delayed(_bootstrap_iter)(seed) for seed, _ in zip(seeds, iters)) + + return samples diff --git a/deepsig/bootstrap.py b/deepsig/bootstrap.py index aec8459..578d0ba 100644 --- a/deepsig/bootstrap.py +++ b/deepsig/bootstrap.py @@ -3,7 +3,11 @@ `(Efron & Tibshirani, 1994)Although Deep Learning has undergone spectacular growth in the recent decade, a large portion of experimental evidence is not supported by statistical hypothesis tests. Instead, @@ -194,16 +194,17 @@
Statistical Significance tests such as Almost Stochastic Order (Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and -permutation-randomization (Noreen, 1989).
Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), +bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989).
Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936).
Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size.
All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation -here or the scenarios in the section Examples.
-The package can simply be installed using pip
by running
pip3 install deepsig
@@ -216,62 +217,72 @@ |:inbox_tray:| Installation
+
+
|:bookmark:| Examples¶
tl;dr: Use aso()
to compare scores for two models. If the returned eps_min < 0.5
, A is better than B. The lower
-eps_min
, the more confident the result.
+eps_min
, the more confident the result (we recommend to check eps_min < 0.2
and record eps_min
alongside
+experimental results).
|:warning:| Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority
in all settings. See General Recommendations & other notes.
-In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply
+
In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply
the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please
refer to resources such as this blog post for a general
overview or Dror et al. (2018) for a NLP-specific point of view.
-In general, in statistical significance testing, we usually compare two algorithms and on a dataset using
-some evaluation metric (we assume a higher = better). The difference between the two algorithms on the
-data is then defined as
-where is our test statistic. We then test the following null hypothesis:
-Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A
-is better than B (what we actually would like to see). Most statistical significance tests operate using
-p-values, which define the probability that under the null-hypothesis, the expected by the test is larger than or
-equal to the observed difference (that is, for a one-sided test, i.e. we assume A to be better than B):
-We can interpret this equation as follows: Assuming that A is not better than B, the test assumes a corresponding distribution
-of differences that is drawn from. How does our actually observed difference fit in there?
-This is what the p-value is expressing: If this probability is high, is in line with what we expected under
-the null hypothesis, so we conclude A not to better than B. If the
-probability is low, that means that is quite unlikely under the null hypothesis and that the reverse
-case is more likely - i.e. that it is
-likely larger than - and we conclude that A is indeed better than B. Note that the p-value does not
-express whether the null hypothesis is true.
-To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough
-for us to reject the null hypothesis, this is called the significance level and it is often set to be 0.05.
-
+We assume that we have two sets of scores we would like to compare, and ,
+for instance obtained by running two models and multiple times with a different random seed.
+We can then define a one-sided test statistic based on the gathered observations.
+An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis:
+That means that we actually assume the opposite of our desired case, namely that is not better than ,
+but equally as good or worse, as indicated by the value of the test statistic.
+Usually, the goal becomes to reject this null hypothesis using the SST.
+p-value testing is a frequentist method in the realm of SST.
+It introduces the notion of data that could have been observed if we were to repeat our experiment again using
+the same conditions, which we will write with superscript in order to distinguish them from our actually
+observed scores (Gelman et al., 2021).
+We then define the p-value as the probability that, under the null hypothesis, the test statistic using replicated
+observation is larger than or equal to the observed test statistic:
+We can interpret this expression as follows: Assuming that is not better than , the test
+assumes a corresponding distribution of statistics that is drawn from. So how does the observed test statistic
+ fit in here? This is what the -value expresses: When the
+probability is high, is in line with what we expected under the
+null hypothesis, so we can not reject the null hypothesis, or in other words, we \emph{cannot} conclude
+ to be better than . If the probability is low, that means that the observed
+ is quite unlikely under the null hypothesis and that the reverse case is
+more likely - i.e. that it is likely larger than - and we conclude that is indeed better than
+. Note that the -value does not express whether the null hypothesis is true. To make our decision
+about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level
+, often set to 0.05 - that the p-value has to fall below. However, it has been argued that a better practice
+involves reporting the p-value alongside the results without a pidgeonholing of results into significant and non-significant
+(Wasserstein et al., 2019).
+
Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks¶
Deep neural networks are highly non-linear models, having their performance highly dependent on hyperparameters, random
seeds and other (stochastic) factors. Therefore, comparing the means of two models across several runs might not be
enough to decide if a model A is better than B. In fact, even aggregating more statistics like standard deviation, minimum
-or maximum might not be enough to make a decision. For this reason, Dror et al. (2019) introduced Almost Stochastic
-Order (ASO), a test to compare two score distributions.
+or maximum might not be enough to make a decision. For this reason, del Barrio et al. (2017) and Dror et al. (2019)
+introduced Almost Stochastic Order (ASO), a test to compare two score distributions.
It builds on the concept of stochastic order: We can compare two distributions and declare one as stochastically dominant
by comparing their cumulative distribution functions:
Here, the CDF of A is given in red and in green for B. If the CDF of A is lower than B for every , we know the
algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal
distributions with the same mean but different variances).
-For this reason, Dror et al. (2019) consider the notion of almost stochastic dominance by quantifying the extent to
-which stochastic order is being violated (red area):
+For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of almost stochastic dominance
+by quantifying the extent to which stochastic order is being violated (red area):
-ASO returns a value , which expresses the amount of violation of stochastic order. If
-, A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as
+
ASO returns a value , which expresses (an upper bound to) the amount of violation of stochastic order. If
+ (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as
superior. We can also interpret as a confidence score. The lower it is, the more sure we can be
that A is better than B. Note: ASO does not compute p-values. Instead, the null hypothesis formulated as
-If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5.
+
If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5
+(see the discussion in this section).
Furthermore, the significance level is determined as an input argument when running ASO and actively influence
the resulting .
-
-
+
+
Scenario 1 - Comparing multiple runs of two models¶
In the simplest scenario, we have retrieved a set of scores from a model A and a baseline B on a dataset, stemming from
various model runs with different seeds. We want to test whether our model A is better than B (higher scores = better)-
@@ -279,12 +290,15 @@
Scenario 1 - Comparing multiple runs of two modelsimport numpy as np
from deepsig import aso
+seed = 1234
+np.random.seed(seed)
+
# Simulate scores
N = 5 # Number of random seeds
my_model_scores = np.random.normal(loc=0.9, scale=0.8, size=N)
baseline_scores = np.random.normal(loc=0, scale=1, size=N)
-min_eps = aso(my_model_scores, baseline_scores) # min_eps = 0.0, so A is better
+min_eps = aso(my_model_scores, baseline_scores, seed=seed) # min_eps = 0.225, so A is better
Note that ASO does not make any assumptions about the distributions of the scores.
@@ -292,8 +306,8 @@
Scenario 1 - Comparing multiple runs of two models
-
When comparing models across datasets, we formulate one null hypothesis per dataset. However, we have to make sure not to fall prey to the multiple comparisons problem: In short, @@ -303,6 +317,9 @@
import numpy as np
from deepsig import aso
+seed = 1234
+np.random.seed(seed)
+
# Simulate scores for three datasets
M = 3 # Number of datasets
N = 5 # Number of random seeds
@@ -310,12 +327,12 @@ Scenario 2 - Comparing multiple runs across datasetsbaseline_scores_per_dataset = [np.random.normal(loc=0, scale=1, size=N) for _ in range(M)]
# epsilon_min values with Bonferroni correction
-eps_min = [aso(a, b, confidence_level=0.05 / M) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)]
-# eps_min = [0.1565800030782686, 1, 0.0]
+eps_min = [aso(a, b, confidence_level=0.95, num_comparisons=M, seed=seed) for a, b in zip(my_model_scores_per_dataset, baseline_scores_per_dataset)]
+# eps_min = [0.006370113450148568, 0.6534772728574852, 0.0]
In previous examples, we have assumed that we compare two algorithms A and B based on their performance per run, i.e. we run each algorithm once per random seed and obtain exactly one score on our test set. In some cases however, @@ -328,6 +345,9 @@
Similarly, when comparing multiple models (now again on a per-seed basis), we can use a similar approach like in the previous example. For instance, for three models, we can create a matrix and fill the entries @@ -358,6 +380,9 @@
import numpy as np
from deepsig import multi_aso
+
+seed = 1234
+np.random.seed(seed)
N = 5 # Number of random seeds
M = 3 # Number of different models / algorithms
@@ -366,23 +391,25 @@ Scenario 4 - Comparing more than two models# Here, we will sample from N(0.1, 0.8), N(0.15, 0.8), N(0.2, 0.8)
my_models_scores = np.array([np.random.normal(loc=loc, scale=0.8, size=N) for loc in np.arange(0.1, 0.1 + 0.05 * M, step=0.05)])
-eps_min = multi_aso(my_models_scores, confidence_level=0.05)
+eps_min = multi_aso(my_models_scores, confidence_level=0.95, seed=seed)
# eps_min =
-# array([[1., 1., 1.],
-# [0., 1., 1.],
-# [0., 0., 1.]])
+# array([[1. , 0.92621655, 1. ],
+# [1. , 1. , 1. ],
+# [0.82081635, 0.73048716, 1. ]])
In the example, eps_min
is now a matrix, containing the score between all pairs of models (for
the same model, it set to 1 by default). The matrix is always to be read as ASO(row, column).
The function applies the bonferroni correction for multiple comparisons by
-default, but this can be turned off by using use_bonferroni=False
. In order to save compute, the above symmetry
-property is used as well, but this can also be disabled by use_symmetry=False
.
use_bonferroni=False
.
Lastly, when the scores
argument is a dictionary and the function is called with return_df=True
, the resulting matrix is
given as a pandas.DataFrame
for increased readability:
import numpy as np
from deepsig import multi_aso
+
+seed = 1234
+np.random.seed(seed)
N = 5 # Number of random seeds
M = 3 # Number of different models / algorithms
@@ -399,18 +426,18 @@ Scenario 4 - Comparing more than two models# ...
# }
-eps_min = multi_aso(my_models_scores, confidence_level=0.05, return_df=True)
+eps_min = multi_aso(my_models_scores, confidence_level=0.95, return_df=True, seed=seed)
# This is now a DataFrame!
# eps_min =
-# model 1 model 2 model 3
-# model 1 1.0 1.0 1.0
-# model 2 0.0 1.0 1.0
-# model 3 1.0 0.0 1.0
+# model 1 model 2 model 3
+# model 1 1.000000 0.926217 1.0
+# model 2 1.000000 1.000000 1.0
+# model 3 0.820816 0.730487 1.0
When ASO used, two important details have to be reported, namely the confidence level and the score. Below lists some example snippets reporting the results of scenarios 1 and 4:
@@ -419,11 +446,11 @@It can be hard to determine whether the currently collected set of scores is large enough to allow for reliable
significance testing or whether more scores are required. For this reason, deep-significance
also implements functions to aid the decision of whether to
@@ -469,10 +496,10 @@
Waiting for all the bootstrap iterations to finish can feel tedious, especially when doing many comparisons. Therefore,
ASO supports multithreading using joblib
@@ -481,15 +508,15 @@
All tests implemented in this package also can take PyTorch / Tensorflow tensors and Jax or NumPy arrays as arguments:
from deepsig import aso
@@ -501,13 +528,13 @@ |:electric_plug:| Compatibility with PyTorch, Tensorflow, Jax & Numpyaso(a, b) # It just works!
In order to ensure replicability, both aso()
and multi_aso()
supply as seed
argument. This even works
when multiple jobs are used!
Should you be suspicious of ASO and want to revert to the good old faithful tests, this package also implements the paired-bootstrap as well as the permutation randomization test. Note that as discussed in the next section, these @@ -523,9 +550,9 @@
Naturally, the CDFs built from scores_a
and scores_b
can only be approximations of the true distributions. Therefore,
@@ -534,10 +561,13 @@
num_bootstrap_iterations
can be reduced to increase the speed of aso()
. However, this is not
+num_bootstrap_iterations
can be reduced to increase the speed of aso()
. However, this is not
recommended as the result of the test will also become less accurate. Technically, is a upper bound
that becomes tighter with the number of samples and bootstrap iterations (del Barrio et al., 2017). Thus, increasing
the number of jobs with num_jobs
instead is always preferred.
While we could declare a model stochastically dominant with , we found this to have a comparatively high +Type I error (false positives). Tests in our paper have shown that a more useful threshold that trades of Type I and +Type II error between different scenarios might be .
Bootstrap and permutation-randomization are all non-parametric tests, i.e. they don’t make any assumptions about the distribution of our test metric. Nevertheless, they differ in their statistical power, which is defined as the probability that the null hypothesis is being rejected given that there is a difference between A and B. In other words, the more powerful @@ -547,10 +577,19 @@
If you use the ASO test via aso()
, please cite the original work:
Using this package in general, please cite the following:
+@article{ulmer2022deep,
+ title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks},
+ author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes},
+ journal={arXiv preprint arXiv:2204.06815},
+ year={2022}
+}
+
If you use the ASO test via aso()
or `multi_aso, please cite the original works:
@inproceedings{dror2019deep,
author = {Rotem Dror and
Segev Shlomov and
@@ -569,25 +608,24 @@ |:mortar_board:| Citedoi = {10.18653/v1/p19-1266},
timestamp = {Tue, 28 Jan 2020 10:27:52 +0100},
}
-
Using this package in general, please cite the following:
-@software{dennis_ulmer_2021_4638709,
- author = {Dennis Ulmer},
- title = {{deep-significance: Easy and Better Significance
- Testing for Deep Neural Networks}},
- month = mar,
- year = 2021,
- note = {https://github.com/Kaleidophon/deep-significance},
- publisher = {Zenodo},
- version = {v1.0.0a},
- doi = {10.5281/zenodo.4638709},
- url = {https://doi.org/10.5281/zenodo.4638709}
+
+@incollection{del2018optimal,
+ title={An optimal transportation approach for assessing almost stochastic order},
+ author={Del Barrio, Eustasio and Cuesta-Albertos, Juan A and Matr{\'a}n, Carlos},
+ booktitle={The Mathematics of the Uncertain},
+ pages={33--44},
+ year={2018},
+ publisher={Springer}
}
For instance, you can write
+In order to compare models, we use the Almost Stochastic Order test \citep{del2018optimal, dror2019deep} as
+implemented by \citet{ulmer2022deep}.
+
This package was created out of discussions of the NLPnorth group at the IT University Copenhagen, whose members I want to thank for their feedback. The code in this repository is in multiple places based on @@ -597,8 +635,8 @@
In this last section of the readme, I would like to refer to works already using deep-significance
. Open an issue or
pull request if you would like to see your work added here!
Del Barrio, Eustasio, Juan A. Cuesta-Albertos, and Carlos Matrán. “An optimal transportation approach for assessing almost stochastic order.” The Mathematics of the Uncertain. Springer, Cham, 2018. 33-44.
Bonferroni, Carlo. “Teoria statistica delle classi e calcolo delle probabilita.” Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8 (1936): 3-62.
@@ -616,14 +654,18 @@Dror, Rotem, Shlomov, Segev, and Reichart, Roi. “Deep dominance-how to properly compare deep neural models.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Efron, Bradley, and Robert J. Tibshirani. “An introduction to the bootstrap.” CRC press, 1994.
+Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, Donald B Rubin, John +Carlin, Hal Stern, Donald Rubin, and David Dunson. Bayesian data analysis third edition, 2021.
Henderson, Peter, et al. “Deep reinforcement learning that matters.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. “Visualizing the Loss Landscape of Neural Nets.” NeurIPS 2018: 6391-6401
Narang, Sharan, et al. “Do Transformer Modifications Transfer Across Implementations and Applications?.” arXiv preprint arXiv:2102.11972 (2021).
Noreen, Eric W. “Computer intensive methods for hypothesis testing: An introduction.” Wiley, New York (1989).
+Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. Moving to a world beyond “p< 0.05”, +2019
Yuan, Ke‐Hai, and Kentaro Hayashi. “Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models.” British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110.
-
P |