Skip to content

Commit

Permalink
Merge pull request #9 from Kaleidophon/sample-size
Browse files Browse the repository at this point in the history
Sample size
  • Loading branch information
Kaleidophon authored Dec 3, 2021
2 parents 0a8e0dd + 43c46cd commit fb3856d
Show file tree
Hide file tree
Showing 19 changed files with 629 additions and 26 deletions.
57 changes: 57 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
* [Scenario 3: Comparing sample-level scores](#scenario-3---comparing-sample-level-scores)
* [Scenario 4: Comparing more than two models](#scenario-4---comparing-more-than-two-models)
* [How to report results](#newspaper-how-to-report-results)
* [Sample size](#control_knobs-sample-size)
* [Other features](#sparkles-other-features)
* [General Recommendations & other notes](#general-recommendations)
* [:mortar_board: Cite](#mortar_board-cite)
Expand Down Expand Up @@ -313,6 +314,60 @@ score. Below lists some example snippets reporting the results of scenarios 1 an
$\alpha = 0.05$ (before adjusting for all pair-wise comparisons using the Bonferroni correction). Almost stochastic
dominance ($\epsilon_\text{min} < 0.5)$ is indicated in table X.

### :control_knobs: Sample size

It can be hard to determine whether the currently collected set of scores is large enough to allow for reliable
significance testing or whether more scores are required. For this reason, `deep-significance` also implements functions to aid the decision of whether to
collect more samples or not.

First of all, it contains *Bootstrap power analysis* (Yuan & Hayashi, 2003): Given a set of scores, it gives all of them a uniform lift to
create an artificial, second sample. Then, the analysis runs repeated analyses using bootstrapped versions of both
samples, comparing them with a significance test. Ideally, this should yield a significant result: If the difference
between the re-sampled original and the lifted sample is non-significant, the original sample has too high of a variance. The
analyses then returns the *percentage of comparisons* that yielded significant results. If the number is too low,
more scores should be collected and added to the sample.

The result of the analysis is the *statistical power*: The
higher the power, the smaller the risk of falling prey to a Type II error - the probability of mistakenly accepting the
null hypothesis, when in fact it should actually be rejected. Usually, a power of ~ 0.8 is recommended (although that is
sometimes hard to achieve in a machine learning setup).

The function can be used in the following way:

```python
import numpy as np
from deepsig import bootstrap_power_analysis

scores = np.random.normal(loc=0, scale=20, size=5) # Create too small of a sample with high variance
power = bootstrap_power_analysis(scores, show_progress=False) # 0.081, way too low

scores2 = np.random.normal(loc=0, scale=20, size=50) # Let's collect more samples
power2 = bootstrap_power_analysis(scores2, show_progress=False) # Better power with 0.2556
```

By default, `bootstrap_power_analysis()` uses a one-sided Welch's t-test. However, this can be modified by passing
a function to the `significance_test` argument, which expects a function taking two sets of scores and returning a
p-value.

Secondly, if the Almost Stochastic Order test (ASO) is being used, there is a second function available. ASO estimates
the violation ratio of two samples using bootstrapping. However, there is necessarily some uncertainty around that
estimate, given that we only possess a finite number of samples. Using more samples decreases the uncertainty and makes the estimate tighter.
The degree to which collecting more samples increases the tightness can be computed using the following function:

```python
import numpy as np
from deepsig import aso_uncertainty_reduction

scores1 = np.random.normal(loc=0, scale=0.3, size=5) # First sample with five scores
scores2 = np.random.normal(loc=0.2, scale=5, size=3) # Second sample with three scores

red1 = aso_uncertainty_reduction(m_old=len(scores1), n_old=len(scores2), m_new=5, n_new=5) # 1.1547005383792515
red2 = aso_uncertainty_reduction(m_old=len(scores1), n_old=len(scores2), m_new=7, n_new=3) # 1.0583005244258363

# Adding two runs to scores1 increases tightness of estimate by 1.15
# But adding two runs to scores2 only increases tightness by 1.06! So spending two more runs on scores1 is better
```

### :sparkles: Other features

#### :rocket: For the impatient: ASO with multi-threading
Expand Down Expand Up @@ -475,3 +530,5 @@ Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. "Visualizing th
Narang, Sharan, et al. "Do Transformer Modifications Transfer Across Implementations and Applications?." arXiv preprint arXiv:2102.11972 (2021).

Noreen, Eric W. "Computer intensive methods for hypothesis testing: An introduction." Wiley, New York (1989).

Yuan, Ke‐Hai, and Kentaro Hayashi. "Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models." British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110.
57 changes: 57 additions & 0 deletions README_RAW.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
* [Scenario 3: Comparing sample-level scores](#scenario-3---comparing-sample-level-scores)
* [Scenario 4: Comparing more than two models](#scenario-4---comparing-more-than-two-models)
* [How to report results](#newspaper-how-to-report-results)
* [Sample size](#control_knobs-sample-size)
* [Other features](#sparkles-other-features)
* [General Recommendations & other notes](#general-recommendations)
* [:mortar_board: Cite](#mortar_board-cite)
Expand Down Expand Up @@ -323,6 +324,60 @@ score. Below lists some example snippets reporting the results of scenarios 1 an
$\alpha = 0.05$ (before adjusting for all pair-wise comparisons using the Bonferroni correction). Almost stochastic
dominance ($\epsilon_\text{min} < 0.5)$ is indicated in table X.

### :control_knobs: Sample size

It can be hard to determine whether the currently collected set of scores is large enough to allow for reliable
significance testing or whether more scores are required. For this reason, `deep-significance` also implements functions to aid the decision of whether to
collect more samples or not.

First of all, it contains *Bootstrap power analysis* (Yuan & Hayashi, 2003): Given a set of scores, it gives all of them a uniform lift to
create an artificial, second sample. Then, the analysis runs repeated analyses using bootstrapped versions of both
samples, comparing them with a significance test. Ideally, this should yield a significant result: If the difference
between the re-sampled original and the lifted sample is non-significant, the original sample has too high of a variance. The
analyses then returns the *percentage of comparisons* that yielded significant results. If the number is too low,
more scores should be collected and added to the sample.

The result of the analysis is the *statistical power*: The
higher the power, the smaller the risk of falling prey to a Type II error - the probability of mistakenly accepting the
null hypothesis, when in fact it should actually be rejected. Usually, a power of ~ 0.8 is recommended (although that is
sometimes hard to achieve in a machine learning setup).

The function can be used in the following way:

```python
import numpy as np
from deepsig import bootstrap_power_analysis

scores = np.random.normal(loc=0, scale=20, size=5) # Create too small of a sample with high variance
power = bootstrap_power_analysis(scores, show_progress=False) # 0.081, way too low

scores2 = np.random.normal(loc=0, scale=20, size=50) # Let's collect more samples
power2 = bootstrap_power_analysis(scores2, show_progress=False) # Better power with 0.2556
```

By default, `bootstrap_power_analysis()` uses a one-sided Welch's t-test. However, this can be modified by passing
a function to the `significance_test` argument, which expects a function taking two sets of scores and returning a
p-value.

Secondly, if the Almost Stochastic Order test (ASO) is being used, there is a second function available. ASO estimates
the violation ratio of two samples using bootstrapping. However, there is necessarily some uncertainty around that
estimate, given that we only possess a finite number of samples. Using more samples decreases the uncertainty and makes the estimate tighter.
The degree to which collecting more samples increases the tightness can be computed using the following function:

```python
import numpy as np
from deepsig import aso_uncertainty_reduction

scores1 = np.random.normal(loc=0, scale=0.3, size=5) # First sample with five scores
scores2 = np.random.normal(loc=0.2, scale=5, size=3) # Second sample with three scores

red1 = aso_uncertainty_reduction(m_old=len(scores1), n_old=len(scores2), m_new=5, n_new=5) # 1.1547005383792515
red2 = aso_uncertainty_reduction(m_old=len(scores1), n_old=len(scores2), m_new=7, n_new=3) # 1.0583005244258363

# Adding two runs to scores1 increases tightness of estimate by 1.15
# But adding two runs to scores2 only increases tightness by 1.06! So spending two more runs on scores1 is better
```

### :sparkles: Other features

#### :rocket: For the impatient: ASO with multi-threading
Expand Down Expand Up @@ -485,3 +540,5 @@ Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. "Visualizing th
Narang, Sharan, et al. "Do Transformer Modifications Transfer Across Implementations and Applications?." arXiv preprint arXiv:2102.11972 (2021).

Noreen, Eric W. "Computer intensive methods for hypothesis testing: An introduction." Wiley, New York (1989).

Yuan, Ke‐Hai, and Kentaro Hayashi. "Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models." British Journal of Mathematical and Statistical Psychology 56.1 (2003): 93-110.
3 changes: 2 additions & 1 deletion deepsig/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from deepsig.bootstrap import bootstrap_test
from deepsig.correction import bonferroni_correction
from deepsig.permutation import permutation_test
from deepsig.sample_size import aso_uncertainty_reduction, bootstrap_power_analysis

__version__ = "1.1.3"
__version__ = "1.2.0"
__author__ = "Dennis Ulmer"
4 changes: 2 additions & 2 deletions deepsig/aso.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from deepsig.conversion import (
ArrayLike,
ScoreCollection,
score_conversion,
score_pair_conversion,
ALLOWED_TYPES,
CONVERSIONS,
)
Expand All @@ -28,7 +28,7 @@
set_loky_pickler("dill") # Avoid weird joblib error with multi_aso


@score_conversion
@score_pair_conversion
def aso(
scores_a: ArrayLike,
scores_b: ArrayLike,
Expand Down
4 changes: 2 additions & 2 deletions deepsig/bootstrap.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@
import numpy as np

# PKG
from deepsig.conversion import ArrayLike, score_conversion
from deepsig.conversion import ArrayLike, score_pair_conversion


@score_conversion
@score_pair_conversion
def bootstrap_test(
scores_a: ArrayLike, scores_b: ArrayLike, num_samples: int = 1000
) -> float:
Expand Down
26 changes: 14 additions & 12 deletions deepsig/conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,10 @@ def extend_type(type_: type, new_type: type) -> type:
pass


def score_conversion(func: Callable) -> Callable:
def score_pair_conversion(func: Callable) -> Callable:
"""
Decorator that makes sure that any sort of array containing scores is internally being converted to a numpy array.
This decorator is specficially used for functions that that two sets of scores as their first argument.
Supports most common Python iterables, PyTorch and Tensorflow tensors as well as Jax arrays.
Parameters
Expand All @@ -94,7 +95,7 @@ def score_conversion(func: Callable) -> Callable:
"""

@wraps(func)
def with_score_conversion(
def with_score_pair_conversion(
scores_a: ArrayLike, scores_b: ArrayLike, *args, **kwargs
):

Expand Down Expand Up @@ -125,13 +126,14 @@ def _squeeze_or_exception(array: np.array, name: str) -> np.array:

return func(scores_a, scores_b, *args, **kwargs)

return with_score_conversion
return with_score_pair_conversion


def p_value_conversion(func: Callable) -> Callable:
def score_conversion(func: Callable) -> Callable:
"""
Decorator that makes sure that any sort of array containing p-values is internally being converted to a numpy array.
Supports most common Python iterables, PyTorch and Tensorflow tensors as well as Jax arrays.
Decorator that makes sure that any sort of array containing scores is internally being converted to a numpy array.
In comparison to score_pair_conversion, this decorator is used for functions only using a single set of scores
(or valuues). Supports most common Python iterables, PyTorch and Tensorflow tensors as well as Jax arrays.
Parameters
----------
Expand All @@ -145,18 +147,18 @@ def p_value_conversion(func: Callable) -> Callable:
"""

@wraps(func)
def with_p_value_conversion(p_values: ArrayLike, *args, **kwargs):
def with_score_conversion(scores: ArrayLike, *args, **kwargs):

# Select appropriate conversion functions
conversion_func = CONVERSIONS[type(p_values)]
conversion_func = CONVERSIONS[type(scores)]

# Convert to numpy arrays
p_values = conversion_func(p_values)
p_values = _squeeze_or_exception(p_values, "p_values")
scores = conversion_func(scores)
scores = _squeeze_or_exception(scores, "p_values")

return func(p_values, *args, **kwargs)
return func(scores, *args, **kwargs)

return with_p_value_conversion
return with_score_conversion


def _squeeze_or_exception(array: np.array, name: str) -> np.array:
Expand Down
4 changes: 2 additions & 2 deletions deepsig/correction.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@
import numpy as np

# PKD
from deepsig.conversion import p_value_conversion, ArrayLike
from deepsig.conversion import score_conversion, ArrayLike


@p_value_conversion
@score_conversion
def bonferroni_correction(p_values: ArrayLike) -> np.array:
"""
Correct for multiple comparisons based on Bonferroni's method.
Expand Down
4 changes: 2 additions & 2 deletions deepsig/permutation.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@
import numpy as np

# PKG
from deepsig.conversion import ArrayLike, score_conversion
from deepsig.conversion import ArrayLike, score_pair_conversion


@score_conversion
@score_pair_conversion
def permutation_test(
scores_a: ArrayLike, scores_b: ArrayLike, num_samples: int = 1000
) -> float:
Expand Down
Loading

0 comments on commit fb3856d

Please sign in to comment.