Skip to content

Latest commit

 

History

History
1788 lines (1273 loc) · 58.1 KB

detailled-docu.md

File metadata and controls

1788 lines (1273 loc) · 58.1 KB

Table of Contents

statanalysis

statanalysis.mdl_esti_md.prediction_metrics

compute_skew

def compute_skew(arr)

summary

Arguments:

Returns:

  • _type_ - description

compute_kurtosis

def compute_kurtosis(arr, residuals=None)

summary

Arguments:

Returns:

  • _type_ - description

compute_aic_bic

def compute_aic_bic(dfr: int, n: int, llh: float, method: str = "basic")

summary

Utils

Arguments:

  • dfr int - nb_predictors(not including the intercept)

  • dfe int - nb of observations

  • llh float - log likelihood

    Question what about mixed models ?

Returns:

  • float - aicself, y_true, y_pred

PredictionMetrics Objects

class PredictionMetrics()

compute_log_likelihood

def compute_log_likelihood(std_eval: float = None,
                           debug=False,
                           min_tol: float = True)

summary

Arguments:

  • std_eval float, optional - (ignored if self.binary=True). Defaults to None.
  • debug bool, optional - description. Defaults to False.
  • min_tol float, optional - (ignored if self.binary=False). Defaults to None.

Returns:

  • _type_ - description

statanalysis.mdl_esti_md

statanalysis.mdl_esti_md.prediction_results

HPE_REGRESSION_FISHER_TEST

def HPE_REGRESSION_FISHER_TEST(y: list,
                               y_hat: list,
                               nb_param: int,
                               alpha: float = None)

check if mean is equal accross many samples

Args y (list): array-like of 1 dim y_hat (list): array-like of 1 dim nb_param (int): number of parameter in the regression (include the intercept). ex: for 6 independant variables, nb_params=7 alpha (float, optional): description. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.

Hypothesis H0: β1 = β2 = ... = βk-1 = 0; k=nb_params H1: βj ≠ 0, for at least one value of j

Hypothesis

Fisher test

Returns:

  • data - (RegressionFisherTestData)

compute_logit_regression_results

def compute_logit_regression_results(crd: RegressionResultData,
                                     debug: bool = False)

summary

Arguments:

Returns:

  • _type_ - description

statanalysis.mdl_esti_md.log_reg_example

Author: Susan Li source: LogisticRegressionImplementation.ipynb - github.com/aihubprojects

statanalysis.mdl_esti_md.hp_estimators_regression

todo

  • refactor output (last lines)
  • use "alternative" instead of "tail"
  • use kwargs format while calling functions
  • reorder fcts attributes
  • Que signifie le R au carré négatif?:
    • selon ma def, c'est entre 0 et 1 à cause d'une somme mais c'est faux ?? qastack.fr

ComputeRegression Objects

class ComputeRegression()

fit

def fit(X, y, nb_iter: float = None, learning_rate: float = None)

summary

Arguments:

  • X 2-dim array - list of columns (including slope) (n,nb_params)
  • y 1-dim array - observations (n,)
  • alpha type, optional - description. Defaults to None.
  • debug bool, optional - description. Defaults to False.

Raises:

  • Exception - description

Returns:

  • _type_ - description

statanalysis.mdl_esti_md.model_estimator

We know why t-student is useful what about khi-2 ? we know fisher ? yes F

  • add a fct to predict
    • attention to extrapolation (unsern data) vs interpolation
  • another for the curve showing the std
    • the interval should be narrower tinyer when X reacg the sample mean
  • a good list of intel/reminder about the regression here - sites.ualberta.ca - pdf

ME_Normal_dist

def ME_Normal_dist(sample: list, alpha=None, debug=False)

estimate a normal distribution from a sample

visualisation:

  • check if normal:
    • sns.distplot(data.X)
    • check if qq-plot is linear en.wikipedia.org ::from statsmodels.graphics.gofplots import qqplot ::from matplotlib import pyplot ::qqplot(sample, line='s') ::pyplot.show()

hypothesis

lenght

  • you may need data over 1000 samples to get

ME_Regression

def ME_Regression(x: list,
                  y: list,
                  degre: int,
                  logit=False,
                  fit_intercept=True,
                  debug=False,
                  alpha: float = 0.05,
                  nb_iter: int = 100000,
                  learning_rate: float = 0.1)

estimate a regression model from two samples

prediction

  • predict Y conditional on X assuming that Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + N(0,s**2)
  • Y is a dependant variable
  • x, s are independant ones => predictors of the dependant variables
  • If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + var_exp_1*G +

visualisation:

  • sns.scatterplot(X,Y)

hypothesis

  • Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + err
  • err ~~> N(0,s**2)
  • variance(error)==s**2 is the same accross the data
  • var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3
  • pr[i] cst
  • pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in)

prediction

  • each pr[i] have a mean and a std based on normal distribution
  • Y too =>
    • Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*X^2 + pr_[3]*X^3
    • Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn't s ?

predictors

  • pr[i], s**2

lenght

  • you may need data over 1000 samples to get

Others D'ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]

utils

ME_multiple_regression

def ME_multiple_regression(X: list,
                           y: list,
                           logit=False,
                           fit_intercept=True,
                           debug=False,
                           alpha: float = 0.05,
                           nb_iter: int = 100000,
                           learning_rate: float = 0.1)

summary

Arguments:

  • X list - description

  • y list - description

  • debug bool, optional - description. Defaults to False.

  • alpha type, optional - description. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.

    estimate a regression model from two samples

    prediction

    • predict Y conditional on X, B, G, ... assuming that Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + N(0,s**2)
    • Y is a dependant variable
    • x, B, G, ...., s are independant ones => predictors of the dependant variables
    • If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + pr[4]*T1 +pr[5]*T2 + => The correlation of the repeated measures needs to be taken into account, and time since administration needs to be added to the model as an independent variable.

    Questions of interest

    • Are you interested in establishing a relationship?
    • Are you interested in which predictors are driving that relationship?

    visualisation:

    • sns.scatterplot(X[i],y) for i in range(len(X))
    • check for Form_linear_or_not;Direction_pos_or_neg;Strengh_of_the_colinearity;Outliers

    hypothesis

    • Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + err
    • err ~~> N(0,s**2)
    • variance(error)==s**2 is the same accross the data
    • var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G
    • pr[i] cst
    • pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in)
    • non Collinearity a.k.a Multicollinearity
    • a correlation with be computed
    • Anyway, i does not change the predictive power not the efficieency of the model
    • Too, i guess aic selection remove one right ?
    • But data about coefficients are not good because there is repetition
    • Regression Trees = can handle correlated data well

    prediction

    • each pr[i] have a mean and a std based on normal distribution
    • Y too =>
    • Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*B + pr_h[3]*G
    • Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn't s ?

    predictors

    • pr[i], s**2

    lenght

    • you may need data over 1000 samples to get

    Others D'ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]

Raises:

  • Exception - description

Returns:

  • _type_ - description

statanalysis.common

statanalysis.utils_md.preprocessing

clear_list

def clear_list(L: list) -> ndarray

remove nan from a list

Arguments:

  • L list - a 1-dim array (n,). Anyway, data will be flatten

    What about he handle missing values properly !!

    • weight shit
    • Anyway, it would be good to know how missing values removal the distribution of L

Returns:

1-dim array: array of shape (n,)

Examples

A = np.array([ [1,3], [4,3], [5,3], [7,np.nan] ]) y = np.array([6,np.nan,3,2]) A1 = clear_list(A) y1 = clear_list(y) print("A1: ",A1)

  • A1 - array([1, 3, 4, 3, 5, 3])

    print("y: ",y1)

  • y - array([6. 3. 2.])

clear_list_pair

def clear_list_pair(L1, L2) -> Tuple[ndarray, ndarray]

remove nan values (remove observation data containing nan value in L1 or L2) from 2 lists

Arguments:

  • L1 list - a 1-dim array (n,). Anyway, data will be flatten

  • L2 list - a 1-dim array (n,). Anyway, data will be flatten

    What about he handle missing values properly !!

    • weight shit
    • Anyway, it would be good to know how missing values removal the distribution of L

Raises:

L1 and L2 have different size: lists must be of the same size

Returns:

1-dim array: L1 of shape(n,) 1-dim array: L2 of shape(n,)

Examples

y1 = np.array([4, 8,np.nan,2]) y2 = np.array([6,np.nan,36,9]) y1,y2 = clear_list_pair(y1, y2) print("y1: ",y1)

  • y1 - array([4, 2])

    print("y2: ",y2)

  • y2 - array([6. 9.])

clear_mat_vec

def clear_mat_vec(A, y) -> Tuple[ndarray, ndarray]

Remove nan values (remove observation data containing nan value in X or y) from a matric and a corresponding vector

Parameters

A : 2-dimensional array (n,p) y: 1-dimensional array (n,)

Others

What about he handle missing values properly !! - weight shit - Anyway, it would be good to know how missing values removal the distribution of L

Raises

L1 and L2 have different size: lists must be of the same size

Returns

1-dim array: L1 of shape(n,)
1-dim array: L2 of shape(n,)

Examples

A = np.array([ [1,3], [4,3], [5,3], [7,np.nan] ]) y = np.array([6,np.nan,3,2]) A1,y1 = clear_mat_vec(A,y) print("A1: ",A1) A1: [[1. 3.] [5. 3.]] print("y: ",y1) y: [6. 3.]

statanalysis.utils_md

statanalysis.utils_md.estimate_std

estimate_std

def estimate_std(sample)

Instead of std, he divide by (n-1) correspondng to the std estimator used in t-test

statanalysis.utils_md.compute_ppf_and_p_value

get_p_value_from_tail

def get_p_value_from_tail(prob, tail, debug=False)

get p value based on cdf and tail If tail=Tails.middle, the distribution is assumed symmetric because we double F(Z) if tail - right: return P(N > Z) = 1- F(Z) = 1 - prob - left: return P(N < Z) = F(Z) = prob - middle: return P(N < -|Z|) + P(N > |Z|) => return 2*P(N > |Z|)

get_p_value_z_test

def get_p_value_z_test(Z: float, tail: str, debug=False)

get p value based on normal distribution N(0, 1) if tail - right: return P(N > Z) - left: return P(N < Z) - middle: return P(N < -|Z|) + P(N > |Z|) => return 2*P(N > |Z|)

get_p_value_t_test

def get_p_value_t_test(Z: float, ddl, tail: str, debug: bool = False)

get p value based on student distribution T(df=ddl) with ddl degres of freedom if tail - right: return P(T > Z) - left: return P(T < Z) - middle: return P(T < -|Z|) + P(T > |Z|) => return 2*P(T > |Z|)

get_p_value_f_test

def get_p_value_f_test(Z: float, dfn: int, dfd: int, debug: bool = False)

get p value based on fisher distribution T(dfn, dfd) with ddl degres of freedom

Utils

Arguments:

  • Z float - description
  • dfn int - description
  • dfd int - description
  • debug bool, optional - description. Defaults to False.

Raises:

  • Exception - description

Returns:

  • _type_ - description

get_p_value

def get_p_value(Z: float, tail: str, test: str, ddl: int = None, debug=False)

get p value based on - (if test=="t_test") student distribution T(df=ddl) with ddl degres of freedom - (if test=="z_test") normal distribution N(0, 1) - (if test=="f_test") normal distribution F(ddl[0], ddl[1])

if tail - right: return P(T > Z) - left: return P(T < Z) - middle: return P(T < -|Z|) + P(T > |Z|) => return 2*P(T > |Z|)

statanalysis.utils_md.constants

statanalysis.utils_md.refactoring

Confidence_data Objects

@dataclass()
class Confidence_data()

sample_size

int or tuple

Hypothesis_data Objects

@dataclass()
class Hypothesis_data()

pnull

prior value to check against

tail

right, left,middle

sample_size

int or tuple

reject_null

if H0 is rejected

RegressionFisherTestData Objects

@dataclass()
class RegressionFisherTestData()

MSR

SSR/(k-1)

R_carre

1-SSR/SST

R_carre_adj

1-MSR/MST

F_stat

F=MSR/MSE

reject_null

If F is large

statanalysis.hyp_vali_md

statanalysis.hyp_vali_md.constraints

check_sample_normality

def check_sample_normality(residuals: list, debug=False, alpha=None)

check if residuals is like a normal distribution

  • test_implemented

Arguments:

  • residuals list - list of float or array-like (will be flatten)
  • debug bool, optional - description. Defaults to False.

Returns:

  • bool - if all tests passed

check_equal_var

def check_equal_var(*samples, alpha=COMMON_ALPHA_FOR_HYPH_TEST)

summary

Arguments:

Returns:

  • _type_ - description

statanalysis.hyp_vali_md.hypothesis_validator

les test de valisation (hypothese avant de lancer un autre test) qui dependant de test que j'ai écrits moi-même Les mettre dans utils prut créer un import circulaire

check_residuals_centered

def check_residuals_centered(residuals: list, alpha=None)

check if a list is centered (if the mean ==0 nuder a significance od 0.05)

Arguments:

  • residuals list - list or array-like

Returns:

  • _type_ - description

check_coefficients_non_zero

def check_coefficients_non_zero(list_coeffs: list,
                                list_coeff_std: list,
                                nb_obs: int,
                                debug=False,
                                alpha=None)

compute non zero tests for each corfficien

  • test
  • for ech coefficient
  • H0: coeff==0
  • H1: coeff!=0
  • if the test passed (H0 is rejected), the coefficient is away from 0, return = True

Arguments:

  • list_coeffs list - lists of values
  • list_coeff_std list - list of std; the two lists should have the same lenght

Returns:

  • HypothesisValidationData(pass_non_zero_test_bool,pass_non_zero_test)
  • testPassed (bool)
  • obj (list) list of boolean (For each value, True if H0 is reected)

check_equal_mean

def check_equal_mean(*samples, alpha=None)

check if mean if the same accross samples

Hypothesis H0: mean1 = mean2 = mean3 = .... H1: one is different

Hypothesis

  • The samples are independent.
  • Each sample is from a normally distributed population.
  • The population standard deviations of the groups are all equal. This property is known as homoscedasticity.

Arguments:

  • *samples (list): one or many lists

Fisher test

  • The F Distribution is also called the Snedecor’s F, Fisher’s F or the Fisher–Snedecor distribution 1 2

Returns:

  • stat - (float) F
  • p_value - (float)

statanalysis.conf_inte_md

statanalysis.conf_inte_md.confidence_interval

Some defs

  • parameter: A quantifiable characteristic of a population
  • confidence interval: range of reasonable values for the parameter

IC_PROPORTION_ONE

def IC_PROPORTION_ONE(sample_size: int,
                      parameter: float,
                      confidence: float = None,
                      method: str = None)

Confidence_interval(ONE PROPORTION):Confidence interval calculus after a statistic test

  • input

  • sample_size: int: sample size (more than 10 to use this method)

  • parameter: float: the measurement on the sample

  • confidence: float: confidence confidence (between O and 1). Greater the confidence, wider the interval

  • method: str: either "classic" (default) or "conservative.

  • Example:

  • how many men in the entire population with a con ?

  • a form filled by 300 people show that there is only 120 men => p = (120/300); N=300

  • Hypothesis

  • the sample is over 10 for each of the categories in place => we use the "Law of Large Numbers"

  • the sample proportion comes from data that is considered a simple random sample

  • Idea

  • let P: the real proportion in the population

  • let S: Size of each sample == nb of observations per sample

  • For many samples, we calculate proportions per sample: ex: for N samples of size S => N proportions values

  • (p - P) / ( p*(1-p)/S ) follow a normal distribution

  • Descriptions:

  • For a given polulation and a parameter P to find, If we repeated this study many times, each producing a new sample (of the same size {res.sample_size==S}) from witch a {res.confidence} confidence interval is computed, then {res.confidence} of the resulting confidence intervals would be excpected to contain the true value P

  • If the entire interval verify a property, then it is reasonable say that the parameter verify that property

  • Result

  • with a {res.confidence} confidence, we estimate that the populztion proportion who are men is between {res.left_tail} and {res.right_tail}

IC_MEAN_ONE

def IC_MEAN_ONE(sample: list, t_etoile=None, confidence: float = None)

Estimate_population_mean(ONE MEAN): We need the spread (std): We will use an estimation

Data - confidence:.. - sample: value...

Method

  • Use t-distribution to calculate few

Hypothesis

  • Samples follow a normal (or large enough to bypass this assumption) => means of these sample follow a t-dist

IC_PROPORTION_TWO

def IC_PROPORTION_TWO(p1, p2, N1, N2, confidence: float = None)

Difference_population_proportion(TWO PROPORTIONS): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1-p2#

Method - create joint confidence interval

Construction

  • Cmparison

Hypotheses

  • two independant random samples
  • large enough sample sizes : 10 per category (ex 10 yes, 10 no)

IC_MEAN_TWO_PAIR

def IC_MEAN_TWO_PAIR(sample1,
                     sample2,
                     t_etoile=None,
                     confidence: float = None)

Difference_population_means_for_paired_data(TWO MEANS FOR PAIRED DATA): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1-p2#

What is paired data ?

- measurements took on individuals (people, home, any object)
- technicality:
    - When in a dataset (len = n) there is a row  df.a witch values only repeat twice (=> df.a.nunique = n/2)
    - we can do a plot(x=feature1, y=feature2)
- examples
    - Each home need canibet quote from two suppliers => we want to know if there is an average difference in nb_quotes from between twese two suppliers
    - In a blind taste test to compare two new juice flavors, grape and apple, consumers were given a sample of each flavor and the results will be used to estimate the percentage of all such consumers who prefer the grape flavor to the apple flavor.
- Construction
    - It is like,
        - checking if a feature magnitude change when going from a category to another, each pair split the two cztegories
        - Example_contexte
            - having a dataframe df, with 3 col [name, score, equipe, role]
                - equipe: "1" or "2"
                - role: df.role.nunique = 11 => len(df)==22
            - Now there is a battle: For a "same role" fight", which team is the best?
        - Example_question
            - if education level are generally equal -> mean difference is 0
                - Is there a mean difference between the education level of twins
                - if education levels are unequel -> mean difference is not 0
            - So, Look for 0 in the ranfe of reaonable values

We need the spread (std): We will use an estimation

Equivl

  • IC_MEAN_ONE(confidence, sample1 - sample2)

Data - confidence:.. - Sample1: list: values... - Sample2: list: (same len) values...

Method

  • Use t-distribution to calculate few
  • create joint confidence interval

Hypothesis

  • a random sample of identical twin sets
  • Samples follow a normal (or large enough to bypass this assumption: (ex 20 twins)) => means of these sample follow a t-dist

Notes

  • With {cf} confidence, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}
  • if all values are above 0, cool there is a significativity

IC_MEAN_TWO_NOTPAIR

def IC_MEAN_TWO_NOTPAIR(sample1,
                        sample2,
                        pool=False,
                        confidence: float = None)

Difference_population_means_for_nonpaired_data(TWO MEANS FOR PAIRED DATA): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1-p2#

Construction

  • It is like,
    • checking if a feature magnitude change when going from a category to another
    • Example_contexte
      • having a dataframe df, with 3 col [name, score, equipe, role]
        • equipe: "1" or "2"
        • role: df.role.nunique = 11 => len(df)==22
      • Now there is a battle: For a "same role" fight", which team is the best?
    • Example_question
      • if education level are generally equal -> mean difference is 0
        • Is there a mean difference between the education level based on gender
        • if education levels are unequel -> mean difference is not 0
      • So, Look for 0 in the ranfe of reaonable values

We need the spread (std): We will use an estimation

Args - confidence:.. - Sample1: list: values... - Sample2: list: (same len) values... - pool: default False - True - if we assume that our populations variance are equal - we use a t-distribution of (N1+N2-1) ddl - False - if we assume that our populations variance are not equal - we use a t-distribution of min(N1, N2)-1 ddl

Method

  • Use t-distribution to calculate few
  • create joint confidence interval

Hypothesis

  • a random sample
  • Samples follow a normal (or large enough to bypass this assumption: 10 per category) => means of these sample follow a t-dist

Notes

  • With {cf} confidence, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}
  • if all values are above 0, cool there is a significativity

statanalysis.conf_inte_md.ci_estimators

todo

  • refactor output (last lines)
  • use "alternative" instead of "tail"
  • use kwargs format while calling functions
  • reorder fcts attributes

get_min_sample

def get_min_sample(moe: float, p=None, method=None, cf: float = None)

Get_min_sample:get the minimum of sample_size to use for a

  • input
    • cf: confidence (or coverage_probability): between 0 and 1
    • moe: margin of error
    • method (optional): "conservative" (default)
    • p: not used if method=="conservative" Hyp
  • better the population follow nornal dist. Or use large sample (>10)

CIE_ONE_PROPORTION

def CIE_ONE_PROPORTION(proportion, n, method, cf: float = None)

Get_interval_simple: get a proportion of an attribute value (male gender, ) in a population based on a sample (no sign pb)

  • cf: confidence_level (or coverage_probability)
  • proportion: measurement
  • n: number of observations == len(sample)
  • method: "classic" or "conservative"

Hyp

  • better the population follow nornal dist. Or use large sample (>10)

CIE_PROPORTION_TWO

def CIE_PROPORTION_TWO(p1, p2, n1, n2, cf: float = None)

Get_interval_diff: get the diff of mean between 2 population based on a sample from each population (p1-p2) p1-p2#

  • cf: confidence_level (or coverage_probability)
  • p1: mean of liste1
  • p2: mean of liste2
  • n1: len(liste1)
  • n2: len(liste2)

Hyp

  • better the populations follow normal dist. Or use large samples (>10)

CIE_MEAN_ONE

def CIE_MEAN_ONE(n, mean_dist, std_sample, t_etoile=None, cf: float = None)

Get_interval_mean:get the mean of a population from a sample (no sign pb)

  • cf: confidence level (or coverage_probability)
  • n: number of observations == len(sample)
  • mean_dist: the mean measured on the sample = mean(sample)
  • std_sample: std of the sample ==std(sample)
  • t_etoile: if set, cf is ignored.

Hyp

  • better the population follow nornal dist. Or use large sample (>10)
    • Alternative to normality: Wilcoxon Signed Rank Test

Theo

CIE_MEAN_TWO

def CIE_MEAN_TWO(N1,
                 N2,
                 diff_mean,
                 std_sample_1,
                 std_sample_2,
                 t_etoile=None,
                 pool=False,
                 cf: float = None)

Get_interval_diff_mean: get the diff in mean of two populations(taking their samples) (sign(diff_mean) => no sign pb)

  • cf: confidence level (or coverage_probability)
  • N1: number of observations == len(sample1)
  • N2: number of observations == len(sample2)
  • mean_dist: the mean measured on the sample = mean(sample)
  • std_sample_1: std of the sample ==std(sample1)
  • std_sample_2: std of the sample ==std(sample2)
  • t_etoile: if set, cf is ignored.
  • pool: default False
    • True
      • if we assume that our populations variance are equal
      • we use a t-distribution of (N1+N2-1) ddl
    • False
      • if we assume that our populations variance are not equal
      • we use a t-distribution of min(N1, N2)-1 ddl

Hyp

  • both the population follow normal dist. Or use large sample (>10)
  • the populations are independant from each other
  • use simple random samples
  • for pool=True, variances are assume to be the same

Eqvl

  • scipy.stats.ttest_ind(liste1,liste2, equal_var = False | True)

Eqvl_pointWise estimation

  • Assume diff_mean = 82
  • Result: diff_mean in CI = [77.33, 87.63]
  • If we test H0:p=80 vs H1:p>80, we would fail to reject the null because H1 is not valide here
  • As sa matter of fact, there is some value in CI below 80 witch if not compatible with H1 => the test doest give enough evidence to reject H0

Theo

statanalysis.hyp_testi_md

statanalysis.hyp_testi_md.hp_estimators

utils

  • Dans un test, H0 est l'hypothese pessimiste
    • il faudra donc assez d'evidence (p<0.05) afin de la rejeter

HPE_FROM_P_VALUE

def HPE_FROM_P_VALUE(tail: str = None,
                     p_value=None,
                     t_stat=None,
                     p_hat=None,
                     p0=None,
                     std_stat_eval=None,
                     alpha=None,
                     test="z_test",
                     ddl=0,
                     onetail=False)

summary

Arguments:

  • tail str, optional - "middle" or "left" or "right"
  • p_value type, optional - description. Defaults to None.
  • t_stat type, optional - description. Defaults to None.
  • p_hat type, optional - description. Defaults to None.
  • p0 type, optional - description. Defaults to None.
  • std_stat_eval type, optional - description. Defaults to None.
  • alpha type, optional - description. Defaults to None.
  • test str, optional - description. Defaults to "z_test".
  • ddl int, optional - description. Defaults to 0.
  • onetail bool, optional - if tail="middle". return one_tail_cf_p_value instead of the 2tail_2cf_p_value Defaults to False.

Returns:

  • _type_ - description

HPE_PROPORTION_ONE

def HPE_PROPORTION_ONE(alpha, p0, proportion, n, tail=Tails.right)

check a proportion of an attribute value (male gender, ) in a population based on a sample (no sign pb) using a Z-statistic

  • alpha: p_value_max: significance level
  • p0: proportion under the null
  • proportion: measurement
  • n: number of observations == len(sample)
  • tail:
    • right: check if p>p0
    • left: check if p<p0
    • middle: ckeck id p==p0 Hyp
  • simple random sample
  • large sample (np>10)

Hypotheses

  • H0: proportion = p0
  • H1:
    • tail==right => proportion > p0
    • tail==left => proportion < p0
    • tail==middle => proportion != p0

Detail

  • use a normal distribion (Z-statistic)

Result (ex:tail=right)

  • if reject==True
    • There is sufficient evidence to conclude that the population proportion of {....} is greater than p0

HPE_PROPORTION_TW0

def HPE_PROPORTION_TW0(alpha, p1, p2, n1, n2, tail=Tails.middle, evcpp=False)

check the diff of proportion between 2 population based on a sample from each population (p1-p2) p1-p2# using a Z-statistic (always used for difference of estimates). there is also fisher and chi-square

  • alpha: level of significance
  • p1: proportion of liste1
  • p2: proportion of liste2
  • n1: len(liste1)
  • n2: len(liste2)
  • evcpp: bool(defult=False) (True -> Estimate of the variance of the combined population proportion)

Hyp

  • two independant samples
  • two random samples
  • large enough data

Hypotheses

  • H0: proportion = p0
  • H1: proportion !=p0

Detail

  • use a normal distribion (Z-statistic)

HPE_MEAN_ONE

def HPE_MEAN_ONE(alpha, p0, mean_dist, n, std_sample, tail=Tails.right)

get the mean of a population from a sample (no sign pb) using using a T-statistic (always T for mean!! unless youre comparing a sample vs a population of known std)

  • alpha:
  • n: number of observations == len(sample)
  • mean_dist: the mean measured on the sample = mean(sample)
  • std_sample: std of the sample ==std(sample). You should use a real estimate (ffod=n-1)

Hyp

  • simple random sample
  • better the population follow nornal dist. Or use large sample (>10)
    • Alternative to normality: Wilcoxon Signed Rank Test

Theo

HPE_MEAN_TWO_PAIRED

def HPE_MEAN_TWO_PAIRED(alpha,
                        mean_diff_sample,
                        n,
                        std_diff_sample,
                        tail=Tails.middle)

get the difference of mean between two list paired (no sign pb) using a T-statistic (always T for mean!! unless youre comparing a sample vs a population of known std)

  • alpha:
  • mean_diff_sample: the mean measured on the sample = mean(sample)
  • n: number of observations == len(sample) == n1 == n2
  • std_diff_sample: std of the sample ==std(sample). You should use a real estimate (ffod=n-1)
  • tail: default=Tails.middle to test the equality (mean_diff=0). But we can also to mean_diff>0 (right) or mean_diff<0 (left)

Hyp

  • simple random sample
  • better when the diff of the samples (sample1 - sample2) follow nornal dist. Or use large sample (>10)
  • std_diff_sample is a good data based estimated [use (n-1) instead of n]. example: np.std(sample1 - sample2, ddof=1) is better than ddof=0 (default)

Hypothesis

  • H0: p1 - p2 = 0
  • H1:
    • H1: p1 - p2 != 0 for(tail=middle)
    • H1: p1 - p2 > 0 for(tail=right)
    • H1: p1 - p2 < 0 for(tail=left)

HPE_MEAN_TWO_NOTPAIRED

def HPE_MEAN_TWO_NOTPAIRED(alpha,
                           diff_mean,
                           N1,
                           N2,
                           std_sample_1,
                           std_sample_2,
                           pool=False,
                           tail=Tails.middle)

check the diff in mean of two populations(taking their samples) (sign(diff_mean) => no sign pb)

  • alpha:
  • N1: number of observations == len(sample1)
  • N2: number of observations == len(sample2)
  • mean_dist: the mean measured on the sample = mean(sample)
  • std_sample_1: std of the sample ==std(sample1)
  • std_sample_2: std of the sample ==std(sample2)
  • pool: default False
    • True
      • if we assume that our populations variance are equal
      • we use a t-distribution of (N1+N2-1) ddl
    • False
      • if we assume that our populations variance are not equal
      • we use a t-distribution of min(N1, N2)-1 ddl

Hyp

  • both the population follow normal dist. Or use large sample (>10)
  • the populations are independant from each other
  • use simple random samples
  • for pool=True, variances must be the same

Theo

HPE_MEAN_MANY

def HPE_MEAN_MANY(*samples, alpha=None)

check if mean is equal accross many samples

Hypothesis H0: mean1 = mean2 = mean3 = .... H1: one is different

Hypothesis

Fisher test

  • The F Distribution is also called the Snedecor’s F, Fisher’s F or the Fisher–Snedecor distribution 1 2

Returns:

  • stat - (float) F
  • p_value - (float)

statanalysis.hyp_testi_md.hypothesis_testing

utils - Dans un test, H0 est l'hypothese pessimiste - il faudra donc assez d'evidence (p<0.05) afin de la rejeter - on a alors mis une borne max faible sur l'erreur de type 1 (rejeter H0 alors qu'il est vrai)

Some defs - parameter: A quantifiable characteristic of a population (baseline) - alpha: level of significance = type1_error = proba(reject_null;when null is True)

HP_PROPORTION_ONE

def HP_PROPORTION_ONE(sample_size: int,
                      parameter: float,
                      p0: float,
                      alpha: float,
                      symb=Tails.SUP_SYMB)

ONE PROPORTION:alpha calculus after a statistic test

  • input

  • sample_size: int: sample size (more than 10 to use this method)

  • parameter: float: the measurement on the sample

  • alpha: float: alpha alpha (between O and 1). Greater the alpha, wider the interval

  • method: str: either "classic" (default) or "conservative.

  • Example:

  • how many men in the entire population with a con ?

  • a form filled by 300 people show that there is only 120 men => p = (120/300); N=300

  • Hypothesis

  • the sample is over 10 for each of the categories in place => we use the "Law of Large Numbers"

  • the sample proportion comes from data that is considered a simple random sample

  • Idea

  • let P: the real proportion in the population

  • let S: Size of each sample == nb of observations per sample

  • For many samples, we calculate proportions per sample: ex: for N samples of size S => N proportions values

  • (p - P) / ( p*(1-p)/S ) follow a normal distribution

  • Descriptions:

  • For a given polulation and a parameter P to find, If we repeated this study many times, each producing a new sample (of the same size {res.sample_size==S}) from witch a {res.alpha} alpha is computed, then {res.alpha} of the resulting alphas would be excpected to contain the true value P

  • If the entire interval verify a property, then it is reasonable say that the parameter verify that property

  • Result

  • with a {res.alpha} alpha, we estimate that the populztion proportion who are men is between {res.left_tail} and {res.right_tail}

HP_MEAN_ONE

def HP_MEAN_ONE(p0: float, alpha: float, sample: list, symb=Tails.SUP_SYMB)

ONE MEAN: We need the spread (std): We will use an estimation

Data - alpha:.. - sample: value...

Method

  • Use t-distribution to calculate few

Hypothesis

  • Samples follow a normal (or large enough to bypass this assumption) => means of these sample follow a t-dist

HP_PROPORTION_TWO

def HP_PROPORTION_TWO(alpha, p1, p2, N1, N2, symb=Tails.NEQ_SYMB, evcpp=False)

TWO PROPORTIONS: We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1-p2#

Method - create joint alpha - evcpp: bool(defult=False) (True -> Estimate of the variance of the combined population proportion)

Construction

  • Cmparison

Hypotheses

  • two independant random samples
  • large enough sample sizes : 10 per category (ex 10 yes, 10 no)

HP_MEAN_TWO_PAIR

def HP_MEAN_TWO_PAIR(alpha, sample1, sample2, symb=Tails.NEQ_SYMB)

TWO MEANS FOR PAIRED DATA: We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1-p2#

What is paired data: - measurements took on individuals (people, home, any object) - technicality: - When in a dataset (len = n) there is a row df.a witch values only repeat twice (=> df.a.nunique = n/2) - we can do a plot(x=feature1, y=feature2) - examples - Each home need canibet quote from two suppliers => we want to know if there is an average difference in nb_quotes from between twese two suppliers - In a blind taste test to compare two new juice flavors, grape and apple, consumers were given a sample of each flavor and the results will be used to estimate the percentage of all such consumers who prefer the grape flavor to the apple flavor. - Construction - It is like, - checking if a feature magnitude change when going from a category to another, each pair split the two cztegories - Example_contexte - having a dataframe df, with 3 col [name, score, equipe, role] - equipe: "1" or "2" - role: df.role.nunique = 11 => len(df)==22 - Now there is a battle: For a "same role" fight", which team is the best? - Example_question - if education level are generally equal -> mean difference is 0 - Is there a mean difference between the education level of twins - if education levels are unequel -> mean difference is not 0 - So, Look for 0 in the ranfe of reaonable values

We need the spread (std): We will use an estimation

Equivl

  • estimate_population_mean(alpha, sample1 - sample2)

Data - alpha:.. - Sample1: list: values... - Sample2: list: (same len) values...

Method

  • Use t-distribution to calculate few
  • create joint alpha

Hypothesis

  • a random sample of identical twin sets

  • Samples follow a normal (or large enough to bypass this assumption: (ex 20 twins)) => means of these sample follow a t-dist

  • description

  • With {alpha} alpha, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}

  • if all values are above 0, cool there is a significativity

HP_MEAN_TWO_NOTPAIR

def HP_MEAN_TWO_NOTPAIR(alpha,
                        sample1,
                        sample2,
                        symb=Tails.NEQ_SYMB,
                        pool=False)

TWO MEANS FOR PAIRED DATA: We have have estimate a parameter p on two populations (1 , 2).How to check p1-p2 != 0? p1-p2#

Construction

  • It is like,
    • checking if a feature magnitude change when going from a category to another
    • Example_contexte
      • having a dataframe df, with 3 col [name, score, equipe, role]
        • equipe: "1" or "2"
        • role: df.role.nunique = 11 => len(df)==22
      • Now there is a battle: For a "same role" fight", which team is the best?
    • Example_question
      • if education level are generally equal -> mean difference is 0
        • Is there a mean difference between the education level based on gender
        • if education levels are unequel -> mean difference is not 0
      • So, Look for 0 in the ranfe of reaonable values

We need the spread (std): We will use an estimation

Data - alpha:.. - Sample1: list: values... - Sample2: list: (same len) values... - pool: default False - True - if we assume that our populations variance are equal - we use a t-distribution of (N1+N2-1) ddl - False - if we assume that our populations variance are not equal - we use a t-distribution of min(N1, N2)-1 ddl

Method

  • Use t-distribution to calculate few
  • create joint alpha

Hypothesis

  • a random sample

  • Samples follow a normal (or large enough to bypass this assumption: 10 per category) => means of these sample follow a t-dist

  • description

  • With {alpha} alpha, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}

  • if all values are above 0, cool there is a significativity