def compute_skew(arr)
type - descriptionUtils
- description
def compute_kurtosis(arr, residuals=None)
list|array-like - descriptionUtils
- description
def compute_aic_bic(dfr: int, n: int, llh: float, method: str = "basic")
- It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.
- aic and bic in python - medium.com/analytics-vidhya
int - nb_predictors(not including the intercept) -
int - nb of observations -
float - log likelihoodQuestion what about mixed models ?
- aicself, y_true, y_pred
class PredictionMetrics()
def compute_log_likelihood(std_eval: float = None,
min_tol: float = True)
float, optional - (ignored if self.binary=True). Defaults to None.debug
bool, optional - description. Defaults to False.min_tol
float, optional - (ignored if self.binary=False). Defaults to None.
- description
y_hat: list,
nb_param: int,
alpha: float = None)
check if mean is equal accross many samples
Args y (list): array-like of 1 dim y_hat (list): array-like of 1 dim nb_param (int): number of parameter in the regression (include the intercept). ex: for 6 independant variables, nb_params=7 alpha (float, optional): description. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.
Hypothesis H0: β1 = β2 = ... = βk-1 = 0; k=nb_params H1: βj ≠ 0, for at least one value of j
- each sample is
- simple random
- normal
- indepebdant from others
- same variance
- attention: use levene test (plus robuste que fusher ou bartlett face à la non-normalité de la donnée)(https://fr.wikipedia.org/wiki/Test_de_Bartlett)
Fisher test
- The F Distribution is also called the Snedecor’s F, Fisher’s F or the Fisher–Snedecor distribution
- f_oneway - docs.scipy.org/doc
- anova and f test - blog.minitab.com
- f-test-reg - facweb.cs.depaul.edu/sjost
- (RegressionFisherTestData)
def compute_logit_regression_results(crd: RegressionResultData,
debug: bool = False)
RegressionResultData - descriptiondebug
bool, optional - description. Defaults to False. Info
- description
Author: Susan Li source: LogisticRegressionImplementation.ipynb - github.com/aihubprojects
- refactor output (last lines)
- use "alternative" instead of "tail"
- use kwargs format while calling functions
- reorder fcts attributes
- Que signifie le R au carré négatif?:
- selon ma def, c'est entre 0 et 1 à cause d'une somme mais c'est faux ?? qastack.fr
class ComputeRegression()
def fit(X, y, nb_iter: float = None, learning_rate: float = None)
2-dim array - list of columns (including slope) (n,nb_params)y
1-dim array - observations (n,)alpha
type, optional - description. Defaults to None.debug
bool, optional - description. Defaults to False.
- description
- description
We know why t-student is useful what about khi-2 ? we know fisher ? yes F
- add a fct to predict
- attention to extrapolation (unsern data) vs interpolation
- another for the curve showing the std
- the interval should be narrower tinyer when X reacg the sample mean
- a good list of intel/reminder about the regression here - sites.ualberta.ca - pdf
def ME_Normal_dist(sample: list, alpha=None, debug=False)
estimate a normal distribution from a sample
- check if normal:
- sns.distplot(data.X)
- check if qq-plot is linear en.wikipedia.org ::from statsmodels.graphics.gofplots import qqplot ::from matplotlib import pyplot ::qqplot(sample, line='s') ::pyplot.show()
X = m + N(0,s**2)
check normal hypothesis: machinelearningmastery
- you may need data over 1000 samples to get
def ME_Regression(x: list,
y: list,
degre: int,
alpha: float = 0.05,
nb_iter: int = 100000,
learning_rate: float = 0.1)
estimate a regression model from two samples
- predict Y conditional on X assuming that Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + N(0,s**2)
- Y is a dependant variable
- x, s are independant ones => predictors of the dependant variables
- If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + var_exp_1*G +
- sns.scatterplot(X,Y)
- Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + err
- err ~~> N(0,s**2)
- variance(error)==s**2 is the same accross the data
- var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3
- pr[i] cst
- pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in)
- each pr[i] have a mean and a std based on normal distribution
- Y too =>
- Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*X^2 + pr_[3]*X^3
- Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn't s ?
- pr[i], s**2
- you may need data over 1000 samples to get
Others D'ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]
def ME_multiple_regression(X: list,
y: list,
alpha: float = 0.05,
nb_iter: int = 100000,
learning_rate: float = 0.1)
list - description -
list - description -
bool, optional - description. Defaults to False. -
type, optional - description. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.estimate a regression model from two samples
- predict Y conditional on X, B, G, ... assuming that Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + N(0,s**2)
- Y is a dependant variable
- x, B, G, ...., s are independant ones => predictors of the dependant variables
- If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + pr[4]*T1 +pr[5]*T2 + => The correlation of the repeated measures needs to be taken into account, and time since administration needs to be added to the model as an independent variable.
Questions of interest
- Are you interested in establishing a relationship?
- Are you interested in which predictors are driving that relationship?
- sns.scatterplot(X[i],y) for i in range(len(X))
- check for Form_linear_or_not;Direction_pos_or_neg;Strengh_of_the_colinearity;Outliers
- Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + err
- err ~~> N(0,s**2)
- variance(error)==s**2 is the same accross the data
- var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G
- pr[i] cst
- pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in)
- non Collinearity a.k.a Multicollinearity
- a correlation with be computed
- Anyway, i does not change the predictive power not the efficieency of the model
- Too, i guess aic selection remove one right ?
- But data about coefficients are not good because there is repetition
- Regression Trees = can handle correlated data well
- each pr[i] have a mean and a std based on normal distribution
- Y too =>
- Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*B + pr_h[3]*G
- Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn't s ?
- pr[i], s**2
- you may need data over 1000 samples to get
Others D'ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]
- description
- description
def clear_list(L: list) -> ndarray
remove nan from a list
list - a 1-dim array (n,). Anyway, data will be flattenWhat about he handle missing values properly !!
- weight shit
- Anyway, it would be good to know how missing values removal the distribution of L
1-dim array: array of shape (n,)
A = np.array([ [1,3], [4,3], [5,3], [7,np.nan] ]) y = np.array([6,np.nan,3,2]) A1 = clear_list(A) y1 = clear_list(y) print("A1: ",A1)
- array([1, 3, 4, 3, 5, 3])print("y: ",y1)
- array([6. 3. 2.])
def clear_list_pair(L1, L2) -> Tuple[ndarray, ndarray]
remove nan values (remove observation data containing nan value in L1 or L2) from 2 lists
list - a 1-dim array (n,). Anyway, data will be flatten -
list - a 1-dim array (n,). Anyway, data will be flattenWhat about he handle missing values properly !!
- weight shit
- Anyway, it would be good to know how missing values removal the distribution of L
L1 and L2 have different size: lists must be of the same size
1-dim array: L1 of shape(n,) 1-dim array: L2 of shape(n,)
y1 = np.array([4, 8,np.nan,2]) y2 = np.array([6,np.nan,36,9]) y1,y2 = clear_list_pair(y1, y2) print("y1: ",y1)
- array([4, 2])print("y2: ",y2)
- array([6. 9.])
def clear_mat_vec(A, y) -> Tuple[ndarray, ndarray]
Remove nan values (remove observation data containing nan value in X or y) from a matric and a corresponding vector
A : 2-dimensional array (n,p) y: 1-dimensional array (n,)
What about he handle missing values properly !! - weight shit - Anyway, it would be good to know how missing values removal the distribution of L
L1 and L2 have different size: lists must be of the same size
1-dim array: L1 of shape(n,)
1-dim array: L2 of shape(n,)
A = np.array([ [1,3], [4,3], [5,3], [7,np.nan] ]) y = np.array([6,np.nan,3,2]) A1,y1 = clear_mat_vec(A,y) print("A1: ",A1) A1: [[1. 3.] [5. 3.]] print("y: ",y1) y: [6. 3.]
def estimate_std(sample)
Instead of std, he divide by (n-1) correspondng to the std estimator used in t-test
def get_p_value_from_tail(prob, tail, debug=False)
get p value based on cdf and tail If tail=Tails.middle, the distribution is assumed symmetric because we double F(Z) if tail - right: return P(N > Z) = 1- F(Z) = 1 - prob - left: return P(N < Z) = F(Z) = prob - middle: return P(N < -|Z|) + P(N > |Z|) => return 2*P(N > |Z|)
def get_p_value_z_test(Z: float, tail: str, debug=False)
get p value based on normal distribution N(0, 1) if tail - right: return P(N > Z) - left: return P(N < Z) - middle: return P(N < -|Z|) + P(N > |Z|) => return 2*P(N > |Z|)
def get_p_value_t_test(Z: float, ddl, tail: str, debug: bool = False)
get p value based on student distribution T(df=ddl) with ddl degres of freedom if tail - right: return P(T > Z) - left: return P(T < Z) - middle: return P(T < -|Z|) + P(T > |Z|) => return 2*P(T > |Z|)
def get_p_value_f_test(Z: float, dfn: int, dfd: int, debug: bool = False)
get p value based on fisher distribution T(dfn, dfd) with ddl degres of freedom
- F-distribution - wiki
- tail is right because Fisher is positive
float - descriptiondfn
int - descriptiondfd
int - descriptiondebug
bool, optional - description. Defaults to False.
- description
- description
def get_p_value(Z: float, tail: str, test: str, ddl: int = None, debug=False)
get p value based on - (if test=="t_test") student distribution T(df=ddl) with ddl degres of freedom - (if test=="z_test") normal distribution N(0, 1) - (if test=="f_test") normal distribution F(ddl[0], ddl[1])
if tail - right: return P(T > Z) - left: return P(T < Z) - middle: return P(T < -|Z|) + P(T > |Z|) => return 2*P(T > |Z|)
class Confidence_data()
int or tuple
class Hypothesis_data()
prior value to check against
right, left,middle
int or tuple
if H0 is rejected
class RegressionFisherTestData()
If F is large
def check_sample_normality(residuals: list, debug=False, alpha=None)
check if residuals is like a normal distribution
- test_implemented
list - list of float or array-like (will be flatten)debug
bool, optional - description. Defaults to False.
- if all tests passed
def check_equal_var(*samples, alpha=COMMON_ALPHA_FOR_HYPH_TEST)
type, optional - description. Defaults to COMMON_ALPHA_FOR_HYPH_TEST. Utils
- description
les test de valisation (hypothese avant de lancer un autre test) qui dependant de test que j'ai écrits moi-même Les mettre dans utils prut créer un import circulaire
def check_residuals_centered(residuals: list, alpha=None)
check if a list is centered (if the mean ==0 nuder a significance od 0.05)
list - list or array-like
- description
def check_coefficients_non_zero(list_coeffs: list,
list_coeff_std: list,
nb_obs: int,
compute non zero tests for each corfficien
- test
- for ech coefficient
- H0: coeff==0
- H1: coeff!=0
- if the test passed (H0 is rejected), the coefficient is away from 0, return = True
list - lists of valueslist_coeff_std
list - list of std; the two lists should have the same lenght
- HypothesisValidationData(pass_non_zero_test_bool,pass_non_zero_test)
- testPassed (bool)
- obj (list) list of boolean (For each value, True if H0 is reected)
def check_equal_mean(*samples, alpha=None)
check if mean if the same accross samples
Hypothesis H0: mean1 = mean2 = mean3 = .... H1: one is different
- The samples are independent.
- Each sample is from a normally distributed population.
- The population standard deviations of the groups are all equal. This property is known as homoscedasticity.
- *samples (list): one or many lists
Fisher test
- The F Distribution is also called the Snedecor’s F, Fisher’s F or the Fisher–Snedecor distribution 1 2
- (float) Fp_value
- (float)
Some defs
- parameter: A quantifiable characteristic of a population
- confidence interval: range of reasonable values for the parameter
def IC_PROPORTION_ONE(sample_size: int,
parameter: float,
confidence: float = None,
method: str = None)
Confidence_interval(ONE PROPORTION):Confidence interval calculus after a statistic test
sample_size: int: sample size (more than 10 to use this method)
parameter: float: the measurement on the sample
confidence: float: confidence confidence (between O and 1). Greater the confidence, wider the interval
method: str: either "classic" (default) or "conservative.
how many men in the entire population with a con ?
a form filled by 300 people show that there is only 120 men => p = (120/300); N=300
the sample is over 10 for each of the categories in place => we use the "Law of Large Numbers"
the sample proportion comes from data that is considered a simple random sample
let P: the real proportion in the population
let S: Size of each sample == nb of observations per sample
For many samples, we calculate proportions per sample: ex: for N samples of size S => N proportions values
(p - P) / ( p*(1-p)/S ) follow a normal distribution
For a given polulation and a parameter P to find, If we repeated this study many times, each producing a new sample (of the same size {res.sample_size==S}) from witch a {res.confidence} confidence interval is computed, then {res.confidence} of the resulting confidence intervals would be excpected to contain the true value P
If the entire interval verify a property, then it is reasonable say that the parameter verify that property
with a {res.confidence} confidence, we estimate that the populztion proportion who are men is between {res.left_tail} and {res.right_tail}
def IC_MEAN_ONE(sample: list, t_etoile=None, confidence: float = None)
Estimate_population_mean(ONE MEAN): We need the spread (std): We will use an estimation
Data - confidence:.. - sample: value...
- Use t-distribution to calculate few
- Samples follow a normal (or large enough to bypass this assumption) => means of these sample follow a t-dist
def IC_PROPORTION_TWO(p1, p2, N1, N2, confidence: float = None)
Difference_population_proportion(TWO PROPORTIONS): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1
Method - create joint confidence interval
- Cmparison
- two independant random samples
- large enough sample sizes : 10 per category (ex 10 yes, 10 no)
def IC_MEAN_TWO_PAIR(sample1,
confidence: float = None)
Difference_population_means_for_paired_data(TWO MEANS FOR PAIRED DATA): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1
What is paired data ?
- measurements took on individuals (people, home, any object)
- technicality:
- When in a dataset (len = n) there is a row df.a witch values only repeat twice (=> df.a.nunique = n/2)
- we can do a plot(x=feature1, y=feature2)
- examples
- Each home need canibet quote from two suppliers => we want to know if there is an average difference in nb_quotes from between twese two suppliers
- In a blind taste test to compare two new juice flavors, grape and apple, consumers were given a sample of each flavor and the results will be used to estimate the percentage of all such consumers who prefer the grape flavor to the apple flavor.
- Construction
- It is like,
- checking if a feature magnitude change when going from a category to another, each pair split the two cztegories
- Example_contexte
- having a dataframe df, with 3 col [name, score, equipe, role]
- equipe: "1" or "2"
- role: df.role.nunique = 11 => len(df)==22
- Now there is a battle: For a "same role" fight", which team is the best?
- Example_question
- if education level are generally equal -> mean difference is 0
- Is there a mean difference between the education level of twins
- if education levels are unequel -> mean difference is not 0
- So, Look for 0 in the ranfe of reaonable values
We need the spread (std): We will use an estimation
- IC_MEAN_ONE(confidence, sample1 - sample2)
Data - confidence:.. - Sample1: list: values... - Sample2: list: (same len) values...
- Use t-distribution to calculate few
- create joint confidence interval
- a random sample of identical twin sets
- Samples follow a normal (or large enough to bypass this assumption: (ex 20 twins)) => means of these sample follow a t-dist
- With {cf} confidence, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}
- if all values are above 0, cool there is a significativity
def IC_MEAN_TWO_NOTPAIR(sample1,
confidence: float = None)
Difference_population_means_for_nonpaired_data(TWO MEANS FOR PAIRED DATA): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1
- It is like,
- checking if a feature magnitude change when going from a category to another
- Example_contexte
- having a dataframe df, with 3 col [name, score, equipe, role]
- equipe: "1" or "2"
- role: df.role.nunique = 11 => len(df)==22
- Now there is a battle: For a "same role" fight", which team is the best?
- having a dataframe df, with 3 col [name, score, equipe, role]
- Example_question
- if education level are generally equal -> mean difference is 0
- Is there a mean difference between the education level based on gender
- if education levels are unequel -> mean difference is not 0
- So, Look for 0 in the ranfe of reaonable values
- if education level are generally equal -> mean difference is 0
We need the spread (std): We will use an estimation
Args - confidence:.. - Sample1: list: values... - Sample2: list: (same len) values... - pool: default False - True - if we assume that our populations variance are equal - we use a t-distribution of (N1+N2-1) ddl - False - if we assume that our populations variance are not equal - we use a t-distribution of min(N1, N2)-1 ddl
- Use t-distribution to calculate few
- create joint confidence interval
- a random sample
- Samples follow a normal (or large enough to bypass this assumption: 10 per category) => means of these sample follow a t-dist
- With {cf} confidence, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}
- if all values are above 0, cool there is a significativity
- refactor output (last lines)
- use "alternative" instead of "tail"
- use kwargs format while calling functions
- reorder fcts attributes
def get_min_sample(moe: float, p=None, method=None, cf: float = None)
Get_min_sample:get the minimum of sample_size to use for a
- input
- cf: confidence (or coverage_probability): between 0 and 1
- moe: margin of error
- method (optional): "conservative" (default)
- p: not used if method=="conservative" Hyp
- better the population follow nornal dist. Or use large sample (>10)
def CIE_ONE_PROPORTION(proportion, n, method, cf: float = None)
Get_interval_simple: get a proportion of an attribute value (male gender, ) in a population based on a sample (no sign pb)
- cf: confidence_level (or coverage_probability)
- proportion: measurement
- n: number of observations == len(sample)
- method: "classic" or "conservative"
- better the population follow nornal dist. Or use large sample (>10)
def CIE_PROPORTION_TWO(p1, p2, n1, n2, cf: float = None)
Get_interval_diff: get the diff of mean between 2 population based on a sample from each population (p1-p2) p1
- cf: confidence_level (or coverage_probability)
- p1: mean of liste1
- p2: mean of liste2
- n1: len(liste1)
- n2: len(liste2)
- better the populations follow normal dist. Or use large samples (>10)
def CIE_MEAN_ONE(n, mean_dist, std_sample, t_etoile=None, cf: float = None)
Get_interval_mean:get the mean of a population from a sample (no sign pb)
- cf: confidence level (or coverage_probability)
- n: number of observations == len(sample)
- mean_dist: the mean measured on the sample = mean(sample)
- std_sample: std of the sample ==std(sample)
- t_etoile: if set, cf is ignored.
- better the population follow nornal dist. Or use large sample (>10)
- Alternative to normality: Wilcoxon Signed Rank Test
- reade here
cf: float = None)
Get_interval_diff_mean: get the diff in mean of two populations(taking their samples) (sign(diff_mean) => no sign pb)
- cf: confidence level (or coverage_probability)
- N1: number of observations == len(sample1)
- N2: number of observations == len(sample2)
- mean_dist: the mean measured on the sample = mean(sample)
- std_sample_1: std of the sample ==std(sample1)
- std_sample_2: std of the sample ==std(sample2)
- t_etoile: if set, cf is ignored.
- pool: default False
- True
- if we assume that our populations variance are equal
- we use a t-distribution of (N1+N2-1) ddl
- False
- if we assume that our populations variance are not equal
- we use a t-distribution of min(N1, N2)-1 ddl
- True
- both the population follow normal dist. Or use large sample (>10)
- the populations are independant from each other
- use simple random samples
- for pool=True, variances are assume to be the same
- to test that, you can
use levene test plus robuste que fusher ou bartlett face à la non-normalité de la donnée
- H0: Variances are equals; H1: there are not
scipy.stats.levene(liste1,liste2, center='mean') solution = "no equality" if p-value<0.05 else "equality"
or check if IQR are the same
- IQR = quantile(75%) - quantile(25%)
- to test that, you can
- scipy.stats.ttest_ind(liste1,liste2, equal_var = False | True)
Eqvl_pointWise estimation
- Assume diff_mean = 82
- Result: diff_mean in CI = [77.33, 87.63]
- If we test H0:p=80 vs H1:p>80, we would fail to reject the null because H1 is not valide here
- As sa matter of fact, there is some value in CI below 80 witch if not compatible with H1 => the test doest give enough evidence to reject H0
- read here
- Dans un test, H0 est l'hypothese pessimiste
- il faudra donc assez d'evidence (p<0.05) afin de la rejeter
def HPE_FROM_P_VALUE(tail: str = None,
str, optional - "middle" or "left" or "right"p_value
type, optional - description. Defaults to None.t_stat
type, optional - description. Defaults to None.p_hat
type, optional - description. Defaults to None.p0
type, optional - description. Defaults to None.std_stat_eval
type, optional - description. Defaults to None.alpha
type, optional - description. Defaults to None.test
str, optional - description. Defaults to "z_test".ddl
int, optional - description. Defaults to 0.onetail
bool, optional - if tail="middle". return one_tail_cf_p_value instead of the 2tail_2cf_p_value Defaults to False.
- description
def HPE_PROPORTION_ONE(alpha, p0, proportion, n, tail=Tails.right)
check a proportion of an attribute value (male gender, ) in a population based on a sample (no sign pb) using a Z-statistic
- alpha: p_value_max: significance level
- p0: proportion under the null
- proportion: measurement
- n: number of observations == len(sample)
- tail:
- right: check if p>p0
- left: check if p<p0
- middle: ckeck id p==p0 Hyp
- simple random sample
- large sample (np>10)
- H0: proportion = p0
- H1:
- tail==right => proportion > p0
- tail==left => proportion < p0
- tail==middle => proportion != p0
- use a normal distribion (Z-statistic)
Result (ex:tail=right)
- if reject==True
- There is sufficient evidence to conclude that the population proportion of {....} is greater than p0
def HPE_PROPORTION_TW0(alpha, p1, p2, n1, n2, tail=Tails.middle, evcpp=False)
check the diff of proportion between 2 population based on a sample from each population (p1-p2) p1
-p2# using a Z-statistic (always used for difference of estimates).
there is also fisher and chi-square
- alpha: level of significance
- p1: proportion of liste1
- p2: proportion of liste2
- n1: len(liste1)
- n2: len(liste2)
- evcpp: bool(defult=False) (True -> Estimate of the variance of the combined population proportion)
- two independant samples
- two random samples
- large enough data
- H0: proportion = p0
- H1: proportion !=p0
- use a normal distribion (Z-statistic)
def HPE_MEAN_ONE(alpha, p0, mean_dist, n, std_sample, tail=Tails.right)
get the mean of a population from a sample (no sign pb) using using a T-statistic (always T for mean!! unless youre comparing a sample vs a population of known std)
- alpha:
- n: number of observations == len(sample)
- mean_dist: the mean measured on the sample = mean(sample)
- std_sample: std of the sample ==std(sample). You should use a real estimate (ffod=n-1)
- simple random sample
- better the population follow nornal dist. Or use large sample (>10)
- Alternative to normality: Wilcoxon Signed Rank Test
- read here
get the difference of mean between two list paired (no sign pb) using a T-statistic (always T for mean!! unless youre comparing a sample vs a population of known std)
- alpha:
- mean_diff_sample: the mean measured on the sample = mean(sample)
- n: number of observations == len(sample) == n1 == n2
- std_diff_sample: std of the sample ==std(sample). You should use a real estimate (ffod=n-1)
- tail: default=Tails.middle to test the equality (mean_diff=0). But we can also to mean_diff>0 (right) or mean_diff<0 (left)
- simple random sample
- better when the diff of the samples (sample1 - sample2) follow nornal dist. Or use large sample (>10)
- std_diff_sample is a good data based estimated [use (n-1) instead of n]. example: np.std(sample1 - sample2, ddof=1) is better than ddof=0 (default)
- H0: p1 - p2 = 0
- H1:
- H1: p1 - p2 != 0 for(tail=middle)
- H1: p1 - p2 > 0 for(tail=right)
- H1: p1 - p2 < 0 for(tail=left)
check the diff in mean of two populations(taking their samples) (sign(diff_mean) => no sign pb)
- alpha:
- N1: number of observations == len(sample1)
- N2: number of observations == len(sample2)
- mean_dist: the mean measured on the sample = mean(sample)
- std_sample_1: std of the sample ==std(sample1)
- std_sample_2: std of the sample ==std(sample2)
- pool: default False
- True
- if we assume that our populations variance are equal
- we use a t-distribution of (N1+N2-1) ddl
- False
- if we assume that our populations variance are not equal
- we use a t-distribution of min(N1, N2)-1 ddl
- True
- both the population follow normal dist. Or use large sample (>10)
- the populations are independant from each other
- use simple random samples
- for pool=True, variances must be the same
- to test that, you can
- use levene test plus robuste que fusher ou bartlett face à la non-normalité de la donnée ::H0: Variances are equals; H1: there are not ::scipy.stats.levene(liste1,liste2, center='mean') ::solution = "no equality" if p-value<0.05 else "equality"
- or check if IQR are the same
- IQR = quantile(75%) - quantile(25%)
- to test that, you can
- read here
def HPE_MEAN_MANY(*samples, alpha=None)
check if mean is equal accross many samples
Hypothesis H0: mean1 = mean2 = mean3 = .... H1: one is different
- each sample is
- simple random
- normal
- indepebdant from others
- same variance
- if added, the "same variance test" should use levene test but apparently, use levene test plus robuste que fusher ou bartlett face à la non-normalité de la donnée
Fisher test
- The F Distribution is also called the Snedecor’s F, Fisher’s F or the Fisher–Snedecor distribution 1 2
- (float) Fp_value
- (float)
utils - Dans un test, H0 est l'hypothese pessimiste - il faudra donc assez d'evidence (p<0.05) afin de la rejeter - on a alors mis une borne max faible sur l'erreur de type 1 (rejeter H0 alors qu'il est vrai)
Some defs - parameter: A quantifiable characteristic of a population (baseline) - alpha: level of significance = type1_error = proba(reject_null;when null is True)
def HP_PROPORTION_ONE(sample_size: int,
parameter: float,
p0: float,
alpha: float,
ONE PROPORTION:alpha calculus after a statistic test
sample_size: int: sample size (more than 10 to use this method)
parameter: float: the measurement on the sample
alpha: float: alpha alpha (between O and 1). Greater the alpha, wider the interval
method: str: either "classic" (default) or "conservative.
how many men in the entire population with a con ?
a form filled by 300 people show that there is only 120 men => p = (120/300); N=300
the sample is over 10 for each of the categories in place => we use the "Law of Large Numbers"
the sample proportion comes from data that is considered a simple random sample
let P: the real proportion in the population
let S: Size of each sample == nb of observations per sample
For many samples, we calculate proportions per sample: ex: for N samples of size S => N proportions values
(p - P) / ( p*(1-p)/S ) follow a normal distribution
For a given polulation and a parameter P to find, If we repeated this study many times, each producing a new sample (of the same size {res.sample_size==S}) from witch a {res.alpha} alpha is computed, then {res.alpha} of the resulting alphas would be excpected to contain the true value P
If the entire interval verify a property, then it is reasonable say that the parameter verify that property
with a {res.alpha} alpha, we estimate that the populztion proportion who are men is between {res.left_tail} and {res.right_tail}
def HP_MEAN_ONE(p0: float, alpha: float, sample: list, symb=Tails.SUP_SYMB)
ONE MEAN: We need the spread (std): We will use an estimation
Data - alpha:.. - sample: value...
- Use t-distribution to calculate few
- Samples follow a normal (or large enough to bypass this assumption) => means of these sample follow a t-dist
def HP_PROPORTION_TWO(alpha, p1, p2, N1, N2, symb=Tails.NEQ_SYMB, evcpp=False)
TWO PROPORTIONS: We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1
Method - create joint alpha - evcpp: bool(defult=False) (True -> Estimate of the variance of the combined population proportion)
- Cmparison
- two independant random samples
- large enough sample sizes : 10 per category (ex 10 yes, 10 no)
def HP_MEAN_TWO_PAIR(alpha, sample1, sample2, symb=Tails.NEQ_SYMB)
TWO MEANS FOR PAIRED DATA: We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? p1
What is paired data: - measurements took on individuals (people, home, any object) - technicality: - When in a dataset (len = n) there is a row df.a witch values only repeat twice (=> df.a.nunique = n/2) - we can do a plot(x=feature1, y=feature2) - examples - Each home need canibet quote from two suppliers => we want to know if there is an average difference in nb_quotes from between twese two suppliers - In a blind taste test to compare two new juice flavors, grape and apple, consumers were given a sample of each flavor and the results will be used to estimate the percentage of all such consumers who prefer the grape flavor to the apple flavor. - Construction - It is like, - checking if a feature magnitude change when going from a category to another, each pair split the two cztegories - Example_contexte - having a dataframe df, with 3 col [name, score, equipe, role] - equipe: "1" or "2" - role: df.role.nunique = 11 => len(df)==22 - Now there is a battle: For a "same role" fight", which team is the best? - Example_question - if education level are generally equal -> mean difference is 0 - Is there a mean difference between the education level of twins - if education levels are unequel -> mean difference is not 0 - So, Look for 0 in the ranfe of reaonable values
We need the spread (std): We will use an estimation
- estimate_population_mean(alpha, sample1 - sample2)
Data - alpha:.. - Sample1: list: values... - Sample2: list: (same len) values...
- Use t-distribution to calculate few
- create joint alpha
a random sample of identical twin sets
Samples follow a normal (or large enough to bypass this assumption: (ex 20 twins)) => means of these sample follow a t-dist
With {alpha} alpha, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}
if all values are above 0, cool there is a significativity
TWO MEANS FOR PAIRED DATA: We have have estimate a parameter p on two populations (1 , 2).How to check p1-p2 != 0? p1
- It is like,
- checking if a feature magnitude change when going from a category to another
- Example_contexte
- having a dataframe df, with 3 col [name, score, equipe, role]
- equipe: "1" or "2"
- role: df.role.nunique = 11 => len(df)==22
- Now there is a battle: For a "same role" fight", which team is the best?
- having a dataframe df, with 3 col [name, score, equipe, role]
- Example_question
- if education level are generally equal -> mean difference is 0
- Is there a mean difference between the education level based on gender
- if education levels are unequel -> mean difference is not 0
- So, Look for 0 in the ranfe of reaonable values
- if education level are generally equal -> mean difference is 0
We need the spread (std): We will use an estimation
Data - alpha:.. - Sample1: list: values... - Sample2: list: (same len) values... - pool: default False - True - if we assume that our populations variance are equal - we use a t-distribution of (N1+N2-1) ddl - False - if we assume that our populations variance are not equal - we use a t-distribution of min(N1, N2)-1 ddl
- Use t-distribution to calculate few
- create joint alpha
a random sample
Samples follow a normal (or large enough to bypass this assumption: 10 per category) => means of these sample follow a t-dist
With {alpha} alpha, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}
if all values are above 0, cool there is a significativity