forked from brylevkirill/notes
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathStatistics.txt
260 lines (146 loc) · 26.4 KB
/
Statistics.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
Statistics is reduction of data into numbers expressing valueable information about data generating process.
Data Science is transformation of data into knowledge with help of computer science and mathematical sciences.
Michael Jordan - "On Computational Thinking, Inferential Thinking and Data Science" - https://youtube.com/watch?v=bQ02K0kWKzg
"The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the inferential and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in "Big Data" is apparent from their sharply divergent nature at an elementary level-in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. On a formal level, the gap is made evident by the lack of a role for computational concepts such as "runtime" in core statistical theory and the lack of a role for statistical concepts such as "risk" in core computational theory."
Max Welling - "Are Machine Learning and Statistics Complementary" - https://www.ics.uci.edu/~welling/publications/papers/WhyMLneedsStatistics.pdf
selected papers and books on statistics - https://dropbox.com/sh/ff6xkunvb9emlc1/AAA3SCZx5kvdr1BlYq9ArEaka
selected papers and books on data science - https://dropbox.com/sh/ivbn5ldue427s5q/AABBOkS0aDRo0Optz3ZY96tta
[overview]
"What is statistics?" - http://blogs.sas.com/content/iml/2014/08/05/stiglers-seven-pillars-of-statistical-wisdom.html
"Aggregation: It sounds like an oxymoron that you can gain knowledge by discarding information, yet that is what happens when you replace a long list of numbers by a sum or mean. Every day the news media reports a summary of billions of stock market transactions by reporting a single a weighted average of stock prices: the Dow Jones Industrial Average. Statisticians aggregate, and policy makers and business leaders use these aggregated values to make complex decisions.
"The law of diminishing information: If 10 pieces of data are good, are 20 pieces twice as good? No, the value of additional information diminishes like the square root of the number of observations. The square root appears in formulas such as the standard error of the mean, which describes the probability that the mean of a sample will be close to the mean of a population."
"Likelihood: Some people say that statistics is "the science of uncertainty." One of the pillars of statistics is being able to confidently state how good a statistical estimate is. Hypothesis tests and p-values are examples of how statisticians use probability to carry out statistical inference."
"Intercomparisons: When analyzing data, statisticians usually make comparisons that are based on differences among the data. This is different than in some fields, where comparisons are made against some ideal "gold standard." Well-known analyses such as ANOVA and t-tests utilize this pillar."
"Regression and multivariate analysis: Children that are born to two extraordinarily tall parents tend to be shorter than their parents. Similarly, if both parents are shorter than average, the children tend to be taller than the parents. This is known as regression to the mean. Regression is the best known example of multivariate analysis, which also includes dimension-reduction techniques and latent factor models."
"Design: A pillar of statistics is the design of experiments, and—by extension—all data collection and planning that leads to good data. Included in this pillar is the idea that random assignment of subjects to design cells improves the analysis. This pillar is the basis for agricultural experiments and clinical trials, just to name two examples."
"Models and Residuals: This pillar enables you to examine shortcomings of a model by examining the difference between the observed data and the model. If the residuals have a systematic pattern, you can revise your model to explain the data better. You can continue this process until the residuals show no pattern. This pillar is used by statistical practitioners every time that they look at a diagnostic residual plot for a regression model."
[interesting papers]
Breiman - "Statistical Modeling: The Two Cultures"
Norvig - "Warning Signs in Experimental Design and Interpretation" [http://norvig.com/experiment-design.html]
Debrouwere, Goetghebeur - "The Statistical Crisis in Science" [http://lib.ugent.be/fulltxt/RUG01/002/304/385/RUG01-002304385_2016_0001_AC.pdf]
Ioannidis - "Why most published research findings are false" [http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124]
Goodman, Greenland - "Why Most Published Research Findings Are False: Problems in the Analysis"
Ioannidis - "Why Most Published Research Findings Are False: Author's Reply to Goodman and Greenland"
Moonesinghe Khoury, Janssens - "Most published research findings are false - But a little replication goes a long way"
Leek, Jager - "Is most published research really false?"
Aitchison, Corradi, Latham - "Zipf’s Law Arises Naturally When There Are Underlying, Unobserved Variables" [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005110]
[study]
course by Joe Blitzstein - https://youtube.com/playlist?list=PLCzY7wK5FzzPANgnZq5pIT3FOomCT1s36
course by Duke University - https://coursera.org/learn/bayesian/
course by Yandex (in russian):
https://ru.coursera.org/specializations/machine-learning-data-analysis
course by Mail.ru (in russian):
https://youtube.com/playlist?list=PLrCZzMib1e9p5F99rIOzugNgQP5KHHfK8
course by Computer Science Center (in russian):
https://compscicenter.ru/courses/math-stat/2015-spring/
https://compscicenter.ru/courses/math-stat/2013-spring/
seminars in Yandex Academy (in russian) - https://youtube.com/channel/UCeq6ZIlvC9SVsfhfKnSvM9w
"Myths of Data Science" by Alexey Natekin (in russian) - https://youtube.com/watch?v=tEIkgAsYWb0
Roth, Hardt - "Rigorous Data Dredging - Theory and Tools for Adaptive Data Analysis"
[http://techtalks.tv/talks/rigorous-data-dredging-theory-and-tools-for-adaptive-data-analysis/62362/]
[theory]
4 views of statistics: Frequentist, Bayesian, Likelihood, Information-Theoretic - http://labstats.net/articles/overview.html
"In essence, Bayesian means probabilistic. The specific term exists because there are two approaches to probability. Bayesians think of it as a measure of belief, so that probability is subjective and refers to the future. Frequentists have a different view: they use probability to refer to past events - in this way it’s objective and doesn’t depend on one’s beliefs."
http://allendowney.blogspot.ru/2016/09/bayess-theorem-is-not-optional.html
difference between bayesian and frequentist expected loss - https://en.wikipedia.org/wiki/Loss_function#Expected_loss
example of frequentist (p-values) and bayesian (bayes factor) statistical hypothesis testing - https://en.wikipedia.org/wiki/Bayes_factor#Example
http://blog.efpsa.org/2014/11/17/bayesian-statistics-what-is-it-and-why-do-we-need-it-2/
http://blog.efpsa.org/2015/08/03/bayesian-statistics-why-and-how/
advantages of bayesian inference - http://bayesian-inference.com/advantagesbayesian
advantages of frequentist inference - http://bayesian-inference.com/advantagesfrequentist
frequentist vs bayesian vs machine learning - http://stats.stackexchange.com/a/73180
"Frequentism and Bayesianism: A Python-driven Primer" - http://arxiv.org/abs/1411.5018
http://nowozin.net/sebastian/blog/becoming-a-bayesian-part-1.html
http://nowozin.net/sebastian/blog/becoming-a-bayesian-part-2.html
http://nowozin.net/sebastian/blog/becoming-a-bayesian-part-3.html
http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/
http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-differ/
http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/
"Model-based statistics assumes that the observed data has been produced from a random distribution or probability model. The model usually involves some unknown parameters. Statistical inference aims to learn the parameters from the data. This might be an end in itself - if the parameters have interesting real world implications we wish to learn - or as part of a larger workflow such as prediction or decision making. Classical approaches to statistical inference are based on the probability (or probability density) of the observed data y0 given particular parameter values θ. This is known as the likelihood function, π(y0|θ). Since y0 is fixed this is a function of θ and so can be written L(θ). Approaches to inference involve optimising this, used in maximum likelihood methods, or exploring it, used in Bayesian methods.
A crucial implicit assumption of both approaches is that it’s possible and computationally inexpensive to numerically evaluate the likelihood function. As computing power has increased over the last few decades, there are an increasing number of interesting situations for which this assumption doesn’t hold. Instead models are available from which data can be simulated, but where the likelihood function is intractable, in that it cannot be numerically evaluated in a practical time.
In Bayesian approach to inference a probability distribution must be specified on the unknown parameters, usually through a density π(θ). This represents prior beliefs about the parameters before any data is observed. The aim is to learn the posterior beliefs resulting from updating the prior to incorporate the observations. Mathematically this is an application of conditional probability using Bayes theorem: the posterior is π(θ|y0)=kπ(θ)L(θ), where k is a constant of proportionality that is typically hard to calculate. A central aim of Bayesian inference is to produce methods which approximate useful properties of the posterior in a reasonable time."
(E. T. Jaynes) "The traditional ‘frequentist’ methods which use only sampling distributions are usable and useful in many particularly simple, idealized problems; however, they represent the most proscribed special cases of probability theory, because they presuppose conditions (independent repetitions of a ‘random experiment’ but no relevant prior information) that are hardly ever met in real problems. This approach is quite inadequate for the current needs of science. In addition, frequentist methods provide no technical means to eliminate nuisance parameters or to take prior information into account, no way even to use all the information in the data when sufficient or ancillary statistics do not exist. Lacking the necessary theoretical principles, they force one to ‘choose a statistic’ from intuition rather than from probability theory, and then to invent ad hoc devices (such as unbiased estimators, confidence intervals, tail-area significance tests) not contained in the rules of probability theory. Each of these is usable within the small domain for which it was invented but, as Cox’s theorems guarantee, such arbitrary devices always generate inconsistencies or absurd results when applied to extreme cases.
All of these defects are corrected by use of Bayesian methods, which are adequate for what we might call ‘well-developed’ problems of inference. As Jeffreys demonstrated, they have a superb analytical apparatus, able to deal effortlessly with the technical problems on which frequentist methods fail. They determine the optimal estimators and algorithms automatically, while taking into account prior information and making proper allowance for nuisance parameters, and, being exact, they do not break down – but continue to yield reasonable results – in extreme cases. Therefore they enable us to solve problems of far greater complexity than can be discussed at all in frequentist terms. All this capability is contained already in the simple product and sum rules of probability theory interpreted as extended logic, with no need for – indeed, no room for – any ad hoc devices.
Before Bayesian methods can be used, a problem must be developed beyond the ‘exploratory phase’ to the point where it has enough structure to determine all the needed apparatus (a model, sample space, hypothesis space, prior probabilities, sampling distribution). Almost all scientific problems pass through an initial exploratory phase in which we have need for inference, but the frequentist assumptions are invalid and the Bayesian apparatus is not yet available. Indeed, some of them never evolve out of the exploratory phase. Problems at this level call for more primitive means of assigning probabilities directly out of our incomplete information. For this purpose, the Principle of maximum entropy has at present the clearest theoretical justification and is the most highly developed computationally, with an analytical apparatus as powerful and versatile as the Bayesian one. To apply it we must define a sample space, but do not need any model or sampling distribution. In effect, entropy maximization creates a model for us out of our data, which proves to be optimal by so many different criteria that it is hard to imagine circumstances where one would not want to use it in a problem where we have a sample space but no model."
[frequentist statistics]
"Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value."
"P values are commonly used to test (and dismiss) a ‘null hypothesis’, which generally states that there is no difference between two groups, or that there is no correlation between a pair of characteristics. The smaller the P value, the less likely an observed set of values would occur by chance — assuming that the null hypothesis is true. A P value of 0.05 or less is generally taken to mean that a finding is statistically significant and warrants publication."
http://allendowney.blogspot.ru/2011/05/there-is-only-one-test.html
http://allendowney.blogspot.ru/2011/06/more-hypotheses-less-trivia.html
"The ASA's statement on p-values: context, process, and purpose" [http://dx.doi.org/10.1080/00031305.2016.1154108]:
- P-values can indicate how incompatible the data are with a specified statistical model.
"A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions."
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
"Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself."
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
"Practices that reduce data analysis or scientific inference to mechanical “bright-line” rules (such as “p < 0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision-making. A conclusion does not immediately become “true” on one side of the divide and “false” on the other. Researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis. Pragmatic considerations often require binary, “yes-no” decisions, but this does not mean that p-values alone can ensure that a decision is correct or incorrect. The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process."
- Proper inference requires full reporting and transparency.
"P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. Cherry-picking promising findings, also known by such terms as data dredging, significance chasing, significance questing, selective inference and “p-hacking”, leads to a spurious excess of statistically significant results in the published literature and should be vigorously avoided. One need not formally carry out multiple statistical tests for this problem to arise: Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted and all p-values computed. Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including p-values) were selected for reporting."
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
"Statistical significance is not equivalent to scientific, human, or economic significance. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect. Any effect, no matter how tiny, can produce a small p-value if the sample size or measurement precision is high enough, and large effects may produce unimpressive p-values if the sample size is small or measurements are imprecise. Similarly, identical estimated effects will have different p-values if the precision of the estimates differs."
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
"Researchers should recognize that a p-value without context or other evidence provides limited information. For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large p-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. For these reasons, data analysis should not end with the calculation of a p-value when other approaches are appropriate and feasible."
"
- The p-value doesn't tell scientists what they want (it is the probability of the data given that H0 is true, and scientists would like the probability of H0 or H1 given the data)
- H0 is often known to be false
- P-values are widely misunderstood
- Leads to binary yes/no thinking
- Prior information is never taken into account (Bayesian argument)
- A small p-value could reflect a very large sample size rather than a meaningful difference
- Leads to publication bias, because significant results (i.e. p < 0.05) are more likely to be published
"
"Null Hypothesis Significance Testing Never Worked" - http://www.fharrell.com/2017/01/null-hypothesis-significance-testing.html
http://quillette.com/2015/11/13/the-great-statistical-schism/
[bayesian statistics]
http://blog.efpsa.org/2014/11/17/bayesian-statistics-what-is-it-and-why-do-we-need-it-2/
http://blog.efpsa.org/2015/08/03/bayesian-statistics-why-and-how/
Kruschke - "Doing Bayesian Data Analysis" - http://www.users.csbsju.edu/~mgass/robert.pdf
"Bayesian Multi-armed Bandits vs A/B Tests" - https://habrahabr.ru/company/ods/blog/325416/ (in russian)
https://alexanderetz.com/understanding-bayes/ :
- a look at the likelihood - https://alexanderetz.com/2015/04/15/understanding-bayes-a-look-at-the-likelihood/
- updating priors via the likelihood
- evidence vs. conclusions
- what is the maximum Bayes factor for a given p value
- posterior probabilities vs. posterior odds
- objective vs. subjective Bayes
- prior probabilities for models vs. parameters
- strength of evidence vs. probability of obtaining that evidence
- the Jeffreys-Lindley paradox
- when do Bayesians and frequentists agree and why?
- bayesian model averaging
- bayesian bias mitigation
- bayesian updating over multiple studies
- does Bayes have error control?
"A likelihood (of data given model parameters) is similar to a probability, but the area under a likelihood curve does not add up to one like it does for a probability density. It treats the data as fixed (rather than as a random variable) and the likelihood of two different models can be compared by taking the ratio of their likelihoods, and a test of signficance can be performed."
"Likelihood is not a probability, but it is proportional to a probability. The likelihood of a hypothesis (H) given some data (D) is proportional to the probability of obtaining D given that H is true, multiplied by an arbitrary positive constant (K). In other words, L(H|D) = K · P(D|H). Since a likelihood isn’t actually a probability it doesn’t obey various rules of probability. For example, likelihood need not sum to 1.
A critical difference between probability and likelihood is in the interpretation of what is fixed and what can vary. In the case of a conditional probability, P(D|H), the hypothesis is fixed and the data are free to vary. Likelihood, however, is the opposite. The likelihood of a hypothesis, L(H|D), conditions on the data as if they are fixed while allowing the hypotheses to vary."
Likelihood:
- is central to almost all of statistics
- treats the data as fixed (once the experiment is complete, the data are fixed)
- allows one to compare hypotheses given the data
- captures the evidence in the data
- likelihoods can be easily combined, for example from two independent studies
- prior information can easily included (Bayesian analysis)
- seems to be the way we normally think (Pernerger & Courvoisier, 2010)
Bayes factors:
- not biased against H0
- allow us to state evidence for the absence of an effect
- condition only on the observed data
- allow to stop experiment once the data is informative enough
- subjective just as p-values
"Because complex models can capture many different observations, their prior on parameters p(θ) is spread out wider than those of simpler models. Thus there is little density at any specific point - because complex models can capture so many data points; taken individually, each data point is comparatively less likely. For the marginal likelihood, this means that the likelihood gets multiplied with these low density values of the prior, which decreases the overall marginal likelihood. Thus model comparison via Bayes factors incorporates an automatic Ockham’s razor, guarding us against overfitting. While classical approaches like the AIC naively add a penalty term (2 times the number of parameters) to incorporate model complexity, Bayes factors offer a more natural and principled approach to this problem."
pathologies of frequentist statistics according to E. T. Jaynes - https://youtube.com/watch?v=zZkwzvrO-pU
[interesting quotes]
(Claudia Perlich) https://quora.com/What-are-the-greatest-inefficiencies-data-scientists-face-today : "First of, let me state what I think is NOT the the problem: the fact that data scientists spend 80% of their time with data preparation. That is their JOB! If you are not good at data preparation, you are NOT a good data scientists. It is not a janitor problem as Steve Lohr provoked. The validity of any analysis is resting almost completely on the preparation. The algorithm you end up using is close to irrelevant.Complaining about data preparation is the same as being a farmer and complaining about having to do anything but harvesting and please have somebody else deal with the pesky watering, fertilizing, weeding, etc."
(Andrew Gelman) "In reality, null hypotheses are nearly always false. Is drug A identically effective as drug B? Certainly not. You know before doing an experiment that there must be some difference that would show up given enough data."
(Jim Berger) "A small p-value means the data were unlikely under the null hypothesis. Maybe the data were just as unlikely under the alternative hypothesis. Comparisons of hypotheses should be conditional on the data.
(Stephen Ziliak, Deirdra McCloskey) "Statistical significance is not the same as scientific significance. The most important question for science is the size of an effect, not whether the effect exists."
(William Gosset) "Statistical error is only one component of real error, maybe a small component. When you actually conduct multiple experiments rather than speculate about hypothetical experiments, the variability of your data goes up."
(John Ioannidis) "Small p-values do not mean small probability of being wrong. In one review, 74% of studies with p-value 0.05 were found to be wrong."
() "Empirical model comparisons based on real data compare (model1, estimator1, algorithm1) with (model2, estimator2, algorithm2). Saying my observations, xᵢ, are IID draws from a Bernoulli(p) random variable is a model for my data. Using the sample mean, p̂ = mean(xᵢ), to estimate the value of p is an estimator for that model. Computing the sample mean as sum(x)/length(x) is an algorithm if you assume sum and length are primitives. The distinctions matter because you always fit models to data using a triple of (model, estimator, algorithm)."
() "Data Science is a lot more than machine learning:
- understanding goals (sometimes requires background research)
- how to get the right data
- figuring out what to measure or optimize
- beware the lazy path of computing what's easy but wrong"
<brylevkirill (at) gmail.com>