Statistical Modeling: The Two Cultures

Key ideas

Culture 1 assumes data are generated by stochastic data models
Culture 2 assumes algorithmic models and the data mechanism is assumed to be unknown
Should the field of statistics learn from Culture 2 instead of assuming Culture 1 by default?

Introduction

Usually there are 2 goals for analyzing data, prediction and information
Data Modeling Culture: linear regression, logistic regression, measured by fit tests & residual examination
Algorithmic Modeling Culture: decision trees, neural nets, measured by predictive accuracy

Consulting Projects background

Ozone project: predict next-day ozone levels in LA basin - FAILURE
- Major source was automobile tailpipe emissions
- Commuting patterns in LA are regular varying only < 5% day by day
- Linear Regression used to predict this with 450 variables, quadratic terms...
Chlorine project: determine potential toxicity of compounds through mass spectra - SUCCESS
- Cost and availability of chemists was a concern
- Dimensionality varies from 30 to 10k
- Goal to predict "contains chlorine or NOT"
- linear discriminant analysis and quadratic discriminant analysis failed
- Decision trees with domain knowledge in a set of 1500 branches gave 95% accuracy to the predictions

Return to university

All statisticians explored problems by prefacing them with "assume that the data are generated by the following model"
When a model is fit to data to draw quantitative conclusions - conclusions are about the model's mechanism and not nature's mechanism, so if the model is a poor emulation of nature, conclusions are wrong.
Besides computing R2, nothing else was done to see if the observational data could have been generated by model (R)

Problems in current data modeling

During a study that assessed whether gender discrimination in faculty salaries was real:
- A linear regresssion was able to fit the data with 5% significance of the gender coefficient, strong evidence
- No questions such as: "Is inference justified if your sample is 100% of the population", "Can the data answer the question posed" were ever made.
Standard goodness-of-fit tests and residual analysis are justn ot applicable if variables are deleted or nonlinear combinations of the variables are added.
Acceptable residual plots does not imply residuals are good for all dimensions - maybe they are good in a few dimensions only

Limitations of data models

Nobody really believes that multivariate data is multivariate normal, but that data model is often the standard
It's delusional to think that nature fits parametric models selected by an statistician for all variables

Algorithmic modeling

Key: neural networks, decision trees, cross-validation
Problem is to find an algorithm f(x) such that for future x in a test set, f(x) is close enough to y
The one assumption is that data is drawn from an unknown multivariate distribution

Rashomon and the multiplicity of good models

A multitude of different f(x) might give the same minimum error rate
Usually we pick the one with lowest RSS (residual sum of squares) or lowest test error
If we perturb the dataset by removing a random 2-3% of the data, the DT will be dramatically different but with the same test error.
- Similar results are obtained with neural nets
When using a similar technique with logistic regression, the resulting model will be extremely different and the conclusion about which variables are important too

Occam's razor

Author makes a point that sometimes, the simplest models won't make the best predictions
Criteria for simplicity -> interpretability. Random forests are great predictors but hard to interpret.

Bellman and the curse of dimensionality

If there were too many prediction variables, for decades the recipe asked to find the features that preserve most of the information.
However, often times more dimensions are a blessing for NNs, DTs, and Random Forests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistical Modeling: The Two Cultures.md

Statistical Modeling: The Two Cultures.md

Statistical Modeling: The Two Cultures

Key ideas

Introduction

Consulting Projects background

Return to university

Problems in current data modeling

Limitations of data models

Algorithmic modeling

Rashomon and the multiplicity of good models

Occam's razor

Bellman and the curse of dimensionality

Files

Statistical Modeling: The Two Cultures.md

Latest commit

History

Statistical Modeling: The Two Cultures.md

File metadata and controls

Statistical Modeling: The Two Cultures

Key ideas

Introduction

Consulting Projects background

Return to university

Problems in current data modeling

Limitations of data models

Algorithmic modeling

Rashomon and the multiplicity of good models

Occam's razor

Bellman and the curse of dimensionality