Skip to content

Latest commit

 

History

History
59 lines (49 loc) · 4.02 KB

Statistical Modeling: The Two Cultures.md

File metadata and controls

59 lines (49 loc) · 4.02 KB

Key ideas

  • Culture 1 assumes data are generated by stochastic data models
  • Culture 2 assumes algorithmic models and the data mechanism is assumed to be unknown
  • Should the field of statistics learn from Culture 2 instead of assuming Culture 1 by default?

Introduction

  • Usually there are 2 goals for analyzing data, prediction and information
  • Data Modeling Culture: linear regression, logistic regression, measured by fit tests & residual examination
  • Algorithmic Modeling Culture: decision trees, neural nets, measured by predictive accuracy

Consulting Projects background

  • Ozone project: predict next-day ozone levels in LA basin - FAILURE
    • Major source was automobile tailpipe emissions
    • Commuting patterns in LA are regular varying only < 5% day by day
    • Linear Regression used to predict this with 450 variables, quadratic terms...
  • Chlorine project: determine potential toxicity of compounds through mass spectra - SUCCESS
    • Cost and availability of chemists was a concern
    • Dimensionality varies from 30 to 10k
    • Goal to predict "contains chlorine or NOT"
    • linear discriminant analysis and quadratic discriminant analysis failed
    • Decision trees with domain knowledge in a set of 1500 branches gave 95% accuracy to the predictions

Return to university

  • All statisticians explored problems by prefacing them with "assume that the data are generated by the following model"
  • When a model is fit to data to draw quantitative conclusions - conclusions are about the model's mechanism and not nature's mechanism, so if the model is a poor emulation of nature, conclusions are wrong.
  • Besides computing R2, nothing else was done to see if the observational data could have been generated by model (R)

Problems in current data modeling

  • During a study that assessed whether gender discrimination in faculty salaries was real:
    • A linear regresssion was able to fit the data with 5% significance of the gender coefficient, strong evidence
    • No questions such as: "Is inference justified if your sample is 100% of the population", "Can the data answer the question posed" were ever made.
  • Standard goodness-of-fit tests and residual analysis are justn ot applicable if variables are deleted or nonlinear combinations of the variables are added.
  • Acceptable residual plots does not imply residuals are good for all dimensions - maybe they are good in a few dimensions only

Limitations of data models

  • Nobody really believes that multivariate data is multivariate normal, but that data model is often the standard
  • It's delusional to think that nature fits parametric models selected by an statistician for all variables

Algorithmic modeling

  • Key: neural networks, decision trees, cross-validation
  • Problem is to find an algorithm f(x) such that for future x in a test set, f(x) is close enough to y
  • The one assumption is that data is drawn from an unknown multivariate distribution

Rashomon and the multiplicity of good models

  • A multitude of different f(x) might give the same minimum error rate
  • Usually we pick the one with lowest RSS (residual sum of squares) or lowest test error
  • If we perturb the dataset by removing a random 2-3% of the data, the DT will be dramatically different but with the same test error.
    • Similar results are obtained with neural nets
  • When using a similar technique with logistic regression, the resulting model will be extremely different and the conclusion about which variables are important too

Occam's razor

  • Author makes a point that sometimes, the simplest models won't make the best predictions
  • Criteria for simplicity -> interpretability. Random forests are great predictors but hard to interpret.

Bellman and the curse of dimensionality

  • If there were too many prediction variables, for decades the recipe asked to find the features that preserve most of the information.
  • However, often times more dimensions are a blessing for NNs, DTs, and Random Forests