- Culture 1 assumes data are generated by stochastic data models
- Culture 2 assumes algorithmic models and the data mechanism is assumed to be unknown
- Should the field of statistics learn from Culture 2 instead of assuming Culture 1 by default?
- Usually there are 2 goals for analyzing data, prediction and information
- Data Modeling Culture: linear regression, logistic regression, measured by fit tests & residual examination
- Algorithmic Modeling Culture: decision trees, neural nets, measured by predictive accuracy
- Ozone project: predict next-day ozone levels in LA basin - FAILURE
- Major source was automobile tailpipe emissions
- Commuting patterns in LA are regular varying only < 5% day by day
- Linear Regression used to predict this with 450 variables, quadratic terms...
- Chlorine project: determine potential toxicity of compounds through mass spectra - SUCCESS
- Cost and availability of chemists was a concern
- Dimensionality varies from 30 to 10k
- Goal to predict "contains chlorine or NOT"
- linear discriminant analysis and quadratic discriminant analysis failed
- Decision trees with domain knowledge in a set of 1500 branches gave 95% accuracy to the predictions
- All statisticians explored problems by prefacing them with "assume that the data are generated by the following model"
- When a model is fit to data to draw quantitative conclusions - conclusions are about the model's mechanism and not nature's mechanism, so if the model is a poor emulation of nature, conclusions are wrong.
- Besides computing R2, nothing else was done to see if the observational data could have been generated by model (R)
- During a study that assessed whether gender discrimination in faculty salaries was real:
- A linear regresssion was able to fit the data with 5% significance of the gender coefficient, strong evidence
- No questions such as: "Is inference justified if your sample is 100% of the population", "Can the data answer the question posed" were ever made.
- Standard goodness-of-fit tests and residual analysis are justn ot applicable if variables are deleted or nonlinear combinations of the variables are added.
- Acceptable residual plots does not imply residuals are good for all dimensions - maybe they are good in a few dimensions only
- Nobody really believes that multivariate data is multivariate normal, but that data model is often the standard
- It's delusional to think that nature fits parametric models selected by an statistician for all variables
- Key: neural networks, decision trees, cross-validation
- Problem is to find an algorithm f(x) such that for future x in a test set, f(x) is close enough to y
- The one assumption is that data is drawn from an unknown multivariate distribution
- A multitude of different f(x) might give the same minimum error rate
- Usually we pick the one with lowest RSS (residual sum of squares) or lowest test error
- If we perturb the dataset by removing a random 2-3% of the data, the DT will be dramatically different but with the same test error.
- Similar results are obtained with neural nets
- When using a similar technique with logistic regression, the resulting model will be extremely different and the conclusion about which variables are important too
- Author makes a point that sometimes, the simplest models won't make the best predictions
- Criteria for simplicity -> interpretability. Random forests are great predictors but hard to interpret.
- If there were too many prediction variables, for decades the recipe asked to find the features that preserve most of the information.
- However, often times more dimensions are a blessing for NNs, DTs, and Random Forests