11-regularisation/11-regularisation.Rmd

---
title: "Data Science for Economics"
subtitle: "Lecture 11: Shrinkage methods"
author: "Dawie van Lill"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
  xaringan::moon_reader:
    css: xaringan-themer.css
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---

```{r xaringan-themer, include=FALSE, warning=FALSE}
library(xaringanthemer)
style_duo_accent(primary_color = "#035AA6", secondary_color = "#03A696")
```

## Packages

I thought slides would be easier to present than notes. 

The code and slides are on the Github page. 

Here are the packages that we will be using for this session.  

```{r, cache=F, message=F}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, rsample, caret, glmnet, vip, tidyverse, pdp, AmesHousing)
```

Let us give quick rundown of some of the potentially new packages.

**rsample**:

**caret**: Machine learning

**glmnet**:

---

## Linear regression (one variable)
 
Quick example of linear regression using machine learning process. 

Want to model linear relationship between total above ground living space of a home (`Gr_Liv_Area`) and sale price (`Sale_Price`). 

First, we construct our training sample.  

```{r, cache=F, message=F}
set.seed(123)
ames <- AmesHousing::make_ames()
split  <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- training(split)
ames_test   <- testing(split)
```

Second, we model on training data.  

```{r, cache=F, message=F}
model1 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames_train)
```

In the next slide we show the results from this regression. 

---

```{r, regression_1, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
# Fitted regression line (full training data)
p1 <- model1 %>%
  broom::augment() %>%
  ggplot(aes(Gr_Liv_Area, Sale_Price)) + 
  geom_point(size = 1, alpha = 0.3) +
  geom_smooth(se = FALSE, method = "lm") +
  scale_y_continuous(labels = scales::dollar) +
  ggtitle("Fitted regression line")

# Fitted regression line (restricted range)


# Side-by-side plots
p1
```
---

```{r, regression_2, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
p2 <- model1 %>%
  broom::augment() %>%
  ggplot(aes(Gr_Liv_Area, Sale_Price)) + 
  geom_segment(aes(x = Gr_Liv_Area, y = Sale_Price,
                   xend = Gr_Liv_Area, yend = .fitted), 
               alpha = 0.3) +
  geom_point(size = 1, alpha = 0.3) +
  geom_smooth(se = FALSE, method = "lm") +
  scale_y_continuous(labels = scales::dollar) +
  ggtitle("Fitted regression line (with residuals)")
p2
```
---

## Evaluate results

```{r, cache=F, message=F}
summary(model1)
```

---

## Multiple linear regression

You can include more than one predictor, such as above ground square footage and year house was built. 

```{r, cache=F, message=F}
model2 <- lm(Sale_Price ~ Gr_Liv_Area + Year_Built, data = ames_train)
```

One could add as many predictors as you want. 
```{r, cache=F, message=F}
model3 <- lm(Sale_Price ~ ., data = ames_train) 
```

Let us now test the results from the models to see which one is the most accurate. 
---

## Linear regression

```{r, cache=F, message=F}
set.seed(123)  
(cv_model1 <- train(form = Sale_Price ~ Gr_Liv_Area, 
  data = ames_train, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
))
```

---

## Multiple linear regression

```{r, cache=F, message=F}
set.seed(123)  
(cv_model2 <- train(form = Sale_Price ~ Gr_Liv_Area + Year_Built, 
  data = ames_train, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
))
```

---

### Out of sample performance 

```{r, cache=F, message=F, echo=FALSE}
summary(resamples(list(
  model1 = cv_model1, 
  model2 = cv_model2
)))
```

---

## Regularised regression

Data typically have large number of features. 

As number of features grow models tend to **overfit the data**. 

Regularisation methods provide a means to constrain / regularise the estimated coefficients. 

Easiest way to understand is to see how and why it is applied to OLS. 

Objective in OLS is to find *hyperplane* that minimises SSE between observed and predicted response values. 

This means identifying hyperplane that minimises grey lines in next slide.

---

```{r, regression_3, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
ames_sub <- ames_train %>%
  filter(Gr_Liv_Area > 1000 & Gr_Liv_Area < 3000) %>%
  sample_frac(.5)
model4 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames_sub)

model4 %>%
  broom::augment() %>%
  ggplot(aes(Gr_Liv_Area, Sale_Price)) + 
  geom_segment(aes(x = Gr_Liv_Area, y = Sale_Price,
                   xend = Gr_Liv_Area, yend = .fitted), 
               alpha = 0.3) +
  geom_point(size = 1, color = "red") +
  geom_smooth(se = FALSE, method = "lm") +
  scale_y_continuous(labels = scales::dollar)

```
---

## Regularised regression

OLS adheres to some specific assumptions

- Linear relationship 
- More observations (n) than features (p) - which means n > p
- No or little multicollinearity

However, for many datasets, such as those involved in text mining, we have more features than observations. 

If we have that p > n there are many potential problems

- Model is less interpretable
- Infinite solutions to OLS problem

Make an assumption that small subset of these features exhibit strongest effect

This is called the **sparsity principle** 

---

## Regularised regression

Objective function of regularised regression model includes penalty parameter $P$. 

$$\min(\text{SSE + P})$$
Penalty parameter constrains the size of coefficients.

Coefficients can only increase is if we have decrease in SSE (i.e. the loss function).

Generalises to other linear models as well (such as logistic regression).

Three common penalty parameters

1. Ridge
2. Lasso [**think cowboys!**]
3. Elastic net (combination of ridge and lasso)I

---

## Ridge regression

Ridge regression controls coefficients by adding $\lambda \sum_{j=1}^{p}\beta^{2}_{j}$ to objective function:

$$\min(\text{SSE} + \lambda \sum_{j=1}^{p}\beta^{2}_{j})$$
Size of the penalty is referred to as the $L^2$ norm, or Euclidean norm. 

This penalty can take on different values depending on the tuning parameter $\lambda$.

If $\lambda = 0$, then we have OLS regression.

As $\lambda \rightarrow \infty$ the penalty becomes large and forces coefficients to zero. 

**Important to note**: Does not force all the way to zero, only asymptotically. Does not perform feature selection.

The following slide illustrates dynamics for ridge regression coefficient values as the tuning parameter approaches zero. 

---

```{r, regression_5, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
boston_train_x <- model.matrix(cmedv ~ ., pdp::boston)[, -1]
boston_train_y <- pdp::boston$cmedv

# model
boston_ridge <- glmnet::glmnet(
  x = boston_train_x,
  y = boston_train_y,
  alpha = 0
)

lam <- boston_ridge$lambda %>% 
  as.data.frame() %>%
  mutate(penalty = boston_ridge$a0 %>% names()) %>%
  rename(lambda = ".")

results <- boston_ridge$beta %>% 
  as.matrix() %>% 
  as.data.frame() %>%
  rownames_to_column() %>%
  gather(penalty, coefficients, -rowname) %>%
  left_join(lam)

result_labels <- results %>%
  group_by(rowname) %>%
  filter(lambda == min(lambda)) %>%
  ungroup() %>%
  top_n(5, wt = abs(coefficients)) %>%
  mutate(var = paste0("x", 1:5))

ggplot() +
  geom_line(data = results, aes(lambda, coefficients, group = rowname, color = rowname), show.legend = FALSE) +
  scale_x_log10() +
  geom_text(data = result_labels, aes(lambda, coefficients, label = var, color = rowname), nudge_x = -.06, show.legend = FALSE)
```

---

## Lasso

Similar to ridge penalty. Swap out $L^2$ norm for $L^{1}$ norm: $\lambda \sum_{j=1}^{p}|\beta_{j}|$ 

$$\min(\text{SSE} + \lambda \sum_{j=1}^{p}|\beta_{j}|)$$

Ridge penalty pushes variables to approximately zero. 

Lasso actually pushes coefficients all the way to zero. 

Automated **feature selection**!

Look at the figure on the following slide, can you see the difference?

---

```{r, regression_6, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
boston_train_x <- model.matrix(cmedv ~ ., pdp::boston)[, -1]
boston_train_y <- pdp::boston$cmedv

# model
boston_lasso <- glmnet::glmnet(
  x = boston_train_x,
  y = boston_train_y,
  alpha = 1
)

lam <- boston_lasso$lambda %>% 
  as.data.frame() %>%
  mutate(penalty = boston_lasso$a0 %>% names()) %>%
  rename(lambda = ".")

results <- boston_lasso$beta %>% 
  as.matrix() %>% 
  as.data.frame() %>%
  rownames_to_column() %>%
  gather(penalty, coefficients, -rowname) %>%
  left_join(lam)

result_labels <- results %>%
  group_by(rowname) %>%
  filter(lambda == min(lambda)) %>%
  ungroup() %>%
  top_n(5, wt = abs(coefficients)) %>%
  mutate(var = paste0("x", 1:5))

ggplot() +
  geom_line(data = results, aes(lambda, coefficients, group = rowname, color = rowname), show.legend = FALSE) +
  scale_x_log10() +
  geom_text(data = result_labels, aes(lambda, coefficients, label = var, color = rowname), nudge_x = -.05, show.legend = FALSE)
```

---

## Elastic nets

Generalisation of ridge and lasso penalties:

$$\min(\text{SSE} + \lambda_1 \sum_{j=1}^{p}\beta^{2}_{j} + \lambda_2 \sum_{j=1}^{p}|\beta_{j}|)$$

Lasso performs feature selection, **but** when we have two strongly correlated features one may be pushed to zero and the other remains. 

Ridge regression penalty more effective in handling correlated features. 

Elastic net combines bits of both these methods. 

If you are interested in these topics I suggest you read more on them in **ISLR**.

Penalty models are easy to implement and results are surprisingly good. 
 
---

```{r, regression_7, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
# model
boston_elastic <- glmnet::glmnet(
  x = boston_train_x,
  y = boston_train_y,
  alpha = .2
)

lam <- boston_elastic$lambda %>% 
  as.data.frame() %>%
  mutate(penalty = boston_elastic$a0 %>% names()) %>%
  rename(lambda = ".")

results <- boston_elastic$beta %>% 
  as.matrix() %>% 
  as.data.frame() %>%
  rownames_to_column() %>%
  gather(penalty, coefficients, -rowname) %>%
  left_join(lam)

result_labels <- results %>%
  group_by(rowname) %>%
  filter(lambda == min(lambda)) %>%
  ungroup() %>%
  top_n(5, wt = abs(coefficients)) %>%
  mutate(var = paste0("x", 1:5))

ggplot() +
  geom_line(data = results, aes(lambda, coefficients, group = rowname, color = rowname), show.legend = FALSE) +
  scale_x_log10() +
  geom_text(data = result_labels, aes(lambda, coefficients, label = var, color = rowname), nudge_x = -.05, show.legend = FALSE)
```

---

## glmnet

We will be using `glmnet` to conduct our regularised regression analysis. 

This package is extremely efficient and fast. It utilises Fortran code. 

For those who like Python, one can use `Scikit-Learn` modules.

There are also other R packages available (e.g. `h20`, `elasticnet` and `penalized`).

We have to do some basic transformations to use this package. 

```{r, cache=F, message=F}
# Create training  feature matrices
# we use model.matrix(...)[, -1] to discard the intercept
X <- model.matrix(Sale_Price ~ ., ames_train)[, -1]

# transform y with log transformation
Y <- log(ames_train$Sale_Price)
```

---

## glmnet

Let us become more familiar with the different components of this package. 

`alpha` tells **glmnet** which method to implement. 

- `alpha` = 0: ridge penalty
- `alpha` = 1: lasso penalty
- 0 < `alpha` < 1: elastic net model

**glmnet** does two things that one needs to be aware of. 

1. Standardises features
2. Fits ridge models across wide range of $\lambda$ values (automatically)

---

## Parameter tuning 

$\lambda$ is the tuning parameter and it helps prevent overfitting training data.

To identify optimal $\lambda$ value we can use $k$-fold cross validation.

```{r, cache=F, message=F}
# Apply CV ridge regression to Ames data
ridge <- cv.glmnet(
  x = X,
  y = Y,
  alpha = 0)

# Apply CV lasso regression to Ames data
lasso <- cv.glmnet(
  x = X,
  y = Y,
  alpha = 1)
```

Now let's plot the result...

---

```{r, tuning, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
# plot results
par(mfrow = c(1, 2))
plot(ridge, main = "Ridge penalty\n\n")
plot(lasso, main = "Lasso penalty\n\n")
```

---

```{r, cache=F, message=F}
# Ridge model
min(ridge$cvm)       # minimum MSE
ridge$lambda.min     # lambda for this min MSE

# Lasso model
min(lasso$cvm)       # minimum MSE
lasso$lambda.min     # lambda for this min MSE

lasso$nzero[lasso$lambda == lasso$lambda.min] # No. of coef | Min MSE
```

---

```{r, ridge_lasso, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
# Ridge model
ridge_min <- glmnet(
  x = X,
  y = Y,
  alpha = 0
)

# Lasso model
lasso_min <- glmnet(
  x = X,
  y = Y,
  alpha = 1
)

par(mfrow = c(1, 2))
# plot ridge model
plot(ridge_min, xvar = "lambda", main = "Ridge penalty\n\n")
abline(v = log(ridge$lambda.min), col = "red", lty = "dashed")
abline(v = log(ridge$lambda.1se), col = "blue", lty = "dashed")

# plot lasso model
plot(lasso_min, xvar = "lambda", main = "Lasso penalty\n\n")
abline(v = log(lasso$lambda.min), col = "red", lty = "dashed")
abline(v = log(lasso$lambda.1se), col = "blue", lty = "dashed")
```

---

```{r, cache=F, message=F, echo = FALSE}
set.seed(123)
cv_glmnet <- train(
  x = X,
  y = Y,
  method = "glmnet",
  preProc = c("zv", "center", "scale"),
  trControl = trainControl(method = "cv", number = 10),
  tuneLength = 10
)
```

```{r, cache=F, message=F}
# model with lowest RMSE
cv_glmnet$bestTune

# results for model with lowest RMSE
cv_glmnet$results %>%
  filter(alpha == cv_glmnet$bestTune$alpha, lambda == cv_glmnet$bestTune$lambda)

# predict sales price on training data
pred <- predict(cv_glmnet, X)

# compute RMSE of transformed predicted
RMSE(exp(pred), exp(Y))

# RMSE of multiple linear regression was 26098.00
```

---

```{r, vip, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
vip(cv_glmnet, num_features = 20, geom = "point")
```