-
Notifications
You must be signed in to change notification settings - Fork 17
/
13-boosting.Rmd
757 lines (582 loc) · 21.8 KB
/
13-boosting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
---
title: "Data Science for Economics"
subtitle: "Lecture 13: Decision Trees Contd."
author: "Dawie van Lill"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
xaringan::moon_reader:
css: xaringan-themer.css
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
```{r xaringan-themer, include=FALSE, warning=FALSE}
library(xaringanthemer)
style_duo_accent(primary_color = "#035AA6", secondary_color = "#03A696")
```
## Packages and topics
Here are the packages that we will be using for this session.
```{r, cache=F, message=F}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, rpart, caret, rpart.plot,
vip, pdp, doParallel, foreach, tidyr,
ipred, ranger, gbm, xgboost, AmesHousing, ggrepel)
```
New topics for today
1. Random forests
2. Gradient boosting
---
## Random forests
Random forests are a modification of bagged decision trees.
Build large collection of de-correlated trees to improve predictive performance.
When bagging trees, a problem still exists.
Model building steps independent but trees in bagging are not independent.
This characteristic is known as **tree correlation**.
Random forests help to reduce tree correlation.
Injects more randomness into the tree-growing process.
Uses **split-variable randomization**
- Each time split is performed, search for the split variable is limited to random subset of $m_{try}$ of original $p$ features
- Defaults values are $m_{try} = \frac{p}{3}$ (regression) and $m_{try} = \sqrt{p}$ (classification)
---
## Random forest algorithm
```{r, eval=FALSE}
1. Given a training data set
2. Select number of trees to build (n_trees)
3. for i = 1 to n_trees do
4. | Generate a bootstrap sample of the original data
5. | Grow a regression/classification tree to the bootstrapped data
6. | for each split do
7. | | Select m_try variables at random from all p variables
8. | | Pick the best variable/split-point among the m_try
9. | | Split the node into two child nodes
10. | end
11. | Use typical tree model stopping criteria to determine when a
| tree is complete (but do not prune)
12. end
13. Output ensemble of trees
```
When $m_{try} = p$ we have bagging.
Algorithm randomly selects bootstrap sample to train on **and** a random sample of features to use for each split.
Lessens tree correlation and increases predictive power.
---
## Ames housing example (OOB)
```{r, message=F}
# Create training (70%) set for the Ames housing data.
set.seed(123)
ames <- AmesHousing::make_ames()
split <- rsample::initial_split(ames, prop = 0.7,
strata = "Sale_Price")
ames_train <- rsample::training(split)
# number of features
n_features <- length(setdiff(names(ames_train), "Sale_Price"))
# train a default random forest model
ames_rf1 <- ranger(
Sale_Price ~ .,
data = ames_train,
mtry = floor(n_features / 3),
respect.unordered.factors = "order",
seed = 123)
# get OOB RMSE
(default_rmse <- sqrt(ames_rf1$prediction.error))
```
---
## Hyperparameters
There are several tunable hyperparameters that we can consider in this model.
The main hyperparameters to consider include:
1. The number of trees in the forest
2. The number of features to consider at any given split: $m_{try}$
3. The complexity of each tree
4. The sampling scheme
5. The splitting rule to use during tree construction
We will quickly discuss each of these hyperparameters.
Grid search can also be used to find best combination of hyperparameters.
We will briefly touch on grid search.
---
## Number of trees
Number of trees needs to be sufficiently large to stabilize the error rate
```{r, hyper_1, eval=T, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 9, fig.height = 5, fig.align = 'center'}
# number of features
n_features <- ncol(ames_train) - 1
# tuning grid
tuning_grid <- expand.grid(
trees = seq(10, 1000, by = 20),
rmse = NA
)
for(i in seq_len(nrow(tuning_grid))) {
# Fit a random forest
fit <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = tuning_grid$trees[i],
mtry = floor(n_features / 3),
respect.unordered.factors = 'order',
verbose = FALSE,
seed = 123
)
# Extract OOB RMSE
tuning_grid$rmse[i] <- sqrt(fit$prediction.error)
}
ggplot(tuning_grid, aes(trees, rmse)) +
geom_line(size = 1) +
ylab("OOB Error (RMSE)") +
xlab("Number of trees")
```
Good rule of thumb is to start with 10 times number of features.
---
## Split-variable randomization
Hyperparameter $m_{try}$ controls split-variable randomization.
```{r, hyper_2, eval = T, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 9, fig.height = 5, fig.align = 'center'}
tuning_grid <- expand.grid(
trees = seq(10, 1000, by = 20),
mtry = floor(c(seq(2, 80, length.out = 5), 26)),
rmse = NA
)
for(i in seq_len(nrow(tuning_grid))) {
fit <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = tuning_grid$trees[i],
mtry = tuning_grid$mtry[i],
respect.unordered.factors = 'order',
verbose = FALSE,
seed = 123
)
tuning_grid$rmse[i] <- sqrt(fit$prediction.error)
}
labels <- tuning_grid %>%
filter(trees == 990) %>%
mutate(mtry = as.factor(mtry))
tuning_grid %>%
mutate(mtry = as.factor(mtry)) %>%
ggplot(aes(trees, rmse, color = mtry)) +
geom_line(size = 1, show.legend = FALSE) +
ggrepel::geom_text_repel(data = labels, aes(trees, rmse, label = mtry), nudge_x = 50, show.legend = FALSE) +
ylab("OOB Error (RMSE)") +
xlab("Number of trees")
```
Few (many) relevant predictors, higher (lower) $m_{try}$
---
## Tree complexity
We can control depth and complexity of individual trees with **node size** the most common to control complexity.
```{r, hyper_3, eval = T, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 9, fig.height = 5, fig.align = 'center'}
tuning_grid <- expand.grid(
min.node.size = 1:20,
run_time = NA,
rmse = NA
)
for(i in seq_len(nrow(tuning_grid))) {
fit_time <- system.time({
fit <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 1000,
mtry = 26,
min.node.size = tuning_grid$min.node.size[i],
respect.unordered.factors = 'order',
verbose = FALSE,
seed = 123
)
})
tuning_grid$run_time[i] <- fit_time[[3]]
tuning_grid$rmse[i] <- sqrt(fit$prediction.error)
}
min_node_size <- tuning_grid %>%
mutate(
error_first = first(rmse),
runtime_first = first(run_time),
`Error Growth` = (rmse / error_first) - 1,
`Run Time Reduction` = (run_time / runtime_first) - 1
)
p1 <- ggplot(min_node_size, aes(min.node.size, `Error Growth`)) +
geom_smooth(size = 1, se = FALSE, color = "black") +
scale_y_continuous("Percent growth in error estimate", labels = scales::percent) +
xlab("Minimum node size") +
ggtitle("A) Impact to error estimate")
p2 <- ggplot(min_node_size, aes(min.node.size, `Run Time Reduction`)) +
geom_smooth(size = 1, se = FALSE, color = "black") +
scale_y_continuous("Reduction in run time", labels = scales::percent) +
xlab("Minimum node size") +
ggtitle("B) Impact to run time")
gridExtra::grid.arrange(p1, p2, nrow = 1)
```
If computation time a concern then try increasing node size!
---
## Sampling schedule
Can also adjust sample size and decide on whether to sample with or without replacement.
```{r, hyper_4, eval = T, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 9, fig.height = 5, fig.align = 'center'}
tuning_grid <- expand.grid(
sample.fraction = seq(.05, .95, by = .05),
replace = c(TRUE, FALSE),
rmse = NA
)
for(i in seq_len(nrow(tuning_grid))) {
fit <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 1000,
mtry = 26,
sample.fraction = tuning_grid$sample.fraction[i],
replace = tuning_grid$replace[i],
respect.unordered.factors = 'order',
verbose = FALSE,
seed = 123
)
tuning_grid$rmse[i] <- sqrt(fit$prediction.error)
}
tuning_grid %>%
ggplot(aes(sample.fraction, rmse, color = replace)) +
geom_line(size = 1) +
scale_x_continuous("Sample Fraction", breaks = seq(.1, .9, by = .1), labels = scales::percent) +
ylab("OOB Error (RMSE)") +
scale_color_discrete("Sample with Replacement") +
theme(legend.position = c(0.8, 0.85),
legend.key = element_blank(),
legend.background = element_blank())
```
---
## Gradient boosting
GBMs build an ensemble of **shallow trees in sequence** with each tree learning and improving on the previous one.
The main idea of boosting is to add new models to the ensemble **sequentially**.
```{r terminology, echo=F, fig.align = 'center', out.width = "70%"}
knitr::include_graphics("https://bradleyboehmke.github.io/HOML/images/boosted-trees-process.png")
```
Boosting attacks bias-variance-tradeoff by starting with a *weak* model.
Boosts the performance by building new trees.
- New tree in sequence tries to fix where previous one made biggest mistakes.
- Focus on training rows with largest prediction errors.
---
```{r, hyper_5, eval = T, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
# Simulate sine wave data
set.seed(1112) # for reproducibility
df <- tibble::tibble(
x = seq(from = 0, to = 2 * pi, length = 1000),
y = sin(x) + rnorm(length(x), sd = 0.5),
truth = sin(x)
)
# Function to boost `rpart::rpart()` trees
rpartBoost <- function(x, y, data, num_trees = 100, learn_rate = 0.1, tree_depth = 6) {
x <- data[[deparse(substitute(x))]]
y <- data[[deparse(substitute(y))]]
G_b_hat <- matrix(0, nrow = length(y), ncol = num_trees + 1)
r <- y
for(tree in seq_len(num_trees)) {
g_b_tilde <- rpart(r ~ x, control = list(cp = 0, maxdepth = tree_depth))
g_b_hat <- learn_rate * predict(g_b_tilde)
G_b_hat[, tree + 1] <- G_b_hat[, tree] + matrix(g_b_hat)
r <- r - g_b_hat
colnames(G_b_hat) <- paste0("tree_", c(0, seq_len(num_trees)))
}
cbind(df, as.data.frame(G_b_hat)) %>%
gather(tree, prediction, starts_with("tree")) %>%
mutate(tree = stringr::str_extract(tree, "\\d+") %>% as.numeric())
}
# Plot boosted tree sequence
rpartBoost(x, y, data = df, num_trees = 2^10, learn_rate = 0.05, tree_depth = 1) %>%
filter(tree %in% c(0, 2^c(0:10))) %>%
ggplot(aes(x, prediction)) +
ylab("y") +
geom_point(data = df, aes(x, y), alpha = .1) +
geom_line(data = df, aes(x, truth), color = "blue") +
geom_line(colour = "red", size = 1) +
facet_wrap(~ tree, nrow = 3)
```
---
## Gradient descent
*Gradient* boosting machine comes from the fact that this procedure can be generalized to loss functions other than SSE.
Gradient boosting is considered a **gradient descent** algorithm.
Idea is to tweak parameter(s) iteratively to minimize a cost function.
Gradient descent measures the local gradient of the loss (cost) function for a given set of parameters $\Theta$ and takes steps in the direction of the descending gradient.
Once the gradient is zero, we have reached a minimum.
---
```{r, grad_1, eval = T, warning = F, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
# create data to plot
x <- seq(-5, 5, by = .05)
y <- x^2 + 3
df <- data.frame(x, y)
step <- 5
step_size <- .2
for(i in seq_len(18)) {
next_step <- max(step) + round(diff(range(max(step), which.min(df$y))) * step_size, 0)
step <- c(step, next_step)
next
}
steps <- df[step, ] %>%
mutate(x2 = lag(x), y2 = lag(y)) %>%
dplyr::slice(1:18)
# plot
ggplot(df, aes(x, y)) +
geom_line(size = 1.5, alpha = .5) +
theme_classic() +
scale_y_continuous("Loss function", limits = c(0, 30)) +
xlab(expression(theta)) +
geom_segment(data = df[c(5, which.min(df$y)), ], aes(x = x, y = y, xend = x, yend = -Inf), lty = "dashed") +
geom_point(data = filter(df, y == min(y)), aes(x, y), size = 4, shape = 21, fill = "yellow") +
geom_point(data = steps, aes(x, y), size = 3, shape = 21, fill = "blue", alpha = .5) +
geom_curve(data = steps, aes(x = x, y = y, xend = x2, yend = y2), curvature = 1, lty = "dotted") +
theme(
axis.ticks = element_blank(),
axis.text = element_blank()
) +
annotate("text", x = df[5, "x"], y = 1, label = "Initial value", hjust = -0.1, vjust = .8) +
annotate("text", x = df[which.min(df$y), "x"], y = 1, label = "Minimium", hjust = -0.1, vjust = .8) +
annotate("text", x = df[5, "x"], y = df[5, "y"], label = "Learning step", hjust = -.8, vjust = 0)
```
---
## Gradient descent
Gradient descent can be performed on any loss function that is differentiable.
Allows GBMs to optimize different loss functions as desired.
Important parameter in gradient descent is the size of the steps.
This is controlled by the **learning rate**.
If the learning rate is too small, then the algorithm will take many iterations (steps) to find the minimum.
On the other hand, if the learning rate is too high, you might jump across the minimum and end up further away than when you started.
---
```{r, grad_2, eval = T, warning = F, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
# create too small of a learning rate
step <- 5
step_size <- .05
for(i in seq_len(10)) {
next_step <- max(step) + round(diff(range(max(step), which.min(df$y))) * step_size, 0)
step <- c(step, next_step)
next
}
too_small <- df[step, ] %>%
mutate(x2 = lag(x), y2 = lag(y))
# plot
p1 <- ggplot(df, aes(x, y)) +
geom_line(size = 1.5, alpha = .5) +
theme_classic() +
scale_y_continuous("Loss function", limits = c(0, 30)) +
xlab(expression(theta)) +
geom_segment(data = too_small[1, ], aes(x = x, y = y, xend = x, yend = -Inf), lty = "dashed") +
geom_point(data = too_small, aes(x, y), size = 3, shape = 21, fill = "blue", alpha = .5) +
geom_curve(data = too_small, aes(x = x, y = y, xend = x2, yend = y2), curvature = 1, lty = "dotted") +
theme(
axis.ticks = element_blank(),
axis.text = element_blank()
) +
annotate("text", x = df[5, "x"], y = 1, label = "Start", hjust = -0.1, vjust = .8) +
ggtitle("b) too small")
# create too large of a learning rate
too_large <- df[round(which.min(df$y) * (1 + c(-.9, -.6, -.2, .3)), 0), ] %>%
mutate(x2 = lag(x), y2 = lag(y))
# plot
p2 <- ggplot(df, aes(x, y)) +
geom_line(size = 1.5, alpha = .5) +
theme_classic() +
scale_y_continuous("Loss function", limits = c(0, 30)) +
xlab(expression(theta)) +
geom_segment(data = too_large[1, ], aes(x = x, y = y, xend = x, yend = -Inf), lty = "dashed") +
geom_point(data = too_large, aes(x, y), size = 3, shape = 21, fill = "blue", alpha = .5) +
geom_curve(data = too_large, aes(x = x, y = y, xend = x2, yend = y2), curvature = 1, lty = "dotted") +
theme(
axis.ticks = element_blank(),
axis.text = element_blank()
) +
annotate("text", x = too_large[1, "x"], y = 1, label = "Start", hjust = -0.1, vjust = .8) +
ggtitle("a) too big")
gridExtra::grid.arrange(p2, p1, nrow = 1)
```
---
## Gradient descent
Not all cost functions are convex (i.e., bowl shaped).
There may be local minimas, plateaus, and other irregular terrain of the loss function that makes finding the global minimum difficult.
**Stochastic gradient descent** can help us address this problem.
Sample fraction of training observations and grow next tree using the subsample.
Several hyperparameter tuning options in stochastic gradient boosting.
- Some control gradient descent, while others control tree growing process.
If properly tuned (e.g., with k-fold CV) GBMs can lead to some of the most flexible and accurate predictive models you can build!
---
```{r, grad_3, eval = T, warning = F, cache=F, message=F, echo=F, fig.retina = 2, fig.width = 10, fig.height = 8, fig.align = 'center'}
# create random walk data
set.seed(123)
x <- sample(seq(3, 5, by = .05), 10, replace = TRUE)
set.seed(123)
y <- seq(2, 28, length.out = 10)
random_walk <- data.frame(
x = x,
y = y[order(y, decreasing = TRUE)]
)
optimal <- data.frame(x = 0, y = 0)
# plot
ggplot(df, aes(x, y)) +
coord_polar() +
theme_minimal() +
theme(
axis.ticks = element_blank(),
axis.text = element_blank()
) +
xlab(expression(theta[1])) +
ylab(expression(theta[2])) +
geom_point(data = random_walk, aes(x, y), size = 3, shape = 21, fill = "blue", alpha = .5) +
geom_point(data = optimal, aes(x, y), size = 2, shape = 21, fill = "yellow") +
geom_path(data = random_walk, aes(x, y), lty = "dotted") +
annotate("text", x = random_walk[1, "x"], y = random_walk[1, "y"], label = "Start", hjust = 1, vjust = -1) +
annotate("text", x = optimal[1, "x"], y = optimal[1, "y"], label = "Minimum", hjust = -.2, vjust = 1) +
ylim(c(0, 28)) +
xlim(-5, 5)
```
---
## Basic GBM
Simple GBM model contains two categories of hyperparameters: *boosting hyperparameters* and *tree-specific hyperparameters*.
Two main boosting hyperparameters are:
1. Number of trees
2. Learning rate (shrinkage)
Two main tree hyperparameters are:
1. Tree depth
2. Minimum number of observations in terminal nodes
---
## Ames housing example (GBM)
```{r, message=F, eval = F}
# run a basic GBM model
set.seed(123) # for reproducibility
ames_gbm1 <- gbm(
formula = Sale_Price ~ .,
data = ames_train,
distribution = "gaussian", # SSE loss function
n.trees = 5000,
shrinkage = 0.1,
interaction.depth = 3,
n.minobsinnode = 10,
cv.folds = 10
)
# find index for number trees with minimum CV error
best <- which.min(ames_gbm1$cv.error)
# get MSE and compute RMSE
sqrt(ames_gbm1$cv.error[best])
```
---
## General tuning strategy
Tuning can require much more strategy than a random forest model. A good approach is:
1. Choose a relatively high learning rate. Generally the default value of 0.1 works.
2. Determine the optimum number of trees for this learning rate.
3. Fix tree hyperparameters. Tune learning rate and assess speed vs. performance.
4. Tune tree-specific parameters for decided learning rate.
5. Lower the learning rate to assess for any improvements in accuracy.
6. Use final hyperparameter settings and increase CV procedures to get more robust estimates.
This process is attempted in the textbook, see for reference.
---
## Stochastic GBMs
Few variants of stochastic gradient boosting
All of these variants have additional hyperparameters.
- Subsample rows before creating each tree
- Subsample columns before creating each tree
- Subsample columns before considering each split in each tree
When adding in a stochastic procedure, you can either include it in step 4 or step 6 of our general tuning strategy from before.
---
## XGBoost
**xgboost** provides the traditional boosting and tree-based hyperparameters.
However, **xgboost** also provides additional hyperparameters that can help reduce the chances of overfitting, leading to less prediction variability and, therefore, improved accuracy.
See sections on *regularisation* and *dropout* with **xgboost**.
Our tuning strategy with **xgboost** is the following.
1. Increase number of trees and tune learning rate with early stopping
2. Tune tree-specific hyperparameters
3. Explore stochastic GBM attributes
4. If substantial overfitting occurs explore regularization hyperparameters
5. If you find hyperparameter values that are substantially different from default settings, be sure to retune the learning rate
6. Obtain final “optimal” model
---
## XGBoost
**xgboost** requires some additional data preparation.
Need to encode our categorical variables numerically.
```{r, message=F, eval = F}
library(recipes)
xgb_prep <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_integer(all_nominal()) %>%
prep(training = ames_train, retain = TRUE) %>%
juice()
X <- as.matrix(xgb_prep[setdiff(names(xgb_prep), "Sale_Price")])
Y <- xgb_prep$Sale_Price
```
Next we will go through a series of grid searches to find model hyperparameters.
---
## XGBoost
```{r, message=F, eval = F}
set.seed(123)
ames_xgb <- xgb.cv(
data = X,
label = Y,
nrounds = 6000,
objective = "reg:linear",
early_stopping_rounds = 50,
nfold = 10,
params = list(
eta = 0.1,
max_depth = 3,
min_child_weight = 3,
subsample = 0.8,
colsample_bytree = 1.0),
verbose = 0
)
# minimum test CV RMSE
min(ames_xgb$evaluation_log$test_rmse_mean)
```
Minimum test CV RMSE is about 20488.
---
## XGBoost
Next, we assess if overfitting is limiting our model’s performance by performing a grid search that examines various regularisation parameters.
```{r, message=F, eval = F}
# hyperparameter grid
hyper_grid <- expand.grid(
eta = 0.01,
max_depth = 3,
min_child_weight = 3,
subsample = 0.5,
colsample_bytree = 0.5,
gamma = c(0, 1, 10, 100, 1000),
lambda = c(0, 1e-2, 0.1, 1, 100, 1000, 10000),
alpha = c(0, 1e-2, 0.1, 1, 100, 1000, 10000),
rmse = 0, # a place to dump RMSE results
trees = 0 # a place to dump required number of trees
)
```
---
```{r, message=F, eval = F}
# grid search
for(i in seq_len(nrow(hyper_grid))) {
set.seed(123)
m <- xgb.cv(
data = X,
label = Y,
nrounds = 4000,
objective = "reg:linear",
early_stopping_rounds = 50,
nfold = 10,
verbose = 0,
params = list(
eta = hyper_grid$eta[i],
max_depth = hyper_grid$max_depth[i],
min_child_weight = hyper_grid$min_child_weight[i],
subsample = hyper_grid$subsample[i],
colsample_bytree = hyper_grid$colsample_bytree[i],
gamma = hyper_grid$gamma[i],
lambda = hyper_grid$lambda[i],
alpha = hyper_grid$alpha[i]
)
)
hyper_grid$rmse[i] <- min(m$evaluation_log$test_rmse_mean)
hyper_grid$trees[i] <- m$best_iteration
}
```
---
## XGBoost final model
```{r, message=F, eval = F}
# optimal parameter list
params <- list(
eta = 0.01,
max_depth = 3,
min_child_weight = 3,
subsample = 0.5,
colsample_bytree = 0.5
)
# train final model
xgb.fit.final <- xgboost(
params = params,
data = X,
label = Y,
nrounds = 3944,
objective = "reg:linear",
verbose = 0
)
```