-
Notifications
You must be signed in to change notification settings - Fork 12
/
0XX-intro-parallel.Rmd
164 lines (128 loc) · 4.44 KB
/
0XX-intro-parallel.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "Parallel processing in R"
author: "Jeff Oliver"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document: default
pdf_document:
latex_engine: xelatex
urlcolor: blue
---
[INTRODUCTORY SENTENCE]
#### Learning objectives
1. Install packages for parallel processing in R
2. Write code to automate repetitive tasks
3. Translate iterative `for` loop code into parallel processing
## [DESCRIPTION OR MOTIVATION; 2-4 sentences that would be used for an announcement]
***
## Getting started
<making a new project>
***
## Repeating ourselves
<Something for each row in a data frame>
Do leave-one-out linear regression
```{r}
# Run linear regression on iris data, using leave-one-out error estimation,
# in this example, we use Petal.Length to predict Sepal.Length
# Vector to hold error values
errors <- numeric(nrow(iris))
##### for loop
# The for loop iterative approach
for (i in 1:nrow(iris)) {
# Estimate the model, leaving out the ith row of data
one_model <- lm(Sepal.Length ~ Petal.Length, data = iris[-i, ])
# Use the model to predict value for that ith row of data
predicted_fit <- predict(one_model, newdata = iris[i, ])
# Calculate difference between observed and predicted value
errors[i] <- iris$Sepal.Length[i] - predicted_fit[1]
}
# Calculate mean squared errors
mse <- mean(errors^2)
##### lapply
# The lapply vectorized approach. Start by creating a function to do the work;
# The one argument to lm_mse (x) will be the row number (x); ignore, for the
# moment, the bad practice of referring to things inside the function that are
# not passed to or created inside the function (the iris data frame and columns
# therein).
lm_mse <- function(x) {
# Estimate the model, leaving out the ith row of data
one_model <- lm(Sepal.Length ~ Petal.Length, data = iris[-x, ])
# Use the model to predict value for that ith row of data
predicted_fit <- predict(one_model, newdata = iris[x, ])
# Calculate difference between observed and predicted value
error <- iris$Sepal.Length[x] - predicted_fit[1]
# Send back the error value
return(error)
}
# Now use that function with lapply to do the work without a for loop
mse_list <- lapply(X = 1:nrow(iris),
FUN = lm_mse)
# lapply returns a list, so we need to convert to a vector before squaring
mse <- mean(unlist(mse_list)^2)
##### parallel processing
library(parallel)
# Use two fewer cores than there are available
n <- detectCores() - 2
# Setup the cluster
clust <- makeCluster(n)
# Use the function we defined above with parLapply
mse_list <- parLapply(cl = clust,
X = 1:nrow(iris),
fun = lm_mse)
# Stop the cluster
stopCluster(cl = clust)
# parLapply also returns a list
mse <- mean(unlist(mse_list)^2)
# Same as above, but keeping track of times
system.time({
clust <- makeCluster(n)
a <- parLapply(cl = clust,
X = 1:nrow(iris),
fun = function(x) {
m <- lm(Sepal.Length ~ Petal.Length, data = iris[-x, ])
p <- predict(m, newdata = iris[x, ])
e <- iris$Sepal.Length[x] - p[1]
return(e)
})
stopCluster(cl = clust)
})
mean(unlist(a)^2)
```
***
## Make it functional
<convert the stuff from the for loop into a function>
***
## "Embarrasingly parallizable"
```{r}
# Takes about a second
system.time(
for (i in 1:300) {
x <- rnorm(n = 10000)
y <- x + rnorm(n = length(x), sd = 0.1)
l <- lm(y ~ x)
s <- summary(l)
}
)
library(parallel)
n <- detectCores() - 2
clust <- makeCluster(n)
system.time({
a <- parLapply(cl = clust,
X = rnorm(n = 10000),
fun = function(x) {
y <- x + rnorm(n = length(x), sd = 0.1)
l <- lm(y ~ x)
s <- summary(l)
})
})
stopCluster(cl = clust)
```
***
## Additional resources
+ [Overview](https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html#using-sockets-with-parlapply) of parallel processing in R, with comparsions of
different approaches
+ Examples of parallel processing with [parallel, doParallel, and foreach packages](https://www.r-bloggers.com/2018/09/simple-parallel-processing-in-r/)
+ A [PDF version](https://jcoliver.github.io/learn-r/017-intro-parallel.pdf) of this lesson
***
<a href="index.html">Back to learn-r main page</a>
Questions? e-mail me at <a href="mailto:[email protected]">[email protected]</a>.