forked from poldrack/psych10-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05b-FittingModelsInR.Rmd
256 lines (182 loc) · 7.93 KB
/
05b-FittingModelsInR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
output:
pdf_document: default
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
html_document: default
---
# Fitting simple models with R
In this chapter we will focus on how to compute the measures of central tendency and variability that were covered in the previous chapter. Most of these can be computed using a built-in R function, but we will show how to do them manually in order to give some intuition about how they work.
```{r echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(ggplot2)
library(cowplot)
library(knitr)
rm('NHANES')
library(NHANES)
set.seed(123456)
opts_chunk$set(tidy.opts=list(width.cutoff=80))
options(tibble.width = 60)
options(digits = 7) # reset from previous chapter
# drop duplicated IDs within the NHANES dataset
NHANES <- NHANES %>%
dplyr::distinct(ID,.keep_all=TRUE)
NHANES_adult <- NHANES %>%
drop_na(Height) %>%
subset(subset=Age>=18)
# setup colorblind palette
# from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
```
## Mean
The mean is defined as the sum of values divided by the number of values being summed:
$$
\bar{X} = \frac{\sum_{i=1}^{n}x_i}{n}
$$
Let's say that we want to obtain the mean height for adults in the NHANES database (contained in the data `Height`). We would sum the individual heights (using the `sum()` function) and then divide by the number of values:
```{r}
sum(NHANES$Height)/length(NHANES$Height)
```
This returns the value NA, because there are missing values for some rows, and the `sum()` function doesn't automatically handle those. To address this, we could filter the data frame using `drop_na()` to drop rows with NA values for this variable:
```{r}
height_noNA <- NHANES %>%
drop_na(Height) %>%
pull(Height)
sum(height_noNA)/length(height_noNA)
```
There is, of course, a built-in function in R called `mean()` that will compute the mean. Like the `sum()` function, `mean()` will return NA if there are any NA values in the data:
```{r}
mean(NHANES$Height)
```
The `mean()` function includes an optional argument called `na.rm` that will remove NA values if it is set to TRUE:
```{r}
mean(NHANES$Height, na.rm=TRUE)
```
## Median
The median is the middle value after sorting the entire set of values. Let's use the cleand-up `height_noNA` variable created above to determine this for the NHANES height data. First we sort the data in order of their values:
```{r}
height_sorted <- sort(height_noNA)
```
Next we find the median value. If there is an odd number of values in the list, then this is just the value in the middle, whereas if the number of values is even then we take the average of the two middle values. We can determine whether the number of items is even by dividing the length by two and seeing if there is a remainder; we do this using the `%%` operator, which is known as the *modulus* and returns the remainder:
```{r}
5 %% 2
```
Here we will test whether the remainder is equal to one; if it is, then we will take the middle value, otherwise we will take the average of the two middle values. We can do this using an if/else structure, which executes different processes depending on which of the arguments are true:
```
if (logical value) {
functions to perform if logical value is true
} else {
functions to perform if logical value is false
}
```
Let's do this with our data. To find the middle value when the number of items is odd, we will divide the length and then round up, using the `ceiling()` function:
```{r}
if (length(height_sorted) %% 2 == 1){
# length of vector is odd
median_height <-
height_sorted[ceiling(length(height_sorted) / 2)]
} else {
median_height <-
(height_sorted[length(height_sorted) / 2] +
height_sorted[1 + length(height_sorted) / (2)])/2
}
median_height
```
We can compare this to the result from the built-in median function:
```{r}
median(height_noNA)
```
## Mode
The mode is the most frequent value that occurs in a variable. R has a function called `mode()` but if you look at the help page you will see that it doesn't actually copute the mode. In fact, R doesn't have a built-in function to compute the mode, so we need to create one. Let start with some toy data:
```{r}
mode_test = c('a', 'b', 'b', 'c', 'c', 'c')
mode_test
```
We can see by eye that the mode is "a" since it occurs more often than the others. To find it computationally, let's first get the unique values
To do this, we first create a table with the counts for each value, using the `table()` function:
```{r}
mode_table <- table(mode_test)
mode_table
```
Now we need to find the maximum value. We do this by comparing each value to the maximum of the table; this will work even if there are multiple values with the same frequency (i.e. a tie for the mode).
```{r}
table_max <- mode_table[mode_table == max(mode_table)]
table_max
```
This variable is a special kind of value called a *named vector*, and its name contains the value that we need to identify the mode. We can pull it out using the `names()` function:
```{r}
my_mode <- names(table_max)[1]
my_mode
```
Let's wrap this up into our own custom function:
```{r}
getmode <- function(v, print_table=FALSE) {
mode_table <- table(v)
if (print_table){
print(kable(mode_table))
}
table_max <- mode_table[mode_table == max(mode_table)]
return(names(table_max))
}
```
We can then apply this to real data. Let's apply this to the `MaritalStatus` variable in the NHANES dataset:
```{r}
getmode(NHANES$MaritalStatus)
```
## Variability
Let's first compute the *variance*, which is the average squared difference between each value and the mean. Let's do this with our cleaned-up version of the height data, but instead of working with the entire dataset, let's take a random sample of 150 individuals:
```{r}
height_sample <- NHANES %>%
drop_na(Height) %>%
sample_n(150) %>%
pull(Height)
```
First we need to obtain the sum of squared errors from the mean. In R, we can square a vector using `**2`:
```{r}
SSE <- sum((height_sample - mean(height_sample))**2)
SSE
```
Then we divide by N - 1 to get the estimated variance:
```{r}
var_est <- SSE/(length(height_sample) - 1)
var_est
```
We can compare this to the built-in `var()` function:
```{r}
var(height_sample)
```
We can get the *standard deviation* by simply taking the square root of the variance:
```{r}
sqrt(var_est)
```
Which is the same value obtained using the built-in `sd()` function:
```{r}
sd(height_sample)
```
## Z-scores
A Z-score is obtained by first subtracting the mean and then dividing by the standard deviation of a distribution. Let's do this for the `height_sample` data.
```{r}
mean_height <- mean(height_sample)
sd_height <- sd(height_sample)
z_height <- (height_sample - mean_height)/sd_height
```
Now let's plot the histogram of Z-scores alongside the histogram for the original values. We will use the `plot_grid()` function from the `cowplot` library to plot the two figures alongside one another. First we need to put the values into a data frame, since `ggplot()` requires the data to be contained in a data frame.
```{r fig.height=4, fig.width=8}
height_df <- data.frame(orig_height=height_sample,
z_height=z_height)
# create individual plots
plot_orig <- ggplot(height_df, aes(orig_height)) +
geom_histogram()
plot_z <- ggplot(height_df, aes(z_height)) +
geom_histogram()
# combine into a single figure
plot_grid(plot_orig, plot_z)
```
You will notice that the shapes of the histograms are similar but not exactly the same. This occurs because the binning is slightly different between the two sets of values. However, if we plot them against one another in a scatterplot, we will see that there is a direct linear relation between the two sets of values:
```{r, fig.width=4, fig.height=4}
ggplot(height_df, aes(orig_height, z_height)) +
geom_point()
```