-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathdplyr-group.Rmd
368 lines (240 loc) · 10.9 KB
/
dplyr-group.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
---
title: "dplyr grouping"
output:
html_document:
df_print: paged
code_download: TRUE
toc: true
toc_depth: 1
editor_options:
chunk_output_type: inline
---
```{r, setup, include=FALSE}
knitr::opts_chunk$set(
eval=FALSE, warning=FALSE, error=FALSE
)
```
*The output of most of the R chunks isn't included in the HTML version of the file to keep it to a more reasonable file size. You can run the code in R to see the output.*
This is an [R Markdown](https://rmarkdown.rstudio.com/) document. Follow the link to learn more about R Markdown and the notebook format used during the workshop.
# Setup
```{r, eval=TRUE, message=FALSE}
library(tidyverse)
```
For this session, we're learning three additional dplyr functions: group_by, summarize, and arrange.
## Data
We're using the same data as session 1.
```{r, eval=TRUE}
police <- read_csv("https://raw.githubusercontent.com/nuitrcs/r-tidyverse/main/data/ev_police.csv",
col_types=c("location"="c"))
```
# Summarize
We'll start with `summarize()` (or `summarise()` - British spelling is accepted). We use `mutate()` when we want the output to have the same length as the input -- when we're operating on the individual elements in a vector - we want a value for every row in the data. When we want to condense multiple values down to a single (or a few values), such as taking the mean or standard deviation of a vector), we use summarize instead.
```{r}
police %>%
mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
summarize(mean_vehicle_age = mean(vehicle_age))
```
Note that even though there's just one value, we get a tibble returned. This is what to expect with the tidyverse.
As a side note, if we needed the single value (or a single vector), we could `pull()` it out of the tibble/data frame:
```{r}
police %>%
mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
summarize(mean_vehicle_age = mean(vehicle_age)) %>%
pull()
```
We can compute more than one summary measure at the same time:
```{r}
police %>%
mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
summarize(mean_vehicle_age = mean(vehicle_age),
sd_vehicle_age = sd(vehicle_age),
min_date = min(date),
max_date = max(date))
```
We get one column per summary variable we create. Once we group below, we'll see why we get the output in columns instead of rows.
### EXERCISE
Use summarize to compute the `min()` and `max()` `vehicle_year`
```{r}
```
## Across
If we want to apply the same summary functions to multiple columns in our data frame, we can write out all of the summary commands explicitly, or we can use `across()` to select which variables to summarize with which functions.
Let's use the `n_distinct()` function to count the number of distinct values in each column (`n_distinct(x)` is the same as `length(unique(x))`. This will help us see which columns don't have useful information because every value is the same.
`across()` selects columns using the helper functions you could give to `select()` directly. We'll use `everything()` here to select all columns.
```{r}
police %>%
summarize(across(everything(), n_distinct))
```
If you wanted to select columns using their names, put them in a vector (so it's a single input argument):
```{r}
police %>%
summarize(across(c(date, time, location, beat, subject_age), n_distinct))
```
If we want to apply multiple functions:
```{r}
police %>%
summarize(across(!where(is.character), ## select columns that are not of type character
list(min, max))) # take the min and max of each column
```
To fix the names in the output, explicitly name our summary functions in the list:
```{r}
police %>%
summarize(across(!where(is.character), ## select columns that are not of type character
list(min_val=min, max_val=max))) # take the min and max of each column
```
There are other options for output naming patterns available too.
### EXERCISE
What if we want to do something to each column that is more complicated than a single function? For example, counting missing values, which you do with `sum(is.na(x))`. You can write your own function!
```{r, eval=TRUE}
na_count <- function(x) {
sum(is.na(x))
}
# Example for a single vector:
na_count(police$subject_age)
```
Now use this new `na_count()` function to count the missing values in each column:
```{r}
police %>%
summarize(across(___, ___))
```
This is similar to using the apply() function in base R:
```{r}
apply(police, 2, na_count)
```
#### Advanced Aside
Instead of creating the function ahead of time, we could define it as part of the summarize call. We use .x to stand in for each column, and we put ~ in front of the expression to signal we’re creating a function in place (lambda function).
```{r}
police %>%
summarize(across(everything(),
~sum(is.na(.x)))) # define a function here that operates on each column
```
This is an advanced technique outside of the scope of this workshop. But if you see a ~ inside `across()`, at least know that someone is defining their own function inline.
# Group By
With base R, when we want to compute summary measures or do other computation on groups in our data (as defined by some grouping variable), we use functions such as `tapply()` or `aggregate()`. With dplyr, we can explicitly group our tibble into subgroups. This isn't very useful by itself, but it is often combined with `summarize()` to compute summary measures by group.
First, what if we just group:
```{r}
police %>%
group_by(outcome)
```
When we print this in the console,
```
# A tibble: 14,792 x 29
# Groups: outcome [2]
raw_row_number date time location beat subject_age subject_race subject_sex department_id
<dbl> <date> <tim> <dbl> <chr> <lgl> <chr> <chr> <dbl>
1 11249746 2017-01-01 00:56 60202 72 NA white male 13178
2 11249747 2017-01-01 04:43 60643 71 NA black male 13178
3 11249748 2017-01-01 09:21 60625 78 NA black female 13178
...
```
we see that it tells us that the tibble (data frame) is grouped by outcome, and that there are two groups. It doesn't rearrange the rows, it just keeps track of the groups for us.
Now, combine with summarize. But first, let's make the `vehicle_age` column we've been using actually part of the `police` dataset so that we don't have to keep creating it:
```{r, eval=TRUE}
police <- mutate(police, vehicle_age = 2017-vehicle_year)
```
Now, group and summarize:
```{r}
police %>%
group_by(subject_sex) %>%
summarize(mean_vehicle_age = mean(vehicle_age),
sd_vehicle_age = sd(vehicle_age))
```
Now we get one row for each group, and one column for each summary measure.
We can group by multiple columns, and we'll get all of the combinations of values present across the columns:
```{r}
police %>%
group_by(subject_sex, subject_race) %>%
summarize(mean_vehicle_age = mean(vehicle_age),
sd_vehicle_age = sd(vehicle_age))
```
Let's compute the ratio of warnings to citations by subject_race - note that we can use the variables we create in later expressions within the same call to `summarize()`:
```{r}
police %>%
group_by(subject_race) %>%
summarize(warnings = sum(outcome == "warning"),
citations = sum(outcome == "citation"),
ratio = warnings/citations)
```
There's considerable variation here, from 1.1 warnings for every citation given to 2.2 warnings for every citation given.
### EXERCISE
Compute the `min()` and `max()` `vehicle_year` for each `vehicle_make`.
```{r}
```
## Ungrouping
If you ever have a grouped data frame, you may need to ungroup it to get rid of the groups. To do so, use `ungroup()`:
```{r}
poilce %>%
group_by(outcome) %>%
ungroup()
```
## Slicing
One operation we could do with a grouped tibble is to select just certain rows from each group. For example, we could use the `slice()` function to select the first row from each group:
```{r}
police %>%
select(outcome, everything()) %>% # to reorder columns for output
group_by(outcome) %>%
slice(1)
```
If you look at this output in the console, you'll see the resulting tibble still has groups in it. This is a case where you might want to ungroup:
```{r}
police %>%
select(outcome, everything()) %>% # to reorder columns for output
group_by(outcome) %>%
slice(1) %>%
ungroup()
```
# Arrange
Finally, we come to `arrange()`, which is how we sort the rows in our data. We would mostly use this when viewing our data, but it's also useful when we need to compute a time series (lags and leads in the data), when we want to select just a few rows from each group, or any other order-sensitive transformations on our data.
```{r}
arrange(police, time)
```
To sort in reverse order, wrap the column name in `desc()`.
```{r}
arrange(police, desc(date))
```
Arrange by multiple columns, in order:
```{r}
arrange(police, date, desc(time))
```
An example where it matters: compute time between stops in the dataset:
```{r}
police %>%
arrange(date, time) %>%
mutate(datetime = lubridate::ymd_hms(paste(date, time)), # combine to single value
time_since_last = datetime - lag(datetime)) %>% # current time - previous time
select(datetime, time_since_last)
```
### EXERCISE
Sort the data by `vehicle_make` and then `vehicle_year`.
```{r}
```
# Count
A bonus function that I use frequently: `count()`. It's how you'd get output similar to `table()`
By itself, it counts the number of rows:
```{r}
police %>%
count()
```
If you supply the name of a column, it makes a table:
```{r}
police %>%
count(subject_sex)
```
This is the same result as if you grouped the data first:
```{r}
police %>%
group_by(subject_sex) %>%
count()
```
You can group by multiple columns directly with count:
```{r}
police %>%
count(subject_sex, subject_race)
```
### EXERCISE
How many times does each type of violation appear in the dataset? Bonus: sort the results from most to least frequent. You can do this with `arrange()` or look at the documentation for `count()` to find another option.
```{r}
```
# Recap
We've now covered the core dplyr functions: six "verbs" of common actions we may take with a data frame. There are lots of helper functions, variations, and special cases that can come up when using these functions, but we've covered the essentials.
All of these functions were for working with a single data frame. Next session we'll talk about dplyr functions that will help us combine two different data frames together.
To learn more about some issues in computing police bias statistics, see https://fivethirtyeight.com/features/why-statistics-dont-capture-the-full-extent-of-the-systemic-bias-in-policing/ or from Northwestern researchers Prof. Redbird and Kat Albrecht, https://redbird.shinyapps.io/police-bias-map/.