-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathSolutions_3_dplyr_group.qmd
443 lines (316 loc) · 13.5 KB
/
Solutions_3_dplyr_group.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
---
output:
html_document:
df_print: paged
code_download: TRUE
toc: true
toc_depth: 1
editor_options:
chunk_output_type: console
---
```{r, setup, include=FALSE}
knitr::opts_chunk$set(
eval=FALSE, warning=FALSE, error=FALSE
)
```
*The output of most of the R chunks isn't included in the HTML version of the file to keep it to a more reasonable file size. You can run the code in R to see the output.*
This is an [R Markdown](https://rmarkdown.rstudio.com/) document. Follow the link to learn more about R Markdown and the notebook format used during the workshop.
# Setup
```{r, eval=TRUE, message=FALSE}
library(tidyverse)
```
For this session, we're learning three additional dplyr functions: group_by, summarize, and arrange.
## Data
We're using the same data as sessions 1 and 2.
```{r, eval=TRUE}
police <- read_csv("https://raw.githubusercontent.com/nuitrcs/r-tidyverse/main/data/ev_police.csv",
col_types=c("location"="c"))
```
We'll also use the built-in dataset mtcars:
```{r}
mtcars
```
# Summarize
We'll start with `summarize()` (or `summarise()` - British spelling is accepted). We use `mutate()` when we want the output to have the same length as the input -- when we're operating on the individual elements in a vector - we want a value for every row in the data. When we want to condense multiple values down to a single (or a few values), such as taking the mean or standard deviation of a vector, we use summarize instead.
```{r}
police %>%
mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
summarize(mean_vehicle_age = mean(vehicle_age))
```
Note that even though there's just one value, we get a tibble returned. This is what to expect with the tidyverse.
As a side note, if we needed the single value (or a single vector), we could `pull()` it out of the tibble/data frame:
```{r}
police %>%
mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
summarize(mean_vehicle_age = mean(vehicle_age)) %>%
pull()
```
We can compute more than one summary measure at the same time:
```{r}
police %>%
mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
summarize(mean_vehicle_age = mean(vehicle_age),
sd_vehicle_age = sd(vehicle_age),
min_date = min(date),
max_date = max(date))
```
We get one column per summary variable we create. Once we group below, we'll see why we get the output in columns instead of rows.
### EXERCISE
Use summarize to compute the `min()` and `max()` `vehicle_year`
```{r}
police %>%
summarize(max_year = max(vehicle_year),
min_year = min(vehicle_year))
```
### n()
A useful helper function with summarize is n() which counts the number of rows. It can be useful in computing proportions.
```{r}
police %>%
summarize(total_obs = n())
```
```{r}
police %>%
summarize(warnings = sum(outcome == "warning"), # count the true values
prop_warnings = warnings/n())
```
### EXERCISE
What proportion of vechicles are Toyotas? (vehicle_make "TOYT")
```{r}
police %>%
summarize(toyotas = sum(vehicle_make == "TOYT"),
prop_toyt = toyotas/n())
police %>%
summarise(prop_toyt = sum(vehicle_make == "TOYT")/n())
```
## Across
If we want to apply the same summary functions to multiple columns in our data frame, we can write out all of the summary commands explicitly, or we can use `across()` to select which variables to summarize with which functions.
Let's use the `n_distinct()` function to count the number of distinct values in each column (`n_distinct(x)` is the same as `length(unique(x))`. This will help us see which columns don't have useful information because every value is the same.
`across()` selects columns using the helper functions you could give to `select()` directly. We'll use `everything()` here to select all columns.
```{r}
police %>%
summarize(across(everything(), n_distinct))
```
If you wanted to select columns using their names, put them in a vector (so it's a single input argument):
```{r}
police %>%
summarize(across(c(date, time, location, beat, subject_age), n_distinct))
```
If we want to apply multiple functions to each column, we put their names in a list() together:
```{r}
police %>%
summarize(across(!where(is.character), ## select columns that are not of type character
list(min, max)))# take the min and max of each column
```
To fix the names in the output, explicitly name our summary functions in the list:
```{r}
police %>%
summarize(across(!where(is.character), ## select columns that are not of type character
list(min_val=min, max_val=max))) # take the min and max of each column
```
There are other options for output naming patterns available too.
Side note: What if we want to do something to each column that is more complicated than a single function? For example, counting missing values, which you do with `sum(is.na(x))`. You can write your own function! You can read more about functions [here](https://r4ds.had.co.nz/functions.html). More on this in the Intermediate Tidyverse Workshop.
## EXERCISE
Using the built-in dataset mtcars, get the average of each column.
```{r}
mtcars %>%
summarize(across(everything(), mean))
```
Now get the min and max of just the first three columns:
```{r}
mtcars %>%
summarise(across(1:3, list(min, max)))
```
# Group By
With base R, when we want to compute summary measures or do other computation on groups in our data (as defined by some grouping variable), we use functions such as `tapply()` or `aggregate()`. With dplyr, we can explicitly group our tibble into subgroups. This isn't very useful by itself, but it is often combined with `summarize()` to compute summary measures by group.
First, what if we just group:
```{r}
police %>%
group_by(outcome)
```
When we print this in the console,
```
# A tibble: 14,792 x 29
# Groups: outcome [2]
raw_row_number date time location beat subject_age subject_race subject_sex department_id
<dbl> <date> <tim> <dbl> <chr> <lgl> <chr> <chr> <dbl>
1 11249746 2017-01-01 00:56 60202 72 NA white male 13178
2 11249747 2017-01-01 04:43 60643 71 NA black male 13178
3 11249748 2017-01-01 09:21 60625 78 NA black female 13178
...
```
we see that it tells us that the tibble (data frame) is grouped by outcome, and that there are two groups. It doesn't rearrange the rows, it just keeps track of the groups for us.
Now, combine with summarize. But first, let's make the `vehicle_age` column we've been using actually part of the `police` dataset so that we don't have to keep creating it:
```{r, eval=TRUE}
police <- mutate(police, vehicle_age = 2017-vehicle_year)
```
Now, group and summarize:
```{r}
police %>%
group_by(subject_sex) %>%
summarize(mean_vehicle_age = mean(vehicle_age),
sd_vehicle_age = sd(vehicle_age))
```
Now we get one row for each group, and one column for each summary measure.
We can group by multiple columns, and we'll get all of the combinations of values present across the columns:
```{r}
police %>%
group_by(subject_sex, subject_race) %>%
summarize(mean_vehicle_age = mean(vehicle_age),
sd_vehicle_age = sd(vehicle_age))
```
We are getting extra information alerting us to the fact that our output is grouped. If we don't want the groups in place anymore, we can ungroup(). Usually this doesn't matter though if we're just printing output to the screen.
Let's compute the ratio of warnings to citations by subject_race - note that we can use the variables we create in later expressions within the same call to `summarize()`:
```{r}
police %>%
group_by(subject_race) %>%
summarize(warnings = sum(outcome == "warning"), # count how many warnings
citations = sum(outcome == "citation"), # count how many citations
ratio = warnings/citations)
```
There's considerable variation here, from 1.1 warnings for every citation given to 2.2 warnings for every citation given. So, relatively more warnings to citations for Black drivers than Hispanic ones.
### EXERCISE
Compute the `min()` and `max()` `vehicle_year` for each `vehicle_make`.
```{r}
police %>%
group_by(vehicle_make) %>%
summarize(min_year = min(vehicle_year),
max_year = max(vehicle_year))
```
## EXERCISE
Using mtcars, compute the average horsepower (hp) for each \# of cylinders (cyl)
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(avg_hp = mean(hp))
```
Now compute the average value of each variable by \# of cylinders
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), list(avg=mean)))
```
## Ungrouping
If you ever have a grouped data frame, you may need to ungroup it to get rid of the groups. To do so, use `ungroup()`:
```{r}
police %>%
group_by(outcome) %>%
ungroup()
```
Usually this would come up after more complicated operations, and often after computing summary measures by group.
## Slicing
One operation we could do with a grouped tibble is to select just certain rows from each group. For example, we could use the `slice()` function to select the first row from each group:
```{r}
police %>%
select(outcome, everything()) %>% # to reorder columns for output
group_by(outcome) %>%
slice(1)
```
If you look at this output in the console, you'll see the resulting tibble still has groups in it. This is a case where you might want to ungroup:
```{r}
police %>%
select(outcome, everything()) %>% # to reorder columns for output
group_by(outcome) %>%
slice(1) %>%
ungroup()
```
Intead of using slice by itself, we're more likely to want to use slice_max or slice_min, or combine slice with sorting the rows in a particular order. Let's see how to do that.
# Arrange
Finally, we come to `arrange()`, which is how we sort the rows in our data. We would mostly use this when viewing our data, but it's also useful when we need to compute a time series (lags and leads in the data), when we want to select just a few rows from each group, or any other order-sensitive transformations on our data.
```{r}
arrange(police, time)
```
To sort in reverse order, wrap the column name in `desc()`.
```{r}
arrange(police, desc(date))
```
Arrange by multiple columns, in order:
```{r}
arrange(police, date, desc(time))
```
An example where it matters: compute time between stops in the dataset:
First, how do we get actual date and times we can work with together
```{r}
select(police, date, time)
# to combine these in a datetime object that understands dates and times, first paste together
police %>%
mutate(datetime = paste(date, time)) %>%
## check it
select(datetime, date, time)
# then convert to an actual datetime object (not just character)
police %>%
mutate(datetime = lubridate::ymd_hms(paste(date, time))) %>%
## check it
select(datetime, date, time)
```
Then, we want to make sure they're in chronological order
```{r}
police %>%
arrange(date, time) %>%
mutate(datetime = lubridate::ymd_hms(paste(date, time))) %>%
select(datetime)
```
Then compute the difference between each value and the previous value. The lag() function helps us do this:
```{r}
police %>%
arrange(date, time) %>%
mutate(datetime = lubridate::ymd_hms(paste(date, time)), # combine to single value
time_since_last = datetime - lag(datetime)) %>% # current time - previous time
select(datetime, time_since_last)
```
### EXERCISE
Sort the data by `vehicle_make` and then `vehicle_year`.
```{r}
police %>%
arrange(vehicle_make, vehicle_year)
```
Using a slice function, get the minimum vehicle year for each vehicle make.
```{r}
police %>%
group_by(vehicle_make) %>%
slice_min(vehicle_year) %>%
select(vehicle_make, vehicle_age, vehicle_year) # select columns for easy viewing
```
# Count
Count is how you'd get output similar to `table()` - it is used to count rows, either overall or in groups. It's a shortcut for summarizing and using n().
By itself, it counts the number of rows ("n" by default):
```{r}
police %>%
count()
```
If you supply the name of a column, it makes a table:
```{r}
police %>%
count(subject_sex)
#table(police$subject_sex)
```
This is the same result as if you grouped the data first:
```{r}
police %>%
group_by(subject_sex) %>%
count()
```
You can group by multiple columns directly with count:
```{r}
police %>%
count(subject_sex, subject_race)
#%>% arrange(n)
```
### EXERCISE
How many times does each type of violation appear in the dataset? Bonus: sort the results from most to least frequent. You can do this with `arrange()` or look at the documentation for `count()` to find another option.
```{r}
police %>%
group_by(violation) %>%
count()
#bonus
police %>%
group_by(violation) %>%
count() %>%
arrange(desc(n))
police %>%
group_by(violation) %>%
count(sort=T)
```
# Recap
We've now covered the core dplyr functions: six "verbs" of common actions we may take with a data frame. There are lots of helper functions, variations, and special cases that can come up when using these functions, but we've covered the essentials.
All of these functions were for working with a single data frame. Next session we'll talk about dplyr functions that will help us combine two different data frames together.
To learn more about some issues in computing police bias statistics, see <https://fivethirtyeight.com/features/why-statistics-dont-capture-the-full-extent-of-the-systemic-bias-in-policing/> or from Northwestern researchers Prof. Redbird and Kat Albrecht, <https://redbird.shinyapps.io/police-bias-map/>.