forked from nuitrcs/R-fundamentals-summer-workshop
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathDay3_data_cleaning.qmd
842 lines (559 loc) · 28.4 KB
/
Day3_data_cleaning.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
---
output:
html_document:
df_print: paged
code_download: TRUE
toc: true
toc_depth: 1
editor_options:
chunk_output_type: console
---
```{r, setup, include=FALSE}
knitr::opts_chunk$set(
eval=FALSE, warning=FALSE, error=FALSE
)
```
# Data Cleaning
... or put differently, welcome to the real world. Real datasets are seldom as tidy and clean as those you have been working with so far. Real data is messy and often you'll need to clean and coerce it into the right format in order to actually analyze it. This session contains a number of examples that serve as cookbook recipes for common data wrangling tasks.
## Using the R Help
One of the most important features that you will need as you start on your path to becoming an R user is getting help. No matter how skilled of a programmer you are, you always encounter new functions and need help understanding how to use them. We will take some time to explore the help features in R.
To get online help within an R session we use the `help()` function. This will allow us to see the documentation on a function that we are interested in.
Let's say we want to import a csv file and we know we need to use the `read.csv` function:
```{r}
help(read.csv)
```
You can get help with the `?` function:
```{r}
?read.csv
```
Aside from the documentation, another method is to search the internet for further help. Here are some other resources:
- [The Comprehensive R Archive Network](https://cran.r-project.org/)
- [Google](https://www.google.com/)
- [StackOverflow](https://stackoverflow.com/)
- [Github](https://github.com/)
- [UCLA Statistical Computing Help Site](https://stats.oarc.ucla.edu/r/)
### EXERCISE
Look up the documentation for `glm` . What does this function do? Can you run some examples of this function? What inputs does it take? What outputs does it return?
```{r}
```
## Debugging: errors and warnings
A very useful skill in coding is learning to debug your code. R helpfully provides warning messages or error messages to help you understand where you might be making a mistake.
Let's see a few examples:
```{r}
# Error: object not found
my_vector
# Error: non-numeric argument to binary operator
x <- 1:10
x + "a"
# Error: object 'penguins' not found
penguins[penguins$speciess]
# Warning: longer object length is not a multiple of shorter object length
1:10 + 1:3
library(palmerpenguins)
# Warning: Unknown or uninitialised column
penguins[penguins$speciess]
```
When you get an error, read it carefully and try to understand what's going on. It can be frustrating, but it's important to stay calm and slowly read the error and take a look at your code.
When you get a warning, make sure you understand what's going on and whether you need to do something about it.
Often, Googling errors or warnings can be very helpful! It's likely that someone else has gotten that error and you'll find help in the Internet, often on Stack Overflow.
You can also debug your code yourself. To do this, break your code in the its smallest components. Make sure that you understand what each component is doing. Make sure that it's doing what you expected (e.g., correct output for a given input). Then, make sure you understand how the components are working with each other. Make sure it's what you expected.
[How to Debug Your Code for Beginners](https://www.freecodecamp.org/news/what-is-debugging-how-to-debug-code/#heading-how-to-debug-your-code) provides some helpful tips to get new coders started!
## Useful functions
This section will list several basic R functions that are very useful and should be part of your toolbox.
### ifelse
`ifelse()` allows you to change the values of a vector based on a conditional test. It's particularly useful for recoding data, but you can also use it in situations where you don't want to change the underlying data.
The format is:
```{r}
ifelse(test_condition, value_if_true, value_if_false)
```
where the test condition is usually some comparison or operation on a vector - anything that results in TRUE or FALSE values
```{r}
x <- c(-1, 2, 3, -5, 3, NA, -4, 6)
x >= 0
ifelse(x >= 0, x, 0) # keep positive (>= 0) values as they are, replace others with 0
ifelse(is.na(x), mean(x, na.rm = TRUE), x) # replace missing values with the mean, leave others alone
```
`ifelse()` looks at each value in the initial TRUE/FALSE vector, and it decides what the value at each position should then be.
There's also the useful `replace_na()` function in the tidyr package and the `na_if()` function in the dplyr package.
```{r}
library(tidyr)
library(dplyr)
x
# replace any NA / missing values with user input
replace_na(x, mean(x, na.rm = TRUE)) # replace NA with 0
# make an element into NA / missing if it is undesired value
y <- c("abc", "def", "", "ghi") # remove an empty string
y
na_if(y, "")
```
### EXERCISE
Where the value in the vector below is positive, take the square. Where it is negative, substitute in `NA`.
```{r}
y <- c(4, -1, 3, 6, -7, 10, 20)
```
### EXERCISE
Recode the vector below such that values at or above the average are replaced with "tall" and values below average are replaced with "short"
Remember: break this into parts. Start with computing the mean of x, then use that in an expression to determine which values of x are above that mean.
```{r}
y <- c(4, 6, 5, 7, 3, 6, 4, 5, 3, 6, 6, 5, 6, 7, 6, 4, 6, 5, 7, 5)
```
### %in%
`%in%` returns TRUE if the value on the left is in the vector on the right, FALSE otherwise. Unlike `==`, if the value on the left is `NA`, it will return FALSE unless `NA` is also in the vector on the right:
```{r}
x <- c(-1, 2, 3, -5, 3, NA, -4, 6)
x
x %in% c(1, 2, 3)
x == 1 | x == 2 | x == 3
```
```{r}
state.name # built-in vector in R
sum(is.na(state.name))
ifelse(state.name %in% c("Alaska", "Hawaii"), NA, state.name)
sum(is.na(ifelse(state.name %in% c("Alaska", "Hawaii"), NA, state.name)))
```
Let's say we want to create a new column called `evanston` in our `evp` data frame such that it takes the value TRUE if the `location` for that row is either 60201 or 60202 (zip codes for Evanston IL), but FALSE otherwise. This will help us to quickly subset the data only for stops made in Evanston.
First, read in the data frame:
```{r}
evp <- read.csv("data/ev_police_jan.csv")
```
We could subset data - get the rows where the location is an Evanston zip code
```{r}
evp[evp$location %in% c(60201, 60202), ]
```
Or make a new column called `evanston` if the location is one of the two main Evanston zip codes of 60201 or 60202
```{r}
evp$evanston <- evp$location %in% c(60201, 60202)
names(evp)
table(evp$evanston)
```
### EXERCISE
Select the rows from the `evp` data below where the vehicle make is one of FIAT, DATS, or GEO.
```{r}
```
### paste
The `paste()` function is used to join pieces of text:
```{r}
paste("John", "Oliver")
```
The default separator between the strings is a space, but you can change it:
```{r}
paste("John", "Oliver", sep="---")
```
You can also paste the elements of two or more vectors together:
```{r}
salutation <- c("Dear")
first_names <- c("John", "Kate", "Tyler", "Jordan")
last_names <- c("Lee", "Sharma", "Smith")
paste(first_names, last_names)
paste(salutation, first_names)
```
### EXERCISE
`state.abb` has two-letter abbreviations for US states. `state.region` has the region for each US state. Use `paste()` to join the info from these two vectors together, so the output has entries that look like:
```
"AL: South" "AK: West" "AZ: West" "AR: South" ...
```
```{r}
```
## Factors
Factors are variables with text labels, but the set of values (called levels) that are allowed for the variable is limited, and the values optionally can have a specific order to them.
You'll typically want to store categorical data in R as factors.
You may ask why a special data structure is needed for categorical data, when we could just use `character` variables to represent the categories.
Let's say that you've conducted a survey on students' smoking habits. The possible responses are *Never*, *Occasionally*, *Regularly*, and *Heavy*. From 10 students, you get the following responses:
```{r}
smoke <- c("Never", "Never", "Heavy", "Never", "Occasionally", "Never", "Never", "Regularly", "Regularly", "No")
```
Note that the last answer is invalid - `No` was not one of the four answers that were allowed for the question. You could use `table` to get a summary of how many answers of each type that you got:
```{r}
table(smoke)
```
But the categories are not presented in the correct order! There is a clear order between the different categories, *Never* \< *Occasionally* \< *Regularly* \< *Heavy*, but `table` doesn't present the results in that way. Moreover, R didn't recognize that `No` was an invalid answer, and treats it just the same as the other categories.
This is where `factor` variables come in. They allow you to specify which values your variable can take, and the ordering between them (if any).
### Creating Factors
When creating a `factor` variable, you typically start with a `character`, `numeric` or `logical` variable, the values of which are turned into categories.
To turn the `smoke` vector that you created in the previous section into a `factor`, you can use the `factor` function:
```{r}
smoke_factor <- factor(smoke)
smoke_factor
```
You can inspect the elements, and *levels*, i.e. values that the categorical variable takes, as follows:
```{r}
levels(smoke_factor)
```
We see two problems! First, we see that "No" is a level/category, but it was not listed on our survey. To fix this problem, we can decide the categories/levels a factor vector is allowed to take by specifying the `levels` parameter in the `factor()` function call:
```{r}
smoke_factor <- factor(
smoke,
# allowed levels
levels = c("Never", "Heavy", "Occasionally", "Regularly")
)
levels(smoke_factor)
smoke_factor
```
We see that "No" is no longer listed as a level, and has been replaced by `NA` in the factor vector `smoke_factor`.
However, we still have another problem - the categories are in the wrong order.
```{r}
barplot(table(smoke_factor))
smoke_factor <- factor(
smoke,
# allowed levels in some desired order
levels = c("Never", "Occasionally", "Regularly", "Heavy")
)
barplot(table(smoke_factor))
```
Now we see the categories are in an order that is more informative.
However, we still have one more problem to fix. What if we wanted to compare smokers such that: Never \< Occasionally \< Regularly \< Heavy. In other words, if we make conditional expressions such as smoke_factor \> Occasionally, then Never and Occasionally would evaluate to FALSE while Regularly and Heavy would evaluate to TRUE. This can be helpful to subset the data meaningfully or make comparisons when the order is meaningful in some manner.
Let's see how to do this with the `ordered` parameter within `factor()`
```{r}
smoke_factor <- factor(
smoke,
# create allowed levels in some desired order
levels = c("Never", "Occasionally", "Regularly", "Heavy"),
# logical flag to determine if the levels should be regarded as ordered (in the order given) for any comparisons
ordered = TRUE
)
smoke_factor
```
Notice how the level information is presented differently this time to indicate possible comparisons between categories.
```{r}
smoke_factor
smoke_factor > "Occasionally"
# how many survey respondents smoke more than Occasionally
sum(smoke_factor > "Occasionally", na.rm = TRUE)
```
To recap, you can control the order in which the levels are presented by choosing which order we write them in in the `levels` argument. The `ordered = TRUE` argument specifies that the order of the variables is *meaningful*. `ordered = TRUE` can be excluded in cases where you wish to specify the order in which the categories should be presented purely for presentation purposes (e.g., when specifying whether to use the order `Male/Female/Other` or `Female/Male/Other`).
Also, note that the `No` answer now became an `NA`, which in the case of `factor` variables represents both missing observations and invalid observations.
### Changing factor levels
When we created `smoke_factor`, one of the elements became an `NA`. `NA` was however not included as a level of the `factor`. Sometimes it is desirable to include `NA` as a level, for instance when you want to analyse rows with missing data. This is easily done using the `addNA` function:
```{r}
levels(smoke_factor)
smoke_factor <- addNA(smoke_factor)
levels(smoke_factor)
smoke_factor
```
If you wish to change the name of one or more of the `factor` levels, you can do it directly via the `levels` function. For instance, we can change the name of the `NA` category, which is the 5th level of `smoke_factor`, as follows:
```{r}
levels(smoke_factor)
levels(smoke_factor)[5]
levels(smoke_factor)[5] <- "Invalid answer"
levels(smoke_factor)
smoke_factor
```
Finally, if you've added more levels than what are actually used, these can be dropped using the `droplevels` function:
```{r}
smoke_factor <- factor(
smoke,
levels = c("Never","Occasionally","Regularly","Heavy","Constantly"),
ordered = TRUE
)
levels(smoke_factor)
table(smoke_factor)
smoke_factor <- droplevels(smoke_factor)
levels(smoke_factor)
table(smoke_factor)
```
### Combining levels
Finally, the `levels` function can be used to merge categories by replacing their separate names with a single name. For instance, we can combine the smoking categories *Occasionally*, *Regularly*, and *Heavy* to a single category named *Yes*. Assuming that these are first, second and third in the list of names (as will be the case if you've run the last code chunk above), here's how to do it:
```{r}
# create our factor vector as usual
smoke_factor <- factor(
smoke,
levels = c("Never","Occasionally","Regularly","Heavy"),
ordered = TRUE
)
smoke_factor
# recode levels that are not "Never" to "Yes"
levels(smoke_factor)
levels(smoke_factor)[2:4]
levels(smoke_factor)[2:4] <- "Yes"
levels(smoke_factor)
smoke_factor
table(smoke_factor)
```
### EXERCISE
Convert the vector below to a factor. Set the levels in an intentional order. Should ordered be set to TRUE here - why or why not?
Hint: use `table()` or `unique()` to find all the categories
```{r}
age_group <- c(
"35-50",
"25-34",
"50-65",
"65+",
"18-24",
"18-24",
"50-65",
"35-50",
"50-65",
"18-24",
"18-24",
"65+",
"65+"
)
# your code here
```
### EXERCISE
Using the `evp` dataset, convert the columns `subject_race` and `subject_sex` into factors.
```{r}
```
### EXERCISE
Using the `evp` dataset, convert the column `search_basis` to a factor. Then recode all `NA` values into "invalid".
```{r}
```
## Tidyverse
### Setup
```{r, eval=TRUE}
library(tidyverse)
```
This gives you info on which packages it actually loaded, because when you install tidyverse, it installs \~25 packages plus dependencies, but it only loads the ones listed.
Tidyverse packages tend to be verbose in warning you when there are functions with the same name in multiple packages.
### Why use Tidyverse?
Tidyverse packages do a few things:
- fix some of the annoying parts of using R, such as changing default options when importing data files and preventing large data frames from printing to the console
- are focused on working with data frames --or rather tibbles-- (and their columns), rather than individual vectors
- usually take a data frame/tibble as the first input to a function, and return a data frame/tibble as the output of a function, so that function calls can be more easily strung together in a sequence
- share some common naming conventions for functions and arguments that have a goal of making code more readable
- tend to be verbose, opinionated, and are actively working to provide more useful error messages
Tidyverse packages are particularly useful for:
- data exploration
- reshaping data sets
- computing summary measures over groups
- cleaning up different types of data
- reading and writing data
- predictive modeling
- reporting results
### Data
Let's import the data we'll be using. Similarly to `evp`, the data is from the [Stanford Open Policing Project](https://openpolicing.stanford.edu/data/) and includes vehicle stops by the Evanston police in 2017. In contrast to `evp`, the data are not limited to January.
We're going to use the `read_csv` function from the `readr` package, which is part of the tidyverse. The `read_csv` function works like `read.csv` except it has some different defaults, guesses data types a bit differently, and produces a tibble instead of a data frame (details coming).
```{r, eval=TRUE}
police <- readr::read_csv("data/ev_police.csv")
```
The output message that you get tells you what data type it guessed for each column based on the format of the information. "chr" is character or text data, "dbl" is numeric (stands for double, which is technical term for a type of number), "lgl" is logical/Boolean (TRUE/FALSE). Note that it also automatically read and identified date and time values and converted them to date and time objects -- not just string/character data.
We can also manually specify column types for cases where the assumption that `read_csv` makes is wrong. We use the `col_types` argument (similar to `colClasses` for `read.csv`). Let's make the location to be character data, since it is zip codes -- zip codes should not be treated as numbers.
```{r, eval=TRUE}
police <- readr::read_csv(
"data/ev_police.csv",
col_types = c("location"="c")
)
```
### dplyr
dplyr is the core package of the tidyverse. It includes functions for working with tibbles (or any data frames). While you can still use base R operations on tibbles/data frames, such as using `$` and `[]` subsetting like we did above, dplyr provides alternatives to all of the common data manipulation tasks.
It contains six main functions, each a verb, of actions you frequently take with a data frame: select, filter, mutate, group_by, summarize, and arrange. Here we're going to cover the first five. You can take a look at [arrange](https://dplyr.tidyverse.org/reference/arrange.html) on your own or take one of our Tidyverse workshops.
Before we start, let's remember what columns are in our data:
```{r}
names(police)
```
### Select: choose columns
The `select()` function lets us choose which columns (or variables) we want to keep in our data.
The data frame is the first input, and the name of the column is the second. We do not have to put quotes around the column name.
```{r}
select(police, subject_race)
```
If we want to select additional columns, we can just list the column names as additional inputs, each column name separated by commas:
```{r}
select(police, subject_race, outcome)
```
As with `[]` indexing, columns will be returned in the order specified:
```{r}
select(police, subject_sex, subject_race, date)
```
We could also use the column index number if we wanted to instead. We don't need to put the values in `c()` like we would with `[]` (but we could).
```{r}
select(police, 1, 4, 10)
```
### EXERCISE
Convert this base R expression: `police[ ,c("violation", "citation_issued", "warning_issued")]` to use `select()` instead to do the same thing.
Hint: The base R expression above keeps all rows but selects only the three columns named within `c()`.
```{r}
```
### Filter: choose rows
To choose which rows should remain in our data, we use `filter()`. As with `[]`, we write expressions that evaluate to TRUE or FALSE for each row. Like `select()`, we can use the column names without quotes.
```{r}
filter(police, location == "60202")
```
Note that we use `==` to test for equality and get TRUE/FALSE output. You can also write more complicated expressions -- anything that will evaluate to a vector of TRUE/FALSE values.
```{r}
filter(police, is.na(beat))
```
### EXERCISE
How many rows have subject_race as "white"?
Hint: Use `filter()` to choose the rows where subject_race is "white". The equivalent base R expression would be `police[police$subject_race == "white", ]`.
```{r}
```
### Pipe: Chaining Commands Together
So, we can choose rows and choose columns separately; how do we combine these operations?
`dplyr`, and other tidyverse commands, can be strung together in a series with a `%>%` (say/read: pipe) operator. This operator takes the output of the command on the left and makes that the first input to the command on the right.
The pipe works well with dplyr (and other tidyverse packages) because the functions almost all take a data frame as the first input, and they return a data frame as the output.
We can rewrite
```{r}
select(police, date, time)
```
as
```{r}
police %>% select(date, time)
```
and you'll often see code formatted, so `%>%` is at the end of each line, and the following line that are still part of the same expression are indented:
```{r}
police %>%
select(date, time)
```
The pipe comes from a package called `magrittr`, which has additional special operators in it that you can use. The keyboard shortcut for `%>%` is command-shift-M (Mac) or control-shift-M (Windows).
We can use the pipe to string together multiple commands operating on the same data frame:
```{r}
police %>%
select(subject_race, subject_sex) %>%
filter(subject_race == "white")
```
We would read the `%>%` in the command above as "then" if reading the code out loud: from police, select subject_race and subject_sex, then filter where subject_race is white.
This works because the dplyr functions take a tibble/data frame as the first argument (input) and return a tibble/data frame as the output. This makes it easy to pass a data frame through multiple operations, changing it one step at a time.
Order does matter, as the commands are executed in order. So this would give us an error, because we're trying to filter on a column that we didn't select:
```{r}
police %>%
select(subject_sex, outcome) %>%
filter(subject_race == "white")
```
We'd have to reverse the order:
```{r}
police %>%
filter(subject_race == "white") %>%
select(subject_sex, outcome)
```
You can use the pipe operator to string together commands outside of the tidyverse as well, and it works with any input and output, not just data frames:
```{r}
# sum(is.na(police$beat))
is.na(police$beat) %>% sum()
# or
police$beat %>% is.na() %>% sum()
```
### EXERCISE
Select the date, time, and outcome (columns) of stops that occur in beat "71" (rows). Make use of the `%>%` operator.
The equivalent base R expression would be: `police[police$beat == "71", c("date", "time", "outcome")]`
Hint: remember that a column needs to still be in the data frame if you're going to use the column to filter.
```{r}
```
Note that so far, we haven't actually changed the `police` data frame at all. We've written expressions to give us output, but we haven't saved it.
Sometimes we may still want to save the result of some expression, such as after performing a bunch of data cleaning steps. We can assign the output of piped commands as we would with any other expression:
```{r}
police60201 <- police %>%
filter(location == "60201") %>%
select(date, time, beat, type, outcome)
```
### BONUS -- CHALLENGE EXERCISE
Go back to the previous exercises. Is there a difference in output between base R expression and `dplyr::filter` function outputs? How many rows does each output data frame have? Are they different? Why might that be?
```{r}
```
### Mutate: add/modify columns
`mutate()` is used to both change the values of an existing column and make a new column.
We name the column we're mutating and set the value. If the name already exists, it will update the column. If the name doesn't exist, it will create a new variable (column is appended at the end of the existing columns).
```{r, eval=TRUE}
police %>%
mutate(vehicle_age = 2017 - vehicle_year) %>%
select(starts_with("vehicle")) %>% # just to pick a few columns to look at
names()
```
We can put multiple mutations in the same call to mutate, with the expressions separated by commas:
```{r, eval=TRUE}
mutate(
police,
vehicle_age = 2017 - vehicle_year,
old_car = vehicle_year < 2000
)
```
### EXERCISE
If the column beat in `police` is "/", set it to `NA` instead using `mutate()`.
Hint: remember the `na_if` function.
```{r}
```
### Summarize
We use `mutate()` when we want the output to have the same length as the input. In other words, we use `mutate()` when we're operating on the individual elements in a vector - we want a value for every row in the data.
When we want to condense multiple values down to a single (or a few values), such as taking the mean or standard deviation of a vector, we use summarize instead:
```{r}
police %>%
mutate(vehicle_age = 2017 - vehicle_year) %>% # computing a new variable first
summarize(mean_vehicle_age = mean(vehicle_age))
```
Note that even though there's just one value, we get a tibble returned. This is expected within the tidyverse.
We can compute more than one summary measure at the same time:
```{r}
police %>%
mutate(vehicle_age = 2017 - vehicle_year) %>% # computing a new variable first
summarize(
mean_vehicle_age = mean(vehicle_age),
sd_vehicle_age = sd(vehicle_age),
min_date = min(date),
max_date = max(date)
)
```
We get one column per summary variable we create. Once we group below, we'll see why we get the output in columns instead of rows.
### EXERCISE
Use summarize to compute the `min()` and `max()` `vehicle_year`
```{r}
```
### Group_by
When we want to compute summary measures or do other computation on groups in our data (as defined by some grouping variable), we can explicitly group our tibble into subgroups. This isn't very useful by itself, but it is often combined with `summarize()` to compute summary measures by group.
First, what if we just group our data by a specific column:
```{r}
police %>%
group_by(outcome)
```
When we print this in the console,
```
# A tibble: 14,792 x 29
# Groups: outcome [2]
raw_row_number date time location beat subject_age subject_race subject_sex department_id
<dbl> <date> <tim> <dbl> <chr> <lgl> <chr> <chr> <dbl>
1 11249746 2017-01-01 00:56 60202 72 NA white male 13178
2 11249747 2017-01-01 04:43 60643 71 NA black male 13178
3 11249748 2017-01-01 09:21 60625 78 NA black female 13178
...
```
we see that it tells us that the tibble (data frame) is grouped by outcome, and that there are two groups. It doesn't rearrange the rows, it just keeps track of the groups for us.
Now, let's combine it with summarize. But first, let's make the `vehicle_age` column we've been using actually part of the `police` dataset so that we don't have to keep creating it:
```{r, eval=TRUE}
police <- mutate(police, vehicle_age = 2017 - vehicle_year)
```
Now, group and summarize:
```{r}
police %>%
group_by(subject_sex) %>%
summarize(
mean_vehicle_age = mean(vehicle_age),
sd_vehicle_age = sd(vehicle_age)
)
```
Now we get one row for each group, and one column for each summary measure.
We can group by multiple columns, and we'll get all of the combinations of values present across the columns:
```{r}
police %>%
group_by(subject_sex, subject_race) %>%
summarize(
mean_vehicle_age = mean(vehicle_age),
sd_vehicle_age = sd(vehicle_age)
)
```
We are getting extra information alerting us to the fact that our output is grouped. If we don't want the groups in place anymore, we can `ungroup()`. Usually this doesn't matter though if we're just printing output to the screen.
Let's compute the ratio of warnings to citations by subject_race - note that we can use the variables we create in later expressions within the same call to `summarize()`:
```{r}
police %>%
group_by(subject_race) %>%
summarize(
warnings = sum(outcome == "warning"), # count how many warnings
citations = sum(outcome == "citation"), # count how many citations
ratio = warnings/citations
)
```
There's considerable variation here, from 1.1 warnings for every citation given to 2.2 warnings for every citation given. So, relatively more warnings to citations for Black drivers than white and Hispanic ones.
### EXERCISE
Compute the `min()` and `max()` `vehicle_year` for each `vehicle_make`.
```{r}
```
### EXERCISE
Using mtcars, compute the average horsepower (hp) for each \# of cylinders (cyl)
```{r}
```
## Ungrouping
If you ever have a grouped data frame, you may need to ungroup it to get rid of the groups. To do so, use `ungroup()`:
```{r}
police %>%
group_by(outcome) %>%
ungroup()
```
Usually this would come up after more complicated operations, and often after computing summary measures by group.