-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path1_dplyr_join.qmd
399 lines (255 loc) · 17.1 KB
/
1_dplyr_join.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
---
output:
html_document:
df_print: paged
code_download: TRUE
toc: true
toc_depth: 1
editor_options:
chunk_output_type: console
---
```{r, setup, include=FALSE}
knitr::opts_chunk$set(eval=FALSE, warning=FALSE, error=FALSE, message=FALSE)
```
*The output of most of the R chunks isn't included in the HTML version of the file to keep it to a more reasonable file size. You can run the code in R to see the output.*
# Setup
```{r, eval=TRUE}
library(dplyr)
library(readr)
```
We know `{dplyr}` functions for working with a single data frame. `{dplyr}` also has functions for joining two data frames together.
# Data
We're going to use COVID-19 data by county and demographics by county for joining:
```{r, eval=TRUE}
# county level data
demo <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/county_demographics.csv", show_col_types = FALSE)
cases <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/county_cases.csv", show_col_types = FALSE)
deaths <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/county_deaths.csv", show_col_types = FALSE)
# state level data
state_stats <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/state_covid.csv", show_col_types = FALSE)
```
I've introduced some missing data for the purposes of this workshop. Get the full COVID-19 case data from <https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/>. COVID-19 case and death data included here is as of early July 2020. Demographic variables are census estimates for July 2019.\
To learn more about COVID-19 and race, see <https://covidtracking.com/race>; this is the source of the state-level COVID-19 statistics used here.
What's in the data sets:
```{r}
demo # demographics by county
cases # covid case counts by county
deaths # covid death counts by county
state_stats # covid case and death counts by state and demographics
```
County FIPS Codes are unique 5-digit codes that represent specific US counties. The first two digits represent the county's state and the last three digits represent the county. For example "17031" is the FIPS Code for Cook County, Illinois.
# Adding Rows and Columns
Before we get to joins, let's first cover adding rows or columns to a data frame.
`{dplyr}` includes `bind_rows` and `bind_cols`, which are roughly equivalent to base R `rbind` and `cbind`, but they are a bit easier to work with.
To demonstrate, let's create some subsets of the data to work with first:
```{r, eval=TRUE}
delaware <- filter(demo, state=="Delaware") %>%
select(1:6)
hawaii <- filter(demo, state=="Hawaii") %>%
select(1:4,7:9)
```
```{r}
delaware
hawaii
```
We have some overlap in the column names, but the set of columns doesn't match exactly. To combine these datasets, we want to bind the rows (each is a county) together:
```{r, eval=TRUE}
# just printing, not saving...
bind_rows(delaware, hawaii)
```
`bind_rows` includes all columns the exist in either dataset and fills in missing values with `NA`. It matches columns by full name. If we tried this with `rbind`, we'd get an error, because the set of columns doesn't match.
**When would you really do this?**
Maybe you have separate data files for each year of data, and you need to combine them. Or for each school, or state, or experiment iteration. Any case where the set of **variables** is **similar across multiple data sets** that all have the same unit of observation (rows are the same type of thing across data sets and you have similar variables).
For **adding columns**, we'd need to make sure that **both data frames have the same number of rows** and they appear in the same order. Generally, you don't want to `bind_cols`. If you have some column that identifies observations in each data frame, you'll want to join the datasets together instead, which is what we're going to do next. If you are just trying to add a few columns of data, use `mutate`.
**Why use join instead of `bind_cols`?**
While it does happen, it's dangerous to have multiple data sets with the same set of rows without some ID variable -- where they're just matched by the order of the rows; it's too easy for the matching between datasets to get messed up.
# Joins
The terminology for joining comes from SQL, which is used to interact with databases. And `{dplyr}` can work directly with SQL databases, translating `{dplyr}` commands into SQL, running them in the database, and then retrieving the results. That's for a separate workshop, but if you do want to work with a database, know that you do not necessarily have to export the data in order to work with it in R. And you can write `{dplyr}` commands instead of SQL commands to work with the data.
**Note**: the equivalent function for joining in base R is `merge()`.
## Background: Keys
Joins work when you have two data frames (called tables in databases, but same idea), and you want to merge them together. Each data set needs to have some type of identifier, a key, that tells you how a row in dataset 1 should be matched into dataset 2.
For example, here, we have three county datasets, all of which have different data (different variables). In our case, each dataset includes an ID variable for each county. These IDs are the same across datasets because all 3 use the FIPS code from the US Census. So even though our datasets have different numbers of rows, we have a key that will link rows in the different datasets together:
```{r, eval=TRUE}
names(cases)
names(deaths)
names(demo)
nrow(cases)
nrow(deaths)
nrow(demo)
```
For example, look at a county that is in all three data sets:
```{r, eval=TRUE}
filter(demo, fips == 1009)
filter(cases, countyFIPS == 1009)
filter(deaths, countyFIPS == 1009)
```
We want to combine the different variables from all 3 data frames together into a single one. We can't just bind the columns together though, because the set of counties in each data set is different (some counties are missing from some datasets).
Example (we'll talk through the code shortly):
```{r, eval=TRUE}
cases %>%
left_join(deaths, by="countyFIPS")
```
We also have a data set with state level data on COVID deaths. We might want to add this state level info to each county. In that case, you could use the state name to pull the correct state data from the data set for each row in the county data set.
```{r, eval=TRUE}
# We'll talk through the code more below
cases %>%
left_join(state_stats, by=c("state"="State"))
```
In this second example, each state in the state data set matches to multiple rows (multiple counties) in the county data set -- this is OK and expected. You can have 1:1, 1:many, or many:many matches across data sets (although this last one can get messy! so much so, that `{dplyr}` will warn you when it happens).
There can also be cases where the combination of two different variables is the "key" -- where both columns need to match for the data sets to be joined correctly.
There is also a more complicated case where you don't have clean "key" variables but you still want to join two datasets. This requires fuzzy or approximate matching using variables in the two data sets, which is beyond the scope of what we can cover today. A case where this might happen is when you have country names (not ISO codes), and variants such as DRC vs. Democratic Republic of the Congo vs. Congo, DR (yes, this happens regularly).
Today, we're assuming there is some clean key to match on where values match exactly.
### EXERCISE 1
Import data about degrees awarded at Northwestern. What column(s) can be used as a key to join the datasets together?
Note: this isn't a code answer - read in the data and take a look at the data frames to see what's common in them.
```{r}
bac <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/bac_schools.csv")
masters <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/masters_schools.csv")
doc <- read_csv("https://raw.githubusercontent.com/nuitrcs/R-intermediate-tidyverse-2023/main/data/doc_schools.csv")
```
## Background: Join Types
There are different ways to match up rows in two datasets. They vary based on what happens to rows that don't match to the other data set (where the key from one data set isn't present in the other).
<https://tavareshugo.github.io/r-intro-tidyverse-gapminder/fig/07-dplyr_joins.svg>
An **inner** join only includes in the result rows that matched in both datasets.
A **left** join includes all rows from the first (left) data set, filling in missing values where there isn't a match in the second data set (right).
A **right** join does the opposite: keeps all rows from the second (right) data set, and fills in missing values where there isn't a match in the fist.
An **outer** join keeps all rows from both data sets, filling in missing values where there aren't matches.
An **anti**-join helps you find rows in a dataset that do NOT match to another.
![](images/join_types.png)
Image Source: <https://cghlewis.com/blog/joins/>
## Joins with `{dplyr}`
Each of the join types is a different function in `{dplyr}`: `inner_join()`, `left_join()`, `right_join()`, `full_join()` (the last one is an outer join). You use them all in the same way.
Let's start with the first example above. I'm going to re-write it in a single line:
```{r}
left_join(cases, deaths, by="countyFIPS")
```
The first input, cases, is the "left" table. The second, deaths, is the "right." The `by="countyFIPS"` is telling it what key to use to join -- in this case, there's a column named "countyFIPS" in both data frames that should be used to match up rows.
In the output, it adds the suffix ".x" to any columns from the first data set where there's a column of the same name in the second data set, and uses the suffix ".y" for the reverse. "countyFIPS" doesn't get a suffix because it's the variable we told it to join on. "cases" and "deaths" don't get a suffix because there isn't a column with those names in the other data set.
Note above and also below, I'm just joining and printing the resulting data frame. You could save the output (the resulting data frame) in a new variable though:
```{r}
cases_deaths <- left_join(cases, deaths, by="countyFIPS")
```
### Variations on by
If I don't tell `{dplyr}` what the key variables are to use to join the tables, it will default to trying to join on any columns with the same name across data frames:
```{r}
left_join(cases, deaths)
```
And tell me what it joined by in the output.
**You don't want to do this!** It's best to be explicit about how you want to join the tables. Just because two columns have the same name, doesn't mean all of the data is the same! In the case of the `cases` and `deaths` data frames, they come from the same source. So all of these columns *SHOULD* match, but they don't!
```{r}
left_join(cases, deaths,
by="countyFIPS") %>%
filter(county.x != county.y) %>%
select(county.x, county.y)
```
If we matched by countyFIPS and county, we'd miss these matches.
**Something to always watch out for!** It's usually best to join using the minimal set of columns needed to identify the correct matches.
With the `cases` and `deaths` data frames, they came from the same data source, so the key column has the same name in both data sets. In the `demo` data frame, the name of the column with the county FIPS code is different though:
```{r}
names(demo)
```
To join with that, we use a named vector with `by`, with the form `c("Name in first data set"="Name in second data set")`:
```{r}
left_join(cases, demo, by=c("countyFIPS"="fips"))
```
Here, where the column names are different, it just keeps one copy of the joining key column, and uses the column name from the first (left) data set.
We can also join on multiple columns, where each pair of columns we specify needs to match. For example, in the cases and deaths data, there are some of the same columns. We can require multiple columns to match by supplying a vector of names:
```{r}
left_join(cases, deaths, by=c("countyFIPS", "state"))
```
This gives us a cleaner result without duplicate rows. But make sure that the values in these columns really should all match -- if not, you'll be missing matches.
### EXERCISE 2
Join (`left_join` is fine) the `cases` data frame to the `state_stats` data frame using the state column in each ("state" in `cases` and "State" in `state_stats`); case of the variable name matters! ("state" is not equal to "State" as column names).
```{r}
```
### EXERCISE 3
Join the bac, masters, and doc data frames imported above into a single data frame.
You can string together multiple join commands with %\>%
```{r}
```
## Unmatched Rows
A left join keeps all of the rows in the first data set, whether or not they match a row in the second. How do we find rows that didn't match?
You can use `anti_join()` to find rows in one (the first) data frame that don't match to a row in the second:
```{r}
anti_join(deaths, cases,
by=c("countyFIPS"))
```
This only returns the rows that don't match.
Note that we don't get any columns from the `cases` data frame because these are the rows from `deaths` that don't match to cases. To see what rows in `cases` don't match, switch the order:
```{r}
anti_join(cases, deaths,
by=c("countyFIPS"))
```
They all match! But remember, there are still rows in deaths that aren't represented in the result.
If, instead of seeing the non-matching rows separately, you want to see which ones didn't match in the context of the joined data frame, you can look for missing values in the joined columns.
```{r}
left_join(deaths, cases,
by=c("countyFIPS")) %>%
filter(is.na(cases))
```
Beware though, because if there are legitimately missing values in the column you check (here, cases), then you'll get mixed results here.
### EXERCISE 4
Find the rows in `deaths` that don't have a match in `demo` Hint: join with `deaths` (column "countyFIPS") and `demo` (column "fips").
```{r}
```
## Beware Missing Values
Missing values: by default, `{dplyr}` matches missing (`NA`) as if it is a value. So if we have two data frames, and there are missing values in our key id variables, then they will match, even if there are multiple `NA`s.
```{r}
x1 <- tibble(id=c(1, 2, NA), val=c(10, 20, 30))
x2 <- tibble(id=c(2, 3, NA, NA), val2 = c("a", "b", "c", "d"))
x1
x2
```
```{r}
left_join(x1, x2, by="id")
```
If we want this to not happen, we can set the `na_matches` argument to "never"
```{r}
left_join(x1, x2, by="id",
na_matches="never")
```
That's probably a good idea. In any case, you'd want to make sure you fully understand your keys before doing any joins.
Sometimes, R will warn you that there there might be a problem.
```{r}
x3 <- tibble(id=c(2, 3, NA, NA), val3 = c("w", "x", "y", "z"))
x3
left_join(x2, x3, by="id")
```
This isn't actually related to missing values, but it's one of the times when this is most likely to happen by mistake. What it means is that there's more than one row in the first data set that matches to the same row in the second data set AND there are rows in the first data set that each match to multiple rows in the second data set.
This can be legitimate, but it's something you'd always want to check to confirm that the behavior is correct.
## Cleanup
So you've joined two data sets, and now you have a bunch of columns with ".x" and ".y" (or other types of repeats). What do you do?
**Option 1**: select only the columns you really want in the result ahead of doing the join. NOTE: this is one of the very few times I might nest `{dplyr}` function calls
```{r}
cases %>%
select(countyFIPS, county, state, cases) %>%
left_join(select(demo, fips, total, white, black),
by=c("countyFIPS"="fips"))
```
Why nest the `select` call instead of saving a subset of the data first, and using that? Mostly for situations where the datasets are large, and I don't want extra copies of the data frame in my environment. In this case, the data sets are small enough that I could do this instead:
```{r}
demo_sub <- select(demo, fips, total, white, black)
cases %>%
select(countyFIPS, county, state, cases) %>%
left_join(demo_sub,
by=c("countyFIPS"="fips"))
```
But I have `demo_sub` hanging around in my environment then (I like to keep my environment clean when possible!).
**Option 2**: use `rename()` or `select()` on the result of the join:
```{r}
cases %>%
left_join(demo,
by=c("countyFIPS"="fips")) %>%
select(countyFIPS, county=county.x, state=state.x,
cases, total, white, black)
```
### EXERCISE 5
Create a data frame with only the following columns: countyFIPS, state, deaths, non_white population
Hint: use the `demo` and `deaths` data frames.
```{r}
```
# Resources to continue learning
See the Relational Data chapter in the book R for Data Science: <https://r4ds.had.co.nz/relational-data.html>
# Recap
You now can use different `{dplyr}` join commands, such as `inner_join`, `outer_join`, `left_join`, `right_join`, `anti_join`, to join together your data in a wide variety of ways.
Next: the `{tidyr}` functions for reshaping your data to follow the tidy format: `pivot_longer`, `pivot_wider`, `separate` and `separate_rows`.