-
Notifications
You must be signed in to change notification settings - Fork 1
/
2_tidyr_pivot.qmd
381 lines (267 loc) · 16.1 KB
/
2_tidyr_pivot.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
---
output:
html_document:
df_print: paged
code_download: TRUE
toc: true
toc_depth: 1
editor_options:
chunk_output_type: console
---
```{r, setup, include=FALSE}
knitr::opts_chunk$set(eval=TRUE)
```
*The output of most of the R chunks isn't included in the HTML version of the file to keep it to a more reasonable file size. You can run the code in R to see the output.*
# Setup
It's important that you have `{tidyr}` version 1.1.0 (at least) installed. Check this with:
```{r}
packageVersion("tidyr")
```
If needed, update it with:
```{r, eval=FALSE}
install.packages("tidyr")
```
Load it as part of tidyverse
```{r, warning=FALSE, error=FALSE, message=FALSE}
library(tidyverse)
```
# Tidy data
The goal of `{tidyr}` is to help you create **tidy data**. Tidy data is data where:
1. Each variable is a column; each column is a variable.
2. Each observation is a row; each row is an observation.
3. Each value is a cell; each cell is a single value.
Tidy data describes a standard way of storing data that is used wherever possible throughout the [tidyverse](https://www.tidyverse.org/). If you ensure that your data is tidy, you'll spend less time fighting with the tools and more time working on your analysis.
# Data
We're going to work with a dataset that's part of the `{tidyr}` package: `who`
The details: A subset of data from the World Health Organization Global Tuberculosis Report, and accompanying global populations. The data uses the original codes given by the World Health Organization.
The column names for columns 5 through 60 are made by combining "new\_" to a code for method of [diagnosis]{.underline} (***rel*** = relapse, ***sn*** = negative pulmonary smear, ***sp*** = positive pulmonary smear, ***ep*** = extrapulmonary) to a code for [gender]{.underline} (***f*** = female, ***m*** = male) to a code for [age group]{.underline} (***014*** = 0-14 yrs of age, ***1524*** = 15-24 years of age, ***2534*** = 25 to 34 years of age, ***3544*** = 35 to 44 years of age, ***4554*** = 45 to 54 years of age, ***5564*** = 55 to 64 years of age, ***65*** = 65 years of age or older).
Column names:
```{r}
names(who)
```
Let's look at a few random rows:
```{r}
sample_n(who, 10)
```
In this data set, each row (each observation) is a country-year. Each column is a measure-age value.
There are situations though where we might want each row to be a different unit of observation. We could make a really wide data set (one with lots of columns), where each row is a single country and there are columns for every measure in every year. Or we could make a data set that's really long, where each row is a country-year-measure-age observation. Or one that's in between, with each row being country-year-age, and there are columns for each measure.
These different configurations support different types of statistical models, data transformations, and data visualizations. No configuration is necessarily the right one or better than the others in a vacuum - it depends on what you want to do with the data!
I'm going to reduce this dataset to just a few countries that have at least some data filled in to make it easier for us to work with:
```{r}
who2 <- filter(who, country %in% c("Canada", "Czech Republic", "Morocco"))
```
# Pivot Longer
We make a data set longer by reducing the number of columns and increasing the number of rows. We'll stack some of the columns on top of each other, and duplicate the values in other columns to make this happen. This changes what the unit of observation is (what each row represents).
![](images/pivot_longer_new.png)
Image source: <https://epirhandbook.com/en/pivoting-data.html>
## A Mini Example
Before we work with our larger data set, let's look at a very small one so it's easier to see what's happening.
`tribble()` is a function that makes it easier to manually create a tibble (data frame)
```{r}
foods <- tribble(~id, ~a, ~b, ~c,
1, "apple", "banana", "cod",
2, "asparagus", "bacon", "chocolate")
foods
```
In `foods`, each row (observation) is a person -- each person's favorite food starting with each letter. But we want to get this data set to look like:
```
id category item
1 a apple
1 b banana
1 c cod
2 a asparagus
2 b bacon
2 c chocolate
```
Where each row (observation) is a person-food combination: each food is in a row by itself, and the column names have become encoded in a variable.
To do this, we use the function `pivot_longer()` - in previous versions of `{tidyr}`, this function was called `gather`. The syntax was similar, but the arguments had different names. In base R, you can do similar transformations with the `reshape` function.
Like other tidyverse functions, we'll start with the data frame name. Then we'll select which we're going to collapse -- meaning there won't be columns with those names anymore because we'll stack the names and values on top of each other to make more rows. Then we need to tell it what to name the new variable it will make that will have the column names in it (our new categorical variable), and a name as well for the new column that will have the values in it (the values that were in the columns originally).
We specify which columns we want using any of the select syntax or select helpers we've been talking about in previous sessions.
```{r}
pivot_longer(foods,
cols = -id, # collapse all columns except (-) id
names_to = "category", # name for the new column that will have the old column names as values
values_to = "item") # name for the new column that will store the data from the old columns
```
The `id` column has values duplicated, once for each of the original 3 columns that we collapsed, and each value of our new `category` column appears twice, because there were 2 rows in `food` when we started.
### EXERCISE 1
Pivot the pops data frame below to be in a longer format, with rows for each country-year instead of each country.
[Hint]{.underline}: Break down the problem into steps. Think about which columns of the original dataset need to be collapsed together to make the data "long".
```{r}
pops <- tribble(~country, ~year2011, ~year2012, ~year2013,
"Brazil", 196935134, 198656019, 200361925,
"Germany", 82892904, 82800121, 82726626,
"Kenya", 42027891, 43178141, 44353691)
```
## WHO data
Now, the WHO TB data is a bit more complicated than the simple example. We know how to take the data set into completely long format, by collapsing all of the columns except the first 4 that have country name info and year:
```{r}
names(who2)
```
Each of the measurement columns has the count of the number of people with the given diagnosis, gender, and age group.
```{r, eval=FALSE}
pivot_longer(who2,
-country:-year, # all except first 4 columns (this is the cols argument)
names_to = "measure", # make up a new name
values_to = "count") # make up a new name
```
There were 56 columns that we collapsed -- that weren't the ID variables. So the first row in the original data expanded to now be 56 rows in the longer data set.
But what if we don't want to collapse them completely? What if we want one column per age group? or per measure instead? Or what if all of the columns aren't of the same type? If some are numeric while others have character or factor data?
In really complicated cases, you may need to split your data into multiple data sets, reshape the parts, and then join the results back together.
In most cases though, you can pivot to a completely long format, and then spread columns back out wider after some additional transformations. If there are different data types, they get converted to the more general one, following: boolean -\> integer -\> numeric -\> character. While you need to be careful if you are working with precise numerical values (decimal values), if you're working with whole numbers or the precision (number of significant digits) in the data is small, converting from numeric to character and back again will not lose information.
## Separate variable names
Ok, back to our WHO data. One trick before we learn how to pivot wider to undo what we did. When we pivot longer, our new "measure" variable actually has multiple pieces of information in it. The variable names are of the format: "new\_" diagnosis \_ m/f age. Ideally, we'd like this information separated, so we have a column for the diagnosis method, a column for the gender, and a column for the age group.
We can tell pivot_longer to do this for us:
```{r, eval=FALSE}
names(who2)
pivot_longer(who2, -country:-year,
# we'll keep 3 columns: make up a name for each
names_to=c("diagnosis", "gender", "age"),
# this is a regular expression that captures the pattern
# new_? matches new with an optional _
# . matches any single character except a newline and + means one or more of the preceding
# _ matches _
# [mf] matches either m or f
# \\d+ matches one or more digits
# each () set is a value to keep and turn into a column;
# we are dropping "new"
names_pattern = "new_?(.+)_([mf])(\\d+)",
values_to="count")
```
If the variable names had a simpler pattern, for example: year_measure, then we could use the "names_sep" argument instead of the more complicated "names_pattern" to split up the column.
# Pivot Wider
Now, we've got our completely long data. How do we get it into a form where there's one row per country-year-gender-age, with one column per measure? Essentially, we want to split the diagnosis column into multiple columns for each type of diagnosis.
![](images/pivot_wider_new.png)
Image Source: <https://epirhandbook.com/en/pivoting-data.html>
```{r}
# same as above, just save it
long_data <- pivot_longer(who2, -country:-year,
names_to=c("diagnosis", "gender", "age"),
values_to="count",
names_pattern = "new_?(.+)_([mf])(\\d+)")
names(long_data)
```
```{r, eval=FALSE}
# spread it back out
pivot_wider(long_data,
names_from=diagnosis,
values_from=count)
```
What happens if our long data doesn't have every combination of values to fill into our new data set? For example, if the data started long instead of wide?
For example, if we had data that looked like:
```{r, eval=FALSE}
long_data %>%
filter(!is.na(count)) %>%
sample_n(20) %>%
arrange(country, year, diagnosis, age, gender)
```
It's ok - we can still pivot wider. By default, it will fill in `NA` where it doesn't have a value, or we can tell it to fill with another value (such as 0).
```{r, eval=FALSE}
long_data %>%
filter(!is.na(count)) %>%
sample_n(20) %>%
arrange(country, year, diagnosis, age, gender) %>%
pivot_wider(names_from=diagnosis,
values_from=count)
```
We could also pivot wider more than one column at once. Let's have one row per country-year-diagnosis, and one column for each age-gender combination:
```{r}
names(long_data)
long_data %>%
pivot_wider(names_from=c(age, gender),
# controls how new variable names are created
# uses syntax from the glue package
names_glue="{gender}_{age}",
values_from=count)
```
### EXERCISE 2
Pivot the `long_data` data frame wider so that each row is a country-diagnosis-gender-age observation (pivot "year" so you have one column for each year in the output).
[Hint]{.underline}: Break down the problem into steps. Think about which column(s) in your original dataset need to be spread out into multiple columns to make the data "wide".
```{r}
```
# Bonus 1: Separate
Much like we had `pivot_longer()` split up the components of the variable names for us into measure, age, and gender, we can use the `separate()` function to do that to columns outside of pivoting.
```{r}
nobels <- tribble(~name, ~category,
"Bob Dylan", "Literature",
"Anne L'Huillier", "Physics",
"Narges Mohammadi", "Peace",
"Roger Penrose", "Physics",
"Jennifer Doudna", "Chemistry",
"Abhijit Banerjee", "Economics",
"Jean-Pierre Sauvage", "Chemistry")
nobels
```
We could split the name column into first name and last name, using a space character as a delimiter.
```{r}
separate(nobels, # data frame name
name, # existing column name to split
into=c("first", "last"), # new column names
sep=" ") # what to split on
```
You need to know the maximum number of pieces that your data will be split into, or you'll lose some of the information. If Jean-Pierre Sauvage's name had a space instead of a hyphen:
```{r}
nobels <- tribble(~name, ~category,
"Bob Dylan", "Literature",
"Anne L'Huillier", "Physics",
"Narges Mohammadi", "Peace",
"Roger Penrose", "Physics",
"Jennifer Doudna", "Chemistry",
"Abhijit Banerjee", "Economics",
"Jean Pierre Sauvage", "Chemistry")
separate(nobels,
name,
into=c("first", "last"),
sep=" ")
```
We get a warning, and "Sauvage" was just dropped completely! (You can control this behavior to some extent with the `extra` argument.)
It's better to specify the maximum number of columns possible instead:
```{r}
separate(nobels,
name,
into=c("first1", "first2", "last"),
sep=" ")
```
Now the warning is telling us it filled in `NA` values where it didn't find 3 pieces after separating. Also, the new column names are not exactly suitable since other observations in the dataset all have last names (not `NA`).
But `separate` really works best when you're separating something that will separate into a fixed number of columns.
Side note: working with names is HARD! For some examples why, see <https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/>
### EXERCISE 3
Separate the "location" variable in `addresses` into separate "city" and "state" columns. Hint: the "sep" value can be more than one character in length.
```{r}
addresses <- tribble(~name, ~location,
"Katherine", "Chicago, IL",
"Bob", "Bend, OR",
"Louka", "Honolulu, HI",
"Alex", "San Jose, CA")
```
# Bonus 2: separate_rows()
In cases where you want to separate out a column, but you don't know how many values might be in it, another option is to make one row, instead of one column, per value. It's combining a separate with a pivot longer.
This can be useful if you have a column with a delimited list of values in it. For example, if you ask a multiple choice question that allows multiple answers in Qualtrics (survey platform), you'll get a single column with all of the answer that people checked:
```{r}
survey <- tribble(~id, ~q1, ~q2,
45, 4, "Time",
46, 3, "Quality;Cost",
47, 4, "Time;Cost;Compatibility;Warranty")
survey
```
How do we deal with q2?
```{r}
separate_rows(survey,
q2, # which column to separate
sep=";")
```
## Creating Indicator Variables
Then, we may want to create one column per answer choice -- create indicator variables (aka dummy variables or one-hot encoding).
```{r}
separate_rows(survey,
q2, # which column to separate
sep=";") %>%
mutate(exists=1) %>% # add a column to encode values later
pivot_wider(names_from=q2, # new column names, one for each value
values_from=exists, # so there will be a 1 where there was a value
values_fill=0) ## fills in missing vals with 0 instead of NA
```
# Resources to continue learning
The `{tidyr}` documentation has a tutorial on pivoting: <https://tidyr.tidyverse.org/articles/pivot.html>
# Recap
You now can use different `{tidyr}` commands, such as `pivot_longer`, `pivot_wider`, `separate`, `separate_rows`, to reshape your data in a wide variety of ways.