2_tidyr_pivot.qmd

---
output:
  html_document:
    df_print: paged
    code_download: TRUE
    toc: true
    toc_depth: 1
editor_options:
  chunk_output_type: console
---

```{r, setup, include=FALSE}
knitr::opts_chunk$set(eval=TRUE)
```

*The output of most of the R chunks isn't included in the HTML version of the file to keep it to a more reasonable file size. You can run the code in R to see the output.*

# Setup

It's important that you have `{tidyr}` version 1.1.0 (at least) installed. Check this with:

```{r}
packageVersion("tidyr")
```

If needed, update it with:

```{r, eval=FALSE}
install.packages("tidyr")
```

Load it as part of tidyverse

```{r, warning=FALSE, error=FALSE, message=FALSE}
library(tidyverse)
```

# Tidy data

The goal of `{tidyr}` is to help you create **tidy data**. Tidy data is data where:

1.  Each variable is a column; each column is a variable.

2.  Each observation is a row; each row is an observation.

3.  Each value is a cell; each cell is a single value.

Tidy data describes a standard way of storing data that is used wherever possible throughout the [tidyverse](https://www.tidyverse.org/). If you ensure that your data is tidy, you'll spend less time fighting with the tools and more time working on your analysis.

# Data

We're going to work with a dataset that's part of the `{tidyr}` package: `who`

The details: A subset of data from the World Health Organization Global Tuberculosis Report, and accompanying global populations. The data uses the original codes given by the World Health Organization.

The column names for columns 5 through 60 are made by combining "new\_" to a code for method of [diagnosis]{.underline} (***rel*** = relapse, ***sn*** = negative pulmonary smear, ***sp*** = positive pulmonary smear, ***ep*** = extrapulmonary) to a code for [gender]{.underline} (***f*** = female, ***m*** = male) to a code for [age group]{.underline} (***014*** = 0-14 yrs of age, ***1524*** = 15-24 years of age, ***2534*** = 25 to 34 years of age, ***3544*** = 35 to 44 years of age, ***4554*** = 45 to 54 years of age, ***5564*** = 55 to 64 years of age, ***65*** = 65 years of age or older).

Column names:

```{r}
names(who)
```

Let's look at a few random rows:

```{r}
sample_n(who, 10)
```

In this data set, each row (each observation) is a country-year. Each column is a measure-age value.

There are situations though where we might want each row to be a different unit of observation. We could make a really wide data set (one with lots of columns), where each row is a single country and there are columns for every measure in every year. Or we could make a data set that's really long, where each row is a country-year-measure-age observation. Or one that's in between, with each row being country-year-age, and there are columns for each measure.

These different configurations support different types of statistical models, data transformations, and data visualizations. No configuration is necessarily the right one or better than the others in a vacuum - it depends on what you want to do with the data!

I'm going to reduce this dataset to just a few countries that have at least some data filled in to make it easier for us to work with:

```{r}
who2 <- filter(who, country %in% c("Canada", "Czech Republic", "Morocco"))
```

# Pivot Longer

We make a data set longer by reducing the number of columns and increasing the number of rows. We'll stack some of the columns on top of each other, and duplicate the values in other columns to make this happen. This changes what the unit of observation is (what each row represents).

![](images/pivot_longer_new.png)

Image source: <https://epirhandbook.com/en/pivoting-data.html>

## A Mini Example

Before we work with our larger data set, let's look at a very small one so it's easier to see what's happening.

`tribble()` is a function that makes it easier to manually create a tibble (data frame)

```{r}
foods <- tribble(~id, ~a, ~b, ~c,
                 1, "apple", "banana", "cod", 
                 2, "asparagus", "bacon", "chocolate")
foods
```

In `foods`, each row (observation) is a person -- each person's favorite food starting with each letter. But we want to get this data set to look like:

```         
    id category item     
     1 a        apple    
     1 b        banana   
     1 c        cod      
     2 a        asparagus
     2 b        bacon    
     2 c        chocolate
```

Where each row (observation) is a person-food combination: each food is in a row by itself, and the column names have become encoded in a variable.

To do this, we use the function `pivot_longer()` - in previous versions of `{tidyr}`, this function was called `gather`. The syntax was similar, but the arguments had different names. In base R, you can do similar transformations with the `reshape` function.

Like other tidyverse functions, we'll start with the data frame name. Then we'll select which we're going to collapse -- meaning there won't be columns with those names anymore because we'll stack the names and values on top of each other to make more rows. Then we need to tell it what to name the new variable it will make that will have the column names in it (our new categorical variable), and a name as well for the new column that will have the values in it (the values that were in the columns originally).

We specify which columns we want using any of the select syntax or select helpers we've been talking about in previous sessions.

```{r}
pivot_longer(foods, 
             cols = -id, # collapse all columns except (-) id
             names_to = "category",  # name for the new column that will have the old column names as values
             values_to = "item")  # name for the new column that will store the data from the old columns
```

The `id` column has values duplicated, once for each of the original 3 columns that we collapsed, and each value of our new `category` column appears twice, because there were 2 rows in `food` when we started.

### EXERCISE 1

Pivot the pops data frame below to be in a longer format, with rows for each country-year instead of each country.

[Hint]{.underline}: Break down the problem into steps. Think about which columns of the original dataset need to be collapsed together to make the data "long".

```{r}
pops <- tribble(~country,  ~year2011,  ~year2012,  ~year2013,
                "Brazil",  196935134, 198656019, 200361925,
                "Germany",  82892904,  82800121,  82726626,
                "Kenya",    42027891,  43178141,  44353691)

```

## WHO data

Now, the WHO TB data is a bit more complicated than the simple example. We know how to take the data set into completely long format, by collapsing all of the columns except the first 4 that have country name info and year:

```{r}
names(who2)
```

Each of the measurement columns has the count of the number of people with the given diagnosis, gender, and age group.

```{r, eval=FALSE}
pivot_longer(who2, 
             -country:-year,  # all except first 4 columns (this is the cols argument)
             names_to = "measure",  # make up a new name
             values_to = "count")  # make up a new name
```

There were 56 columns that we collapsed -- that weren't the ID variables. So the first row in the original data expanded to now be 56 rows in the longer data set.

But what if we don't want to collapse them completely? What if we want one column per age group? or per measure instead? Or what if all of the columns aren't of the same type? If some are numeric while others have character or factor data?

In really complicated cases, you may need to split your data into multiple data sets, reshape the parts, and then join the results back together.

In most cases though, you can pivot to a completely long format, and then spread columns back out wider after some additional transformations. If there are different data types, they get converted to the more general one, following: boolean -\> integer -\> numeric -\> character. While you need to be careful if you are working with precise numerical values (decimal values), if you're working with whole numbers or the precision (number of significant digits) in the data is small, converting from numeric to character and back again will not lose information.

## Separate variable names

Ok, back to our WHO data. One trick before we learn how to pivot wider to undo what we did. When we pivot longer, our new "measure" variable actually has multiple pieces of information in it. The variable names are of the format: "new\_" diagnosis \_ m/f age. Ideally, we'd like this information separated, so we have a column for the diagnosis method, a column for the gender, and a column for the age group.

We can tell pivot_longer to do this for us:

```{r, eval=FALSE}
names(who2)

pivot_longer(who2, -country:-year,
             # we'll keep 3 columns: make up a name for each
             names_to=c("diagnosis", "gender", "age"),  
             # this is a regular expression that captures the pattern
             # new_? matches new with an optional _
             # . matches any single character except a newline and + means one or more of the preceding
             # _ matches _
             # [mf] matches either m or f
             # \\d+ matches one or more digits
             # each () set is a value to keep and turn into a column;
             # we are dropping "new"
             names_pattern = "new_?(.+)_([mf])(\\d+)",
             values_to="count")
```

If the variable names had a simpler pattern, for example: year_measure, then we could use the "names_sep" argument instead of the more complicated "names_pattern" to split up the column.

# Pivot Wider

Now, we've got our completely long data. How do we get it into a form where there's one row per country-year-gender-age, with one column per measure? Essentially, we want to split the diagnosis column into multiple columns for each type of diagnosis.

![](images/pivot_wider_new.png)

Image Source: <https://epirhandbook.com/en/pivoting-data.html>

```{r}
# same as above, just save it
long_data <- pivot_longer(who2, -country:-year,
             names_to=c("diagnosis", "gender", "age"), 
             values_to="count",
             names_pattern = "new_?(.+)_([mf])(\\d+)") 

names(long_data)
```

```{r, eval=FALSE}
# spread it back out
pivot_wider(long_data, 
            names_from=diagnosis,
            values_from=count)
```

What happens if our long data doesn't have every combination of values to fill into our new data set? For example, if the data started long instead of wide?

For example, if we had data that looked like:

```{r, eval=FALSE}
long_data %>%
  filter(!is.na(count)) %>%
  sample_n(20) %>%
  arrange(country, year, diagnosis, age, gender)
```

It's ok - we can still pivot wider. By default, it will fill in `NA` where it doesn't have a value, or we can tell it to fill with another value (such as 0).

```{r, eval=FALSE}
long_data %>%
  filter(!is.na(count)) %>%
  sample_n(20) %>%
  arrange(country, year, diagnosis, age, gender) %>%
  pivot_wider(names_from=diagnosis, 
              values_from=count)
```

We could also pivot wider more than one column at once. Let's have one row per country-year-diagnosis, and one column for each age-gender combination:

```{r}
names(long_data)

long_data %>%
  pivot_wider(names_from=c(age, gender),
              # controls how new variable names are created
              # uses syntax from the glue package
              names_glue="{gender}_{age}", 
              values_from=count)
```

### EXERCISE 2

Pivot the `long_data` data frame wider so that each row is a country-diagnosis-gender-age observation (pivot "year" so you have one column for each year in the output).

[Hint]{.underline}: Break down the problem into steps. Think about which column(s) in your original dataset need to be spread out into multiple columns to make the data "wide".

```{r}

```

# Bonus 1: Separate

Much like we had `pivot_longer()` split up the components of the variable names for us into measure, age, and gender, we can use the `separate()` function to do that to columns outside of pivoting.

```{r}
nobels <- tribble(~name, ~category,
                    "Bob Dylan", "Literature",
                    "Anne L'Huillier", "Physics",
                    "Narges Mohammadi", "Peace",
                    "Roger Penrose", "Physics",
                    "Jennifer Doudna", "Chemistry",
                    "Abhijit Banerjee", "Economics",
                    "Jean-Pierre Sauvage", "Chemistry")
nobels
```

We could split the name column into first name and last name, using a space character as a delimiter.

```{r}
separate(nobels,  # data frame name
         name,  # existing column name to split
         into=c("first", "last"),  # new column names 
         sep=" ")  # what to split on
```

You need to know the maximum number of pieces that your data will be split into, or you'll lose some of the information. If Jean-Pierre Sauvage's name had a space instead of a hyphen:

```{r}
nobels <- tribble(~name, ~category,
                    "Bob Dylan", "Literature",
                    "Anne L'Huillier", "Physics",
                    "Narges Mohammadi", "Peace",
                    "Roger Penrose", "Physics",
                    "Jennifer Doudna", "Chemistry",
                    "Abhijit Banerjee", "Economics",
                    "Jean Pierre Sauvage", "Chemistry")

separate(nobels,
         name,
         into=c("first", "last"),
         sep=" ")
```

We get a warning, and "Sauvage" was just dropped completely! (You can control this behavior to some extent with the `extra` argument.)

It's better to specify the maximum number of columns possible instead:

```{r}
separate(nobels,
         name,
         into=c("first1", "first2", "last"),
         sep=" ")
```

Now the warning is telling us it filled in `NA` values where it didn't find 3 pieces after separating. Also, the new column names are not exactly suitable since other observations in the dataset all have last names (not `NA`).

But `separate` really works best when you're separating something that will separate into a fixed number of columns.

Side note: working with names is HARD! For some examples why, see <https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/>

### EXERCISE 3

Separate the "location" variable in `addresses` into separate "city" and "state" columns. Hint: the "sep" value can be more than one character in length.

```{r}
addresses <- tribble(~name, ~location,
                     "Katherine", "Chicago, IL",
                     "Bob", "Bend, OR",
                     "Louka", "Honolulu, HI",
                     "Alex", "San Jose, CA")
```

# Bonus 2: separate_rows()

In cases where you want to separate out a column, but you don't know how many values might be in it, another option is to make one row, instead of one column, per value. It's combining a separate with a pivot longer.

This can be useful if you have a column with a delimited list of values in it. For example, if you ask a multiple choice question that allows multiple answers in Qualtrics (survey platform), you'll get a single column with all of the answer that people checked:

```{r}
survey <- tribble(~id, ~q1, ~q2,
                  45, 4, "Time",
                  46, 3, "Quality;Cost",
                  47, 4, "Time;Cost;Compatibility;Warranty")
survey
```

How do we deal with q2?

```{r}
separate_rows(survey, 
              q2,  # which column to separate
              sep=";")
```

## Creating Indicator Variables

Then, we may want to create one column per answer choice -- create indicator variables (aka dummy variables or one-hot encoding).

```{r}
separate_rows(survey, 
              q2,  # which column to separate
              sep=";") %>%
  mutate(exists=1) %>%  # add a column to encode values later
  pivot_wider(names_from=q2,  # new column names, one for each value
              values_from=exists,  # so there will be a 1 where there was a value
              values_fill=0) ## fills in missing vals with 0 instead of NA
```

# Resources to continue learning

The `{tidyr}` documentation has a tutorial on pivoting: <https://tidyr.tidyverse.org/articles/pivot.html>

# Recap

You now can use different `{tidyr}` commands, such as `pivot_longer`, `pivot_wider`, `separate`, `separate_rows`, to reshape your data in a wide variety of ways.