03b-SummarizingDataWithR.Rmd

---
output:
  pdf_document: default
  bookdown::gitbook:
    lib_dir: "book_assets"
    includes:
      in_header: google_analytics.html
  html_document: default
---

# Summarizing data with R (with Lucy King)

This chapter will introduce you to how to summarize data using R, as well as providing an introduction to a popular set of R tools known as the "Tidyverse."

Before doing anything else we need to load the libraries that we will use in this notebook.  

```{r loadLibraries}
library(tidyverse)
library(cowplot)
library(knitr)
set.seed(123456)
opts_chunk$set(tidy.opts=list(width.cutoff=80))
options(tibble.width = 60)
```

We will use the NHANES dataset for several of our examples, so let's load the library that contains the data.

```{r}
# load the NHANES data library
# first unload it if it's already loaded, to make sure
# we have a clean version
rm('NHANES')
library(NHANES)
dim(NHANES)
```

## Introduction to the Tidyverse

In this chapter we will introduce a way of working with data in R that is often referred to as the "Tidyverse." This comprises a set of packages that provide various tools for working with data, as well as a few special ways of using those functions

### Making a data frame using tibble()

The tidyverse provides its own version of a data frame, which is known as a *tibble*.  A tibble is a data frame but with some smart tweaks that make it easier to work with, expecially when using functions from the tidyverse. See here for more information on  the function `tibble()`: https://r4ds.had.co.nz/tibbles.html

```{r}
# first create the individual variables
n <- c("russ", "lucy", "jaclyn", "tyler")
x <- c(1, 2, 3, 4)
y <- c(4, 5, 6, 7)
z <- c(7, 8, 9, 10)

# create the data frame
myDataFrame <-
  tibble(
    n, #list each of your columns in the order you want them
    x,
    y,
    z
  )

myDataFrame
```

Take a quick look at the properties of the data frame using `glimpse()`:

```{r}
glimpse(myDataFrame) 
```

### Selecting an element

There are various ways to access the contents within a data frame.  

#### Selecting a row or column by name

```{r}
myDataFrame$x
```


The first index refers to the row, the second to the column.

```{r}
myDataFrame[1, 2]

myDataFrame[2, 3]
```


#### Selecting a row or column by index

```{r}
myDataFrame[1, ]

myDataFrame[, 1]
```

#### Select a set of rows 

```{r}
myDataFrame %>% 
  slice(1:2) 
```

`slice()` is a function that selects out rows based on their row number.

You will also notice something we haven't discussed before: `%>%`.  This is called a "pipe", which is commonly used within the tidyverse; you can read more [here](http://magrittr.tidyverse.org/). A pipe takes the output from one command and feeds it as input to the next command. In this case, simply writing the name of the data frame (myDataFrame) causes it to be input to the `slice()` command following the pipe. The benefit of pipes will become especially apparent when we want to start stringing together multiple data processing operations into a single command.

In the previous example, no new variable was created - the output was simply printed to the screen, just like it would be if you typed the name of the variable.  If you wanted to save it to a new variable, you would use the `<-` assignment operator, like this:

```{r}
myDataFrameSlice <- myDataFrame %>% 
  slice(1:2) 

myDataFrameSlice

```

#### Select a set of rows based on specific value(s)

```{r}
myDataFrame %>% 
  filter(n == "russ")

```

`filter()` is a function that retains only those rows that meet your stated criteria.  We can also filter for multiple criteria at once --- in this example, the `|` symbol indicates "or":

```{r}
myDataFrame %>% 
  filter(n == "russ" | n == "lucy")

```


#### Select a set of columns

```{r}
myDataFrame %>% 
  select(x:y)
```

`select()` is a function that selects out only those columns you specify using their names

You can also specify a vector of columns to select.

```{r}
myDataFrame %>% 
  select(c(x,z))
```

### Adding a row or column

add a named row

```{r}
tiffanyDataFrame <-
  tibble(
    n = "tiffany",
    x = 13,
    y = 14,
    z = 15
  )

myDataFrame %>% 
  bind_rows(tiffanyDataFrame)
```

`bind_rows()` is a function that combines the rows from another dataframe to the current dataframe 


## Creating or modifying variables using `mutate()`

Often we will want to either create a new variable based on an existing variable, or modify the value of an existing variable.  Within the tidyverse, we do this using a function called ```mutate()```.  Let's start with a toy example by creating a data frame containing a single variable.

```{r}
toy_df <- data.frame(x = c(1,2,3,4))
glimpse(toy_df)
```

Let's say that we wanted to create a new variable called `y` that would contain the value of x multiplied by 10.  We could do this using ```mutate()``` and then assign the result back to the same data frame:

```{r}
toy_df <- toy_df %>%
  # create a new variable called y that contains x*10
  mutate(y = x*10)
glimpse(toy_df)
```


We could also overwrite a variable with a new value:

```{r}
toy_df2 <- toy_df %>%
  # create a new variable called y that contains x*10
  mutate(y = y + 1)
glimpse(toy_df2)

```
We will use `mutate()` often so it's an important function to understand.

Here we can use it with our example data frame to create a new variable that is the sum of several other variables.
```{r}
myDataFrame <- 
  myDataFrame %>%
  mutate(total = x + y + z)

kable(myDataFrame)
```

mutate() is a function that creates a new variable in a data frame using the existing variables.  In this case, it creates a variable called total that is the sum of the existing variables x, y, and z.

### Remove a column using the select() function

Adding a minus sign to the name of a variable within the `select()` command will remove that variable, leaving all of the others.
```{r}
myDataFrame <- 
  myDataFrame %>%
  dplyr::select(-total)

kable(myDataFrame)
```

## Tidyverse in action

To see the tidyverse in action, let's clean up the NHANES dataset.  Each individual in the NHANES dataset has a unique identifier stored in the variable ```ID```.  First let's look at the number of rows in the dataset:

```{r}
nrow(NHANES)
```

Now let's see how many unique IDs there are. The ```unique()``` function returns a vector containing all of the unique values for a particular variable, and the ```length()``` function returns the length of the resulting vector.

```{r}
length(unique(NHANES$ID))
```

This shows us that while there are 10,000 observations in the data frame, there are only `r I(length(unique(NHANES$ID)))` unique IDs.  This means that if we were to use the entire dataset, we would be reusing data from some individuals, which could give us incorrect results. For this reason, we wold like to discard any observations that are duplicated.  

Let's create a new variable called ```NHANES_unique``` that will contain only the distinct observations, with no individuals appearing more than once. The `dplyr` library  provides a function called ```distinct()``` that will do this for us. You may notice that we didn't explicitly load the `dplyr` library above; however, if you look at the messages that appeared when we loaded the `tidyverse` library, you will see that it loaded `dplyr` for us.  To create the new data frame with unique observations, we will pipe the NHANES data frame into the ```distinct()``` function and then save the output to our new variable.


```{r dropDupes, warning=FALSE}
NHANES_unique <- 
  NHANES %>% 
  distinct(ID, .keep_all = TRUE)
```

If we number of rows in the new data frame, it should be the same as the number of unique IDs (`r I(length(unique(NHANES$ID)))`):

```{r}
nrow(NHANES_unique)
```

In the next example you will see the power of pipes come to life, when we start tying together multiple functions into a single operation (or "pipeline").


## Looking at individual variables using pull() and head()

The NHANES data frame contains a large number of variables, but usually we are only interested in a particular variable.  We can extract a particular variable from a data frame using the ```pull()``` function.  Let's say that we want to extract the variable `PhysActive`.  We could do this by piping the data frame into the pull command, which will result in a list of many thousands of values.  Instead of printing out this entire list, we will pipe the result into the ```head()``` function, which just shows us the first few values contained in a variable.  In this case we are not assigning the value back to a variable, so it will simply be printed to the screen.  

```{r}
NHANES %>%
  # extract the PhysActive variable
  pull(PhysActive) %>%
  # extract the first 10 values 
  head(10) %>%
  kable()

```

There are two important things to notice here.  The first is that there are three different values apparent in the answers: "Yes", "No", and <NA>, which means that the value is missing for this person (perhaps they didn't want to answer that question on the survey).  When we are working with data we generally need to remove missing values, as we will see below.

The second thing to notice is that R prints out a list of "Levels" of the variable. This is because this variable is defined as a particular kind of variable in R known as a *factor*. You can think of a factor variable as a categorial variable with a specific set of levels.  The missing data are not treated as a level, so it can be useful to make the missing values explicit, which can be done using a function called ```fct_explicit_na()``` in the `forcats` package.  Let's add a line to do that:

```{r}
NHANES %>%
  mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
  # extract the PhysActive variable
  pull(PhysActive) %>%
  # extract the first 10 values 
  head(10) %>%
  kable()

```

This new line overwrote the old value of `PhysActive` with a version that has been processed by the ```fct_explicit_na()``` function to convert <NA> values to explicitly missing values. Now you can see that Missing values are treated as an explicit level, which will be useful later.  

Now we are ready to start summarizing data!

## Computing a frequency distribution (Section \@ref(frequency-distributions))

We would like to compute a frequency distribution showing how many people report being either active or inactive.  The following statement is fairly complex so we will step through it one bit at a time.

```{r makePhysActiveTable}
PhysActive_table <- NHANES_unique %>%
  # convert the implicit missing values to explicit
  mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
  # select the variable of interest
  dplyr::select(PhysActive) %>% 
  # group by values of the variable
  group_by(PhysActive) %>% 
  # count the values
  summarize(AbsoluteFrequency = n()) 

# kable() prints out the table in a prettier way.
kable(PhysActive_table)
```

This data frame still contains all of the original variables.  Since we are only interested in the `PhysActive` variable, let's extract that one and get rid of the rest.  We can do this using the ```select()``` command from the `dplyr` package.  Because there is also another select command available in R, we need to explicitly refer to the one from the `dplyr` package, which we do by including the package name followed by two colons: ```dplyr::select()```.

```{r}
NHANES_unique %>%
  # convert the implicit missing values to explicit
  mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
  # select the variable of interest
  dplyr::select(PhysActive) %>% 
  head(10) %>%
  kable()

```

The next function, ```group_by()``` tells R that we are going to want to analyze the data separate according to the different levels of the `PhysActive` variable:

```{r}
NHANES_unique %>%
  # convert the implicit missing values to explicit
  mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
  # select the variable of interest
  dplyr::select(PhysActive) %>% 
  group_by(PhysActive) %>%
  head(10) %>%
  kable()
```

The final command tells R to create a new data frame by summarizing the data that we are passing in (which in this case is the PhysActive variable, grouped by its different levels).  We tell the ```summarize()``` function to create a new variable (called `AbsoluteFrequency`) will contain a count of the number of observations for each group, which is generated by the ```n()``` function.

```{r}
NHANES_unique %>%
  # convert the implicit missing values to explicit
  mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
  # select the variable of interest
  dplyr::select(PhysActive) %>% 
  group_by(PhysActive) %>%
  summarize(AbsoluteFrequency = n())  %>%
  kable()

```

Now let's say we want to add another column with percentage of observations in each group.  We compute the percentage by dividing the absolute frequency for each group by the total number. We can use the table that we already generated, and add a new variable, again using ```mutate()```:

```{r}
PhysActive_table <- PhysActive_table %>%
  mutate(
    Percentage = AbsoluteFrequency / 
      sum(AbsoluteFrequency) * 100
  )

kable(PhysActive_table, digits=2)
```

## Computing a cumulative distribution (Section \@ref(cumulative-distributions))

Let's compute a cumulative distribution for the `SleepHrsNight` variable in NHANES.  This looks very similar to what we saw in the previous section.  

```{r}
# create summary table for relative frequency of different
# values of SleepHrsNight 

SleepHrsNight_cumulative <- 
  NHANES_unique %>%
  # drop NA values for SleepHrsNight variable
  drop_na(SleepHrsNight) %>%
  # remove other variables
  dplyr::select(SleepHrsNight) %>%
  # group by values
  group_by(SleepHrsNight) %>%
  # create summary table
  summarize(AbsoluteFrequency = n()) %>%
  # create relative and cumulative frequencies
  mutate(
    RelativeFrequency = AbsoluteFrequency / 
      sum(AbsoluteFrequency),
    CumulativeDensity = cumsum(RelativeFrequency)
  )

kable(SleepHrsNight_cumulative)
```


## Data cleaning and tidying with R

Now that you know a bit about the tidyverse, let's look at the various tools that it provides for working with data. We will use as an example an analysis of whether attitudes about statistics are different between the different student year groups in the class.  

### Statistics attitude data from course survey

These data were collected using the Attitudes Towards Statistics (ATS) scale (from https://www.stat.auckland.ac.nz/~iase/cblumberg/wise2.pdf).

The 29-item ATS has two subscales. The Attitudes Toward Field subscale consists of the following 20 items, with reverse-keyed items indicated by an “(R)”:
1, 3, 5, 6(R), 9, 10(R), 11, 13, 14(R), 16(R), 17, 19, 20(R), 21, 22, 23, 24, 26, 28(R), 29

The Attitudes Toward Course subscale consists of the following 9 items:
2(R), 4(R), 7(R), 8, 12(R), 15(R), 18(R), 25(R), 27(R)

For our purposes, we will just combine all 29 items together, rather than separating them into these subscales.

Note: I have removed the data from the graduate students and 5+ year students, since those would be too easily identifiable given how few there are.

Let's first save the file path to the data. 

```{r}
attitudeData_file <- 'data/statsAttitude.txt'

```

Next, let's load the data from the file using the tidyverse function `read_tsv()`. There are several functions available for reading in different file formats as part of the the `readr` tidyverse package.

```{r, echo=FALSE, message=FALSE}
attitudeData <- read_tsv(attitudeData_file)
                       
```

Right now the variable names are unwieldy, since they include the entire name of the item; this is how Google Forms stores the data.  Let's change the variable names to be somewhat more readable.  We will change the names to "ats<X>" where <X> is replaced with the question number and ats indicates Attitudes Toward Statistics scale.  We can create these names using the `rename()` and `paste0()` functions. `rename()` is pretty self-explanatory: a new name is assigned to an old name or a column position. The `paste0()` function takes a string along with a set of numbers, and creates a vector that combines the string with the number.


```{r}
nQuestions <- 29 # other than the first two columns, 
# the rest of the columns are for the 29 questions in the statistics
# attitude survey; we'll use this below to rename these columns 
# based on their question number

# use rename to change the first two column names
# rename can refer to columns either by their number or their name
attitudeData <-
  attitudeData %>% 
  rename(     # rename using columm numbers
    # The first column is the year 
    Year = 1, 
    # The second column indicates 
    # whether the person took stats before
    StatsBefore = 2 
  ) %>% 
  rename_at( 
    # rename all the columns except Year and StatsBefore
    vars(-Year, -StatsBefore), 
    #rename by pasting the word "stat" and the number
    list(name = ~paste0('ats', 1:nQuestions)) 
  )

# print out the column names
names(attitudeData)

#check out the data again
glimpse(attitudeData)
```

The next thing we need to do is to create an ID for each individual. To do this, we will use the `rownames_to_column()` function from the tidyverse.  This creates a new variable (which we name "ID") that contains the row names from the data frame; thsee are simply the numbers 1 to N.

```{r}
# let's add a participant ID so that we will be able to 
# identify them later
attitudeData <- 
  attitudeData %>% 
  rownames_to_column(var = 'ID')

head(attitudeData)
```

If you look closely at the data, you can see that some of the participants have some missing responses.  We can count them up for each individual and create a new variable to store this to a new variable called `numNA` using `mutate()`.

We can also create a table showing how many participants have a particular number of NA values.  Here we use two additional commands that you haven't seen yet. The `group_by()` function tells other functions to do their analyses while breaking the data into groups based on one of the variables.  Here we are going to want to summarize the number of people with each possible number of NAs, so we will group responses by the numNA variable that we are creating in the first command here.  

The summarize() function creates a summary of the data, with the new variables based on the data being fed in.  In this case, we just want to count up the number of subjects in each group, which we can do using the special n() function from dpylr. 


```{r}
# compute the number of NAs for each participant
attitudeData <- 
  attitudeData %>% 
  mutate(
    # we use the . symbol to tell the is.na function 
    # to look at the entire data frame
    numNA = rowSums(is.na(.)) 
  )
  
# present a table with counts of the number of missing responses
attitudeData %>% 
  count(numNA)
```

We can see from the table that there are only a few participants with missing data; six people are missing one answer, and one is missing two answers. Let's find those individuals, using the filter() command from dplyr.  filter() returns the subset of rows from a data frame that match a particular test - in this case, whether numNA is > 0.

```{r}
attitudeData %>% 
  filter(numNA > 0)
```


There are fancy techniques for trying to guess the value of missing data (known as "imputation") but since the number of participants with missing values is small, let's just drop those participants from the list. We can do this using the `drop_na()` function from the `tidyr` package, another tidyverse package that provides tools for cleaning data.  We will also remove the numNA variable, since we won't need it anymore after removing the subjects with missing answeres. We do this using the `select()` function from the `dplyr` tidyverse package, which selects or removes columns from a data frame.  By putting a minus sign in front of numNA, we are telling it to remove that column.

`select()` and `filter()` are  similar - `select()` works on columns (i.e. variables) and `filter()` works on rows (i.e. observations).


```{r}
# this is equivalent to drop_na(attitudeData)
attitudeDataNoNA <- 
  attitudeData %>% 
  drop_na() %>% 
  select(-numNA)

```

Try the following on your own:  Using the attitudeData data frame, drop the NA values, create a  new variable called mystery that contains a value of 1 for anyone who answered 7 to question ats4 ("Statistics seems very mysterious to me"). Create a summary that includes the number of people reporting 7 on this question, and the proportion of people who reported 7.


#### Tidy data
These data are in a format that meets the principles of "tidy data", which state the following:

- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.

In our case, each column represents a variable: `ID` identifies which student responded, `Year` contains their year at Stanford, `StatsBefore` contains whether or not they have taken statistics before, and ats1 through ats29 contain their responses to each item on the ATS scale. Each observation (row) is a response from one individual student. Each value has its own cell (e.g., the values for `Year` and `StatsBefoe` are stored in separate cells in separate columns).

For an example of data that are NOT tidy, take a look at these data [Belief in Hell](http://www.pewforum.org/religious-landscape-study/belief-in-hell/#generational-cohort) - click on the "Table" tab to see the data.

- What are the variables
- Why aren't these data tidy?

#### Recoding data 
We now have tidy data; however, some of the ATS items require recoding. Specifically, some of the items need to be "reverse coded"; these items include: ats2, ats4, ats6, ats7, ats10, ats12, ats14, ats15, ats16, ats18, ats20, ats25, ats27 and ats28. The raw responses for each item are on the 1-7 scale; therefore, for the reverse coded items, we need to reverse them by subtracting the raw score from 8 (such that 7 becomes 1 and 1 becomes 7). To recode these items, we will use the tidyverse `mutate()` function. It's a good idea when recoding to preserve the raw original variables and create new recoded variables with different names.

There are two ways we can use `mutate()` function to recode these variables. The first way is easier to understand as a new code, but less efficient and more prone to error. Specifically, we repeat the same code for every variable we want to reverse code as follows:

```{r}
attitudeDataNoNA %>% 
  mutate(
    ats2_re = 8 - ats2,
    ats4_re = 8 - ats4,
    ats6_re = 8 - ats6,
    ats7_re = 8 - ats7,
    ats10_re = 8 - ats10,
    ats12_re = 8 - ats12,
    ats14_re = 8 - ats14,
    ats15_re = 8 - ats15,
    ats16_re = 8 - ats16,
    ats18_re = 8 - ats18,
    ats20_re = 8 - ats20,
    ats25_re = 8 - ats25,
    ats27_re = 8 - ats27,
    ats28_re = 8 - ats28
  ) 
```

The second way is more efficient and takes advatange of the use of "scoped verbs" (https://dplyr.tidyverse.org/reference/scoped.html), which allow you to apply the same code to several variables at once. Because you don't have to keep repeating the same code, you're less likely to make an error:
```{r}
#create a vector of the names of the variables to recode
ats_recode <- 
  c(
    "ats2",
    "ats4",
    "ats6",
    "ats7",
    "ats10",
    "ats12",
    "ats14",
    "ats15",
    "ats16",
    "ats18",
    "ats20",
    "ats25",
    "ats27",
    "ats28"
  )


attitudeDataNoNA <-
  attitudeDataNoNA %>% 
  mutate_at(
    vars(ats_recode), # the variables you want to recode
    funs(re = 8 - .) # the function to apply to each variable
  )
```


Whenever we do an operation like this, it's good to check that it actually worked correctly.  It's easy to make mistakes in coding, which is why it's important to check your work as well as you can.

We can quickly select a couple of the raw and recoded columns from our data and make sure things appear to have gone according to plan:

```{r}
attitudeDataNoNA %>% 
  select(
    ats2,
    ats2_re,
    ats4,
    ats4_re
  )
```

Let's also make sure that there are no responses outside of the 1-7 scale that we expect, and make sure that no one specified a year outside of the 1-4 range.

```{r}
attitudeDataNoNA %>% 
  summarise_at(
    vars(ats1:ats28_re),
    funs(min, max)
  )

attitudeDataNoNA %>% 
  summarise_at(
    vars(Year),
    funs(min, max)
  )
```

#### Different data formats
Sometimes we need to reformat our data in order to analyze it or visualize it in a specific way. Two tidyverse functions, `gather()` and `spread()`, help us to do this. 

For example, say we want to examine the distribution of the raw responses to each of the ATS items (i.e., a histogram). In this case, we would need our x-axis to be a single column of the responses across all the ATS items. However, currently the responses for each item are stored in 29 different columns. 

This means that we need to create a new version of this dataset. It will have four columns:
- ID
- Year
- Question (for each of the ATS items)
- ResponseRaw (for the raw response to each of the ATS items)

Thus, we want change the format of the dataset from being "wide" to being "long".  

We change the format to "wide" using the `gather()` function.  

`gather()` takes a number of variables and reformates them into two variables: one that contains the variable values, and another called the "key" that tells us which variable the value came from. In this case, we want it to reformat the data so that each response to an ATS question is in a separate row and the key column tells us which ATS question it corresponds to. It is much better to see this in practice than to explain in words!

```{r}
attitudeData_long <- 
  attitudeDataNoNA %>% 
  #remove the raw variables that you recoded
  select(-ats_recode) %>% 
  gather(
    # key refers to the new variable containing the question number
    key = question,
    # value refers to the new response variable
    value = response, 
    #the only variables we DON'T want to gather
    -ID, -Year, -StatsBefore 
  )

attitudeData_long %>% 
  slice(1:20)

glimpse(attitudeData_long)
```

Say we now wanted to undo the `gather()` and return our dataset to wide format. For this, we would use the function `spread()`.  
```{r}
attitudeData_wide <-
  attitudeData_long %>% 
  spread(
    #key refers to the variable indicating which question 
    # each response belongs to
    key = question, 
    value = response
  )

attitudeData_wide %>% 
  slice(1:20)
```

Now that we have created a "long" version of our data, they are in the right format to create the plot. We will use the tidyverse function `ggplot()` to create our histogram with `geom_histogram`. 
```{r}
attitudeData_long %>% 
  ggplot(aes(x = response)) +
  geom_histogram(binwidth = 0.5) +
  scale_x_continuous(breaks = seq.int(1, 7, 1))
```

It looks like responses were fairly positively overall.

We can also aggregate each participant's responses to each question during each year of their study at Stanford to examine the distribution of mean ATS responses across people by year.

We will use the `group_by()` and `summarize()` functions to aggregate the responses. 

```{r}
attitudeData_agg <-
  attitudeData_long %>% 
  group_by(ID, Year) %>%
  summarize(
    mean_response = mean(response)
  )
attitudeData_agg
```

First let's use the geom_density argument in `ggplot()` to look at mean responses across people, ignoring year of response. The density argrument is like a histogram but smooths things over a bit.

```{r}
attitudeData_agg %>% 
  ggplot(aes(mean_response)) +
  geom_density()
```

Now we can also look at the distribution for each year.

```{r}
attitudeData_agg %>% 
  ggplot(aes(mean_response, color = factor(Year))) +
  geom_density()
```

Or look at trends in responses across years. 

```{r}
attitudeData_agg %>% 
  group_by(Year) %>% 
  summarise(
    mean_response = mean(mean_response)
  ) %>% 
  ggplot(aes(Year, mean_response)) +
  geom_line()
```

This looks like a precipitous drop - but how might that be misleading?