dplyr-group.Rmd

---
title: "dplyr grouping"
output:
  html_document:
    df_print: paged
    code_download: TRUE
    toc: true
    toc_depth: 1
editor_options:
  
  chunk_output_type: inline
---

```{r, setup, include=FALSE}
knitr::opts_chunk$set(
  eval=FALSE, warning=FALSE, error=FALSE
)
```

*The output of most of the R chunks isn't included in the HTML version of the file to keep it to a more reasonable file size.  You can run the code in R to see the output.*

This is an [R Markdown](https://rmarkdown.rstudio.com/) document.  Follow the link to learn more about R Markdown and the notebook format used during the workshop.

# Setup

```{r, eval=TRUE, message=FALSE}
library(tidyverse)
```

For this session, we're learning three additional dplyr functions: group_by, summarize, and arrange.

## Data

We're using the same data as session 1.  

```{r, eval=TRUE}
police <- read_csv("https://raw.githubusercontent.com/nuitrcs/r-tidyverse/main/data/ev_police.csv",
                   col_types=c("location"="c"))
```


# Summarize

We'll start with `summarize()` (or `summarise()` - British spelling is accepted).  We use `mutate()` when we want the output to have the same length as the input -- when we're operating on the individual elements in a vector - we want a value for every row in the data.  When we want to condense multiple values down to a single (or a few values), such as taking the mean or standard deviation of a vector), we use summarize instead.

```{r}
police %>% 
  mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
  summarize(mean_vehicle_age = mean(vehicle_age))
```

Note that even though there's just one value, we get a tibble returned.  This is what to expect with the tidyverse.

As a side note, if we needed the single value (or a single vector), we could `pull()` it out of the tibble/data frame:

```{r}
police %>% 
  mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
  summarize(mean_vehicle_age = mean(vehicle_age)) %>%
  pull()
```

We can compute more than one summary measure at the same time:

```{r}
police %>% 
  mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
  summarize(mean_vehicle_age = mean(vehicle_age),
            sd_vehicle_age = sd(vehicle_age),
            min_date = min(date),
            max_date = max(date))
```

We get one column per summary variable we create.  Once we group below, we'll see why we get the output in columns instead of rows.


### EXERCISE

Use summarize to compute the `min()` and `max()` `vehicle_year`

```{r}

```


## Across

If we want to apply the same summary functions to multiple columns in our data frame, we can write out all of the summary commands explicitly, or we can use `across()` to select which variables to summarize with which functions.

Let's use the `n_distinct()` function to count the number of distinct values in each column (`n_distinct(x)` is the same as `length(unique(x))`.  This will help us see which columns don't have useful information because every value is the same.

`across()` selects columns using the helper functions you could give to `select()` directly.  We'll use `everything()` here to select all columns.

```{r}
police %>%
  summarize(across(everything(), n_distinct))
```

If you wanted to select columns using their names, put them in a vector (so it's a single input argument):

```{r}
police %>%
  summarize(across(c(date, time, location, beat, subject_age), n_distinct))
```


If we want to apply multiple functions:

```{r}
police %>%
  summarize(across(!where(is.character), ## select columns that are not of type character
                   list(min, max)))   # take the min and max of each column
```

To fix the names in the output, explicitly name our summary functions in the list:

```{r}
police %>%
  summarize(across(!where(is.character), ## select columns that are not of type character
                   list(min_val=min, max_val=max)))   # take the min and max of each column
```

There are other options for output naming patterns available too.


### EXERCISE

What if we want to do something to each column that is more complicated than a single function?  For example, counting missing values, which you do with `sum(is.na(x))`.  You can write your own function!

```{r, eval=TRUE}
na_count <- function(x) {
  sum(is.na(x))
}

# Example for a single vector:
na_count(police$subject_age)
```

Now use this new `na_count()` function to count the missing values in each column:

```{r}
police %>%
  summarize(across(___, ___))
```


This is similar to using the apply() function in base R:

```{r}
apply(police, 2, na_count)
```

#### Advanced Aside

Instead of creating the function ahead of time, we could define it as part of the summarize call. We use .x to stand in for each column, and we put ~ in front of the expression to signal we’re creating a function in place (lambda function).

```{r}
police %>%
  summarize(across(everything(),
                   ~sum(is.na(.x))))  # define a function here that operates on each column
```

This is an advanced technique outside of the scope of this workshop.  But if you see a ~ inside `across()`, at least know that someone is defining their own function inline.  


# Group By

With base R, when we want to compute summary measures or do other computation on groups in our data (as defined by some grouping variable), we use functions such as `tapply()` or `aggregate()`.  With dplyr, we can explicitly group our tibble into subgroups.  This isn't very useful by itself, but it is often combined with `summarize()` to compute summary measures by group.

First, what if we just group:

```{r}
police %>%
  group_by(outcome)
```

When we print this in the console, 

```
# A tibble: 14,792 x 29
# Groups:   outcome [2]
   raw_row_number date       time  location beat  subject_age subject_race subject_sex department_id
            <dbl> <date>     <tim>    <dbl> <chr> <lgl>       <chr>        <chr>               <dbl>
 1       11249746 2017-01-01 00:56    60202 72    NA          white        male                13178
 2       11249747 2017-01-01 04:43    60643 71    NA          black        male                13178
 3       11249748 2017-01-01 09:21    60625 78    NA          black        female              13178
...
```

we see that it tells us that the tibble (data frame) is grouped by outcome, and that there are two groups.  It doesn't rearrange the rows, it just keeps track of the groups for us. 

Now, combine with summarize.  But first, let's make the `vehicle_age` column we've been using actually part of the `police` dataset so that we don't have to keep creating it:

```{r, eval=TRUE}
police <- mutate(police, vehicle_age = 2017-vehicle_year)
```

Now, group and summarize:

```{r}
police %>% 
  group_by(subject_sex) %>%
  summarize(mean_vehicle_age = mean(vehicle_age),
            sd_vehicle_age = sd(vehicle_age))
```

Now we get one row for each group, and one column for each summary measure.


We can group by multiple columns, and we'll get all of the combinations of values present across the columns:

```{r}
police %>% 
  group_by(subject_sex, subject_race) %>%
  summarize(mean_vehicle_age = mean(vehicle_age),
            sd_vehicle_age = sd(vehicle_age))
```

Let's compute the ratio of warnings to citations by subject_race - note that we can use the variables we create in later expressions within the same call to `summarize()`:

```{r}
police %>%
  group_by(subject_race) %>%
  summarize(warnings = sum(outcome == "warning"),
            citations = sum(outcome == "citation"), 
            ratio = warnings/citations)
```

There's considerable variation here, from 1.1 warnings for every citation given to 2.2 warnings for every citation given.

### EXERCISE

Compute the `min()` and `max()` `vehicle_year` for each `vehicle_make`.

```{r}

```


## Ungrouping

If you ever have a grouped data frame, you may need to ungroup it to get rid of the groups.  To do so, use `ungroup()`:

```{r}
poilce %>% 
  group_by(outcome) %>%
  ungroup()
```


## Slicing

One operation we could do with a grouped tibble is to select just certain rows from each group.  For example, we could use the `slice()` function to select the first row from each group:

```{r}
police %>%
  select(outcome, everything()) %>%  # to reorder columns for output
  group_by(outcome) %>%
  slice(1)
```

If you look at this output in the console, you'll see the resulting tibble still has groups in it.  This is a case where you might want to ungroup:

```{r}
police %>%
  select(outcome, everything()) %>%  # to reorder columns for output
  group_by(outcome) %>%
  slice(1) %>%
  ungroup()
```


# Arrange

Finally, we come to `arrange()`, which is how we sort the rows in our data.  We would mostly use this when viewing our data, but it's also useful when we need to compute a time series (lags and leads in the data), when we want to select just a few rows from each group, or any other order-sensitive transformations on our data.

```{r}
arrange(police, time)
```

To sort in reverse order, wrap the column name in `desc()`.  

```{r}
arrange(police, desc(date))
```

Arrange by multiple columns, in order:

```{r}
arrange(police, date, desc(time))
```

An example where it matters: compute time between stops in the dataset:

```{r}
police %>%
  arrange(date, time) %>%
  mutate(datetime = lubridate::ymd_hms(paste(date, time)),  # combine to single value
         time_since_last = datetime - lag(datetime)) %>%  # current time - previous time
  select(datetime, time_since_last)
```


### EXERCISE

Sort the data by `vehicle_make` and then `vehicle_year`.

```{r}

```


# Count

A bonus function that I use frequently: `count()`.  It's how you'd get output similar to `table()`

By itself, it counts the number of rows:

```{r}
police %>%
  count()
```

If you supply the name of a column, it makes a table:

```{r}
police %>%
  count(subject_sex)
```

This is the same result as if you grouped the data first:

```{r}
police %>%
  group_by(subject_sex) %>%
  count()
```

You can group by multiple columns directly with count:

```{r}
police %>%
  count(subject_sex, subject_race)
```


### EXERCISE

How many times does each type of violation appear in the dataset?  Bonus: sort the results from most to least frequent.  You can do this with `arrange()` or look at the documentation for `count()` to find another option.

```{r}

```


# Recap

We've now covered the core dplyr functions: six "verbs" of common actions we may take with a data frame.  There are lots of helper functions, variations, and special cases that can come up when using these functions, but we've covered the essentials.  

All of these functions were for working with a single data frame.  Next session we'll talk about dplyr functions that will help us combine two different data frames together.  

To learn more about some issues in computing police bias statistics, see https://fivethirtyeight.com/features/why-statistics-dont-capture-the-full-extent-of-the-systemic-bias-in-policing/ or from Northwestern researchers Prof. Redbird and Kat Albrecht, https://redbird.shinyapps.io/police-bias-map/.