Solutions_Day3.qmd

---
output:
  html_document:
    df_print: paged
    code_download: TRUE
    toc: true
    toc_depth: 1
editor_options:
  chunk_output_type: console
---

```{r, setup, include=FALSE}
knitr::opts_chunk$set(
  eval=FALSE, warning=FALSE, error=FALSE
)
```

# Data Cleaning

... or put differently, welcome to the real world. Real datasets are seldom as tidy and clean as those you have been working with so far. Real data is messy and often you'll need to clean and coerce it into the right format in order to actually analyze it. This session contains a number of examples that serve as cookbook recipes for common data wrangling tasks.

## Using the R Help

One of the most important features that you will need as you start on your path to becoming and R user is getting help. No matter how skilled of a programmer you are, you always encounter new functions and need help understanding how to use them. We will take some time to explore the help features in R.

To get online help within an R session we use the `help()` function. This will allow us to see the documentation on a function that we are interested in.

Let's say we want to import a csv file and we know we need to use the `read.csv` function:

```{r}
help(read.csv)
```

You can get help with the `?` function:

```{r}
?read.csv
```

Aside from the documentation, another method is to search the internet for further help. Here are some other resources:

-   [The Comprehensive R Archive Network](https://cran.r-project.org/)

-   [UCLA Statistical Computing Help Site](https://stats.oarc.ucla.edu/r/)

-   [Google Search Site](http://statseducation.com/Introduction-to-R/modules/getting%20started/help/www.google.com)

-   [Github](http://statseducation.com/Introduction-to-R/modules/getting%20started/help/www.github.com)

-   [StackExchange](http://stackexchange.com/)

### EXERCISE

Look up the documentation for `lm` . What does this function do? Can you run some examples of this function? What inputs does it take? What outputs does it return?

```{r}
?lm
```

## Debugging: errors and warnings

A very useful skill in coding is learning to debug your code. R helpfully provides warning messages or error messages to help you understand where you might be making a mistake.

Let's see a few examples:

```{r}
# Error: object not found
my_vector

# Error: non-numeric argument to binary operator
x <- 1:10
x + "a" 

# Warning: longer object length is not a multiple of shorter object length
1:10 + 1:3

# Warning: Unknown or uninitialised column
penguins[penguins$speciess]
```

Once you identify an error, you can then debug your code by fixing the source of the error and testing the code line by line. [How to Debug Your Code for Beginners](https://www.freecodecamp.org/news/what-is-debugging-how-to-debug-code/#heading-how-to-debug-your-code) provides some helpful tips to get new coders started!

## Useful functions

This section will list several basic R functions that are very useful and should be part of your toolbox.

### ifelse

`ifelse()` allows you to change the values of a vector based on a conditional test. It's useful for recoding data, and you can also use it in situations where you don't want to change the underlying data.

The format is:

```{r}
ifelse(test_condition, value_if_true, value_if_false)
```

where the test condition is usually some comparison or operation on a vector - anything that results in TRUE or FALSE values

```{r}
x <- c(-1, 2, 3, -5, 3, NA, -4, 6)
x >= 0

ifelse(x >= 0, x, 0)  # keep positive (>= 0) values as they are, replace others with 0

ifelse(is.na(x), mean(x, na.rm = TRUE), x)  # replace missing values with the mean, leave others alone
```

`ifelse()` looks at each value in the initial TRUE/FALSE vector, and it decides what the value at each position should then be.

%% is an operator that returns the value left after dividing by an integer -- the remainder

```{r}
4 %% 2
7 %% 2

x %% 2

ifelse(x %% 2 == 0, "even", "odd")  ## remainder of dividing x by 2 is 0 
```

There's also the useful `replace_na()` function in the tidyr package and `na_if()` in dplyr.

```{r}
library(tidyr)
library(dplyr)

# replace any NA / missing values with user input
replace_na(x, mean(x, na.rm = TRUE)) # replace NA with 0

# make an element into NA / missing if it is undesired value
y <- c("abc", "def", "", "ghi") # remove an empty string
na_if(y, "")
```

### EXERCISE

Where the value in the vector below is positive, take the square. Where it is negative, substitute in `NA`.

```{r}
y <- c(4, -1, 3, 6, -7, 10, 20)

ifelse(y > 0, y^2, NA)
```

### EXERCISE

Recode the vector below such that values at or above the average are replaced with "tall" and values below average are replaced with "short"

Remember: break this into parts. Start with computing the mean of x, then use that in an expression to determine which values of x are above that mean

```{r}
y <- c(4, 6, 5, 7, 3, 6, 4, 5, 3, 6, 6, 5, 6, 7, 6, 4, 6, 5, 7, 5)

ifelse(y >= mean(y), "tall", "short")
```

### %in%

`%in%` returns TRUE if the value on the left is in the vector on the right, FALSE otherwise. Unlike `==`, if the value on the left is `NA`, it will return FALSE unless `NA` is also in the vector on the right:

```{r}
x <- c(-1, 2, 3, -5, 3, NA, -4, 6)
x %in% c(1, 2, 3)
x == 1 | x == 2 | x == 3
```

```{r}
state.name  # built-in vector in R

ifelse(state.name %in% c("Alaska", "Hawaii"), NA, state.name)
```

Let's say we want to create a new column called `evanston` in our `evp` data frame such that it takes the value TRUE if the `location` for that row is either 60201 or 60202 (zip codes for Evanston IL), but FALSE otherwise. This will help us to quickly subset the data only for stops made in Evanston.

First, read in the data frame:

```{r}
evp <- read.csv("data/ev_police_jan.csv")
```

We could subset data - get the rows where the location is an Evanston zip code

```{r}
evp[evp$location %in% c(60201, 60202), ]
```

Or make a new column called `evanston` if the location is one of the two main Evanston zip codes of 60201 or 60202

```{r}
evp$evanston <- evp$location %in% c(60201, 60202)
table(evp$evanston)
```

### EXERCISE

Select the rows from the `evp` data below where the vehicle make is one of FIAT, DATS, GEO, JAGU, or PEUG

```{r}
evp <- read.csv("data/ev_police_jan.csv")
View(evp)

evp[evp$vehicle_make %in% c("FIAT", "DATS", "GEO", "JAGU", "PEUG"),]
```

### paste

The `paste()` function is used to join pieces of text:

```{r}
paste("John", "Oliver")
```

The default separator between the strings is a space, but you can change it:

```{r}
paste("John", "Oliver", sep="---")
```

You can also paste the elements of two or more vectors together:

```{r}
salutation <- c("Dear")
first_names <- c("John", "Kate", "Tyler", "Jordan")
last_names <- c("Lee", "Sharma", "Smith")

paste(first_names, last_names, sep=" ")
paste(salutation, first_names, sep=" ")
```

### EXERCISE

`state.abb` has two-letter abbreviations for US states. `state.region` has the region for each US state. Use `paste()` to join the info from these two vectors together, so the output has entries that look like:

```         
"AL: South"         "AK: West"          "AZ: West"          "AR: South"  ...
```

```{r}
paste(state.abb, state.region, sep=": ")
```

## Factors

Factors are variables with text labels, but the set of values (called levels) that are allowed for the variable is limited, and the values optionally can have a specific order to them. Categorical data is stored in R as `factor` variables. You may ask why a special data structure is needed for categorical data, when we could just use `character` variables to represent the categories.

Let's say that you've conducted a survey on students' smoking habits. The possible responses are *Never*, *Occasionally*, *Regularly* and *Heavy*. From 10 students, you get the following responses:

```{r}
smoke <- c("Never", "Never", "Heavy", "Never", "Occasionally",
           "Never", "Never", "Regularly", "Regularly", "No")
```

Note that the last answer is invalid - `No` was not one of the four answers that were allowed for the question. You could use `table` to get a summary of how many answers of each type that you got:

```{r}
table(smoke)
```

But the categories are not presented in the correct order! There is a clear order between the different categories, *Never* \< *Occasionally* \< *Regularly* \< *Heavy*, but `table` doesn't present the results in that way. Moreover, R didn't recognise that `No` was an invalid answer, and treats it just the same as the other categories. This is where `factor` variables come in. They allow you to specify which values your variable can take, and the ordering between them (if any).

### Creating Factors

When creating a `factor` variable, you typically start with a `character`, `numeric` or `logical` variable, the values of which are turned into categories. To turn the `smoke` vector that you created in the previous section into a `factor`, you can use the `factor` function:

```{r}
smoke2 <- factor(smoke)
smoke2
```

You can inspect the elements, and *levels*, i.e. values that the categorical variable takes, as follows:

```{r}
levels(smoke2)
```

We see two problems! First, we see that "No" is a level/category, but it was not listed on our survey. To fix this problem, we can decide the categories/levels a factor vector is allowed to take by specifying the `levels` parameter in the `factor()` function call:

```{r}
smoke2 <- factor(
  smoke,
  # allowed levels
  levels = c("Never", "Heavy", "Occasionally", "Regularly") 
)

levels(smoke2)
smoke2
```

We see that "No" is no longer listed as a level, and has been replaced by `NA` in the factor vector `smoke2`.

However, we still have another problem - the categories are in the wrong order.

```{r}
barplot(table(smoke2))

smoke2 <- factor(
  smoke,
  # allowed levels in some desired order
  levels = c("Never", "Occasionally", "Regularly", "Heavy") 
)

barplot(table(smoke2))
```

Now we see the categories are in an order that is more informative. However, we still have one more problem to fix. What if we wanted to compare smokers such that:

Never \< Occasionally \< Regularly \< Heavy

i.e. if we make conditional expressions such as smoke2 \> Occasionally, then Never and Occasionally would evaluate to FALSE while Regularly and Heavy would evaluate to TRUE. This can be helpful to subset the data meaningfully or make comparisons when the order is meaningful in some manner.

Let's see how to do this with the `ordered` parameter within `factor()`

```{r}
smoke2 <- factor(
  smoke,
  # create allowed levels in some desired order
  levels = c("Never", "Occasionally", "Regularly", "Heavy"),
  # logical flag to determine if the levels should be regarded as ordered (in the order given) for any comparisons
  ordered = TRUE
  )

smoke2
```

Notice how the level information is presented differently this time to indicate possible comparisons between categories.

```{r}
smoke2
smoke2 > "Occasionally"

# how many survey respondents smoke more than Occasionally
sum(smoke2 > "Occasionally", na.rm = TRUE)
```

To recap, you can control the order in which the levels are presented by choosing which order we write them in in the `levels` argument. The `ordered = TRUE` argument specifies that the order of the variables is *meaningful*.

It can be excluded in cases where you wish to specify the order in which the categories should be presented purely for presentation purposes (e.g., when specifying whether to use the order `Male/Female/Other` or `Female/Male/Other`).

Also note that the `No` answer now became an `NA`, which in the case of `factor` variables represents both missing observations and invalid observations. To find the values of `smoke` that became `NA` in `smoke2` you can use `which` and `is.na`:

```{r}
smoke[which(is.na(smoke2))]
```

By checking the original values of the `NA` elements, you can see if they should be excluded from the analysis or recoded into a proper category (`No` could for instance be recoded into `Never`).

### Changing factor levels

When we created `smoke2`, one of the elements became an `NA`. `NA` was however not included as a level of the `factor`. Sometimes it is desirable to include `NA` as a level, for instance when you want to analyse rows with missing data. This is easily done using the `addNA` function:

```{r}
smoke2 <- addNA(smoke2)
levels(smoke2)
smoke2
```

If you wish to change the name of one or more of the `factor` levels, you can do it directly via the `levels` function. For instance, we can change the name of the `NA` category, which is the 5th level of `smoke2`, as follows:

```{r}
levels(smoke2)
levels(smoke2)[5]
levels(smoke2)[5] <- "Invalid answer"
levels(smoke2)
smoke2
```

Finally, if you've added more levels than what are actually used, these can be dropped using the `droplevels` function:

```{r}
smoke2 <- factor(
  smoke,
  levels = c("Never","Occasionally","Regularly","Heavy","Constantly"),
  ordered = TRUE
  )
levels(smoke2)
table(smoke2)
smoke2 <- droplevels(smoke2)
levels(smoke2)
table(smoke2)
```

### Combining levels

Finally, `levels` can be used to merge categories by replacing their separate names with a single name. For instance, we can combine the smoking categories *Occasionally*, *Regularly*, and *Heavy* to a single category named *Yes*. Assuming that these are first, second and third in the list of names (as will be the case if you've run the last code chunk above), here's how to do it:

```{r}
# create our factor vector as usual
smoke2 <- factor(
  smoke,
  levels = c("Never","Occasionally","Regularly","Heavy"),
  ordered = TRUE)
smoke2

# recode levels that are not "Never" to "Yes"
levels(smoke2)
levels(smoke2)[2:4] 
levels(smoke2)[2:4] <- "Yes"
levels(smoke2)
smoke
table(smoke2)
```

### EXERCISE

Convert the vector below to a factor. Set the levels in an intentional order. Should ordered be set to TRUE here - why or why not?

Hint: use `table()` or `unique()` to find all the categories

```{r}
age_group <- c("35-50",
               "25-34",
               "50-65",
               "65+",
               "18-24",
               "18-24",
               "50-65",
               "35-50",
               "50-65",
               "18-24",
               "18-24",
               "65+",
               "65+")

# your code here 
table(age_group)

age_group2 <- factor(
  age_group, 
  levels = c("18-24", "25-34", "35-50", "50-65", "65+"),
  ordered = TRUE)
age_group2
```

### EXERCISE

Using the `evp` dataset, and convert the columns `subject_race` and `subject_sex` into factors.

```{r}
evp$subject_race <- factor(evp$subject_race)
evp$subject_sex <- factor(evp$subject_sex)

str(evp)
```

### EXERCISE

Using the `evp` dataset, convert the column `search_basis` to a factor. Then recode all `NA` values into "invalid".

```{r}
evp$search_basis <- factor(evp$search_basis)
evp$search_basis <- addNA(evp$search_basis)
levels(evp$search_basis)[3] <- "invalid"

evp$search_basis
```

## Tidyverse

### Setup

```{r, eval=TRUE}
library(tidyverse)
```

This gives you info on which packages it actually loaded, because when you install tidyverse, it installs \~25 packages plus dependencies, but it only loads the ones listed.

Tidyverse packages tend to be verbose in warning you when there are functions with the same name in multiple packages.

### Why use Tidyverse?

Tidyverse packages do a few things:

-   fix some of the annoying parts of using R, such as changing default options when importing data files and preventing large data frames from printing to the console
-   are focused on working with data frames --or rather tibbles-- (and their columns), rather than individual vectors
-   usually take a data frame/tibble as the first input to a function, and return a data frame/tibble as the output of a function, so that function calls can be more easily strung together in a sequence
-   share some common naming conventions for functions and arguments that have a goal of making code more readable
-   tend to be verbose, opinionated, and are actively working to provide more useful error messages

Tidyverse packages are particularly useful for:

-   data exploration
-   reshaping data sets
-   computing summary measures over groups
-   cleaning up different types of data
-   reading and writing data
-   predictive modeling
-   reporting results

### Data

Let's import the data we'll be using. The data is from the [Stanford Open Policing Project](https://openpolicing.stanford.edu/data/) and includes vehicle stops by the Evanston police in 2017. It's similar, but not exactly the same as `evp` (e.g., it has more rows).

We're going to use the `read_csv` function from the `readr` package, which is part of the tidyverse. The `read_csv` function works like `read.csv` except it has some different defaults, guesses data types a bit differently, and produces a tibble instead of a data frame (details coming).

```{r, eval=TRUE}
police <- readr::read_csv("data/ev_police.csv")
```

The output message that you get tells you what data type it guessed for each column based on the format of the information. "chr" is character or text data, "dbl" is numeric (stands for double, which is technical term for a type of number), "lgl" is logical/Boolean (TRUE/FALSE). Note that it also automatically read and identified date and time values and converted them to date and time objects -- not just string/character data.

We can also manually specify column types for cases where the assumption that `read_csv` makes is wrong. We use the `col_types` argument (similar to colClasses for `read.csv`). Let's make the location to be character data, since it is zip codes -- zip codes should not be treated as numbers.

```{r, eval=TRUE}
police <- readr::read_csv(
  "data/ev_police.csv",
  col_types=c("location"="c")
  )
```

### dplyr

dplyr is the core package of the tidyverse. It includes functions for working with tibbles (or any data frames). While you can still use base R operations on tibbles/data frames, such as using `$` and `[]` subsetting like we did above, dplyr provides alternatives to all of the common data manipulation tasks.

It contains six main functions, each a verb, of actions you frequently take with a data frame: select, filter, mutate, group_by, summarize, and arrange

Before we start, let's remember what columns are in our data:

```{r}
names(police)
```

### Select: choose columns

The `select()` function lets us choose which columns (or variables) we want to keep in our data.

The data frame is the first input, and the name of the column is the second. We do not have to put quotes around the column name.

```{r}
select(police, subject_race)
```

If we want to select additional columns, we can just list the column names as additional inputs, each column name separated by commas:

```{r}
select(police, subject_race, outcome)
```

As with `[]` indexing, columns will be returned in the order specified:

```{r}
select(police, subject_sex, subject_race, date)
```

We could also use the column index number if we wanted to instead. We don't need to put the values in `c()` like we would with `[]` (but we could).

```{r}
select(police, 1, 4, 10)
```

### EXERCISE

Convert this base R expression: `police[,c("violation", "citation_issued", "warning_issued")]` to use `select()` instead to do the same thing:

```{r}
select(police, violation, citation_issued, warning_issued)
```

Hint: The base R expression above keeps all rows but selects only the three columns named within `c()`.

### Filter: choose rows

To choose which rows should remain in our data, we use `filter()`. As with `[]`, we write expressions that evaluate to TRUE or FALSE for each row. Like `select()`, we can use the column names without quotes.

```{r}
filter(police, location == "60202")
```

Note that we use `==` to test for equality and get TRUE/FALSE output. You can also write more complicated expressions -- anything that will evaluate to a vector of TRUE/FALSE values.

```{r}
filter(police, is.na(beat))
```

### EXERCISE

How many rows have subject_race as "white"?

Hint: Use `filter()` to choose the rows where subject_race is "white". The equivalent base R expression would be `police[police$subject_race == "white",]`.

```{r}
filter(police, subject_race == "white")
nrow(filter(police, subject_race == "white"))
```

### Pipe: Chaining Commands Together

So, we can choose rows and choose columns separately; how do we combine these operations? `dplyr`, and other tidyverse commands, can be strung together in a series with a `%>%` (say/read: pipe) operator. (If you are familiar with working in a terminal/at the command line, it works like a bash pipe character `|`.) It takes the output of the command on the left and makes that the first input to the command on the right.

The pipe works well with dplyr (and other tidyverse packages) because the functions almost all take a data frame as the first input, and they return a data frame as the output.

We can rewrite

```{r}
select(police, date, time)
```

as

```{r}
police %>% select(date, time)
```

and you'll often see code formatted, so `%>%` is at the end of each line, and the following line that are still part of the same expression are indented:

```{r}
police %>%
  select(date, time)
```

The pipe comes from a package called `magrittr`, which has additional special operators in it that you can use. The keyboard shortcut for `%>%` is command-shift-M (Mac) or control-shift-M (Windows).

We can use the pipe to string together multiple commands operating on the same data frame:

```{r}
police %>%
  select(subject_race, subject_sex) %>%
  filter(subject_race == "white")
```

We would read the `%>%` in the command above as "then" if reading the code outloud: from police, select subject_race and subject_sex, then filter where subject_race is white.

This works because the dplyr functions take a tibble/data frame as the first argument (input) and return a tibble/data frame as the output. This makes it easy to pass a data frame through multiple operations, changing it one step at a time.

Order does matter, as the commands are executed in order. So this would give us an error:

```{r}
police %>%
  select(subject_sex, outcome) %>%
  filter(subject_race == "white")
```

Because `subject_race` is no longer in the data frame once we try to filter with it. We'd have to reverse the order:

```{r}
police %>%
  filter(subject_race == "white") %>%
  select(subject_sex, outcome)
```

You can use the pipe operator to string together commands outside of the tidyverse as well, and it works with any input and output, not just data frames:

```{r}
# sum(is.na(police$beat))
is.na(police$beat) %>% sum()

# or
police$beat %>% is.na() %>% sum()
```

Advanced aside: it is possible to select parts of a data frame within a piped set of commands. A `.` represents whatever the result of the left of the %\>% is:

```{r}
police %>% 
  filter(subject_race == "white") %>%
  View() # notice the name of the df in the Viewer is .

# example of getting a part of the dataframe piped before
police %>% 
  .$beat %>% 
  is.na() %>% 
  sum()
```

### EXERCISE

Select the date, time, and outcome (columns) of stops that occur in beat "71" (rows). Make use of the `%>%` operator.

The equivalent base R expression would be: `police[police$beat == "71", c("date", "time", "outcome")]`

Hint: remember that a column needs to still be in the data frame if you're going to use the column to filter.

```{r}
police %>% 
  filter(beat =="71") %>% 
  select(date, time, outcome)
```

Note that so far, we haven't actually changed the `police` data frame at all. We've written expressions to give us output, but we haven't saved it.

Sometimes we may still want to save the result of some expression, such as after performing a bunch of data cleaning steps. We can assign the output of piped commands as we would with any other expression:

```{r}
police60201 <- police %>%
  filter(location == "60201") %>%
  select(date, time, beat, type, outcome) 
```

### Mutate: add/modify columns

`mutate()` is used to both change the values of an existing column and make a new column.

We name the column we're mutating and set the value. If the name already exists, it will update the column. If the name doesn't exist, it will create a new variable (column is appended at the end of the existing columns).

```{r, eval=TRUE}
police %>% 
  mutate(vehicle_age = 2017 - vehicle_year) %>%
  select(starts_with("vehicle")) %>%   # just to pick a few columns to look at
  names()
```

We can put multiple mutations in the same call to mutate, with the expressions separated by commas:

```{r,  eval=TRUE}
mutate(police, 
       vehicle_age = 2017 - vehicle_year,
       old_car = vehicle_year < 2000)
```

### EXERCISE

If the column beat in `police` is "/", set it to `NA` instead using `mutate()`.

Hint: look up the `na_if` function to do this.

```{r}
police %>% 
  mutate(beat = na_if(beat, "/")) 
```

### Summarize

We use `mutate()` when we want the output to have the same length as the input. In other words, we use `mutate()` when we're operating on the individual elements in a vector - we want a value for every row in the data.

When we want to condense multiple values down to a single (or a few values), such as taking the mean or standard deviation of a vector, we use summarize instead:

```{r}
police %>% 
  mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
  summarize(mean_vehicle_age = mean(vehicle_age))
```

Note that even though there's just one value, we get a tibble returned. This is what to expect with the tidyverse.

We can compute more than one summary measure at the same time:

```{r}
police %>% 
  mutate(vehicle_age = 2017-vehicle_year) %>% # computing a new variable first
  summarize(mean_vehicle_age = mean(vehicle_age),
            sd_vehicle_age = sd(vehicle_age),
            min_date = min(date),
            max_date = max(date))
```

We get one column per summary variable we create. Once we group below, we'll see why we get the output in columns instead of rows.

### EXERCISE

Use summarize to compute the `min()` and `max()` `vehicle_year`

```{r}
police %>% 
  summarize(max_year = max(vehicle_year),
            min_year = min(vehicle_year))
```

### Group_by

When we want to compute summary measures or do other computation on groups in our data (as defined by some grouping variable), we can explicitly group our tibble into subgroups. This isn't very useful by itself, but it is often combined with `summarize()` to compute summary measures by group.

First, what if we just group our data by a specific column:

```{r}
police %>%
  group_by(outcome)
```

When we print this in the console,

```         
# A tibble: 14,792 x 29
# Groups:   outcome [2]
   raw_row_number date       time  location beat  subject_age subject_race subject_sex department_id
            <dbl> <date>     <tim>    <dbl> <chr> <lgl>       <chr>        <chr>               <dbl>
 1       11249746 2017-01-01 00:56    60202 72    NA          white        male                13178
 2       11249747 2017-01-01 04:43    60643 71    NA          black        male                13178
 3       11249748 2017-01-01 09:21    60625 78    NA          black        female              13178
...
```

we see that it tells us that the tibble (data frame) is grouped by outcome, and that there are two groups. It doesn't rearrange the rows, it just keeps track of the groups for us.

Now, let's combine it with summarize. But first, let's make the `vehicle_age` column we've been using actually part of the `police` dataset so that we don't have to keep creating it:

```{r, eval=TRUE}
police <- mutate(police, vehicle_age = 2017-vehicle_year)
```

Now, group and summarize:

```{r}
police %>% 
  group_by(subject_sex) %>%
  summarize(mean_vehicle_age = mean(vehicle_age),
            sd_vehicle_age = sd(vehicle_age))
```

Now we get one row for each group, and one column for each summary measure.

We can group by multiple columns, and we'll get all of the combinations of values present across the columns:

```{r}
police %>% 
  group_by(subject_sex, subject_race) %>%
  summarize(mean_vehicle_age = mean(vehicle_age),
            sd_vehicle_age = sd(vehicle_age))
```

We are getting extra information alerting us to the fact that our output is grouped. If we don't want the groups in place anymore, we can ungroup(). Usually this doesn't matter though if we're just printing output to the screen.

Let's compute the ratio of warnings to citations by subject_race - note that we can use the variables we create in later expressions within the same call to `summarize()`:

```{r}
police %>%
  group_by(subject_race) %>%
  summarize(warnings = sum(outcome == "warning"),  # count how many warnings
            citations = sum(outcome == "citation"), # count how many citations
            ratio = warnings/citations)
```

There's considerable variation here, from 1.1 warnings for every citation given to 2.2 warnings for every citation given. So, relatively more warnings to citations for Black drivers than white and Hispanic ones.

### EXERCISE

Compute the `min()` and `max()` `vehicle_year` for each `vehicle_make`.

```{r}
police %>% 
  group_by(vehicle_make) %>% 
  summarize(min_year = min(vehicle_year),
            max_year = max(vehicle_year))
```

### EXERCISE

Using mtcars, compute the average horsepower (hp) for each \# of cylinders (cyl)

```{r}
mtcars %>% 
  group_by(cyl) %>% 
  summarise(avg_hp = mean(hp))
```

## Ungrouping

If you ever have a grouped data frame, you may need to ungroup it to get rid of the groups. To do so, use `ungroup()`:

```{r}
police %>% 
  group_by(outcome) %>%
  ungroup()
```

Usually this would come up after more complicated operations, and often after computing summary measures by group.