Day2_intro_to_R.qmd

---
output:
  html_document:
    df_print: paged
    code_download: TRUE
    toc: true
    toc_depth: 1
editor_options:
  chunk_output_type: console
editor: 
  markdown: 
    wrap: sentence
---

```{r, setup, include=FALSE}
knitr::opts_chunk$set(
  eval=FALSE, warning=FALSE, error=FALSE
)
```

# R: Tools of the Trade

In this section, we're going to cover (or review) some fundamental R tools such as:

-   File paths
-   Working directory
-   RStudio Projects
-   Code scripts and R Documents

## File paths

-   You can use the Files tab to access the files on your computer
-   **Absolute paths** always start with the root directory and provide the full path to the file or directory. For example, the absolute path to the home directory of a user named "john" would be "/home/john".
-   On the other hand, a **relative path** is a path to a file or directory that is relative to the current directory. For example, from the directory "/home/john", the relative path to a data set located in the Desktop would be "Desktop/data.csv".

## Working directory

Working directory: where R starts looking for files on your computer.
In other words, where you're "working from." This is also where R will save any output you save (e.g., processed data or plots).

You can set it with functions.

```{r}
getwd() # function to get working directory
setwd("<insert valid path here>") # function to set working directory
```

### EXERCISE

Set the working directory for your RStudio session to `R-fundamentals-summer-workshop` or `R-fundamentals-summer-workshop-main` (or whatever the correct path is for your project folder).

```{r}
# run this code to change your working directory only for this exercise
setwd("~")

# type code below to change the working directory back to your R workshop folder

```

## RStudio Projects

R experts keep all the files associated with a project together -- input data, R scripts, analytical results, figures.

This is such a wise and common practice that RStudio has built-in support for this via **projects**.

There are multiple ways to create a project. Here we cover two of them.

**Option 1 -- Create a new directory:**

On the top-right corner of RStudio, click on the button that has a blue cube with an R inside and says "Project: (None):. ![](images/project1.jpeg)

Then, click on "New Project..." in the drop-down menu.
![](images/project2.jpeg)

Then, click on "New Directory".
![](images/project3.jpeg)

Then, click on "New Project".
![](images/project4a.jpeg)

Then, populate a directory name (whatever name you want to be associated with the folder), select a location, and click on "Create Project". In this case, the folder will be called "test_project" and will be located in the Desktop.
![](images/project5a.jpeg)

**Option 2 -- Project in an existing directory:**

On the top-right corner of RStudio, click on the button that has a blue cube with an R inside and says "Project: (None):. ![](images/project1.jpeg)

Then, click on "New Project..." in the drop-down menu.
![](images/project2.jpeg)

Then, click on "Existing Directory".
![](images/project3.jpeg)

Then, select the location of your existing directory and click on "Create Project".
![](images/project4b.jpeg)

Regardless of which option you use, you'll end up with a directory containing a file ending in **".Rproj"**.
Double clicking on that file automatically opens up the project and associated documents in the correct working directory. **THIS IS HOW WE WANT TO OPEN RSTUDIO FOR THIS WORKSHOP.**

### EXERCISE

Check that you are currently working within the `R-fundamentals-summer-workshop` or `R-fundamentals-summer-workshop-main` Project. If not, save any changes you've made to this file (or others in this folder), and then open the relevant .RProj file to open the project (which will restart your R session as well).

```{r}

```

## Code scripts and R Document files

This is an R Quarto file. It is very similar to an R Markdown file. There are two view modes for this file - Source and Visual. The code in the gray chunks (surrounded by \`\`\` in Source view) is R code that you can run. There's a green triangle you can click to run all of the code in a chunk, or put your cursor in a line and press Command+Enter (Mac) or Control+Enter (PC) to execute that command (which can span multiple lines).

We're using an R Quarto Document instead of an R script so that we can include additional text and explanation in the file. In an .R script, everything has to be an R command or comment. Here, between chunks of R code, we can include text and other information.

# Packages

R comes with lots of built-in basic functions (commands). But we almost always need to use functions from additional packages as well. Packages can also include data sets sometimes.

## Installing

-   RStudio often prompts you to install packages you're missing: if you open a file that uses a package that you don't have installed
-   Do NOT put installation commands in R scripts or R Markdown files (included below only because we're teaching)

```{r, installpackages, eval=FALSE}
# install 1
install.packages("praise")

# install a few at once (put names in a vector)
install.packages(c("palmerpenguins", "tidyverse", "janitor"))
```

Installing a package is similar to going to the store to buy cold medication of a particular brand. Once you buy it, you have access to the medication, but you haven't used it yet.

Packages are installed from a repository. CRAN (Comprehensive R Archive Network) is the most common. People also install packages from Bioconductor and GitHub (process differs for these).

If you're asked a question about whether you want to install packages from source, say NO. This is because installing from source requires compiling into computer-readable documents which take time. If you say NO then R uses a pre-compiled version.

You only need to install packages once (for each major version of R).

To update a package, re-install it.

### EXERCISE

Let's install all the packages that we're going to use throughout the workshop!

Write code to install these packages:`praise`, `palmerpenguins`, `janitor`, `tidyverse`, `RColorBrewer`, `ggpubr`, `gridExtra`, and `grid`.

```{r, installpackagesworkshop, eval=FALSE}

```

## What if you can't install packages in your computer? Posit Cloud to the rescue!

If for some reason you can't install packages in your computer, you can use R and RStudio within Posit Cloud.
Posit Cloud allows you to run R on the cloud (in other words, other people's computers).

When using cloud services, keep in mind that you're sending data to other people's computers. Always consider the type of data that you're using and keep Northwestern's [research data](https://researchintegrity.northwestern.edu/resources/research-data-policy.pdf) and [data classification](https://policies.northwestern.edu/docs/data-classification-policy.pdf) policies in mind. You can always consult with [Research Computing and Data Services' Data Management team](https://www.it.northwestern.edu/departments/it-services-support/research/data-storage/). For this workshop, it's okay to use Posit Cloud.

To use Posit Cloud, go to <https://posit.cloud/>.
![](images/posit_cloud_main.jpeg)

The first time you use it, follow these steps:

1.  [Sign up for a free acount](https://posit.cloud/plans/free) ![](images/posit_cloud_sign_up.jpeg)
2.  Once you are logged in, click on "New Project" on the top right. ![](images/posit_cloud_new_project.jpeg)
3.  Then click on New RStudio Project. It'll take a couple of moments to get the project ready. ![](images/posit_cloud_rstudio_project.jpeg)
4.  Once you see RStudio's interface, you're good to go! ![](images/posit_cloud_rstudio.jpeg)

## Using Packages

To use a package in your current session, use the `library()` function to load all of the functions (and other objects) in the package into your environment. You do not need quotes around the package names.

Loading a package is similar to opening the cold medication after you come home from the store. Not only you already have the medication, but you also opened it, so that it's ready for you to use.

You need a separate library command for each package.

Put these at the top of your script or R Markdown file so people (collaborators or your future self) can see which packages are needed for the code.

Only load the packages you need. Packages can conflict with each other, so don't load ones you aren't using.

```{r}
library(praise)
library(palmerpenguins)  # we're going to use this later
```

Now that we loaded `praise`, we can use functions within it!

```{r}
praise()
```

Note that, as opposed to installing, you need to load packages in each R session where you want to use them.

# Import Data

## Reading in a Data File

Most of the time, you'll read in data from a file. For example:

```{r}
evp <- read.csv("data/ev_police_jan.csv")
```

Most data in R will be stored in a data frame (more below).

As with other variables, when we type the name of the data frame, it will print the contents of the variable to the console.

```{r, eval=FALSE}
evp
```

Note: Evanston police data comes from the [Stanford Open Policing Project](https://openpolicing.stanford.edu/data/). This is a small subset of data: just January 2017.

TIPS: if you get an error message that a file can't be found when you're trying to import it:

1)  Check the spelling of the filename for typos

2)  Check your working directory (`getwd()`) and make sure the path to the file is correct and completely specified given what your working directory is.

3)  Make sure you extracted the contents of the .zip file you downloaded. We've seen some problems on Windows computers especially where a .zip file isn't really unzipped - it's just letting you see inside without actually expanding the contents and creating real files. Right click on the file and look for an Extract All or similar option.

## From a Package

Today, we're going to work with data on penguins from <https://github.com/allisonhorst/palmerpenguins>. It's been included in an R package: `palmerpenguins`.

There's a data frame in the package called penguins. Like with functions in a package, we can use it once we've loaded the package with the library command above.

```{r}
penguins
```

## Manual Creation

It's also possible to create a data frame from scratch. You will not do this often.

```{r, manualdf}
month_info <- data.frame(
  month = month.name,
  index = 1:12,                 
  days = c(31,28,31,30,31,30,31,31,30,31,30,31)
  )

month_info
```

# Data Frame Basics

-   Rectangles of data
-   Rows are observations
-   Columns are variables
-   Columns (variables) have names
-   Multiple vectors stuck together - each column is a vector
-   Each column has 1 type of data (numeric, character, logical, etc.)
-   Data frames can have columns with different types of data

## What is a data frame?

It has columns and rows. Let's look at it:

```{r, eval=FALSE}
View(evp)
```

Or we can look at a few rows in the console:

```{r}
head(evp)
```

What size is it?

```{r}
dim(evp)
nrow(evp)
ncol(evp)
```

What are the variables?

```{r}
names(evp)
```

What type of data is in each variable?

```{r}
str(evp)
```

Get some basic info about the variables:

```{r}
summary(evp)
```

## EXERCISE

Using the `penguins` data frame, write commands to figure out:

-   How many rows and columns it has?
-   What are the names of the variables?

```{r}
# load the package
library(palmerpenguins)


```

## Working with Variables

Each column (variable) is a vector. We can access them individually with the `$` notation using the name of each variable:

```{r}
evp$location
penguins$species
```

We can do things like count values, summarize numeric values, etc.

```{r}
table(evp$location)
median(month_info$days)
```

We can also use the vectors in expressions:

```{r}
month_info$days/7
```

## EXERCISE

Import data from `"data/nu_degrees.csv"` and save it in a variable (choose an appropriate name).

View the data.

What is the average (the function is `mean()`) number of bachelor's degrees issued in a year?

What is the standard deviation in the number of doctoral degrees issued each year? The function is `sd()`.

How many total masters degrees were issued for the years in the data? Hint: `sum()`.

```{r}

```

Bonus -- challenge: How many total degrees were issued each year? How many total degrees were issued across all years and types?

```{r}

```

## Indexing by Position: Vectors

We can access the individual values in vectors or data frames by their index position. We put the number of the element we want in `[]`.

Let's start with vectors:

```{r}
month_info$month
month_info$month[1]
```

We can also select a range of values:

```{r}
1:5  # shortcut for making a vector with values 1, 2, 3, 4, 5
month_info$month[1:5]
```

Or non-adjacent values by using a vector of values.

```{r}
# We can make our own vector with the `c()` function:
c(1, 3, 5, 7, 9, 11)

month_info$month[c(1, 3, 5, 7, 9, 11)]
```

Negative indices mean to exclude the value at the given position:

```{r}
month_info$month[-1]
month_info$month[-1:-6]

table(month_info$days)
table(month_info$days[-2])
```

## EXERCISE

Using the data on Northwestern degrees from the last exercise:

What is the mean number of bachelors degrees issued in the first 5 years of the data?

What is the mean number of bachelors degrees issued in the last 5 years of the data?

What is the mean number of doctoral degrees per year excluding the most recent year of 2021-22?

```{r}

```

## Detour: Tibbles vs. Data Frames

`class()` tells us the type of the object -- what's stored in the variable.

```{r}
class(evp)
```

What is `penguins`?

```{r}
class(penguins)
```

"tbl_df" is a tibble data frame. These behave a little bit differently from normal data frames. You'll see tibbles instead of data frames within the tidyverse set of packages (and those packages that work within that framework).

The biggest difference is that tibbles give you a tibble back when subsetting with \[\], while data frames sometimes give you a vector.

For now: treat tibbles as data frames; be aware that there may be a few differences (one is coming below!). Mostly tibbles are more consistent in their behavior than data frames -- they like to stay as tibbles.

## Indexing by Position: Data Frames

With data frames, there are two dimensions (rows and columns), so we need two indices:

```{r}
month_info
month_info[1, 1]
month_info[2, 1]
month_info[1, 2]
```

There is a shortcut if we want all rows or all columns -- we can leave one position blank:

Select first row

```{r}
month_info[1, ]
```

Select first two rows

```{r}
month_info[1:2, ]
```

Select first column

```{r}
month_info[ ,1]
```

This gave us a vector back, but with `penguins`:

```{r}
penguins[ ,1]
```

we get a data frame/tibble back. This is one difference between tibbles and regular data frames.

We can select rows and columns at the same time:

```{r}
month_info[6:7, 1:2]
```

If we want rows or columns that aren't next to each other, you can use a vector.

```{r}
month_info[c(1, 3), ]
```

## EXERCISE

Using the Northwestern degree data set from above: select rows 2 through 5, and columns 1 through 2

```{r}

```

Bonus -- challenge: look up the `seq()` function in the Help. Use it to select every other row from the degree data.

```{r}

```

## Indexing by Names

We've seen that we can reference columns by name with `$` notation (no quotes on names)

```{r}
names(month_info)

month_info$days
```

Note that the `$` notation got us a vector back.

```{r}
month_info$days
```

Use names in `[]`: put them in quotes

```{r}
month_info[ ,"days"]
```

With penguins (tibble), it stays as a tibble:

```{r}
penguins[ ,"bill_length_mm"]
```

Multiple columns by name, need to use a vector of names:

```{r}
month_info[ ,c("month", "days")]
```

## Boolean/Conditional Selection

Suppose we want to select rows from a data frame that might meet a specific condition. For example, select all data for patients aged greater than 65, or select all sales items that were sold in the month of December.

If we have a Boolean vector (`TRUE` and `FALSE` values) that is the same length as the number of rows, we can use it to select data.

For example, suppose your data set is as below:

```         
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
```

You want to choose rows where the **condition** `mpg` ≥ 15 evaluates to TRUE

```         
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb condition
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4     FALSE
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4     FALSE
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4     FALSE
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     FALSE
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4     FALSE
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8      TRUE
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3      TRUE
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2      TRUE
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2      TRUE
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4      TRUE
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3      TRUE
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3      TRUE
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4      TRUE
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      TRUE
```

Therefore, the selected rows will be the ones where the condition evaluated to TRUE:

```         
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb condition
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8      TRUE
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3      TRUE
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2      TRUE
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2      TRUE
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4      TRUE
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3      TRUE
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3      TRUE
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4      TRUE
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      TRUE
```

And R will return the following subset of the original data frame to you:

```         
                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb 
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8      
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3      
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2      
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2      
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4      
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3      
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3      
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4      
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      
```

Let's see another example...

Say, you want to find the names of the months which have exactly 31 days.

```{r}
month_info
month_info$days # vector containing number of days in each month
month_info$days == 31 # boolean vector of T,F values based on a condition

month_info$month[month_info$days == 31] # works!
```

The command above works because R selects the values which evaluated to TRUE and discards the values that evaluated to FALSE.

Another example:

```{r}
month_info$month
length(month_info$month)

# create a Boolean vector of length equal to number of rows in month_info
selection_vector <- c(T, T, T, T, T, T, F, F, F, F, F, F)
length(selection_vector)

month_info$month[selection_vector] 

month_info$month[!selection_vector] # inverted selection
```

Let's see an example with the penguins data frame:

Start with the criteria we want:

```{r}
penguins$bill_length_mm < 34
```

Hmmm - there's some missing values in there (NA) - we'll come back to that in a minute.

Select the **rows** (all columns) where bill length is less than 34:

```{r}
penguins[penguins$bill_length_mm < 34, ]
```

If I forget the `,` in the `[]`:

```{r, eval=FALSE}
penguins[penguins$bill_length_mm < 34]
```

It tries to index the columns instead, and our vector is too long.

Multiple conditions

```{r}
penguins[penguins$bill_length_mm < 34 & penguins$bill_depth_mm < 16, ]
dim(penguins[penguins$bill_length_mm < 34 & penguins$bill_depth_mm < 16, ])
```

```{r}
penguins[penguins$bill_length_mm < 34 | penguins$bill_length_mm > 58, ]
dim(penguins[penguins$bill_length_mm < 34 | penguins$bill_length_mm > 58, ])
```

## EXERCISE

Which years were more than 600 doctoral degrees awarded according to the `nu_degrees.csv` data?

You might want to break this down into two steps: 1) write an expression for determining whether there were more than 600 doctoral degrees, then 2) use that expression to index the correct column of your degrees data.

```{r}

```

# Missing Values

`NA` is the symbol for missing data; it can be used with any data type

```{r}
NA  
y <- c("dog", "cat", NA, NA, "bird")
y
x <- c(NA, 2, NA, 4, NA, 6, NA, 6, 6, 4)
x

is.na(y)
is.na(x)  
```

A common operation is to count how many missing values there are in a vector. We use the function `is.na()` to get TRUE when the value is `NA` and FALSE otherwise. Then we sum that result; this works because TRUE converts to 1 and FALSE converts to 0. So it counts the number of missing values (the number of TRUEs).

```{r}
sum(c(TRUE, FALSE))
x
sum(is.na(x))  
```

Different functions handle missing values in different ways. Most commonly, you'll get an answer of `NA` unless you tell the function to remove or exclude the missing values in the calculation.

```{r}
mean(x)
mean(x, na.rm=TRUE)
```

If we don't want the missing rows included in our results above when we were selecting rows of penguins:

```{r}
is.na(penguins$bill_length_mm)
!is.na(penguins$bill_length_mm)  # ! is not

dim(penguins)

# not missing bill length AND bill length < 34
penguins[!is.na(penguins$bill_length_mm) & penguins$bill_length_mm < 34, ]
dim(penguins[!is.na(penguins$bill_length_mm) & penguins$bill_length_mm < 34, ])

penguins[penguins$bill_length_mm < 34, ]
dim(penguins[penguins$bill_length_mm < 34, ])
```

The function `complete.cases()` can be used to find which rows have no missing values at all. The function `na.omit()` directly returns the subset of data with complete cases. You may want to use these functions at some point.

## EXERCISE

How many missing values in `penguins$bill_depth_mm`?

What is the mean of the non-missing values of `penguins$bill_depth_mm`?

```{r}

```

Bonus -- challenge: Select the rows from penguins where `penguins$bill_depth_mm` is missing.

```{r}

```

Bonus -- challenge: What is the mean of non-missing values of `penguins$bill_depth_mm` for penguins of species Gentoo?

```{r}

```

# Column Names

## Fixing Bad Names on Import

What if the variable/column names in your dataset have spaces, start with numbers, or are otherwise not names that you can use with R?

First, take a look at `annoying.csv` in Excel.

Then, run this code:

```{r}
bad_data <- read.csv("data/annoying.csv")
bad_data
names(bad_data)
```

The package `{janitor}` expedites the initial data exploration and cleaning that comes with any new data set.

```{r}
library(janitor)
bad_data <- clean_names(bad_data)
bad_data
names(bad_data)
```

## Renaming Columns

Sometimes, we want to change the column names of a dataframe after importing into R. For example, in `bad_data` above, perhaps we want the columns `x2022` and `x2023` to be called `year_2022` and `year_2023` respectively.

Just like we use `names()` to get the names, use the same function to assign the names. This is different syntax than we see other places in R; `names()` is a special type of function called a replacement function.

We can change the name of a data frame column as follows:

```{r}
names(bad_data)
names(bad_data)[2]
names(bad_data)[2] <- "year_2022"
names(bad_data) # inspect new names
bad_data 
```

You can also change all or several column names at once.

```{r}
names(bad_data)
names(bad_data) <- c("fruit", "year_2022", "year_2023", "year_2024")
names(bad_data) # inspect new names
bad_data 
```

# Making new columns

We can add a new column to the data frame by naming it with the `$` notation, and assigning a value to it.

For example, make a column that has bill length in **centimeters (cm)** instead of **millimeters (mm)**

```{r}
names(penguins)
penguins$bill_length_cm <- penguins$bill_length_mm / 10  # make new variable: note CM instead of MM in the name
names(penguins)  # check to see that it was added
```

Columns will be added to the end (right).

```{r}
View(penguins)
```

## EXERCISE

Using `month_info`: make a new variable as part of `month_info`, called `weeks`, that is the number of days divided by 7.

```{r}

```

Make a new column in the degrees data with the total number of degrees awarded each year

```{r}

```

Bonus -- challenge: Make a new logical column called `long_month` that is TRUE if the month has 31 days.

```{r}

```

# Dropping Columns

To remove a column from a data frame, we select the data we want to keep, and then save that back to the same column:

```{r}
month_info

month_info[ , 1:3]

month_info <- month_info[ , -c(2,3)]
month_info
```

Of course, you can also do it with names instead of indices.

# Replacing Values

How do we change our data? A common task might be replacing missing values -- either actual NA values or other values that indicate missing. One way to do this is by replacing missing values by the mean of the non-missing values in that column.

Let's go through an example to replace missing values in the `bill_depth_mm` column of the `penguins` data frame.

First, we identify and select the values we want to change:

```{r}
# find NA / missing values in the column bill_depth_mm
penguins$bill_depth_mm[is.na(penguins$bill_depth_mm)]
```

Next, compute the replacement value, which is the mean of non-missing values of `bill_depth_mm` in this case:

```{r}
bill_depth_mean <- mean(penguins$bill_depth_mm, na.rm=TRUE)
bill_depth_mean
```

Now, assign the replacement value `bill_depth_mean` to the missing observations:

```{r}
penguins$bill_depth_mm[is.na(penguins$bill_depth_mm)] <- bill_depth_mean

sum(is.na(penguins$bill_depth_mm))
```

You can replace any values you need to replace (it doesn't have to be missing), and often-times it is common to use functions to compute the replacement values.

## EXERCISE

Here is code to make a new data frame. Run the code and take a look at the data frame

```{r}
set.seed(12924)
some_numbers <- data.frame(v1 = rnorm(10), v2 = runif(10))
some_numbers
```

You decide that you don't want any negative numbers in your data frame. Write code to replace all values of column v1 in `some_numbers` that are below 0 with the value 0.

```{r}
# Identify the values to be replaced/changed in some_numbers

# What is the new/replacement value?

# Assign the new/replacement value to the elements identified in the first step

# check some_numbers to see if the replacement worked!
some_numbers
```

# Bonus: Building Up Expressions

What if we want to select observations with more complicated criteria?

Which penguins are heavier than average?

Let's build this up together...

```{r}

```

What proportion of Adelie penguins have above average flipper length?

```{r}

```

# PRACTICE SET 1

Set 1 exercises use the policing data here:

```{r}
evp <- read.csv("data/ev_police_jan.csv")
```

If you mess up the data frame, you can always read it back in from the file.

## EXERCISE 1

What are the variable names, and how many observations are there?

```{r}

```

## EXERCISE 2

Are there any missing values in the data?

There are a few ways you might approach this

```{r}

```

## EXERCISE 3

What is the most common location? You may want to use the `table()` function.

Hint: you can use the `sort()` function in combination with the `table()` function: `sort(table(x))`

```{r}

```

## EXERCISE 4

How many stops found contraband?

Hint (for one approach to this): if you `sum()` a boolean variable, TRUE = 1, FALSE = 0, so it counts the TRUE values. Remember to deal with missing values with `sum()`: `sum(x, na.rm=TRUE)`. You could also make a table.

```{r}

```

What proportion of all stops in the data found contraband?

```{r}

```

## EXERCISE 5

Rename the column "raw_row_number" to "id"

```{r}

```

## EXERCISE 6

Subset the data frame to just have rows where subject_sex is "female" and save the result in a new variable.

```{r}

```

## EXERCISE 7

Make a new column in the data frame called evanston that is TRUE if the location is 60201 or 60202 and FALSE otherwise.

```{r}

```

## EXERCISE 8

This exercise is more challenging than the others.

What is the *make* of the oldest vehicles in the data set.

Hint: there are two observations in the data that both have the minimum vehicle_year.

```{r}

```

## EXERCISE 9

This exercise is more challenging than the others.

What percentage of men in the dataset drive a Toyota ("TOYT")? How does this compare to the percentage of women who drive a Toyota?

What percentage of Toyota drivers are male? What percentage of drivers in the data set are male?

```{r}

```

# PRACTICE SET 2

These exercises are more challenging than the ones above and most require multiple steps to complete. Try to break each problem down into smaller steps.

These exercises use this data set, which is data on Northwestern doctoral degrees by program/department and year:

```{r}
phds <- read.csv("data/doctorates_clean.csv")
```

This data is in what we call long format -- there is one row (observation) per department-year

## EXERCISE 1

View() the data, and look at the column names and column types. How many rows and columns?

```{r}

```

## EXERCISE 2

Select programs/degrees where it's not a PhD (use the phd indicator column). These are other types of doctorate degrees.

Challenge: use the unique() function to get a vector that has each program in the list only once.

```{r}

```

## EXERCISE 3

Select data just for 2021-22. Which program awarded the most degrees?

```{r}

```

## EXERCISE 4

How many different programs awarded at least one degree in 2021-22?

```{r}

```

## EXERCISE 5

Are there any programs who awarded no degrees during the time period covered by the data set?

```{r}

```