-
Notifications
You must be signed in to change notification settings - Fork 7
/
topics1.Rmd
435 lines (277 loc) · 13.4 KB
/
topics1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
---
title: 'R: Dates, Factors, Lists, Help'
author: "Christina Maimone"
output:
html_document:
df_print: paged
code_download: TRUE
toc: TRUE
toc_float: TRUE
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(tidyverse)
```
# Dates
In most cases, when you read data with dates/times in it into R, it will be read in as character (text) data:
```{r}
evp <- read.csv("data/ev_police_jan.csv")
str(evp)
```
Note: Evanston police data comes from the [Stanford Open Policing Project](https://openpolicing.stanford.edu/data/). This is a small subset of data: just January 2017.
Note: if you use the readr package to import your data, it will try to convert things that look like dates to Date objects if it can; you can also specify that a column contains dates/times when you import the data with the readr package.
If you want to use this information in a visualization, or aggregate data by month, or use the date information in other ways, you'll need to convert this information into an object that knows about dates and times.
The easiest way to do this is with the lubridate package (part of tidyverse).
```{r}
library(lubridate)
```
It has a series of functions that are named like:
```{r, eval=FALSE}
ymd()
ymd_hms()
dmy()
dmy_h()
mdy()
```
And so on, where y=year, m (in the first part)=month, d=day, h=hour, m (in the second part)=minute, s=second. With the function name, you are specifying which parts of a date or time appear in the text you're converting and what order the different parts appear in. You can ignore all of the delimiters and other components like dashes, slashes, or even no delimiter at all. lubridate can handle most cases:
```{r}
my_dates <- c("1/13/2020","1/13/20", "01132020", "1-13-2020", "Jan 13 2020", "Jan. 13, 2020")
class(my_dates)
my_dates <- mdy(my_dates)
class(my_dates)
```
Without lubridate, you have to specify the format of the datetime object with POSIX standards (see the help page for strftime).
## EXERCISE
Make two new columns in the evp data frame:
* Date (capitalized) that converts the date column to a Date type
* datetime: paste (concatenate) the date and time columns together, and then convert the joined string to a datetime object (it will be of type "POSIXct")
The command to paste the two columns together is provided to get you started.
```{r, eval=FALSE}
paste(evp$date, evp$time)
```
## Example
Now that we have the data information in a date-type format, we can ask questions such as: what day of the week had the most traffic stops?
```{r, eval=FALSE}
wday(evp$Date, label=TRUE) # day of the week of the stop (named)
table(wday(evp$Date, label=TRUE)) # tally these
```
The data we have is just for January, which means there are more instances of some days of the week than others. So we want to normalize by the number of Mondays, Tuesdays, etc.
```{r, eval=FALSE}
seq.Date(ymd("2017-01-01"), ymd("2017-01-31"), 1) # days in January 2017
wday(seq.Date(ymd("2017-01-01"), ymd("2017-01-31"), 1), label=TRUE) ## as days of the week
table(wday(seq.Date(ymd("2017-01-01"), ymd("2017-01-31"), 1), label=TRUE)) # tallied
table(wday(evp$Date, label=TRUE)) / table(wday(seq.Date(ymd("2017-01-01"), ymd("2017-01-31"), 1), label=TRUE)) # together
```
# Factors
Looking at the days of the week above, the output included this text at the bottom:
`Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat`
This is because the days of the week were returned as a ordered factor -- a categorical variable.
Factors are variables with text labels, but the set of values (called levels) that are allowed for the variable is limited, and the values optionally can have a specific order to them.
## Why Do I Need Factors?
Let's look at what happens with the days of the week if they are NOT a factor -- just character (text) data instead:
```{r}
evp$Date <- ymd(evp$date) # did this in the exercise above
evp$dow <- as.character(wday(evp$Date, label=TRUE))
table(evp$dow)
```
The days are in alphabetical order, not day of the week order!
Or, let's make a plot of the reported race of of the people stopped:
```{r, fig.width=5, fig.height=3}
ggplot(evp, aes(x=subject_race)) +
geom_bar()
```
This is OK, but ideally the categories of a bar chart would either be in a meaningful order, or ordered by their frequency.
## Make a Factor
In both cases above, the character data is converted into a factor in the process of making a table or the plot. If we don't like the default ordering, we can control it by explicitly making the variable a factor ourselves and setting the levels.
The default order is alphabetical order, which is dependent on the locale of your computer (language and location settings).
```{r}
evp$subject_race <- factor(evp$subject_race,
levels=c("white", "black", "hispanic", "asian/pacific islander", "other"))
head(evp$subject_race)
table(evp$subject_race)
```
Here, we specified the levels in order, so they will appear that way. But we didn't say that any category is greater than or less than another -- there isn't an inherit order to the categories. In cases that there is an order, we can provide that information as well. I have found very few cases where I need to do this in practice, however.
```{r}
answers <- c("Very unhappy", "Somewhat unhappy", "Somewhat happy", "Very happy",
"Somewhat unhappy", "Somewhat happy", "Somewhat unhappy", "Somewhat happy",
"Very happy", "Very unhappy", "Very happy", "Very happy")
answers <- factor(answers,
levels = c("Very unhappy", "Somewhat unhappy", "Somewhat happy", "Very happy"),
ordered = TRUE)
answers
```
You can see here that the levels are printed with less than signs `<` indicating a formal ordering that can be used in boolean comparisons.
## EXERCISE
Convert the vector below to a factor. Set the levels in an intentional order.
```{r}
directions <- c("east", "west", "east", "south", "north", "north", "west", "north", "east")
```
## forcats Package
The forcats package has a number of functions to help you work with factors -- for example, reordering the levels of a factor based on the frequency of the levels, combining low-frequency categories into an "other" category, and dealing with missing values.
With our bar plot above, we could have used the forcats package to re-order our categories in frequency order:
```{r}
evp$subject_race <- forcats::fct_infreq(evp$subject_race)
ggplot(evp, aes(x=subject_race)) +
geom_bar()
```
## Caution
If you try to change the value of a variable that is a factor to something that isn't one of the factor levels, you will introduce missing values into your data:
```{r}
my_colors <- factor(c("red", "blue", "green", "red", "red", "blue", "green"))
my_colors
my_colors[1] <- "pink"
my_colors
```
To avoid this issue, usually the easiest thing to do is to convert the variable to be of character type before changing the values, and then re-make the factor if you need:
```{r}
my_colors <- as.character(my_colors)
my_colors[1] <- "pink"
my_colors <- factor(my_colors)
my_colors
```
Also, the default order for factors can be different in different places in the world. To avoid this issue, either explicitly set the levels for any factor you create or use the forcats package to create a factor instead of the default `factor()` function.
# Lists
Lists allow us to store many different types of data to be stored together. The most common uses of lists I see in R are for storing
* multiple data frames together in one object (so you can do the same thing to many different data frames with one command)
* vectors of varying lengths
* hierarchical data from JSON files or similar files (although there are special packages to work with this type of data)
* result of a function call that can have output of varying lengths (i.e. `strsplit()`)
Our goals are to:
1. Recognize a list
2. Access elements in a list
Normally, we'd have a list as a result of some other operation. Here, I'm making a simple one manually to work with that uses some of the built-in vectors and data sets in R:
```{r}
state_data <- list(date_created="2021-12-08", # single value
names = state.name, # vector
abb = state.abb, # vector
vars = as.data.frame(state.x77), # data frame
center = state.center, # list
"From R built-in datasets") # single value, unnamed element
state_data
```
When it prints, you can see that it prints the names of the elements after a `$`. For elements that don't have a name, it prints the index position in double brackets: `[[]]`
How many elements are in the list (at the top level)
```{r}
length(state_data)
```
We can access the elements of the list by name with the `$` notation:
```{r}
state_data$names
```
When this prints, there is no name in the output -- we've extracted the element from the list.
We can also access the elements of the list by index position. If we use single brackets:
```{r}
state_data[1]
```
what we get back is still a list:
```{r}
class(state_data[1])
```
If we want what's in that position instead, we need to use double brackets:
```{r}
state_data[[1]]
```
When the element we want to access also has it's own parts, we can keep adding on brackets (or `$`):
```{r}
state_data[[5]]
state_data[[5]][1]
state_data[[5]][[1]]
state_data[[5]]$x
```
We can iterate over lists to do the same thing (apply the same function) to every element:
```{r}
sapply(state_data, length)
```
If you want to do more with lists, look at the purrr package. There are multiple tutorials available online.
## EXERCISE
What is the median Area of US states? Hint: There's an Area variable in the data frame that's in the "vars" element of the state_data list. There's also a `median()` function.
```{r}
```
## Objects
Objects are used in R to store complex information. One type of object you may have seen is the result of running a regression model.
```{r}
reg1 <- lm(mpg ~ hp + wt, data=mtcars) # mtcars is a built-in data set
reg1
```
You can access the components of these objects as you would with a list:
```{r}
names(reg1)
reg1$coefficients
reg1[[1]]
```
The objects have other properties too beyond a standard list -- many know how to print a summary of themselves, they may know how to plot themselves, etc.
# Using the R Help
See the `help.pdf` file in the materials.
We're going to talk through the page for `read.csv()`.
## EXERCISE
Do the first part below where you're unfamiliar with the function
### Part A: table
Look at the help page for `table()` to make a table that also includes the count of any missing values in the vector x.
Bonus: after you do that, look up the help for the `sample()` function to see what's going on there :)
```{r}
x <- sample(c(letters[1:10], NA, NA, NA), 100, TRUE)
```
### Part B: quantile
Use the `quantile()` function to find the 50th (0.5) and 95th (.95) quantiles of the x vector below. Look up the help for the function to see how to use it.
```{r}
x <- rnorm(100)
```
### Part C: list.files
Use the `list.files()` function to list all files in your working directory that end with ".R". Look up the help for the function to see how to use it.
Hint: ".R" will include files where the extension starts with ".R" but may have more, such as ".Rmd". To get just the ".R" files, use the pattern ".R$"
Didn't find any ".R" files? See what other files are in your working directory instead. You can also check what your working directory is with `getwd()`
```{r}
```
# Question Time!
What are things you've wondered about R?
What are things you find difficult to do?
What's confusing?
# Bonus: Useful functions
A few useful functions that I use frequently that weren't covered elsewhere.
## ifelse
`ifelse()` allows you to change the values of a vector based on a conditional test. It's useful for recoding data, and you can use it in situations where you don't want to change the underlying data.
```{r}
x <- c(-1, 2, 3, -5, 3, NA, -4, 6)
ifelse(x >= 0, x, 0) # replace negative values with 0, leave others alone
ifelse(is.na(x), 0, x) # replace missing values with 0, leave others alone
ifelse(x %% 2 == 0, "even", "odd") ## remainder of dividing x by 2 is 0
```
There's also the useful `replace_na()` function in the tidyr package and `na_if()` in dplyr.
## %in%
`%in%` returns TRUE if the value on the left is in the vector on the right, FALSE otherwise. Unlike `==`, if the value on the left is `NA`, it will return FALSE unless `NA` is also in the vector on the right:
```{r}
x <- c(-1, 2, 3, -5, 3, NA, -4, 6)
x %in% c(1, 2, 3)
x == 1 | x == 2 | x == 3
```
```{r}
state.name
ifelse(state.name %in% c("Alaska", "Hawaii"), NA, state.name)
```
## paste0
The `paste()` function is used to join pieces of text:
```{r}
paste("Christina", "Maimone")
```
The default separator between the strings is a space, but you can change it:
```{r}
paste("Christina", "Maimone", sep="---")
```
But, I frequently want to join strings with no space between them:
```{r}
paste("Christina", "Maimone", sep="")
```
There's a shortcut for this:
```{r}
paste0("Christina", "Maimone")
```
## cat
If you want to print output to the console to copy and paste elsewhere, you may not want quotation marks around strings or the index markers in brackets at the start of the lines. Use `cat()` to print:
```{r}
cat(1:10, sep=", ")
cat(letters, sep="\n")
```