forked from poldrack/psych10-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03b-SummarizingDataWithR.Rmd
722 lines (528 loc) · 27.5 KB
/
03b-SummarizingDataWithR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
---
output:
pdf_document: default
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
html_document: default
---
# Summarizing data with R (with Lucy King)
This chapter will introduce you to how to summarize data using R, as well as providing an introduction to a popular set of R tools known as the "Tidyverse."
Before doing anything else we need to load the libraries that we will use in this notebook.
```{r loadLibraries}
library(tidyverse)
library(cowplot)
library(knitr)
set.seed(123456)
opts_chunk$set(tidy.opts=list(width.cutoff=80))
options(tibble.width = 60)
```
We will use the NHANES dataset for several of our examples, so let's load the library that contains the data.
```{r}
# load the NHANES data library
# first unload it if it's already loaded, to make sure
# we have a clean version
rm('NHANES')
library(NHANES)
dim(NHANES)
```
## Introduction to the Tidyverse
In this chapter we will introduce a way of working with data in R that is often referred to as the "Tidyverse." This comprises a set of packages that provide various tools for working with data, as well as a few special ways of using those functions
### Making a data frame using tibble()
The tidyverse provides its own version of a data frame, which is known as a *tibble*. A tibble is a data frame but with some smart tweaks that make it easier to work with, expecially when using functions from the tidyverse. See here for more information on the function `tibble()`: https://r4ds.had.co.nz/tibbles.html
```{r}
# first create the individual variables
n <- c("russ", "lucy", "jaclyn", "tyler")
x <- c(1, 2, 3, 4)
y <- c(4, 5, 6, 7)
z <- c(7, 8, 9, 10)
# create the data frame
myDataFrame <-
tibble(
n, #list each of your columns in the order you want them
x,
y,
z
)
myDataFrame
```
Take a quick look at the properties of the data frame using `glimpse()`:
```{r}
glimpse(myDataFrame)
```
### Selecting an element
There are various ways to access the contents within a data frame.
#### Selecting a row or column by name
```{r}
myDataFrame$x
```
The first index refers to the row, the second to the column.
```{r}
myDataFrame[1, 2]
myDataFrame[2, 3]
```
#### Selecting a row or column by index
```{r}
myDataFrame[1, ]
myDataFrame[, 1]
```
#### Select a set of rows
```{r}
myDataFrame %>%
slice(1:2)
```
`slice()` is a function that selects out rows based on their row number.
You will also notice something we haven't discussed before: `%>%`. This is called a "pipe", which is commonly used within the tidyverse; you can read more [here](http://magrittr.tidyverse.org/). A pipe takes the output from one command and feeds it as input to the next command. In this case, simply writing the name of the data frame (myDataFrame) causes it to be input to the `slice()` command following the pipe. The benefit of pipes will become especially apparent when we want to start stringing together multiple data processing operations into a single command.
In the previous example, no new variable was created - the output was simply printed to the screen, just like it would be if you typed the name of the variable. If you wanted to save it to a new variable, you would use the `<-` assignment operator, like this:
```{r}
myDataFrameSlice <- myDataFrame %>%
slice(1:2)
myDataFrameSlice
```
#### Select a set of rows based on specific value(s)
```{r}
myDataFrame %>%
filter(n == "russ")
```
`filter()` is a function that retains only those rows that meet your stated criteria. We can also filter for multiple criteria at once --- in this example, the `|` symbol indicates "or":
```{r}
myDataFrame %>%
filter(n == "russ" | n == "lucy")
```
#### Select a set of columns
```{r}
myDataFrame %>%
select(x:y)
```
`select()` is a function that selects out only those columns you specify using their names
You can also specify a vector of columns to select.
```{r}
myDataFrame %>%
select(c(x,z))
```
### Adding a row or column
add a named row
```{r}
tiffanyDataFrame <-
tibble(
n = "tiffany",
x = 13,
y = 14,
z = 15
)
myDataFrame %>%
bind_rows(tiffanyDataFrame)
```
`bind_rows()` is a function that combines the rows from another dataframe to the current dataframe
## Creating or modifying variables using `mutate()`
Often we will want to either create a new variable based on an existing variable, or modify the value of an existing variable. Within the tidyverse, we do this using a function called ```mutate()```. Let's start with a toy example by creating a data frame containing a single variable.
```{r}
toy_df <- data.frame(x = c(1,2,3,4))
glimpse(toy_df)
```
Let's say that we wanted to create a new variable called `y` that would contain the value of x multiplied by 10. We could do this using ```mutate()``` and then assign the result back to the same data frame:
```{r}
toy_df <- toy_df %>%
# create a new variable called y that contains x*10
mutate(y = x*10)
glimpse(toy_df)
```
We could also overwrite a variable with a new value:
```{r}
toy_df2 <- toy_df %>%
# create a new variable called y that contains x*10
mutate(y = y + 1)
glimpse(toy_df2)
```
We will use `mutate()` often so it's an important function to understand.
Here we can use it with our example data frame to create a new variable that is the sum of several other variables.
```{r}
myDataFrame <-
myDataFrame %>%
mutate(total = x + y + z)
kable(myDataFrame)
```
mutate() is a function that creates a new variable in a data frame using the existing variables. In this case, it creates a variable called total that is the sum of the existing variables x, y, and z.
### Remove a column using the select() function
Adding a minus sign to the name of a variable within the `select()` command will remove that variable, leaving all of the others.
```{r}
myDataFrame <-
myDataFrame %>%
dplyr::select(-total)
kable(myDataFrame)
```
## Tidyverse in action
To see the tidyverse in action, let's clean up the NHANES dataset. Each individual in the NHANES dataset has a unique identifier stored in the variable ```ID```. First let's look at the number of rows in the dataset:
```{r}
nrow(NHANES)
```
Now let's see how many unique IDs there are. The ```unique()``` function returns a vector containing all of the unique values for a particular variable, and the ```length()``` function returns the length of the resulting vector.
```{r}
length(unique(NHANES$ID))
```
This shows us that while there are 10,000 observations in the data frame, there are only `r I(length(unique(NHANES$ID)))` unique IDs. This means that if we were to use the entire dataset, we would be reusing data from some individuals, which could give us incorrect results. For this reason, we wold like to discard any observations that are duplicated.
Let's create a new variable called ```NHANES_unique``` that will contain only the distinct observations, with no individuals appearing more than once. The `dplyr` library provides a function called ```distinct()``` that will do this for us. You may notice that we didn't explicitly load the `dplyr` library above; however, if you look at the messages that appeared when we loaded the `tidyverse` library, you will see that it loaded `dplyr` for us. To create the new data frame with unique observations, we will pipe the NHANES data frame into the ```distinct()``` function and then save the output to our new variable.
```{r dropDupes, warning=FALSE}
NHANES_unique <-
NHANES %>%
distinct(ID, .keep_all = TRUE)
```
If we number of rows in the new data frame, it should be the same as the number of unique IDs (`r I(length(unique(NHANES$ID)))`):
```{r}
nrow(NHANES_unique)
```
In the next example you will see the power of pipes come to life, when we start tying together multiple functions into a single operation (or "pipeline").
## Looking at individual variables using pull() and head()
The NHANES data frame contains a large number of variables, but usually we are only interested in a particular variable. We can extract a particular variable from a data frame using the ```pull()``` function. Let's say that we want to extract the variable `PhysActive`. We could do this by piping the data frame into the pull command, which will result in a list of many thousands of values. Instead of printing out this entire list, we will pipe the result into the ```head()``` function, which just shows us the first few values contained in a variable. In this case we are not assigning the value back to a variable, so it will simply be printed to the screen.
```{r}
NHANES %>%
# extract the PhysActive variable
pull(PhysActive) %>%
# extract the first 10 values
head(10) %>%
kable()
```
There are two important things to notice here. The first is that there are three different values apparent in the answers: "Yes", "No", and <NA>, which means that the value is missing for this person (perhaps they didn't want to answer that question on the survey). When we are working with data we generally need to remove missing values, as we will see below.
The second thing to notice is that R prints out a list of "Levels" of the variable. This is because this variable is defined as a particular kind of variable in R known as a *factor*. You can think of a factor variable as a categorial variable with a specific set of levels. The missing data are not treated as a level, so it can be useful to make the missing values explicit, which can be done using a function called ```fct_explicit_na()``` in the `forcats` package. Let's add a line to do that:
```{r}
NHANES %>%
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
# extract the PhysActive variable
pull(PhysActive) %>%
# extract the first 10 values
head(10) %>%
kable()
```
This new line overwrote the old value of `PhysActive` with a version that has been processed by the ```fct_explicit_na()``` function to convert <NA> values to explicitly missing values. Now you can see that Missing values are treated as an explicit level, which will be useful later.
Now we are ready to start summarizing data!
## Computing a frequency distribution (Section \@ref(frequency-distributions))
We would like to compute a frequency distribution showing how many people report being either active or inactive. The following statement is fairly complex so we will step through it one bit at a time.
```{r makePhysActiveTable}
PhysActive_table <- NHANES_unique %>%
# convert the implicit missing values to explicit
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
# select the variable of interest
dplyr::select(PhysActive) %>%
# group by values of the variable
group_by(PhysActive) %>%
# count the values
summarize(AbsoluteFrequency = n())
# kable() prints out the table in a prettier way.
kable(PhysActive_table)
```
This data frame still contains all of the original variables. Since we are only interested in the `PhysActive` variable, let's extract that one and get rid of the rest. We can do this using the ```select()``` command from the `dplyr` package. Because there is also another select command available in R, we need to explicitly refer to the one from the `dplyr` package, which we do by including the package name followed by two colons: ```dplyr::select()```.
```{r}
NHANES_unique %>%
# convert the implicit missing values to explicit
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
# select the variable of interest
dplyr::select(PhysActive) %>%
head(10) %>%
kable()
```
The next function, ```group_by()``` tells R that we are going to want to analyze the data separate according to the different levels of the `PhysActive` variable:
```{r}
NHANES_unique %>%
# convert the implicit missing values to explicit
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
# select the variable of interest
dplyr::select(PhysActive) %>%
group_by(PhysActive) %>%
head(10) %>%
kable()
```
The final command tells R to create a new data frame by summarizing the data that we are passing in (which in this case is the PhysActive variable, grouped by its different levels). We tell the ```summarize()``` function to create a new variable (called `AbsoluteFrequency`) will contain a count of the number of observations for each group, which is generated by the ```n()``` function.
```{r}
NHANES_unique %>%
# convert the implicit missing values to explicit
mutate(PhysActive = fct_explicit_na(PhysActive)) %>%
# select the variable of interest
dplyr::select(PhysActive) %>%
group_by(PhysActive) %>%
summarize(AbsoluteFrequency = n()) %>%
kable()
```
Now let's say we want to add another column with percentage of observations in each group. We compute the percentage by dividing the absolute frequency for each group by the total number. We can use the table that we already generated, and add a new variable, again using ```mutate()```:
```{r}
PhysActive_table <- PhysActive_table %>%
mutate(
Percentage = AbsoluteFrequency /
sum(AbsoluteFrequency) * 100
)
kable(PhysActive_table, digits=2)
```
## Computing a cumulative distribution (Section \@ref(cumulative-distributions))
Let's compute a cumulative distribution for the `SleepHrsNight` variable in NHANES. This looks very similar to what we saw in the previous section.
```{r}
# create summary table for relative frequency of different
# values of SleepHrsNight
SleepHrsNight_cumulative <-
NHANES_unique %>%
# drop NA values for SleepHrsNight variable
drop_na(SleepHrsNight) %>%
# remove other variables
dplyr::select(SleepHrsNight) %>%
# group by values
group_by(SleepHrsNight) %>%
# create summary table
summarize(AbsoluteFrequency = n()) %>%
# create relative and cumulative frequencies
mutate(
RelativeFrequency = AbsoluteFrequency /
sum(AbsoluteFrequency),
CumulativeDensity = cumsum(RelativeFrequency)
)
kable(SleepHrsNight_cumulative)
```
## Data cleaning and tidying with R
Now that you know a bit about the tidyverse, let's look at the various tools that it provides for working with data. We will use as an example an analysis of whether attitudes about statistics are different between the different student year groups in the class.
### Statistics attitude data from course survey
These data were collected using the Attitudes Towards Statistics (ATS) scale (from https://www.stat.auckland.ac.nz/~iase/cblumberg/wise2.pdf).
The 29-item ATS has two subscales. The Attitudes Toward Field subscale consists of the following 20 items, with reverse-keyed items indicated by an “(R)”:
1, 3, 5, 6(R), 9, 10(R), 11, 13, 14(R), 16(R), 17, 19, 20(R), 21, 22, 23, 24, 26, 28(R), 29
The Attitudes Toward Course subscale consists of the following 9 items:
2(R), 4(R), 7(R), 8, 12(R), 15(R), 18(R), 25(R), 27(R)
For our purposes, we will just combine all 29 items together, rather than separating them into these subscales.
Note: I have removed the data from the graduate students and 5+ year students, since those would be too easily identifiable given how few there are.
Let's first save the file path to the data.
```{r}
attitudeData_file <- 'data/statsAttitude.txt'
```
Next, let's load the data from the file using the tidyverse function `read_tsv()`. There are several functions available for reading in different file formats as part of the the `readr` tidyverse package.
```{r, echo=FALSE, message=FALSE}
attitudeData <- read_tsv(attitudeData_file)
```
Right now the variable names are unwieldy, since they include the entire name of the item; this is how Google Forms stores the data. Let's change the variable names to be somewhat more readable. We will change the names to "ats<X>" where <X> is replaced with the question number and ats indicates Attitudes Toward Statistics scale. We can create these names using the `rename()` and `paste0()` functions. `rename()` is pretty self-explanatory: a new name is assigned to an old name or a column position. The `paste0()` function takes a string along with a set of numbers, and creates a vector that combines the string with the number.
```{r}
nQuestions <- 29 # other than the first two columns,
# the rest of the columns are for the 29 questions in the statistics
# attitude survey; we'll use this below to rename these columns
# based on their question number
# use rename to change the first two column names
# rename can refer to columns either by their number or their name
attitudeData <-
attitudeData %>%
rename( # rename using columm numbers
# The first column is the year
Year = 1,
# The second column indicates
# whether the person took stats before
StatsBefore = 2
) %>%
rename_at(
# rename all the columns except Year and StatsBefore
vars(-Year, -StatsBefore),
#rename by pasting the word "stat" and the number
list(name = ~paste0('ats', 1:nQuestions))
)
# print out the column names
names(attitudeData)
#check out the data again
glimpse(attitudeData)
```
The next thing we need to do is to create an ID for each individual. To do this, we will use the `rownames_to_column()` function from the tidyverse. This creates a new variable (which we name "ID") that contains the row names from the data frame; thsee are simply the numbers 1 to N.
```{r}
# let's add a participant ID so that we will be able to
# identify them later
attitudeData <-
attitudeData %>%
rownames_to_column(var = 'ID')
head(attitudeData)
```
If you look closely at the data, you can see that some of the participants have some missing responses. We can count them up for each individual and create a new variable to store this to a new variable called `numNA` using `mutate()`.
We can also create a table showing how many participants have a particular number of NA values. Here we use two additional commands that you haven't seen yet. The `group_by()` function tells other functions to do their analyses while breaking the data into groups based on one of the variables. Here we are going to want to summarize the number of people with each possible number of NAs, so we will group responses by the numNA variable that we are creating in the first command here.
The summarize() function creates a summary of the data, with the new variables based on the data being fed in. In this case, we just want to count up the number of subjects in each group, which we can do using the special n() function from dpylr.
```{r}
# compute the number of NAs for each participant
attitudeData <-
attitudeData %>%
mutate(
# we use the . symbol to tell the is.na function
# to look at the entire data frame
numNA = rowSums(is.na(.))
)
# present a table with counts of the number of missing responses
attitudeData %>%
count(numNA)
```
We can see from the table that there are only a few participants with missing data; six people are missing one answer, and one is missing two answers. Let's find those individuals, using the filter() command from dplyr. filter() returns the subset of rows from a data frame that match a particular test - in this case, whether numNA is > 0.
```{r}
attitudeData %>%
filter(numNA > 0)
```
There are fancy techniques for trying to guess the value of missing data (known as "imputation") but since the number of participants with missing values is small, let's just drop those participants from the list. We can do this using the `drop_na()` function from the `tidyr` package, another tidyverse package that provides tools for cleaning data. We will also remove the numNA variable, since we won't need it anymore after removing the subjects with missing answeres. We do this using the `select()` function from the `dplyr` tidyverse package, which selects or removes columns from a data frame. By putting a minus sign in front of numNA, we are telling it to remove that column.
`select()` and `filter()` are similar - `select()` works on columns (i.e. variables) and `filter()` works on rows (i.e. observations).
```{r}
# this is equivalent to drop_na(attitudeData)
attitudeDataNoNA <-
attitudeData %>%
drop_na() %>%
select(-numNA)
```
Try the following on your own: Using the attitudeData data frame, drop the NA values, create a new variable called mystery that contains a value of 1 for anyone who answered 7 to question ats4 ("Statistics seems very mysterious to me"). Create a summary that includes the number of people reporting 7 on this question, and the proportion of people who reported 7.
#### Tidy data
These data are in a format that meets the principles of "tidy data", which state the following:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
In our case, each column represents a variable: `ID` identifies which student responded, `Year` contains their year at Stanford, `StatsBefore` contains whether or not they have taken statistics before, and ats1 through ats29 contain their responses to each item on the ATS scale. Each observation (row) is a response from one individual student. Each value has its own cell (e.g., the values for `Year` and `StatsBefoe` are stored in separate cells in separate columns).
For an example of data that are NOT tidy, take a look at these data [Belief in Hell](http://www.pewforum.org/religious-landscape-study/belief-in-hell/#generational-cohort) - click on the "Table" tab to see the data.
- What are the variables
- Why aren't these data tidy?
#### Recoding data
We now have tidy data; however, some of the ATS items require recoding. Specifically, some of the items need to be "reverse coded"; these items include: ats2, ats4, ats6, ats7, ats10, ats12, ats14, ats15, ats16, ats18, ats20, ats25, ats27 and ats28. The raw responses for each item are on the 1-7 scale; therefore, for the reverse coded items, we need to reverse them by subtracting the raw score from 8 (such that 7 becomes 1 and 1 becomes 7). To recode these items, we will use the tidyverse `mutate()` function. It's a good idea when recoding to preserve the raw original variables and create new recoded variables with different names.
There are two ways we can use `mutate()` function to recode these variables. The first way is easier to understand as a new code, but less efficient and more prone to error. Specifically, we repeat the same code for every variable we want to reverse code as follows:
```{r}
attitudeDataNoNA %>%
mutate(
ats2_re = 8 - ats2,
ats4_re = 8 - ats4,
ats6_re = 8 - ats6,
ats7_re = 8 - ats7,
ats10_re = 8 - ats10,
ats12_re = 8 - ats12,
ats14_re = 8 - ats14,
ats15_re = 8 - ats15,
ats16_re = 8 - ats16,
ats18_re = 8 - ats18,
ats20_re = 8 - ats20,
ats25_re = 8 - ats25,
ats27_re = 8 - ats27,
ats28_re = 8 - ats28
)
```
The second way is more efficient and takes advatange of the use of "scoped verbs" (https://dplyr.tidyverse.org/reference/scoped.html), which allow you to apply the same code to several variables at once. Because you don't have to keep repeating the same code, you're less likely to make an error:
```{r}
#create a vector of the names of the variables to recode
ats_recode <-
c(
"ats2",
"ats4",
"ats6",
"ats7",
"ats10",
"ats12",
"ats14",
"ats15",
"ats16",
"ats18",
"ats20",
"ats25",
"ats27",
"ats28"
)
attitudeDataNoNA <-
attitudeDataNoNA %>%
mutate_at(
vars(ats_recode), # the variables you want to recode
funs(re = 8 - .) # the function to apply to each variable
)
```
Whenever we do an operation like this, it's good to check that it actually worked correctly. It's easy to make mistakes in coding, which is why it's important to check your work as well as you can.
We can quickly select a couple of the raw and recoded columns from our data and make sure things appear to have gone according to plan:
```{r}
attitudeDataNoNA %>%
select(
ats2,
ats2_re,
ats4,
ats4_re
)
```
Let's also make sure that there are no responses outside of the 1-7 scale that we expect, and make sure that no one specified a year outside of the 1-4 range.
```{r}
attitudeDataNoNA %>%
summarise_at(
vars(ats1:ats28_re),
funs(min, max)
)
attitudeDataNoNA %>%
summarise_at(
vars(Year),
funs(min, max)
)
```
#### Different data formats
Sometimes we need to reformat our data in order to analyze it or visualize it in a specific way. Two tidyverse functions, `gather()` and `spread()`, help us to do this.
For example, say we want to examine the distribution of the raw responses to each of the ATS items (i.e., a histogram). In this case, we would need our x-axis to be a single column of the responses across all the ATS items. However, currently the responses for each item are stored in 29 different columns.
This means that we need to create a new version of this dataset. It will have four columns:
- ID
- Year
- Question (for each of the ATS items)
- ResponseRaw (for the raw response to each of the ATS items)
Thus, we want change the format of the dataset from being "wide" to being "long".
We change the format to "wide" using the `gather()` function.
`gather()` takes a number of variables and reformates them into two variables: one that contains the variable values, and another called the "key" that tells us which variable the value came from. In this case, we want it to reformat the data so that each response to an ATS question is in a separate row and the key column tells us which ATS question it corresponds to. It is much better to see this in practice than to explain in words!
```{r}
attitudeData_long <-
attitudeDataNoNA %>%
#remove the raw variables that you recoded
select(-ats_recode) %>%
gather(
# key refers to the new variable containing the question number
key = question,
# value refers to the new response variable
value = response,
#the only variables we DON'T want to gather
-ID, -Year, -StatsBefore
)
attitudeData_long %>%
slice(1:20)
glimpse(attitudeData_long)
```
Say we now wanted to undo the `gather()` and return our dataset to wide format. For this, we would use the function `spread()`.
```{r}
attitudeData_wide <-
attitudeData_long %>%
spread(
#key refers to the variable indicating which question
# each response belongs to
key = question,
value = response
)
attitudeData_wide %>%
slice(1:20)
```
Now that we have created a "long" version of our data, they are in the right format to create the plot. We will use the tidyverse function `ggplot()` to create our histogram with `geom_histogram`.
```{r}
attitudeData_long %>%
ggplot(aes(x = response)) +
geom_histogram(binwidth = 0.5) +
scale_x_continuous(breaks = seq.int(1, 7, 1))
```
It looks like responses were fairly positively overall.
We can also aggregate each participant's responses to each question during each year of their study at Stanford to examine the distribution of mean ATS responses across people by year.
We will use the `group_by()` and `summarize()` functions to aggregate the responses.
```{r}
attitudeData_agg <-
attitudeData_long %>%
group_by(ID, Year) %>%
summarize(
mean_response = mean(response)
)
attitudeData_agg
```
First let's use the geom_density argument in `ggplot()` to look at mean responses across people, ignoring year of response. The density argrument is like a histogram but smooths things over a bit.
```{r}
attitudeData_agg %>%
ggplot(aes(mean_response)) +
geom_density()
```
Now we can also look at the distribution for each year.
```{r}
attitudeData_agg %>%
ggplot(aes(mean_response, color = factor(Year))) +
geom_density()
```
Or look at trends in responses across years.
```{r}
attitudeData_agg %>%
group_by(Year) %>%
summarise(
mean_response = mean(mean_response)
) %>%
ggplot(aes(Year, mean_response)) +
geom_line()
```
This looks like a precipitous drop - but how might that be misleading?