forked from poldrack/psych10-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04b-VisualizingDataWithR.Rmd
339 lines (239 loc) · 12.5 KB
/
04b-VisualizingDataWithR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
output:
pdf_document: default
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
html_document: default
---
# Data visualization with R (with Anna Khazenzon)
```{r echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(ggplot2)
library(cowplot)
library(knitr)
library(NHANES)
set.seed(123456)
opts_chunk$set(tidy.opts=list(width.cutoff=80))
options(tibble.width = 60)
# drop duplicated IDs within the NHANES dataset
NHANES <- NHANES %>%
dplyr::distinct(ID,.keep_all=TRUE)
NHANES_adult <- NHANES %>%
drop_na(Height) %>%
subset(subset=Age>=18)
# setup colorblind palette
# from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
```
There are many different tools for plotting data in R, but we will focus on the `ggplot()` function provided by a package called `ggplot2`. ggplot is very powerful, but using it requires getting one's head around how it works.
## The grammar of graphics
or, the "gg" in ggplot
Each language has a grammar consisting of types of words and the rules with which to string them together into sentences. If a sentence is grammatically correct, we're able to parse it, even though that doesn't ensure that it's interesting, beautiful, or even meaningful.
Similarly, plots can be divided up into their core components, which come together via a set of rules.
Some of the major components are :
- data
- aesthetics
- geometries
- themes
The data are the actual variables we're plotting, which we pass to ggplot through the data argument. As you've learned, ggplot takes a **dataframe** in which each column is a variable.
Now we need to tell ggplot *how* to plot those variables, by mapping each variable to an *axis* of the plot. You've seen that when we plot histograms, our variable goes on the x axis. Hence, we set `x=<variable>` in a call to `aes()` within `ggplot()`. This sets **aesthetics**, which are mappings of data to certain scales, like axes or things like color or shape. The plot still had two axes -- x and y -- but we didn't need to *specify* what went on the y axis because ggplot knew by *default* that it should make a count variable.
How was ggplot able to figure that out? Because of **geometries**, which are *shapes* we use to represent our data. You've seen `geom_histogram`, which basically gives our graph a bar plot shape, except that it also sets the default y axis variable to be `count`. Other shapes include points and lines, among many others.
We'll go over other aspects of the grammar of graphics (such as facets, statistics, and coordinates) as they come up. Let's start visualizing some data by first choosing a **theme**, which describes all of the non-data ink in our plot, like grid lines and text.
## Getting started
Load ggplot and choose a theme you like (see [here](https://bookdown.org/asmundhreinn/r4ds-master/graphics-for-communication.html#themes) for examples).
```{r}
library(tidyverse)
theme_set(theme_bw()) # I like this fairly minimal one
```
## Let's think through a visualization
Principles we want to keep in mind:
* Show the data without distortion
* Use color, shape, and location to encourage comparisons
* Minimize visual clutter (maximize your information to ink ratio)
The two questions you want to ask yourself before getting started are:
* What type of variable(s) am I plotting?
* What comparison do I want to make salient for the viewer (possibly myself)?
Figuring out *how* to highlight a comparison and include relevant variables usually benefits from sketching the plot out first.
## Plotting the distribution of a single variable
How do you choose which **geometry** to use? ggplot allows you to choose from a number of geometries. This choice will determine what sort of plot you create. We will use the built-in mpg dataset, which contains fuel efficiency data for a number of different cars.
### Histogram
The histogram shows the overall distribution of the data. Here we use the nclass.FD function to compute the optimal bin size.
```{r fig.width=4, fig.height=4, out.width="50%"}
ggplot(mpg, aes(hwy)) +
geom_histogram(bins = nclass.FD(mpg$hwy)) +
xlab('Highway mileage')
```
Instead of creating discrete bins, we can look at relative density continuously.
### Density plot
```{r fig.width=4, fig.height=4, out.width="50%"}
ggplot(mpg, aes(hwy)) +
geom_density() +
xlab('Highway mileage')
```
A note on defaults: The default statistic (or "stat") underlying `geom_density` is called "density" -- not surprising. The default stat for `geom_histogram` is "count". What do you think would happen if you overrode the default and set `stat="count"`?
```{rfig.width=4, fig.height=4, out.width="50%"}
ggplot(mpg, aes(hwy)) +
geom_density(stat = "count")
```
What we discover is that the *geometric* difference between `geom_histogram` and `geom_density` can actually be generalized. `geom_histogram` is a shortcut for working with `geom_bar`, and `geom_density` is a shortcut for working with `geom_line`.
### Bar vs. line plots
```{r fig.width=4, fig.height=4, out.width="50%"}
ggplot(mpg, aes(hwy)) +
geom_bar(stat = "count")
```
Note that the geometry tells ggplot what kind of plot to use, and the statistic (*stat*) tells it what kind of summary to present.
```{r fig.width=4, fig.height=4, out.width="50%"}
ggplot(mpg, aes(hwy)) +
geom_line(stat = "density")
```
## Plots with two variables
Let's check out mileage by car manufacturer. We'll plot one *continuous* variable by one *nominal* one.
First, let's make a bar plot by choosing the stat "summary" and picking the "mean" function to summarize the data.
```{r fig.width=8, fig.height=4, out.width="80%"}
ggplot(mpg, aes(manufacturer, hwy)) +
geom_bar(stat = "summary", fun.y = "mean") +
ylab('Highway mileage')
```
One problem with this plot is that it's hard to read some of the labels because they overlap. How could we fix that? Hint: search the web for "ggplot rotate x axis labels" and add the appropriate command.
*TBD: fix*
```{r fig.width=8, fig.height=4, out.width="80%"}
ggplot(mpg, aes(manufacturer, hwy)) +
geom_bar(stat = "summary", fun.y = "mean") +
ylab('Highway mileage')
```
### Adding on variables
What if we wanted to add another variable into the mix? Maybe the *year* of the car is also important to consider. We have a few options here. First, you could map the variable to another **aesthetic**.
```{r fig.width=8, fig.height=4, out.width="80%"}
# first, year needs to be converted to a factor
mpg$year <- factor(mpg$year)
ggplot(mpg, aes(manufacturer, hwy, fill = year)) +
geom_bar(stat = "summary", fun.y = "mean")
```
By default, the bars are *stacked* on top of one another. If you want to separate them, you can change the `position` argument form its default to "dodge".
```{r fig.width=8, fig.height=4, out.width="80%"}
ggplot(mpg, aes(manufacturer, hwy, fill=year)) +
geom_bar(stat = "summary",
fun.y = "mean",
position = "dodge")
```
```{r fig.width=4, fig.height=4, out.width="50%"}
ggplot(mpg, aes(year, hwy,
group=manufacturer,
color=manufacturer)) +
geom_line(stat = "summary", fun.y = "mean")
```
For a less visually cluttered plot, let's try **facetting**. This creates *subplots* for each value of the `year` variable.
```{r fig.width=8, fig.height=4, out.width="80%"}
ggplot(mpg, aes(manufacturer, hwy)) +
# split up the bar plot into two by year
facet_grid(year ~ .) +
geom_bar(stat = "summary",
fun.y = "mean")
```
### Plotting dispersion
Instead of looking at just the means, we can get a sense of the entire distribution of mileage values for each manufacturer.
#### Box plot
```{r fig.width=8, fig.height=4, out.width="80%"}
ggplot(mpg, aes(manufacturer, hwy)) +
geom_boxplot()
```
A **box plot** (or box and whiskers plot) uses quartiles to give us a sense of spread. The thickest line, somewhere inside the box, represents the *median*. The upper and lower bounds of the box (the *hinges*) are the first and third quartiles (can you use them to approximate the interquartile range?). The lines extending from the hinges are the remaining data points, excluding **outliers**, which are plotted as individual points.
#### Error bars
Now, let's do something a bit more complex, but much more useful -- let's create our own summary of the data, so we can choose which summary statistic to plot and also compute a measure of dispersion of our choosing.
```{r fig.width=8, fig.height=4, out.width="80%"}
# summarise data
mpg_summary <- mpg %>%
group_by(manufacturer) %>%
summarise(n = n(),
mean_hwy = mean(hwy),
sd_hwy = sd(hwy))
# compute confidence intervals for the error bars
# (we'll talk about this later in the course!)
limits <- aes(
# compute the lower limit of the error bar
ymin = mean_hwy - 1.96 * sd_hwy / sqrt(n),
# compute the upper limit
ymax = mean_hwy + 1.96 * sd_hwy / sqrt(n))
# now we're giving ggplot the mean for each group,
# instead of the datapoints themselves
ggplot(mpg_summary, aes(manufacturer, mean_hwy)) +
# we set stat = "identity" on the summary data
geom_bar(stat = "identity") +
# we create error bars using the limits we computed above
geom_errorbar(limits, width=0.5)
```
Error bars don't always mean the same thing -- it's important to determine whether you're looking at e.g. standard error or confidence intervals (which we'll talk more about later in the course).
##### Minimizing non-data ink
The plot we just created is nice and all, but it's tough to look at. The bar plots add a lot of ink that doesn't help us compare engine sizes across manufacturers. Similarly, the width of the error bars doesn't add any information. Let's tweak which *geometry* we use, and tweak the appearance of the error bars.
```{r fig.width=8, fig.height=4, out.width="80%"}
ggplot(mpg_summary, aes(manufacturer, mean_hwy)) +
# switch to point instead of bar to minimize ink used
geom_point() +
# remove the horizontal parts of the error bars
geom_errorbar(limits, width = 0)
```
Looks a lot cleaner, but our points are all over the place. Let's make a final tweak to make *learning something* from this plot a bit easier.
```{r fig.width=8, fig.height=4, out.width="80%"}
mpg_summary_ordered <- mpg_summary %>%
mutate(
# we sort manufacturers by mean engine size
manufacturer = reorder(manufacturer, -mean_hwy)
)
ggplot(mpg_summary_ordered, aes(manufacturer, mean_hwy)) +
geom_point() +
geom_errorbar(limits, width = 0)
```
### Scatter plot
When we have multiple *continuous* variables, we can use points to plot each variable on an axis. This is known as a **scatter plot**. You've seen this example in your reading.
```{r fig.width=4, fig.height=4, out.width="50%"}
ggplot(mpg, aes(displ, hwy)) +
geom_point()
```
#### Layers of data
We can add layers of data onto this graph, like a *line of best fit*. We use a geometry known as a **smooth** to accomplish this.
```{r fig.width=4, fig.height=4, out.width="50%"}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(color = "black")
```
We can add on points and a smooth line for another set of data as well (efficiency in the city instead of on the highway).
```{r fig.width=4, fig.height=4, out.width="50%"}
ggplot(mpg) +
geom_point(aes(displ, hwy), color = "grey") +
geom_smooth(aes(displ, hwy), color = "grey") +
geom_point(aes(displ, cty), color = "limegreen") +
geom_smooth(aes(displ, cty), color = "limegreen")
```
## Creating a more complex plot
In this section we will recreate Figure \@ref(fig:challengerTemps) from Chapter \@ref{data-visualization}. Here is the code to generate the figure; we will go through each of its sections below.
```{r fig.width=8,fig.height=4,out.height='50%'}
oringDf <- read.table("data/orings.csv", sep = ",",
header = TRUE)
oringDf %>%
ggplot(aes(x = Temperature, y = DamageIndex)) +
geom_point() +
geom_smooth(method = "loess",
se = FALSE, span = 1) +
ylim(0, 12) +
geom_vline(xintercept = 27.5, size =8,
alpha = 0.3, color = "red") +
labs(
y = "Damage Index",
x = "Temperature at time of launch"
) +
scale_x_continuous(breaks = seq.int(25, 85, 5)) +
annotate(
"text",
angle=90,
x = 27.5,
y = 6,
label = "Forecasted temperature on Jan 28",
size = 5
)
```
## Additional reading and resources
* [ggplot theme reference](http://ggplot2.tidyverse.org/reference/ggtheme.html)
* [knockoff tech themes](http://www.ggplot2-exts.org/ggtech.html)