-
Notifications
You must be signed in to change notification settings - Fork 12
/
002-intro-stats.qmd
606 lines (457 loc) · 20 KB
/
002-intro-stats.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
---
title: "Introduction to R Statistics"
author: "Jeff Oliver"
date: "`r format(Sys.time(), '%d %B, %Y')`"
---
An introduction to using the R statistics package and the RStudio interface.
#### Learning objectives
1. Read data from files and output results to files
2. Extract relevant portions of datasets
3. Run standard statistical tests in R, including Student's _t_, analysis of
variance (ANOVA), and simple linear regression.
## Statistics in R
R was _designed_ for statistical analyses. This lesson provides an overview of
reading data and writing output, as well as running standard statistical tests
in R, including t-tests, linear regression, and analysis of variance.
## Setup
First we need to setup our development environment. Open RStudio and create a
new project via:
+ File > New Project...
+ Select 'New Directory'
+ For the Project Type select 'New Project'
+ For Directory name, call it something like "r-stats" (without the quotes)
+ For the subdirectory, select somewhere you will remember (like "My Documents"
or "Desktop")
We need to create two folders: 'data' will store the data we will be analyzing,
and 'output' will store the results of our analyses.
```{r setup, eval = FALSE}
dir.create(path = "data")
dir.create(path = "output")
```
## Data interrogation
For our first set of analyses, we'll use a dataset that comes pre-loaded in R.
The `iris` data were collected by botanist Edgar Anderson and used in the early
statistical work of R.A. Fisher. Start by looking at the data with the `head`
command:
```{r investigate-data}
head(x = iris)
```
`iris` is a `data.frame`, which is probably the most commonly used data
structure in R. It is basically a table where each column is a variable and
each row has one set of values for each of those variables (much like a single
sheet in a program like LibreOffice Calc or Microsoft Excel). In the `iris`
data, there are five columns: `Sepal.Length`, `Sepal.Width`, `Petal.Length`,
`Petal.Width`, and `Species.` Each row corresponds to the measurements for an
individual flower. Note that all the values in a column of a `data.frame` must
be of the same type - if you try to mix numbers and words in the same column, R
will "coerce" the data to a single type, which may cause problems for
downstream analyses.
An investigation of our call to the `head` command illustrates two fundamental
concepts in R: variables and functions.
```{r head-explanation, eval = FALSE}
head(x = iris)
```
+ `iris` is a variable. That is, it is a name we use to refer to some
information in our computer's memory. In this case, the information is a table
of flower measurements.
+ `head` is the name of the function that prints out the first six rows of a
`data.frame`. Most functions require some form of input; in this example, we
provided one piece of input to `head`: the name of the variable for which we
want the first six lines.
Another great idea when investigating data is to plot it out to see if there
are any odd values. Here we use `boxplot` to show the data for each species.
```{r boxplot}
boxplot(formula = Petal.Length ~ Species, data = iris)
```
`boxplot` uses the syntax `y ~ group`, where the reference to the left of the
tilde (~) is the value to plot on the y-axis (here we are plotting the values
of `Petal.Length`) and the reference to the right indicates how to group the
data (here we group by the value in the `Species` column of `iris`). Find out
more about the plot by typing `?boxplot` into the console.
Also note that R is *case sensitive*, so if we refer to objects without using
the correct case, we will often encounter errors. For example, if I forgot to
capitalize `Species` in the `boxplot` call, R cannot find `species` (note the
lower-case "s") and throws an error:
```{r error-example, error = TRUE}
boxplot(formula = Petal.Length ~ species, data = iris)
```
To keep track of what we do, we will switch from running commands directly in
the console to writing R scripts that we can execute. These scripts are simple
text files with R commands.
## Student's _t_
We are going to start by doing a single comparison, looking at the petal
lengths of two species. We use a _t_-test to ask whether or not the values for
two species were likely drawn from two separate populations. Just looking at
the data for two species of irises, it looks like the petal lengths are
different, but are they _significantly_ different?
<style type="text/css">
.table {
width: 30%;
}
</style>
| _I. setosa_ | _I. versicolor_ |
|:-----------:|:---------------:|
| 1.4 | 4.7 |
| 1.4 | 4.5 |
| 1.3 | 4.9 |
| 1.5 | 4.0 |
| 1.4 | 4.6 |
| ... | ... |
Start by making a new R script file (File > New File > R Script) and save it as
"iris-t-test.R". We start by adding some key information to the top of the
script, using the comment character, `#`, so R will know to ignore these lines.
Commenting your code is critical in understanding why and how you did analyses
when you return to the code two years from now.
```{r t-test-header}
# T-test on iris petal lengths
# Jeff Oliver
# 2016-09-09
# Compare setosa and versicolor
```
We'll start by comparing the data of _Iris setosa_ and _Iris versicolor_, so we
need to create two new data objects, one corresponding to the _I. setosa_ data
and one for the _I. versicolor_ data.
```{r subset}
setosa <- iris[iris$Species == "setosa", ]
versicolor <- iris[iris$Species == "versicolor", ]
```
OK, a lot happened with those two lines. Let's take a look:
+ `iris` is the `data.frame` we worked with before.
+ `iris$Species` refers to one column in `iris`, that is, the column with the
name of the species (setosa, versicolor, or virginica).
+ The square brackets `[<position 1>, <position 2>]` are used to indicate a
subset of the `iris` data. A `data.frame` is effectively a two-dimensional
structure - it has some number of rows (the first dimension) and some number of
columns (the second dimension). We can see how many rows and columns are in a
`data.frame` with the `dim` command. `dim(iris)` prints out the number of rows (`r nrow(iris)`)
and the number of columns (`r ncol(iris)`):
```{r iris-dimensions}
dim(iris)
```
We use the square brackets to essentially give an address for the data we are
interested in. We tell R which rows we want in the first position and which
columns we want in the second position. If a dimension is left blank, then all
rows/columns are returned. For example, this returns all columns for the third
row of data in `iris`:
```{r subset-rows}
iris[3, ]
```
So the code
```{r subset-setosa, eval = FALSE}
setosa <- iris[iris$Species == "setosa", ]
```
will extract all columns (because there is nothing after the comma) in the
`iris` data for those rows where the value in the `Species` column is "setosa"
_and_ assign that information to a variable called `setosa`.
Comparing the `iris` data and the `setosa` data, we see that there are indeed
fewer rows in the `setosa` data:
```{r count-rows}
nrow(iris)
nrow(setosa)
```
Now to compare the two species, we call the `t.test` function in R, passing
each set of data to `x` and `y`.
```{r t-test}
# Compare Petal.Length of these two species
t.test(x = setosa$Petal.Length, y = versicolor$Petal.Length)
```
The output of a _t_-test is a little different than what we will see later in
the ANOVA. The results include:
+ Test statistic, degrees of freedom, and p-value
+ The confidence interval for the difference in means between the two data sets
+ The means of each data set
So we reject the hypothesis that these species have the same petal lengths.
The final script should be:
```{r t-test-script, eval = FALSE}
# T-test on iris petal lengths
# Jeff Oliver
# 2016-09-09
# Compare setosa and versicolor
# Subset data
setosa <- iris[iris$Species == "setosa", ]
versicolor <- iris[iris$Species == "versicolor", ]
# Run t-test
t.test(x = setosa$Petal.Length, y = versicolor$Petal.Length)
```
> ### Challenge 1
Test for significant differences in petal lengths between _I. setosa_ and
_I. virginica_ and between _I. versicolor_ and _I. virginica_.
([Solution](#solution-to-challenge-1))
***
## Analysis of Variance (ANOVA)
ANOVA allows us to simultaneously compare multiple groups, to test whether
group membership has a significant effect on a variable of interest. Create a
new script file called 'iris-anova.R' and the header information.
```{r anova-header}
# ANOVA on iris data set
# Jeff Oliver
# 2016-09-09
```
The question we will address is: _are there differences in petal length among
the three species?_ We start by building an analysis of variance model with
the `aov` function:
```{r aov}
aov(formula = Petal.Length ~ Species, data = iris)
```
In this case, we pass _two_ arguments to the `aov` function:
1. For the `formula` parameter, we pass `Petal.Length ~ Species`. This format
is used throughout R for describing relationships we are testing. The format is
`y ~ x`, where the response variables (e.g. `y`) are to the left of the tilde
(~) and the predictor variables (e.g. `x`) are to the right of the tilde. In
this example, we are asking if petal length is significantly different among
the three species.
2. We also need to tell R where to find the `Petal.Length` and `Species` data,
so we pass the variable name of the `iris data.frame` to the `data` parameter.
But we want to store the model, not just print it to the screen, so we use the
assignment operator `<-` to store the product of the `aov` function in a
variable of our choice:
```{r aov-assignment}
petal_length_aov <- aov(formula = Petal.Length ~ Species, data = iris)
```
Notice how when we execute this command, nothing printed in the console. This
is because we instead sent the output of the `aov` call to a variable. If you
just type the variable name,
```{r aov-results-code, eval = FALSE}
petal_length_aov
```
you will see the familiar output from the `aov` function:
```{r aov-results-print, echo = FALSE}
petal_length_aov
```
To see the results of the ANOVA, we call the `summary` function:
```{r aov-summary}
summary(object = petal_length_aov)
```
The species _do_ have significantly different petal lengths (P < 0.001). If one
wanted to run a _post hoc_ test to assess _how_ the species are different, a
Tukey test comparing means would likely be the most appropriate option. A link
to an example of how to do this is in the [Additional resources](#additional-resources)
section at the end of this lesson.
If we want to save these results to a file, we put the call to `summary`
between a pair of calls to `sink`:
```{r aov-write, eval = FALSE}
sink(file = "output/petal-length-anova.txt")
summary(object = petal_length_aov)
sink()
```
Our script should look like this:
```{r aov-script, eval = FALSE}
# ANOVA on iris data set
# Jeff Oliver
# 2016-09-09
# Run ANOVA on petal length
petal_length_aov <- aov(formula = Petal.Length ~ Species, data = iris)
# Save results to file
sink(file = "output/petal-length-anova.txt")
summary(object = petal_length_aov)
sink()
```
> ### Challenge 2
Use ANOVA to test for differences in sepal width among the three species. What
is the value of the _F_-statistic?
([Solution](#solution-to-challenge-2))
***
## Linear regression
For this final section, we will test for a relationship between life expectancy
and per capita [gross domestic product](https://en.wikipedia.org/wiki/Gross_domestic_product)
(GDP). Start by downloading the data from [https://tinyurl.com/gapminder-five-year-csv](https://tinyurl.com/gapminder-five-year-csv)
(right-click or Ctrl-click on link and Save As...). Save this to the 'data'
directory you created in the Setup section. Now create another R script and
call it gapminder-reg.R. The file has comma-separated values for 142 countries
at twelve different years; the data can be loaded in R with the `read.csv`
function:
```{r read-data}
# Test relationship between life expectancy and GDP
# Jeff Oliver
# 2016-07-29
all_gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv",
stringsAsFactors = TRUE)
```
This reads the file into memory and stores the data in a data frame called
`all_gapminder`.
Recall you can see the first few rows with the `head` function.
```{r gapminder-head}
head(all_gapminder)
```
Another useful quality assurance tool is `summary`, which provides a basic
description for each column in the data frame.
```{r gapminder-summary}
summary(all_gapminder)
```
For the four numeric columns (`year`, `pop`, `lifeExp`, and `gdpPercap`), some
descriptive statistics are shown. For the `country` and `continent` columns the
first few values and frequencies of each value are shown (i.e. there are 12
records for Afghanistan and 624 records for Africa).
For this analysis, we only want the data from 2007, so we start by subsetting
those data. This creates a new variable and stores only those rows in the
original data frame where the value in the `year` column is 2007.
```{r subset-gapminder}
# Subset 2007 data
gapminder <- all_gapminder[all_gapminder$year == 2007, ]
```
As we did for the ANOVA analyses, it is usually a good idea to visually inspect
the data when possible. Here we can use the `plot` function to create a
scatterplot of the two columns of interest, `lifeExp` and `gdpPercap`.
```{r plot-gapminder}
# Plot to look at data
plot(x = gapminder$gdpPercap, y = gapminder$lifeExp)
```
We can see immediately that this is unlikely a linear relationship. For our
purposes, we will need to log-transform the GDP data. Create a new column in
the `gapminder` data frame with the log~10~-transformed GDP and plot this
transformed data.
```{r log-transform}
# Create log-transformed GDP
gapminder$logGDP <- log10(gapminder$gdpPercap)
# Plot again, with log-transformed GDP on the x-axis
plot(x = gapminder$logGDP,
y = gapminder$lifeExp,
xlab = "log10(GDP)",
ylab = "Life Expectancy")
```
Notice also that we passed two additional arguments to the `plot` command:
`xlab` and `ylab`. These are used to label the x- and y-axis, respectively (try
the `plot` function without passing `xlab` and `ylab` arguments to see what
happens without them).
Now that the data are properly transformed, we can create the linear model for
the predictability of life expectancy based on gross domestic product.
```{r lm}
# Run a linear model
lifeExp_v_gdp <- lm(formula = lifeExp ~ logGDP, data = gapminder)
# Investigate results of the model
summary(lifeExp_v_gdp)
```
For our question, the relationship between life expectancy and GDP, focus on
the *coefficients* section, specifically the line for *logGDP*:
>`## logGDP 16.585 1.019 16.283 < 2e-16 ***`
```{r lm-summary, echo = FALSE}
lifeExp_sum <- summary(lifeExp_v_gdp)
```
First of all, there *is* a significant relationship between these two variables
(p < 2 x 10^-16^, or, as R reports in the `Pr>(|t|)` column, p < 2e-16). The
`Estimate` column of the results lists a value of
`r round(x = lifeExp_sum$coefficients['logGDP', 'Estimate'], digits = 3)`,
which means that for every 10-fold increase in per capita GDP (remember we
log~10~-transformed GDP), life expectancy increases by almost 17 years.
As before, if we want to instead save the results to a file instead of printing
them to the screen, we use the `sink` function.
```{r lm-write, eval = FALSE}
sink(file = "output/lifeExp-gdp-regression.txt")
summary(lifeExp_v_gdp)
sink()
```
The final script should be:
```{r lm-script, eval = FALSE}
# Test relationship between life expectancy and GDP
# Jeff Oliver
# 2016-07-29
# Read data from comma-separated values file
all_gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv",
stringsAsFactors = TRUE)
# Subset 2007 data
gapminder <- all_gapminder[all_gapminder$year == 2007, ]
# Plot to look at data
plot(x = gapminder$gdpPercap, y = gapminder$lifeExp)
# Create log-transformed GDP
gapminder$logGDP <- log10(gapminder$gdpPercap)
# Plot new variable
plot(x = gapminder$logGDP,
y = gapminder$lifeExp,
xlab = "log10(GDP)",
ylab = "Life Expectancy")
# Run linear model
lifeExp_v_gdp <- lm(formula = lifeExp ~ logGDP, data = gapminder)
# Save results to file
sink(file = "output/lifeExp-gdp-regression.txt")
summary(lifeExp_v_gdp)
sink()
```
> ### Challenge 3
Test for a relationship between life expectancy and log base 2 of GDP for the
1982 data. How does life expectancy change with a four-fold increase in GDP?
([Solution](#solution-to-challenge-3))
***
## Solutions to Challenges
### Solution to Challenge 1
> Test for significant differences in petal lengths between _I. setosa_ and
_I. virginica_ and between _I. versicolor_ and _I. virginica_.
#### First comparison: _I. setosa_ vs. _I. virginica_
```{r solution-1-1}
# Subset setosa data
setosa <- iris[iris$Species == "setosa", ]
# Subset virginica data
virginica <- iris[iris$Species == "virginica", ]
# Run t-test
t.test(x = setosa$Petal.Length, y = virginica$Petal.Length)
```
_I. setosa_ and _I. virginica_ have significantly different petal lengths.
***
#### Second comparison: _I. versicolor_ and _I. virginica_
```{r solution-1-2}
# Subset versicolor data
versicolor <- iris[iris$Species == "versicolor", ]
# Subset virginica data
virginica <- iris[iris$Species == "virginica", ]
# Run t-test
t.test(x = versicolor$Petal.Length, y = virginica$Petal.Length)
```
_I. versicolor_ and _I. virginica_ also have different significantly different
petal lengths.
***
### Solution to Challenge 2
> Use ANOVA to test for differences in sepal width among the three species.
What is the value of the _F_-statistic?
```{r solution-2}
sepal_width_aov <- aov(formula = Sepal.Width ~ Species, data = iris)
summary(object = sepal_width_aov)
```
```{r solution-2-f, echo = FALSE}
aov_summary <- summary(object = sepal_width_aov)
f_stat <- aov_summary[[1]][['F value']][1]
```
The _F_-statistic = `r round(x = f_stat, digits = 2)`, and the p-value is quite
small, so there are significant sepal width differences among species.
***
### Solution to Challenge 3
> Test for a relationship between life expectancy and log base 2 of GDP for the
1982 data. How does life expectancy change with a four-fold increase in GDP?
```{r solution-3}
# Read data from comma-separated values file
gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv",
stringsAsFactors = TRUE)
# Subset 1982 data
gapminder_1982 <- gapminder[gapminder$year == 1982, ]
# Create log2-transformed GDP
gapminder_1982$log2GDP <- log2(gapminder_1982$gdpPercap)
# Run linear model
lifeExp_v_gdp <- lm(lifeExp ~ log2GDP, data = gapminder_1982)
summary(lifeExp_v_gdp)
```
```{r solution-3-summary, echo = FALSE}
lifeExp_sum <- summary(lifeExp_v_gdp)
```
The line to focus on is the `log2GPD` line in the `coefficients` section:
>`## log2GDP 5.1942 0.2766 18.780 <2e-16 ***`
The coefficient for log~2~ GDP in the model is positive, with increases in GDP
correlating with increased life expectancy. The estimated coefficient for the
relationship is
`r round(x = lifeExp_v_gdp$coefficients['log2GDP'], digits = 2)`. Remember that
we log~2~-tranformed the GDP data, so this coefficient indicates the change in
life expectancy for every two-fold increase in per capita GDP. For a four-fold
increase in GDP, we multiply this coefficient by two (because four is two
two-fold changes) to conclude that a four-fold increase in GDP results in an
increase of
`r round(x = (2 * lifeExp_v_gdp$coefficients['log2GDP']), digits = 2)` years in
life expectancy.
***
## Additional resources
+ Early work by R.A. Fisher: [doi: 10.1111%2Fj.1469-1809.1936.tb02137.x](https://dx.doi.org/10.1111%2Fj.1469-1809.1936.tb02137.x)
+ A [PDF version](https://jcoliver.github.io/learn-r/002-intro-stats.pdf) of this lesson
+ An [Example of Tukey's test](https://www.r-bloggers.com/anova-and-tukeys-test-on-r/)
for _post hoc_ pairwise comparisons from ANOVA results.