-
Notifications
You must be signed in to change notification settings - Fork 7
/
practice_stats.R
133 lines (77 loc) · 4.2 KB
/
practice_stats.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# t-test ----
# If you don't have the penguins data package installed from previous exercises,
# uncomment the line below and run it to install
#install.packages("palmerpenguins")
# load the data
library(palmerpenguins)
# Remember the variable names by printing the names
# Information on the variables: https://github.com/allisonhorst/palmerpenguins
names(penguins)
# Do different types of penguin species have different average bill depths?
# Compute the means for each group
# Test if the differences
# for Chinstrap and Gentoo are statistically different with a t-test:
t.test(penguins$___[penguins$___ == "Chinstrap"],
penguins$___[penguins$___ == "Gentoo"])
# What do you conclude from the test?
# Bonus question: Does Adelie penguin body mass differ by island?
# Make a subset of the data with rows just for Adelie penguins:
adelie <- penguins[___, ]
# Using adelie, compute the mean body_mass_g per island
# Does the mass of penguins on Biscoe and Torgersen islands differ?
# Regression ----
# If you don't have the penguins data package installed from previous exercises,
# uncomment the line below and run it to install
#install.packages("palmerpenguins")
# load the data
library(palmerpenguins)
# Remember the variable names by printing the names
# Information on the variables: https://github.com/allisonhorst/palmerpenguins
names(penguins)
# We want to know: Can we predict/explain penguin body mass with the variables we have?
# Run a linear regression with body_mass_g as the dependent (y, outcome) variable and
# species, bill_length_mm, bill_depth_mm, flipper_length_mm, and sex as the independent (x, predictor) variables
reg1 <- lm(___ ~ ___, data=penguins)
# Look at the results of the regression with the summary function
# Notice in the summary output that observations (rows) with missing data were automatically
# removed from the analysis
# Plot the regression object to see a series of diagnostic plots
plot(reg1)
# Fit overall is pretty good. The model above made Adelie penguins the base case,
# and generated other indicator variables for the other species and gender.
# To make a separate intercept for each species instead, add a -1 to the
# formula you ran before; summarize the results again
reg2 <- lm(___ ~ ___, data=penguins)
# If we look at the coefficients for the two models, you can see the coefficients
# for species were the only ones to change
reg1$coefficients
reg2$coefficients
# Now, maybe the combination of bill length and depth also matters for body mass.
# So let's include an interaction term between the two bill variables.
# Hint: multiple two variables together to include them independently in the model and
# have an interaction term automatically generated
reg3 <- lm(___ ~ ___, data=penguins)
# Looks like that is important.
# Bonus: prediction ----
# Let's use our model to predict the body mass for a new penguin.
# To predict, we need a data frame with the same column names as our original data.
newdata <- data.frame(species="Adelie",
sex="female",
# set other variables to mean for Adelie penguins
flipper_length_mm=mean(penguins$flipper_length_mm[
penguins$species=="Adelie"], na.rm=TRUE),
bill_length_mm=mean(penguins$bill_length_mm[
penguins$species=="Adelie"], na.rm=TRUE),
bill_depth_mm=mean(penguins$bill_depth_mm[
penguins$species=="Adelie"], na.rm=TRUE))
# The predict function will automatically make the indicator and interaction variables for us.
# Hint: look at the help for predict.lm (what gets called since we have an lm model)
# to see the order of the arguments it expects
predict(___, ___)
# Is the predicted body mass in the range we might expect for Adelie penguins?
mean(penguins$body_mass_g[penguins$species=="Adelie"], na.rm=TRUE)
# What's the confidence interval on the prediction?
# We use the prediction interval type because we want to know the expected
# range for the mass of individual penguins, not the variability in the
# expected value of the mean mass of penguins with these characteristics.
predict(reg3, newdata, interval="prediction")