forked from poldrack/psych10-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06b-ProbabilityInR.Rmd
173 lines (122 loc) · 4.1 KB
/
06b-ProbabilityInR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
output:
pdf_document: default
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
html_document: default
---
# Probability in R (with Lucy King)
In this chapter we will go over probability computations in R.
```{r echo=FALSE, warning=FALSE, message=FALSE}
library(NHANES)
library(tidyverse)
set.seed(123456)
opts_chunk$set(tidy.opts=list(width.cutoff=80))
options(tibble.width = 60)
```
## Basic probability calculations
Let's create a vector of outcomes from one to 6, using the `seq()` function to create such a sequence:
```{r}
outcomes <- seq(1, 6)
outcomes
```
Now let's create a vector of logical values based on whether the outcome in each position is equal to 1. Remember that `==` tests for equality of each element in a vector:
```{r}
outcome1isTrue <- outcomes == 1
outcome1isTrue
```
Remember that the simple probability of an outcome is number of occurrences of the outcome divided by the total number of events. To compute a probability, we can take advantage of the fact that TRUE/FALSE are equivalent to 1/0 in R. The formula for the mean (sum of values divided by the number of values) is thus exactly the same as the formula for the simple probability! So, we can compute the probability of the event by simply taking the mean of the logical vector.
```{r}
p1isTrue <- mean(outcome1isTrue)
p1isTrue
```
## Empirical frequency (Section \@ref(empirical-frequency))
Let's walk through how we computed empirical frequency of rain in San Francisco.
First we load the data:
```{r message=FALSE}
# we will remove the STATION and NAME variables
# since they are identical for all rows
SFrain <- read_csv("data/SanFranciscoRain/1329219.csv") %>%
dplyr::select(-STATION, -NAME)
glimpse(SFrain)
```
We see that the data frame contains a variable called `PRCP` which denotes the amount of rain each day. Let's create a new variable called `rainToday` that denotes whether the amount of precipitation was above zero:
```{r}
SFrain <-
SFrain %>%
mutate(rainToday = as.integer(PRCP > 0))
glimpse(SFrain)
```
Now we will summarize the data to compute the probability of rain:
```{r}
pRainInSF <-
SFrain %>%
summarize(
pRainInSF = mean(rainToday)
) %>%
pull()
pRainInSF
```
## Conditional probability (Section \@ref(conditional-probability))
Let's determine the conditional probability of someone being unhealthy, given that they are over 70 years of age, using the NHANES dataset. Let's create a new data frame that only contains people over 70 years old.
```{r}
healthDataFrame <-
NHANES %>%
mutate(
Over70 = Age > 70,
Unhealthy = DaysPhysHlthBad > 0
) %>%
dplyr::select(Unhealthy, Over70) %>%
drop_na()
glimpse(healthDataFrame)
```
First, what's the probability of being over 70?
```{r}
pOver70 <-
healthDataFrame %>%
summarise(pOver70 = mean(Over70)) %>%
pull()
# to obtain the specific value, we need to extract it from the data frame
pOver70
```
Second, what's the probability of being unhealthy?
```{r}
pUnhealthy <-
healthDataFrame %>%
summarise(pUnhealthy = mean(Unhealthy)) %>%
pull()
pUnhealthy
```
What's the probability for each combination of unhealthy/healthy and over 70/ not? We can create a new variable that finds the joint probability by multiplying the two individual binary variables together; since anything times zero is zero, this will only have the value 1 for any case where both are true.
```{r}
pBoth <- healthDataFrame %>%
mutate(
both = Unhealthy*Over70
) %>%
summarise(
pBoth = mean(both)) %>%
pull()
pBoth
```
Finally, what's the probability of someone being unhealthy, given that they are over 70 years of age?
```{r}
pUnhealthyGivenOver70 <-
healthDataFrame %>%
filter(Over70 == TRUE) %>% # limit to Over70
summarise(pUnhealthy = mean(Unhealthy)) %>%
pull()
pUnhealthyGivenOver70
```
```{r}
# compute the opposite:
# what the probability of being over 70 given that
# one is unhealthy?
pOver70givenUnhealthy <-
healthDataFrame %>%
filter(Unhealthy == TRUE) %>% # limit to Unhealthy
summarise(pOver70 = mean(Over70)) %>%
pull()
pOver70givenUnhealthy
```