forked from poldrack/psych10-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-FittingModels.Rmd
1136 lines (880 loc) · 53.3 KB
/
05-FittingModels.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
pdf_document: default
html_document: default
---
# Fitting models to data {#fitting-models}
```{r echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(NHANES)
library(cowplot)
library(mapproj)
library(pander)
library(knitr)
library(modelr)
panderOptions('round',2)
panderOptions('digits',7)
options(digits = 2)
set.seed(123456) # set random seed to exactly replicate results
```
One of the fundamental activities in statistics is creating models that can summarize data using a small set of numbers, thus providing a compact description of the data. In this chapter we will discuss the concept of a statistical model and how it can be used to describe data.
## What is a model?
In the physical world, "models" are generally simplifications of things in the real world that nonetheless convey the essence of the thing being modeled. A model of a building conveys the structure of the building while being small and light enough to pick up with one's hands; a model of a cell in biology is much larger than the actual thing, but again conveys the major parts of the cell and their relationships.
In statistics, a model is meant to provide a similarly condensed description, but for data rather than for a physical structure. Like physical models, a statistical model is generally much simpler than the data being described; it is meant to capture the structure of the data as simply as possible. In both cases, we realize that the model is a convenient fiction that necessarily glosses over some of the details of the actual thing being modeled. As the statistician George Box famously said: "All models are wrong but some are useful."
The basic structure of a statistical model is:
$$
data = model + error
$$
This expresses the idea that the data can be described by a statistical model, which describes what we expect to occur in the data, along with the difference between the model and the data, which we refer to as the *error*.
## Statistical modeling: An example
Let's look at an example of fitting a model to data, using the data from NHANES. In particular, we will try to build a model of the height of children in the NHANES sample. First let's load the data and plot them (see Figure \@ref(fig:childHeight)).
```{r childHeight,echo=FALSE,fig.cap="Histogram of height of children in NHANES.",fig.width=4,fig.height=4,out.height='50%'}
# drop duplicated IDs within the NHANES dataset
NHANES <-
NHANES %>%
dplyr::distinct(ID, .keep_all = TRUE)
# select the appropriate children with good height measurements
NHANES_child <-
NHANES %>%
drop_na(Height) %>%
subset(Age < 18)
NHANES_child %>%
ggplot(aes(Height)) +
geom_histogram(bins = 100)
```
Remember that we want to describe the data as simply as possible while still capturing their important features. What is the simplest model we can imagine that might still capture the essence of the data? How about the most common value in the dataset (which we call the *mode*)?
```{r echo=FALSE}
# create function to compute mode and apply to child height data from NHANES
# R doesn't have a built-in mode function
getmode <- function(v) {
uniqv <- unique(v)
return(uniqv[which.max(tabulate(match(v, uniqv)))])
}
height_mode <- getmode(NHANES_child$Height)
```
This redescribes the entire set of `r I(dim(NHANES_child)[1])` children in terms of a single number. If we wanted to predict the height of any new children, then our guess would be the same number: `r I(height_mode)` centimeters.
$$
\hat{height_i} = 166.5
$$
We put the hat symbol over the name of the variable to show that this is our *predicted* value. The error for this individual would then be the difference between the predicted value ($\hat{height_i}$)and their actual height ($height_i$):
$$
error_i = height_i - \hat{height_i}
$$
How good of a model is this? In general we define the goodness of a model in terms of the error, which represents the difference between model and the data; all things being equal, the model that produces lower error is the better model.
```{r errorMode, echo=FALSE}
# compute error compared to the mode and plot histogram
error_mode <- NHANES_child$Height - height_mode
```
What we find is that the average individual has a fairly large error of `r I(mean(error_mode))` centimeters. We would like to have a model where the average error is zero, and it turns out that if we use the arithmetic mean (commonly known as the *average*) as our model then this will be the case.
The mean (often denoted by a bar over the variable, such as $\bar{X}$) is the sum of all of the values, divided by the number of values. Mathematically, we express this as:
$$
\bar{X} = \frac{\sum_{i=1}^{n}x_i}{n}
$$
We can prove mathematically that the sum of errors from the mean (and thus the average error) is zero (see the proof at the end of the chapter if you are interested). Given that the average error is zero, this seems like a better model.
```{r meanError, echo=FALSE,fig.cap="Distribution of errors from the mean.",fig.width=4,fig.height=4,out.height='50%'}
# compute error compared to the mean and plot histogram
error_mean <- NHANES_child$Height - mean(NHANES_child$Height)
ggplot(NULL, aes(error_mean)) +
geom_histogram(bins = 100) +
xlim(-60, 60) +
labs(
x = "Error when predicting height with mean"
)
```
Even though the average of errors from the mean is zero, we can see from the histogram in Figure \@ref(fig:meanError) that each individual still has some degree of error; some are positive and some are negative, and those cancel each other out. For this reason, we generally summarize errors in terms of some kind of measure that counts both positive and negative errors as bad. We could use the absolute value of each error value, but it's more common to use the squared errors, for reasons that we will see later in the course.
There are several common ways to summarize the squared error that you will encounter at various points in this book, so it's important to understand how they relate to one another. First, we could simply add them up; this is referred to as the *sum of squared errors*. The reason we don't usually use this is that its magnitude depends on the number of data points, so it can be difficult to interpret unless we are looking at the same number of observations. Second, we could take the mean of the squared error values, which is referred to as the *mean squared error (MSE)*. However, because we squared the values before averaging, they are not on the same scale as the original data; they are in $centimeters^2$. For this reason, it's also common to take the square root of the MSE, which we refer to as the *root mean squared error (RMSE)*, so that the error is measured in the same units as the original values (in this example, centimeters).
```{r echo=FALSE}
# compute and print RMSE for mean and mode
rmse_mean <- sqrt(mean(error_mean**2))
rmse_mode <- sqrt(mean(error_mode**2))
```
The mean has a pretty substantial amount of error -- any individual data point will be about `r I(sprintf('%.0f',rmse_mean))` cm from the mean on average -- but it's still much better than the mode, which has an average error of about `r I(sprintf('%.0f',rmse_mode))` cm.
### Improving our model
Can we imagine a better model? Remember that these data are from all children in the NHANES sample, who vary from `r I(min(NHANES_child$Age))` to `r I(max(NHANES_child$Age))` years of age. Given this wide age range, we might expect that our model of height should also include age. Let's plot the data for height against age, to see if this relationship really exists.
```{r childHeightLine,echo=FALSE,fig.cap="Height of children in NHANES, plotted without a model (A), with a linear model including only age (B) or age and a constant (C), and with a linear model that fits separate effects of age for males and females (D)."}
p1 <- NHANES_child %>%
ggplot(aes(x = Age, y = Height)) +
geom_point(position = "jitter",size=0.05) +
scale_x_continuous(breaks = seq.int(0, 20, 2)) +
ggtitle('A: original data')
lmResultHeightOnly <- lm(Height ~ Age + 0, data=NHANES_child)
rmse_heightOnly <- sqrt(mean(lmResultHeightOnly$residuals**2))
p2 <- NHANES_child %>%
ggplot(aes(x = Age, y = Height)) +
geom_point(position = "jitter",size=0.05) +
scale_x_continuous(breaks = seq.int(0, 20, 2)) +
annotate('segment',x=0,xend=max(NHANES_child$Age),
y=0,yend=max(lmResultHeightOnly$fitted.values),
color='blue',lwd=1) +
ggtitle('B: age')
p3 <- NHANES_child %>%
ggplot(aes(x = Age, y = Height)) +
geom_point(position = "jitter",size=0.05) +
scale_x_continuous(breaks = seq.int(0, 20, 2)) +
geom_smooth(method='lm',se=FALSE) +
ggtitle('C: age + constant')
p4 <- NHANES_child %>%
ggplot(aes(x = Age, y = Height)) +
geom_point(aes(colour = factor(Gender)),
position = "jitter",
alpha = 0.8,
size=0.05) +
geom_smooth(method='lm',aes(group = factor(Gender),
colour = factor(Gender))) +
theme(legend.position = c(0.25,0.8)) +
ggtitle('D: age + constant + gender')
plot_grid(p1,p2,p3,p4,ncol=2)
```
The black points in Panel A of Figure \@ref(fig:childHeightLine) show individuals in the dataset, and there seems to be a strong relationship between height and age, as we would expect. Thus, we might build a model that relates height to age:
$$
\hat{height_i} = \beta * age_i
$$
where $\beta$ is a *parameter* that we multiply by age to get the smallest error.
You may remember from algebra that a line is defined as follows:
$$
y = slope*x + intercept
$$
If age is the $X$ variable, then that means that our prediction of height from age will be a line with a slope of $\beta$ and an intercept of zero - to see this, let's plot the best fitting line in blue on top of the data (Panel B in Figure \@ref(fig:childHeightLine)). Something is clearly wrong with this model, as the line doesn't seem to follow the data very well. In fact, the RMSE for this model (`r I(rmse_heightOnly)`) is actually higher than the model that only includes the mean! The problem comes from the fact that our model only includes age, which means that the predicted value of height from the model must take on a value of zero when age is zero. Even though the data do not include any children with an age of zero, the line is mathematically required to have a y-value of zero when x is zero, which explains why the line is pulled down below the younger datapoints. We can fix this by including a constant value in our model, which basically represents the estimated value of height when age is equal to zero; even though an age of zero is not plausible in this dataset, this is a mathematical trick that will allow the model to account for the overall magnitude of the data. The model is:
$$
\hat{height_i} = constant + \beta * age_i
$$
where *constant* is a constant value added to the prediction for each individual; we also call it the *intercept*, since it maps onto the intercept in the equation for a line. We will learn later how it is that we actually compute these values for a particular dataset; for now, we will use the `lm()` function in R to compute the values of the constant and $\beta$ that give us the smallest error for these particular data. Panel C in Figure \@ref(fig:childHeightLine) shows this model applied to the NHANES data, where we see that the line matches the data much better than the one without a constant.
```{r ageHeightError,echo=FALSE,fig.cap="Distribution of errors from model including constant and age.",fig.width=4,fig.height=4,out.height='50%'}
# find the best fitting model to predict height given age
model_age <- lm(Height ~ Age, data = NHANES_child)
# the add_predictions() function uses the fitted model to add the predicted values for each person to our dataset
NHANES_child <-
NHANES_child %>%
add_predictions(model_age, var = "predicted_age") %>%
mutate(
error_age = Height - predicted_age #calculate each individual's difference from the predicted value
)
rmse_age <-
NHANES_child %>%
summarise(
sqrt(mean((error_age)**2)) #calculate the root mean squared error
) %>%
pull()
```
```{r echo=FALSE}
# compute model fit for modeling with age and gender
model_age_gender <- lm(Height ~ Age + Gender, data = NHANES_child)
rmse_age_gender <-
NHANES_child %>%
add_predictions(model_age_gender, var = "predicted_age_gender") %>%
summarise(
sqrt(mean((Height - predicted_age_gender)**2))
) %>%
pull()
```
Our error is much smaller using this model -- only `r I(rmse_age)` centimeters on average. Can you think of other variables that might also be related to height? What about gender? In Panel D of Figure \@ref(fig:childHeightLine) we plot the data with lines fitted separately for males and females. From the plot, it seems that there is a difference between males and females, but it is relatively small and only emerges after the age of puberty. Let's estimate this model and see how the errors look. In Figure \@ref(fig:msePlot) we plot the root mean squared error values across the different models. From this we see that the model got a little bit better going from mode to mean, much better going from mean to mean+age, and only very slightly better by including gender as well.
```{r msePlot,echo=FALSE, fig.cap="Mean squared error plotted for each of the models tested above.",fig.width=4,fig.height=4,out.height='50%'}
error_df <- #build a dataframe using the function tribble()
tribble(
~model, ~error,
"mode", rmse_mode,
"mean", rmse_mean,
"mean + age", rmse_age,
"mean + age + gender", rmse_age_gender
) %>%
mutate(
RMSE = error
)
error_df %>%
ggplot(aes(x = model, y = RMSE)) +
geom_col() +
scale_x_discrete(limits = c("mode", "mean", "mean + age", "mean + age + gender")) +
labs(
y = "root mean squared error"
) +
coord_flip()
```
## What makes a model "good"?
There are generally two different things that we want from our statistical model. First, we want it to describe our data well; that is, we want it to have the lowest possible error when modeling our data. Second, we want it to generalize well to new datasets; that is, we want its error to be as low as possible when we apply it to a new dataset. It turns out that these two features can often be in conflict.
To understand this, let's think about where error comes from. First, it can occur if our model is wrong; for example, if we inaccurately said that height goes down with age instead of going up, then our error will be higher than it would be for the correct model. Similarly, if there is an important factor that is missing from our model, that will also increase our error (as it did when we left age out of the model for height). However, error can also occur even when the model is correct, due to random variation in the data, which we often refer to as "measurement error" or "noise". Sometimes this really is due to error in our measurement -- for example, when the measurements rely on a human, such as using a stopwatch to measure elapsed time in a footrace. In other cases, our measurement device is highly accurate (like a digital scale to measure body weight), but the thing being measured is affected by many different factors that cause it to be variable. If we knew all of these factors then we could build a more accurate model, but in reality that's rarely possible.
Let's use an example to show this. Rather than using real data, we will generate some data for the example using a computer simulation (about which we will have more to say in a few chapters). Let's say that we want to understand the relationship between a person's blood alcohol content (BAC) and their reaction time on a simulated driving test. We can generate some simulated data and plot the relationship (see Panel A of Figure \@ref(fig:BACrt)).
```{r, BACrt,echo=FALSE,fig.cap="Simulated relationship between blood alcohol content and reaction time on a driving test, with best-fitting linear model represented by the line. A: linear relationship with low measurement error. B: linear relationship with higher measurement error. C: Nonlinear relationship with low measurement error and (incorrect) linear model"}
dataDf <-
tibble(
BAC = runif(100) * 0.3,
ReactionTime = BAC * 1 + 1 + rnorm(100) * 0.01
)
p1 <- dataDf %>%
ggplot(aes(x = BAC, y = ReactionTime)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle('A: linear, low noise')
# noisy version
dataDf <-
tibble(
BAC = runif(100) * 0.3,
ReactionTime = BAC * 2 + 1 + rnorm(100) * 0.2
)
p2 <- dataDf %>%
ggplot(aes(x = BAC, y = ReactionTime)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle('B: linear, high noise')
# nonlinear (inverted-U) function
dataDf <-
dataDf %>%
mutate(
caffeineLevel = runif(100) * 10,
caffeineLevelInvertedU = (caffeineLevel - mean(caffeineLevel))**2,
testPerformance = -1 * caffeineLevelInvertedU + rnorm(100) * 0.5
)
p3 <- dataDf %>%
ggplot(aes(x = caffeineLevel, y = testPerformance)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle('C: nonlinear')
plot_grid(p1,p2,p3)
```
In this example, reaction time goes up systematically with blood alcohol content -- the line shows the best fitting model, and we can see that there is very little error, which is evident in the fact that all of the points are very close to the line.
We could also imagine data that show the same linear relationship, but have much more error, as in Panel B of Figure \@ref(fig:BACrt). Here we see that there is still a systematic increase of reaction time with BAC, but it's much more variable across individuals.
These were both examples where the *linear model* seems appropriate, and the error reflects noise in our measurement. The linear model specifies that the relationship between two variables follows a straight line. For example, in a linear model, change in BAC is always associated with a specific change in ReactionTime, regardless of the level of BAC.
On the other hand, there are other situations where the linear model is incorrect, and error will be increased because the model is not properly specified. Let's say that we are interested in the relationship between caffeine intake and performance on a test. The relation between stimulants like caffeine and test performance is often *nonlinear* - that is, it doesn't follow a straight line. This is because performance goes up with smaller amounts of caffeine (as the person becomes more alert), but then starts to decline with larger amounts (as the person becomes nervous and jittery). We can simulate data of this form, and then fit a linear model to the data (see Panel C of Figure \@ref(fig:BACrt)). The blue line shows the straight line that best fits these data; clearly, there is a high degree of error. Although there is a very lawful relation between test performance and caffeine intake, it follows a curve rather than a straight line. The linear model has high error because it's the wrong model for these data.
## Can a model be too good? {#overfitting}
Error sounds like a bad thing, and usually we will prefer a model that has lower error over one that has higher error. However, we mentioned above that there is a tension between the ability of a model to accurately fit the current dataset and its ability to generalize to new datasets, and it turns out that the model with the lowest error often is much worse at generalizing to new datasets!
To see this, let's once again generate some data so that we know the true relation between the variables. We will create two simulated datasets, which are generated in exactly the same way -- they just have different random noise added to them.
```{r Overfitting,echo=FALSE,fig.cap='An example of overfitting. Both datasets were generated using the same model, with different random noise added to generate each set. The left panel shows the data used to fit the model, with a simple linear fit in blue and a complex (8th order polynomial) fit in red. The root mean square error (RMSE) values for each model are shown in the figure; in this case, the complex model has a lower RMSE than the simple model. The right panel shows the second dataset, with the same model overlaid on it and the RMSE values computed using the model obtained from the first dataset. Here we see that the simpler model actually fits the new dataset better than the more complex model, which was overfitted to the first dataset.',fig.width=8,fig.height=4,out.height='50%'}
#parameters for simulation
set.seed(1122)
sampleSize <- 16
#build a dataframe of simulated data
simData <-
tibble(
X = rnorm(sampleSize),
Y = X + rnorm(sampleSize, sd = 1),
Ynew = X + rnorm(sampleSize, sd = 1)
)
#fit models to these data
simpleModel <- lm(Y ~ X, data = simData)
complexModel <- lm(Y ~ poly(X, 8), data = simData)
#calculate root mean squared error for "current" dataset
rmse_simple <- sqrt(mean(simpleModel$residuals**2))
rmse_complex <- sqrt(mean(complexModel$residuals**2))
#calculate root mean squared error for "new" dataset
rmse_prediction_simple <- sqrt(mean((simpleModel$fitted.values - simData$Ynew)**2))
rmse_prediction_complex <- sqrt(mean((complexModel$fitted.values - simData$Ynew)**2))
#visualize
plot_original_data <-
simData %>%
ggplot(aes(X, Y)) +
geom_point() +
geom_smooth(
method = "lm",
formula = y ~ poly(x, 8),
color = "red",
se = FALSE
) +
geom_smooth(
method = "lm",
color = "blue",
se = FALSE
) +
ylim(-3, 3) +
annotate(
"text",
x = -1.25,
y = 2.5,
label = sprintf("RMSE=%0.1f", rmse_simple),
color = "blue",
hjust = 0,
cex = 4
) +
annotate(
"text",
x = -1.25,
y = 2,
label = sprintf("RMSE=%0.1f", rmse_complex),
color = "red",
hjust = 0,
cex = 4
) +
ggtitle("original data")
plot_new_data <-
simData %>%
ggplot(aes(X, Ynew)) +
geom_point() +
geom_smooth(
aes(X, Y),
method = "lm",
formula = y ~ poly(x, 8),
color = "red",
se = FALSE
) +
geom_smooth(
aes(X, Y),
method = "lm",
color = "blue",
se = FALSE
) +
ylim(-3, 3) +
annotate(
"text",
x = -1.25,
y = 2.5,
label = sprintf("RMSE=%0.1f", rmse_prediction_simple),
color = "blue",
hjust = 0,
cex = 4
) +
annotate(
"text",
x = -1.25,
y = 2,
label = sprintf("RMSE=%0.1f", rmse_prediction_complex),
color = "red",
hjust = 0,
cex = 4
) +
ggtitle("new data")
plot_grid(plot_original_data, plot_new_data)
```
The left panel in Figure \@ref(fig:Overfitting) shows that the more complex model (in red) fits the data better than the simpler model (in blue). However, we see the opposite when the same model is applied to a new dataset generated in the same way -- here we see that the simpler model fits the new data better than the more complex model. Intuitively, we can see that the more complex model is influenced heavily by the specific data points in the first dataset; since the exact position of these data points was driven by random noise, this leads the more complex model to fit badly on the new dataset. This is a phenomenon that we call *overfitting*. For now it's important to keep in mind that our model fit needs to be good, but not too good. As Albert Einstein (1933) said: "It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience." Which is often paraphrased as: "Everything should be as simple as it can be, but not simpler."
## The simplest model: The mean
We have already encountered the mean (or average), and in fact most people know about the average even if they have never taken a statistics class. It is commonly used to describe what we call the "central tendency" of a dataset -- that is, what value are the data centered around? Most people don't think of computing a mean as fitting a model to data. However, that's exactly what we are doing when we compute the mean.
We have already seen the formula for computing the mean of a sample of data:
$$
\bar{X} = \frac{\sum_{i=1}^{n}x_i}{n}
$$
Note that I said that this formula was specifically for a *sample* of data, which is a set of data points selected from a larger population. Using a sample, we wish to characterize a larger population -- the full set of individuals that we are interested in. For example, if we are a political pollster our population of interest might be all registered voters, whereas our sample might just include a few thousand people sampled from this population. Later in the course we will talk in more detail about sampling, but for now the important point is that statisticians generally like to use different symbols to differentiate *statistics* that describe values for a sample from *parameters* that describe the true values for a population; in this case, the formula for the population mean (denoted as $\mu$) is:
$$
\mu = \frac{\sum_{i=1}^{N}x_i}{N}
$$
where N is the size of the entire population.
We have already seen that the mean is the summary statistic that is guaranteed to give us a mean error of zero. The mean also has another characteristic: It is the summary statistic that has the lowest possible value for the sum of squared errors (SSE). In statistics, we refer to this as being the "best" estimator. We could prove this mathematically, but instead we will demonstrate it graphically in Figure \@ref(fig:MinSSE).
```{r MinSSE, echo=FALSE,fig.cap="A demonstration of the mean as the statistic that minimizes the sum of squared errors. Using the NHANES child height data, we compute the mean (denoted by the blue bar). Then, we test a range of other values, and for each one we compute the sum of squared errors for each data point from that value, which are denoted by the black curve. We see that the mean falls at the minimum of the squared error plot.",fig.width=4,fig.height=4,out.height='50%'}
df_error <-
tibble(
val = seq(100, 175, 0.05),
sse = NA
)
for (i in 1:dim(df_error)[1]) {
err <- NHANES_child$Height - df_error$val[i]
df_error$sse[i] <- sum(err**2)
}
df_error %>%
ggplot(aes(val, sse)) +
geom_vline(xintercept = mean(NHANES_child$Height), color = "blue") +
geom_point(size = 0.1) +
annotate(
"text",
x = mean(NHANES_child$Height) + 3,
y = max(df_error$sse),
label = "mean",
color = "blue"
) +
labs(
x = "test value",
y = "Sum of squared errors"
)
```
This minimization of SSE is a good feature, and it's why the mean is the most commonly used statistic to summarize data. However, the mean also has a dark side. Let's say that five people are in a bar, and we examine each person's income:
```{r echo=FALSE}
# create income data frame
incomeDf <-
tibble(
income = c(48000, 64000, 58000, 72000, 66000),
person = c("Joe", "Karen", "Mark", "Andrea", "Pat")
)
```
```{r, echo=FALSE}
kable(incomeDf, caption='Income for our five bar patrons')
```
The mean (`r I(sprintf("%0.2f", mean(incomeDf$income)))`) seems to be a pretty good summary of the income of those five people. Now let's look at what happens if Beyoncé Knowles walks into the bar:
```{r echo=FALSE}
# add Beyonce to income data frame
incomeDf <-
incomeDf %>%
rbind(c(54000000, "Beyonce")) %>%
mutate(income = as.double(income))
```
```{r echo=FALSE}
kable(incomeDf, caption='Income for our five bar patrons plus Beyoncé Knowles.')
```
The mean is now almost 10 million dollars, which is not really representative of any of the people in the bar -- in particular, it is heavily driven by the outlying value of Beyoncé. In general, the mean is highly sensitive to extreme values, which is why it's always important to ensure that there are no extreme values when using the mean to summarize data.
### The median
If we want to summarize the data in a way that is less sensitive to outliers, we can use another statistic called the *median*. If we were to sort all of the values in order of their magnitude, then the median is the value in the middle. If there is an even number of values then there will be two values tied for the middle place, in which case we take the mean (i.e. the halfway point) of those two numbers.
Let's look at an example. Say we want to summarize the following values:
```
8 6 3 14 12 7 6 4 9
```
If we sort those values:
```
3 4 6 6 7 8 9 12 14
```
Then the median is the middle value -- in this case, the 5th of the 9 values.
Whereas the mean minimizes the sum of squared errors, the median minimizes a slighty different quantity: The sum of *absolute* errors. This explains why it is less sensitive to outliers -- squaring is going to exacerbate the effect of large errors compared to taking the absolute value. We can see this in the case of the income example: The median is much more representative of the group as a whole, and less sensitive to the one large outlier.
```{r echo=FALSE}
income_summary_df <- data.frame(Statistic=c('Mean','Median'),
Value=c(mean(incomeDf$income),
median(incomeDf$income)))
kable(income_summary_df, caption='Summary statistics for income after arrival of Beyoncé Knowles.')
```
Given this, why would we ever use the mean? As we will see in a later chapter, the mean is the "best" estimator in the sense that it will vary less from sample to sample compared to other estimators. It's up to us to decide whether that is worth the sensitivity to potential outliers -- statistics is all about tradeoffs.
## The mode
```{r echo=FALSE}
# compute mean of iPhone model numbers
iphoneDf <-
tribble(
~iPhoneModel, ~count,
8, 325,
9, 450,
10, 700,
11, 250
)
meanPhoneNumber <-
iphoneDf %>%
summarize(
sum(iPhoneModel * count) / sum(count)
) %>%
pull()
```
Sometimes we wish to describe the central tendency of a dataset that is not numeric. For example, let's say that we want to know which models of iPhone are most commonly used. Let's say we ask a large group of iPhone users which model each one owns. If we were to take the average of these values, we would see that the mean iPhone model is `r I(meanPhoneNumber)`, which is clearly nonsensical, since the iPhone model numbers are not meant to be quantitative measurements. In this case, a more appropriate measure of central tendency is the mode, which is the most common value in the dataset, as we discussed above.
## Variability: How well does the mean fit the data?
Once we have described the central tendency of the data, we often also want to describe how variable the data are -- this is sometimes also referred to as "dispersion", reflecting the fact that it describes how widely dispersed the data are.
We have already encountered the sum of squared errors above, which is the basis for the most commonly used measures of variability: the *variance* and the *standard deviation*. The variance for a population (referred to as $\sigma^2$) is simply the sum of squared errors divided by the number of observations - that is, it is exactly the same as the *mean squared error* that you encountered earlier:
$$
\sigma^2 = \frac{SSE}{N} = \frac{\sum_{i=1}^n (x_i - \mu)^2}{N}
$$
where $\mu$ is the population mean. The standard deviation is simply the square root of this -- that is, the *root mean squared error* that we saw before. The standard deviation is useful because the errors are in the same units as the original data (undoing the squaring that we applied to the errors).
We usually don't have access to the entire population, so we have to compute the variance using a sample, which we refer to as $\hat{\sigma}^2$, with the "hat" representing the fact that this is an estimate based on a sample. The equation for $\hat{\sigma}^2$ is similar to the one for $\sigma^2$:
$$
\hat{\sigma}^2 = \frac{\sum_{i=1}^N (x_i - \bar{X})^2}{n-1}
$$
The only difference between the two equations is that we divide by n - 1 instead of N. This relates to a fundamental statistical concept: *degrees of freedom*. Remember that in order to compute the sample variance, we first had to estimate the sample mean $\bar{X}$. Having estimated this, one value in the data is no longer free to vary. For example, let's say we have the following data points for a variable $x$: [3, 5, 7, 9, 11], the mean of which is 7. Because we know that the mean of this dataset is 7, we can compute what any specific value would be if it were missing. For example, let's say we were to obscure the first value (3). Having done this, we still know that its value must be 3, because the mean of 7 implies that the sum of all of the values is $7 * n = 35$ and $35 - (5 + 7 + 9 + 11) = 3$.
So when we say that we have "lost" a degree of freedom, it means that there is a value that is not free to vary after fitting the model. In the context of the sample variance, if we don't account for the lost degree of freedom, then our estimate of the sample variance will be *biased*, causing us to underestimate the uncertainty of our estimate of the mean.
## Using simulations to understand statistics
I am a strong believer in the use of computer simulations to understand statistical concepts, and in later sessions we will dig deeply into their use. Here we will introduce the idea by asking whether we can confirm the need to subtract 1 from the sample size in computing the sample variance.
Let's treat the entire sample of children from the NHANES data as our "population", and see how well the calculations of sample variance using either $n$ or $n-1$ in the denominator will estimate variance of this population, across a large number of simulated random samples from the data. We will return to the details of how to do this in a later chapter.
```{r echo=FALSE}
# compare variance estimates using N or N-1 in denominator
population_variance <-
NHANES_child %>%
summarize(
var(Height)
) %>%
pull()
# take 100 samples and estimate the sample variance using both N or N-1 in the demoninator
sampsize <- 50
nsamp <- 1000
varhat_n <- array(data = NA, dim = nsamp)
varhat_nm1 <- array(data = NA, dim = nsamp)
for (i in 1:nsamp) {
samp <- sample_n(NHANES_child, 1000)[1:sampsize, ]
sampmean <- mean(samp$Height)
sse <- sum((samp$Height - sampmean)**2)
varhat_n[i] <- sse / sampsize
varhat_nm1[i] <- sse / (sampsize - 1)
}
summary_df <- data.frame(Estimate=c("Population variance",
"Variance estimate using n",
"Variance estimate using n-1"),
Value=c(population_variance,
mean(varhat_n),
mean(varhat_nm1)))
kable(summary_df, caption='Variance estimates using n versus n-1; the estimate using n-1 is closer to the population value')
```
This shows us that the theory outlined above was correct: The variance estimate using $n - 1$ as the denominator is very close to the variance computed on the full data (i.e, the population), whereas the variance computed using $n$ as the denominator is biased (smaller) compared to the true value.
## Z-scores
```{r setupCrimeData, echo=FALSE}
crimeData <-
read.table(
"data/CrimeOneYearofData_clean.csv",
header = TRUE,
sep = ","
)
# let's drop DC since it is so small
crimeData <-
crimeData %>%
dplyr::filter(State != "District of Columbia")
caCrimeData <-
crimeData %>%
dplyr::filter(State == "California")
```
Having characterized a distribution in terms of its central tendency and variability, it is often useful to express the individual scores in terms of where they sit with respect to the overall distribution. Let's say that we are interested in characterizing the relative level of crimes across different states, in order to determine whether California is a particularly dangerous place. We can ask this question using data for 2014 from the [FBI's Uniform Crime Reporting site](https://www.ucrdatatool.gov/Search/Crime/State/RunCrimeOneYearofData.cfm). The left panel of Figure \@ref(fig:crimeHist) shows a histogram of the number of violent crimes per state, highlighting the value for California. Looking at these data, it seems like California is terribly dangerous, with `r I(caCrimeData$Violent.crime.total)` crimes in that year.
```{r crimeHist,echo=FALSE,fig.cap="Left: Histogram of the number of violent crimes. The value for CA is plotted in blue. Right: A map of the same data, with number of crimes plotted for each state in color.", fig.width=8,fig.height=4,out.height='50%'}
p1 <- crimeData %>%
ggplot(aes(Violent.crime.total)) +
geom_histogram(bins = 25) +
geom_vline(xintercept = caCrimeData$Violent.crime.total, color = "blue") +
xlab("Number of violent crimes in 2014")
library(mapproj)
library(fiftystater)
data("fifty_states") # this line is optional due to lazy data loading
crimeData <-
crimeData %>%
mutate(StateLower = tolower(State))
# map_id creates the aesthetic mapping to the state name column in your data
plot_map <-
ggplot(crimeData, aes(map_id = StateLower)) +
# map points to the fifty_states shape data
geom_map(aes(fill = Violent.crime.total), map = fifty_states) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
theme(
legend.position = "bottom",
panel.background = element_blank()
) +
coord_map() +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
labs(
x = "",
y = ""
)
# add border boxes to AK/HI
p2 <- plot_map + fifty_states_inset_boxes()
plot_grid(p1,p2)
```
With R it's also easy to generate a map showing the distribution of a variable across states, which is presented in the right panel of Figure \@ref(fig:crimeHist).
It may have occurred to you, however, that CA also has the largest population of any state in the US, so it's reasonable that it will also have a larger number of crimes. If we plot the two against one another (see left panel of Figure \@ref(fig:popVsCrime)), we see that there is a direct relationship between population and the number of crimes.
```{r popVsCrime,echo=FALSE,fig.cap="Left: A plot of number of crimes versus population by state. Right: A histogram of per capita crime rates, expressed as crimes per 100,000 people.",fig.width=8,fig.height=4,out.height='50%'}
p1 <- crimeData %>%
ggplot(aes(Population, Violent.crime.total)) +
geom_point() +
annotate(
"point",
x = caCrimeData$Population,
y = caCrimeData$Violent.crime.total,
color = "blue"
) +
annotate(
"text",
x = caCrimeData$Population - 1000000,
y = caCrimeData$Violent.crime.total + 8000,
label = "CA",
color = "blue"
) +
ylab("Number of violent crimes in 2014")
p2 <- crimeData %>%
ggplot(aes(Violent.Crime.rate)) +
geom_histogram(binwidth = 80) +
geom_vline(xintercept = caCrimeData$Violent.Crime.rate, color = "blue") +
annotate(
"text",
x = caCrimeData$Violent.Crime.rate+25,
y = 12,
label = "CA",
color = "blue"
) +
scale_x_continuous(breaks = seq.int(0, 700, 100)) +
scale_y_continuous(breaks = seq.int(0, 13, 2)) +
xlab("Rate of violent crimes per 100,000 people")
plot_grid(p1,p2)
```
Instead of using the raw numbers of crimes, we should instead use the per-capita violent crime rate, which we obtain by dividing the number of crimes by the population of the state. The dataset from the FBI already includes this value (expressed as rate per 100,000 people). Looking at the right panel of Figure \@ref(fig:popVsCrime), we see that California is not so dangerous after all -- its crime rate of `r I(sprintf("%.2f", caCrimeData$Violent.Crime.rate))` per 100,000 people is a bit above the mean across states of `r I(sprintf("%.2f", mean(crimeData$Violent.Crime.rate)))`, but well within the range of many other states. But what if we want to get a clearer view of how far it is from the rest of the distribution?
The *Z-score* allows us to express data in a way that provides more insight into each data point's relationship to the overall distribution. The formula to compute a Z-score for a data point given that we know the value of the population mean $\mu$ and standard deviation $\sigma$ is:
$$
Z(x) = \frac{x - \mu}{\sigma}
$$
Intuitively, you can think of a Z-score as telling you how far away from the mean any data point is, in units of standard deviation. We can compute this for the crime rate data, as shown in Figure \@ref(fig:crimeZplot).
```{r crimeZplot,echo=FALSE,fig.cap="Scatterplot of original crime rate data against Z-scored data.",fig.width=4,fig.height=4,out.height='50%'}
crimeData <-
crimeData %>%
mutate(
ViolentCrimeRateZscore =
(Violent.Crime.rate - mean(Violent.Crime.rate)) /
sd(crimeData$Violent.Crime.rate)
)
caCrimeData <-
crimeData %>%
dplyr::filter(State == "California")
print(paste("mean of Z-scored data:", mean(crimeData$ViolentCrimeRateZscore)))
print(paste("std deviation of Z-scored data:", sd(crimeData$ViolentCrimeRateZscore)))
crimeData %>%
ggplot(aes(Violent.Crime.rate, ViolentCrimeRateZscore)) +
geom_point() +
labs(
x = "Rate of violent crimes",
y = "Z-scored rate of violent crimes"
)
```
The scatterplot shows us that the process of Z-scoring doesn't change the relative distribution of the data points (visible in the fact that the orginal data and Z-scored data fall on a straight line when plotted against each other) -- it just shifts them to have a mean of zero and a standard deviation of one. However, if you look closely, you will see that the mean isn't exactly zero -- it's just very small. What is going on here is that the computer represents numbers with a certain amount of *numerical precision* - which means that there are numbers that are not exactly zero, but are small enough that R considers them to be zero.
Figure \@ref(fig:crimeZmap) shows the Z-scored crime data using the geographical view.
```{r crimeZmap,echo=FALSE,fig.cap="Crime data rendered onto a US map, presented as Z-scores."}
plot_map_z <-
ggplot(crimeData, aes(map_id = StateLower)) +
# map points to the fifty_states shape data
geom_map(aes(fill = ViolentCrimeRateZscore), map = fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
theme(
legend.position = "bottom",
panel.background = element_blank()
) +
coord_map() +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
labs(x = "", y = "")
# add border boxes to AK/HI
plot_map_z + fifty_states_inset_boxes()
```
This provides us with a slightly more interpretable view of the data. For example, we can see that Nevada, Tennessee, and New Mexico all have crime rates that are roughly two standard deviations above the mean.
### Interpreting Z-scores
The "Z" in "Z-score" comes from the fact that the standard normal distribution (that is, a normal distribution with a mean of zero and a standard deviation of 1) is often referred to as the "Z" distribution. We can use the standard normal distribution to help us understand what specific Z scores tell us about where a data point sits with respect to the rest of the distribution.
```{r zDensityCDF,echo=FALSE,fig.cap="Density (top) and cumulative distribution (bottom) of a standard normal distribution, with cutoffs at one standard deviation above/below the mean."}
# First, create a function to generate plots of the density and CDF
dnormfun <- function(x) {
return(dnorm(x, 248))
}
plot_density_and_cdf <-
function(zcut, zmin = -4, zmax = 4, plot_cdf = TRUE, zmean = 0, zsd = 1) {
zmin <- zmin * zsd + zmean
zmax <- zmax * zsd + zmean
x <- seq(zmin, zmax, 0.1 * zsd)
zdist <- dnorm(x, mean = zmean, sd = zsd)
area <- pnorm(zcut) - pnorm(-zcut)
p2 <-
tibble(
zdist = zdist,
x = x
) %>%
ggplot(aes(x, zdist)) +
geom_line(
aes(x, zdist),
color = "red",
size = 2
) +
stat_function(
fun = dnorm, args = list(mean = zmean, sd = zsd),
xlim = c(zmean - zcut * zsd, zmean + zsd * zcut),
geom = "area", fill = "orange"
) +
stat_function(
fun = dnorm, args = list(mean = zmean, sd = zsd),
xlim = c(zmin, zmean - zcut * zsd),
geom = "area", fill = "green"
) +
stat_function(
fun = dnorm, args = list(mean = zmean, sd = zsd),
xlim = c(zmean + zcut * zsd, zmax),
geom = "area", fill = "green"
) +
annotate(
"text",
x = zmean,
y = dnorm(zmean, mean = zmean, sd = zsd) / 2,
label = sprintf("%0.1f%%", area * 100)
) +
annotate(
"text",
x = zmean - zsd * zcut - 0.5 * zsd,
y = dnorm(zmean - zcut * zsd, mean = zmean, sd = zsd) + 0.01 / zsd,
label = sprintf("%0.1f%%", pnorm(zmean - zsd * zcut, mean = zmean, sd = zsd) * 100)
) +
annotate(
"text",
x = zmean + zsd * zcut + 0.5 * zsd,
y = dnorm(zmean - zcut * zsd, mean = zmean, sd = zsd) + 0.01 / zsd,
label = sprintf("%0.1f%%", (1 - pnorm(zmean + zsd * zcut, mean = zmean, sd = zsd)) * 100)
) +
xlim(zmin, zmax) +
labs(
x = "Z score",
y = "density"
)
if (plot_cdf) {
cdf2 <-
tibble(
zdist = zdist,
x = x,
zcdf = pnorm(x, mean = zmean, sd = zsd)
) %>%
ggplot(aes(x, zcdf)) +
geom_line() +
annotate(
"segment",
x = zmin,
xend = zmean + zsd * zcut,
y = pnorm(zmean + zsd * zcut, mean = zmean, sd = zsd),
yend = pnorm(zmean + zsd * zcut, mean = zmean, sd = zsd),
color = "red",
linetype = "dashed"
) +
annotate(
"segment",
x = zmean + zsd * zcut,
xend = zmean + zsd * zcut,
y = 0, yend = pnorm(zmean + zsd * zcut, mean = zmean, sd = zsd),
color = "red",
linetype = "dashed"
) +
annotate(
"segment",
x = zmin,
xend = zmean - zcut * zsd,
y = pnorm(zmean - zcut * zsd, mean = zmean, sd = zsd),
yend = pnorm(zmean - zcut * zsd, mean = zmean, sd = zsd),
color = "blue",
linetype = "dashed"
) +
annotate(
"segment",
x = zmean - zcut * zsd,
xend = zmean - zcut * zsd,
y = 0,
yend = pnorm(zmean - zcut * zsd, mean = zmean, sd = zsd),
color = "blue",
linetype = "dashed"
) +
ylab("Cumulative density")
plot_grid(p2, cdf2, nrow = 2)
} else {
print(p2)
}
}
plot_density_and_cdf(1)
```
The upper panel in Figure \@ref(fig:zDensityCDF) shows that we expect about 16% of values to fall in $Z\ge 1$, and the same proportion to fall in $Z\le -1$.
```{r zDensity2SD,echo=FALSE,fig.cap="Density (top) and cumulative distribution (bottom) of a standard normal distribution, with cutoffs at two standard deviations above/below the mean"}
plot_density_and_cdf(2)
```
Figure \@ref(fig:zDensity2SD) shows the same plot for two standard deviations. Here we see that only about 2.3% of values fall in $Z \le -2$ and the same in $Z \ge 2$. Thus, if we know the Z-score for a particular data point, we can estimate how likely or unlikely we would be to find a value at least as extreme as that value, which lets us put values into better context.
### Standardized scores
Let's say that instead of Z-scores, we wanted to generate standardized crime scores with a mean of 100 and standard deviation of 10. This is similar to the standardization that is done with scores from intelligence tests to generate the intelligence quotient (IQ). We can do this by simply multiplying the Z-scores by 10 and then adding 100.
```{r stdScores,echo=FALSE,fig.cap="Crime data presented as standardized scores with mean of 100 and standard deviation of 10.",fig.width=4,fig.height=4,out.height='50%'}
crimeData <-
crimeData %>%
mutate(
ViolentCrimeRateStdScore = (ViolentCrimeRateZscore) * 10 + 100
)
caCrimeData <-
crimeData %>%
filter(State == "California")
crimeData %>%
ggplot(aes(ViolentCrimeRateStdScore)) +
geom_histogram(binwidth = 5) +
geom_vline(xintercept = caCrimeData$ViolentCrimeRateStdScore, color = "blue") +
scale_y_continuous(breaks = seq.int(0, 13, 2)) +
annotate(
"text",
x = caCrimeData$ViolentCrimeRateStdScore,
y = 12,
label = "California",
color = "blue"
) +
labs(
x = "Standardized rate of violent crimes"
)
```
#### Using Z-scores to compare distributions
One useful application of Z-scores is to compare distributions of different variables. Let's say that we want to compare the distributions of violent crimes and property crimes across states. In the left panel of Figure \@ref(fig:crimeTypePlot) we plot those against one another, with CA plotted in blue. As you can see the raw rates of property crimes are far higher than the raw rates of violent crimes, so we can't just compare the numbers directly. However, we can plot the Z-scores for these data against one another (right panel of Figure \@ref(fig:crimeTypePlot))-- here again we see that the distribution of the data does not change. Having put the data into Z-scores for each variable makes them comparable, and lets us see that California is actually right in the middle of the distribution in terms of both violent crime and property crime.
```{r crimeTypePlot,echo=FALSE,fig.cap="Plot of violent vs. property crime rates (left) and Z-scored rates (right).",fig.width=8,fig.height=4,out.height='50%'}
p1 <- crimeData %>%
ggplot(aes(Violent.Crime.rate, Property.crime.rate)) +