diff --git a/00-preface.Rmd b/00-preface.Rmd index 35554226b..dc34af81f 100644 --- a/00-preface.Rmd +++ b/00-preface.Rmd @@ -3,7 +3,7 @@ \vspace{0.1in}
-`r include_image("images/logos/Rlogo.png", html_opts = "height=100px", latex_opts = "height=10%")` \hfill       `r include_image("images/logos/RStudio-Logo-Blue-Gradient.png", html_opts = "height=100px", latex_opts = "height=10%")` +`r include_image("images/logos/Rlogo.png", html_opts = "height=100px", latex_opts = "height=12%")` \hfill       `r include_image("images/logos/RStudio-Logo-Blue-Gradient.png", html_opts = "height=100px", latex_opts = "height=12%")`
**Help! I'm completely new to coding and I need to learn R and RStudio! What do I do?** @@ -18,7 +18,7 @@ If you're asking yourself this question, then you've come to the right place! St ```{r results="asis", echo=FALSE, purl=FALSE} if (is_html_output()) { - cat(glue::glue('This is version {version} of *ModernDive* published on {date}. For previous versions of *ModernDive*, see the "About this book" section below.')) + cat(glue::glue('This is version {version} of *ModernDive* published on {date}. For previous versions of *ModernDive*, see the "About this book" section.')) } ``` @@ -34,7 +34,7 @@ We present a map of your upcoming journey in Figure \@ref(fig:moderndive-figure) (ref:flowchart) *ModernDive* flowchart. -```{r moderndive-figure, fig.align="center", fig.cap="(ref:flowchart)", echo=FALSE, purl=FALSE, out.width=if(!knitr::is_latex_output()) "100%"} +```{r moderndive-figure, fig.align="center", fig.cap="(ref:flowchart)", echo=FALSE, purl=FALSE, out.width="100%"} include_graphics("images/flowcharts/flowchart/flowchart.002.png") ``` @@ -101,7 +101,7 @@ You'll commonly hear the phrase "statistically significant" thrown around in the These sub-fields are summarized in what Garrett Grolemund \index{Grolemund, Garrett} and Hadley Wickham \index{Wickham, Hadley} have previously termed the ["data/science pipeline"](http://r4ds.had.co.nz/explore-intro.html) in Figure \@ref(fig:pipeline-figure). -```{r pipeline-figure, fig.align="center", fig.cap="Data/science pipeline.", echo=FALSE, purl=FALSE, out.height="100%", out.width="100%"} +```{r pipeline-figure, fig.align="center", fig.cap="Data/science pipeline.", echo=FALSE, purl=FALSE, out.width="100%"} include_graphics("images/r4ds/data_science_pipeline.png") ``` @@ -188,7 +188,7 @@ Here are some principles and beliefs we kept in mind while writing this text. If + Computing skills are essential to working with data in the 21st century. Given this fact, we feel that to shield students from computing is to ultimately do them a disservice. + We are not teaching a course on coding/programming per se, but rather just enough of the computational and algorithmic thinking necessary for data analysis. 1. **Complete reproducibility and customizability** - + We are frustrated when textbooks give examples, but not the source code and the data itself. We give you the source code for all examples as well as the whole book! While we have made choices to occasionally hide the code that produces more complicated figures, reviewing the book's GitHub repository will provide you with all the code (see below). + + We are frustrated when textbooks give examples, but not the source code and the data itself. We give you the source code for all examples as well as the whole book! While we have made choices to occasionally hide the code that produces more complicated figures, reviewing the book's GitHub repository will provide you with all the code (see in "About this book"). + Ultimately the best textbook is one you've written yourself. You know best your audience, their background, and their priorities. You know best your own style and the types of examples and problems you like best. Customization is the ultimate end. We encourage you to take what we've provided and make it work for your own needs. For more about how to make this book your own, see "About this book" later in this Preface. @@ -221,7 +221,7 @@ Lastly, a special shout out to any student who has ever taken a class with us at ## About this book {#about-the-book .unnumbered} -This book was written using RStudio's [bookdown](https://bookdown.org/) package\index{R packages!bookdown} by \index{Xie, Yihui}Yihui Xie[@R-bookdown]. This package simplifies the publishing of books by having all content written in \index{R Markdown} [R Markdown](http://rmarkdown.rstudio.com/html_document_format.html). The bookdown/R Markdown source code for all versions of ModernDive is available on GitHub: +This book was written using the [bookdown](https://bookdown.org/) package\index{R packages!bookdown} by \index{Xie, Yihui}Yihui Xie[@R-bookdown]. This package simplifies book publishing by having all content written in \index{R Markdown} [R Markdown](http://rmarkdown.rstudio.com/html_document_format.html). The R Markdown source code of ModernDive is available on GitHub: * **Latest online version** The most up-to-date release: + Version `r latest_release_version` released on `r latest_release_date` ([source code](https://github.com/moderndive/moderndive_book/releases/tag/v`r latest_release_version`)) @@ -230,7 +230,7 @@ This book was written using RStudio's [bookdown](https://bookdown.org/) package\ - **Updated Datasets and Code:** Replaced datasets (`promotions`, `evals`, and `pennies`) with new ones (`un_member_states_2024`, `spotify_sample`, and `almonds_bowl`). Adopted the `nycflights23` package instead of `nycflights13` and introduced the base R pipe (`|>`) instead of the tidyverse pipe (`%>%`). Also incorporated `envoy_flights` and `early_january_2023_weather` in the `moderndive` package. - **Content Reorganization:** Restructured sections in Chapters 7 and 10 for improved readability. Moved "Model Selection" from Chapter 6 to Chapter 10 and split it into two new subsections as per suggestions. - **Enhanced Theoretical Discussions:** Improved theory-based discussions in Chapters 7, 8, 10, and 11, and added sections to better connect statistical inference based on reviewer feedback. - - **New Examples and Functions:** Introduced `coffee_quality` and `old_faithful_2024` datasets with examples in Chapter 10, added use of the `fit()` function from the `infer` package for simulation-based inference with multiple linear regression, and incorporated the `infer` package into Chapter 11. + - **New Examples and Functions:** Introduced `coffee_quality` and `old_faithful_2024` datasets with examples in Chapter 10, added use of the `fit()` function from the `infer` package for simulation-based inference with multiple linear regression, and added `infer` coverage into Chapter 11. - **Code Enhancements and Clarifications:** Standardized code to use `|>`, addressed warnings for `group_by()`, and added `relocate()` to Chapter 3. - **Revamped Learning Checks:**: Updated and designed new Learning checks throughout the book to better assess student understanding. * **Print first edition** The CRC Press [print edition](https://www.routledge.com/Statistical-Inference-via-Data-Science-A-ModernDive-into-R-and-the-Tidyverse/Ismay-Kim/p/book/9780367409821) of *ModernDive* corresponds to Version 1.1.0 (with some typos fixed). Available at . @@ -253,7 +253,7 @@ Finally, since this book is under a [Creative Commons Attribution - NonCommercia ## Versions of R packages used {-} -If you'd like your output on your computer to match up exactly with the output presented throughout the book, you may want to use the exact versions of the packages that we used. You can find a full listing of these packages and their versions below. This likely won't be relevant for novices, but we included it for reproducibility. +If you'd like your output on your computer to match up exactly with the output presented throughout the book, you may want to use the exact versions of the packages that we used. You can find a full listing of these packages and their versions here. This likely won't be relevant for novices, but we included it for reproducibility. ```{r message=FALSE, echo=FALSE, purl=FALSE} # Packages needed internally, but not in text. @@ -285,7 +285,7 @@ readr::read_rds("rds/package_versions.rds") |> Chester Ismay | Albert Y. Kim | Arturo Valdivia :-------------------------:|:-------------------------:|:-------------------------: -`r include_image(path = "images/ismay2024.jpg", html_opts = "height=220px", latex_opts = "width=33%")` | `r include_image(path = "images/kim.png", html_opts = "height=220px", latex_opts = "width=33%")` | `r include_image(path = "images/valdivia.png", html_opts = "height=220px", latex_opts = "width=33%")` +`r include_image(path = "images/ismay2024.jpg", html_opts = "height=220px", latex_opts = "width=100%")` | `r include_image(path = "images/kim.png", html_opts = "height=220px", latex_opts = "width=100%")` | `r include_image(path = "images/valdivia.png", html_opts = "height=220px", latex_opts = "width=100%")` **Chester Ismay** is Vice President of Data and Automation at MATE Seminars and is a freelance data science consultant and instructor. He also teaches in the Center for Executive and Professional Education at Portland State University. He completed his PhD in statistics from Arizona State University in 2013. He has previously worked in a variety of roles including as an actuary at Scottsdale Insurance Company (now Nationwide E&S/Specialty) and at Ripon College, Reed College, and Pacific University. He has experience working in online education and was previously a Data Science Evangelist at DataRobot, where he led data science, machine learning, and data engineering in-person and virtual workshops for DataRobot University. In addition to his work for *ModernDive*, he also contributed as initial developer of the [`infer`](https://cran.r-project.org/package=infer) R package and is author and maintainer of the [`thesisdown`](https://github.com/ismayc/thesisdown) R package. diff --git a/01-getting-started.Rmd b/01-getting-started.Rmd index d3d6ebdbb..eaca40f76 100644 --- a/01-getting-started.Rmd +++ b/01-getting-started.Rmd @@ -34,7 +34,7 @@ Before we can start exploring data in R, there are some key concepts to understa 2. How do I code in R? 3. What are R packages? -We'll introduce these concepts in the upcoming Sections \@ref(r-rstudio)-\@ref(packages). If you are already somewhat familiar with these concepts, feel free to skip to Section \@ref(nycflights) where we'll introduce our first dataset: all domestic flights departing one of the three main New York City (NYC) airports in 2023. This is a dataset we will explore in depth for much of the rest of this book. +We'll introduce these concepts in the upcoming Sections \@ref(r-rstudio)--\@ref(packages). If you are already somewhat familiar with these concepts, feel free to skip to Section \@ref(nycflights) where we'll introduce our first dataset: all domestic flights departing one of the three main New York City (NYC) airports in 2023. We will explore this dataset in depth for much of the rest of this book. @@ -52,21 +52,29 @@ R: Engine | RStudio: Dashboard include_graphics("images/shutterstock/R_vs_RStudio_1.png") ``` -More precisely, R is a programming language that runs computations, while RStudio is an *integrated development environment (IDE)* that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rear-view mirrors, and a navigation system makes driving much easier, using RStudio's interface makes using R much easier as well. +More precisely, R is a programming language that runs computations, whereas +```{r results="asis", echo=FALSE, purl=FALSE} +if(knitr::is_latex_output()) { + cat('\\mbox{RStudio}') +} else { + cat('RStudio') +} +``` +is an *integrated development environment (IDE)* that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rear-view mirrors, and a navigation system makes driving much easier, using RStudio's interface makes using R much easier as well. ### Installing R and RStudio {#installing} -> **Note about RStudio Server or RStudio Cloud**: If your instructor has provided you with a link and access to RStudio Server or RStudio Cloud, then you can skip this section. We do recommend after a few months of working on RStudio Server/Cloud that you return to these instructions to install this software on your own computer though. +> **Note about RStudio Server or Posit (formerly RStudio) Cloud**: If your instructor has provided you with a link and access to RStudio Server or Posit Cloud, then you can skip this section. We do recommend after a few months of working on RStudio Server/Posit Cloud that you return to these instructions to install this software on your own computer though. You will first need to download and install both R and RStudio (Desktop version) on your computer. It is important that you install R first and then install RStudio. 1. **You must do this first:** Download and install R by going to . \index{R!installation} - + If you are a Windows user: Click on "Download R for Windows", then click on "base", then click on the Download link. - + If you are macOS user: Click on "Download R for macOS", then under "Latest release:" click on R-X.X.X.pkg, where R-X.X.X is the version number. For example, the latest version of R as of May 24, 2024 was R-4.4.0. + + If you are a Windows user: Click on "Download R for Windows," then click on "base," then click on the Download link. + + If you are macOS user: Click on "Download R for macOS," then under "Latest release:" click on R-X.X.X.pkg, where R-X.X.X is the version number. For example, the latest version of R as of December 4, 2024 was R-4.4.2. + If you are a Linux user: Click on "Download R for Linux" and choose your distribution for more information on installing R for your setup. -1. **You must do this second:** Download and install RStudio at . - + Scroll down to "Installers for Supported Platforms" near the bottom of the page. +1. **You must do this second:** Download and install RStudio at . + + Scroll down to "All Installers and Tarballs" near the bottom of the page. + Click on the download link corresponding to your computer's operating system. \index{RStudio!installation} @@ -98,7 +106,7 @@ Note the three *panes* which are three panels dividing the screen: the *console ## How do I code in R? {#code} -Now that you're set up with R and RStudio, you are probably asking yourself, "OK. Now how do I use R?". The first thing to note is that unlike other statistical software programs like Excel, SPSS, or Minitab that provide [point-and-click](https://en.wikipedia.org/wiki/Point_and_click) interfaces, R is an [interpreted language](https://en.wikipedia.org/wiki/Interpreted_language). This means you have to type in commands written in *R code*. In other words, you have to code/program in R. Note that we'll use the terms "coding" and "programming" interchangeably in this book. +Now that you're set up with R and RStudio, you are probably asking yourself, "OK. Now how do I use R?". The first thing to note is that unlike other statistical software programs like Excel, SPSS, or Minitab that provide [point-and-click](https://en.wikipedia.org/wiki/Point_and_click) interfaces, R is an [interpreted language.](https://en.wikipedia.org/wiki/Interpreted_language) This means you have to type in commands written in *R code*. In other words, you have to code/program in R. Note that we'll use the terms "coding" and "programming" interchangeably in this book. While it is not required to be a seasoned coder/computer programmer to use R, there is still a set of basic programming concepts that new R users need to understand. Consequently, while this book is not a book on programming, you will still learn just enough of these basic programming concepts needed to explore and analyze data effectively. @@ -111,7 +119,7 @@ We now introduce some basic programming concepts and terminology. Instead of ask + *Console pane*: where you enter in commands. \index{console} + *Running code*: the act of telling R to perform an act by giving it commands in the console. + *Objects*: where values are saved in R. We'll show you how to *assign* values to objects and how to display the contents of objects. \index{objects} - + *Data types*: integers, doubles/numerics, logicals, and characters. \index{data types} Integers are values like -1, 0, 2, 4092. Doubles or numerics are a larger set of values containing both the integers but also fractions and decimal values like -24.932 and 0.8. Logicals are either `TRUE` or `FALSE` while characters are text such as "cabbage", "Hamilton", "The Wire is the greatest TV show ever", and "This ramen is delicious." Note that characters are often denoted with the quotation marks around them. + + *Data types*: integers, doubles/numerics, logicals, and characters. \index{data types} Integers are values like -1, 0, 2, 4092. Doubles or numerics are a larger set of values containing both the integers but also fractions and decimal values like -24.932 and 0.8. Logicals are either `TRUE` or `FALSE` while characters are text such as "cabbage," "Hamilton," "The Wire is the greatest TV show ever," and "This ramen is delicious." Note that characters are often denoted with the quotation marks around them. * *Vectors*: a series of values. These are created using the `c()` function, where `c()` stands for "combine" or "concatenate." For example, `c(6, 11, 13, 31, 90, 92)` creates a six element series of positive integer values \index{vectors}. * *Factors*: *categorical data* are commonly represented in R as factors. \index{factors} Categorical data can also be represented as *strings*. We'll study this difference as we progress through the book. * *Data frames*: rectangular spreadsheets. They are representations of datasets in R where the rows correspond to *observations* and the columns correspond to *variables* that describe the observations. \index{data frame} We'll cover data frames later in Section \@ref(nycflights). @@ -132,14 +140,14 @@ One thing that intimidates new R and RStudio users is how it reports *errors*, * R will show red text in the console pane in three different situations: -* **Errors**: \index{R!errors} When the red text is a legitimate error, it will be prefaced with "Error in…" and will try to explain what went wrong. Generally when there's an error, the code will not run. For example, we'll see in Subsection \@ref(package-use) if you see `Error in ggplot(...) : could not find function "ggplot"`, it means that the `ggplot()` function is not accessible because the package that contains the function (`ggplot2`) was not loaded with `library(ggplot2)`. Thus you cannot use the `ggplot()` function without the `ggplot2` package being loaded first. -* **Warnings**: \index{R!warnings} When the red text is a warning, it will be prefaced with "Warning:" and R will try to explain why there's a warning. Generally your code will still work, but with some caveats. For example, you will see in Chapter \@ref(viz) if you create a scatterplot based on a dataset where two of the rows of data have missing entries that would be needed to create points in the scatterplot, you will see this warning: `Warning: Removed 2 rows containing missing values (geom_point)`. R will still produce the scatterplot with all the remaining non-missing values, but it is warning you that two of the points aren't there. -* **Messages**: \index{R!messages} When the red text doesn't start with either "Error" or "Warning", it's *just a friendly message*. You'll see these messages when you load *R packages* in the upcoming Subsection \@ref(package-loading) or when you read data saved in spreadsheet files with the `read_csv()` function as you'll see in Chapter \@ref(tidy). These are helpful diagnostic messages and they don't stop your code from working. Additionally, you'll see these messages when you install packages too using `install.packages()` as discussed in Subsection \@ref(package-installation). +* **Errors**: \index{R!errors} when the red text is a legitimate error, it will be prefaced with "Error in…" and will try to explain what went wrong. Generally when there's an error, the code will not run. For example, we'll see in Subsection \@ref(package-use) if you see `Error in ggplot(...) : could not find function "ggplot"`, it means that the `ggplot()` function is not accessible because the package that contains the function (`ggplot2`) was not loaded with `library(ggplot2)`. Thus you cannot use the `ggplot()` function without the `ggplot2` package being loaded first. +* **Warnings**: \index{R!warnings} when the red text is a warning, it will be prefaced with "Warning:" and R will try to explain why there's a warning. Generally your code will still work, but with some caveats. For example, you will see in Chapter \@ref(viz) if you create a scatterplot based on a dataset where two of the rows of data have missing entries that would be needed to create points in the scatterplot, you will see this warning: `Warning: Removed 2 rows containing missing values (geom_point)`. R will still produce the scatterplot with all the remaining non-missing values, but it is warning you that two of the points aren't there. +* **Messages**: \index{R!messages} when the red text doesn't start with either "Error" or "Warning," it's *just a friendly message*. You'll see these messages when you load *R packages* in the upcoming Subsection \@ref(package-loading) or when you read data saved in spreadsheet files with the `read_csv()` function as you'll see in Chapter \@ref(tidy). These are helpful diagnostic messages and they don't stop your code from working. Additionally, you'll see these messages when you install packages too using `install.packages()` as discussed in Subsection \@ref(package-installation). Remember, when you see red text in the console, *don't panic*. It doesn't necessarily mean anything is wrong. Rather: -* If the text starts with "Error", figure out what's causing it. Think of errors as a red traffic light: something is wrong! -* If the text starts with "Warning", figure out if it's something to worry about. For instance, if you get a warning about missing values in a scatterplot and you know there are missing values, you're fine. If that's surprising, look at your data and see what's missing. Think of warnings as a yellow traffic light: everything is working fine, but watch out/pay attention. +* If the text starts with "Error," figure out what's causing it. Think of errors as a red traffic light: something is wrong! +* If the text starts with "Warning," figure out if it's something to worry about. For instance, if you get a warning about missing values in a scatterplot and you know there are missing values, you're fine. If that's surprising, look at your data and see what's missing. Think of warnings as a yellow traffic light: everything is working fine, but watch out/pay attention. * Otherwise, the text is just a message. Read it, wave back at R, and thank it for talking to you. Think of messages as a green traffic light: everything is working fine and keep on going! @@ -164,7 +172,7 @@ Another point of confusion with many new R users is the idea of an R package. R For example, among the many packages we will use in this book are the `ggplot2` package [@R-ggplot2] for data visualization in Chapter \@ref(viz)\index{R packages!ggplot2}, the `dplyr` package [@R-dplyr] for data wrangling in Chapter \@ref(wrangling)\index{R packages!dplyr}, the `moderndive` package [@R-moderndive] that accompanies this book\index{R packages!moderndive}, and the `infer` package [@R-infer] for "tidy" and transparent statistical inference in Chapters \@ref(confidence-intervals), \@ref(hypothesis-testing), and \@ref(inference-for-regression)\index{R packages!infer}. -A good analogy for R packages \index{R packages} is they are like apps you can download onto a mobile phone: +A good analogy for R packages \index{R packages} is they are like apps you can download onto a mobile phone like in Figure \@ref(fig:R-vs-R-packages): + +``` +Warning: Removed 3 rows containing missing values or values outside the scale range +(`geom_point()`). +``` + Let's first unpack the graphic in Figure \@ref(fig:noalpha). Observe that a *positive relationship* exists between `dep_delay` and `arr_delay`: as departure delays increase, arrival delays tend to also increase. Observe also the large mass of points clustered near (0, 0), the point indicating flights that neither departed nor arrived late. Let's turn our attention to the warning message. R is alerting us to the fact that three rows were ignored due to them being missing. For these three rows, either the value for `dep_delay` or `arr_delay` or both were missing (recorded in R as `NA`), and thus these rows were ignored in our plot. @@ -314,7 +321,7 @@ ggplot(data = envoy_flights, mapping = aes(x = dep_delay, y = arr_delay)) The large mass of points near (0, 0) in Figure \@ref(fig:noalpha) can cause some confusion since it is hard to tell the true number of points that are plotted. This is the result of a phenomenon called \index{overplotting} *overplotting*. As one may guess, this corresponds to points being plotted on top of each other over and over again. When overplotting occurs, it is difficult to know the number of points being plotted. There are two methods to address the issue of overplotting. Either by 1. Adjusting the transparency of the points or -1. Adding a little random "jitter", or random "nudges", to each of the points. +1. Adding a little random "jitter" (or random "nudges") to each of the points. **Method 1: Changing the transparency** @@ -327,7 +334,7 @@ ggplot(data = envoy_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2) ``` -The key feature to note in Figure \@ref(fig:alpha) is that the transparency \index{R packages!ggplot2!alpha}\index{adding transparency to plots}of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark. Note furthermore that there is no `aes()` surrounding `alpha = 0.2`. This is because we are not mapping a variable to an aesthetic attribute, but rather merely changing the default setting of `alpha`. In fact, you'll receive an error if you try to change the second line to read `geom_point(aes(alpha = 0.2))`. +The key feature to note in Figure \@ref(fig:alpha) is that the transparency \index{R packages!ggplot2!alpha}\index{adding transparency to plots}of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark. Note, furthermore, that there is no `aes()` surrounding `alpha = 0.2`. This is because we are not mapping a variable to an aesthetic attribute, but rather merely changing the default setting of `alpha`. In fact, you'll receive an error if you try to change the second line to read `geom_point(aes(alpha = 0.2))`. **Method 2: Jittering the points** @@ -396,7 +403,7 @@ The next of the five named graphs are linegraphs. Linegraphs \index{linegraphs} The most common examples of linegraphs have some notion of time on the x-axis: hours, days, weeks, years, etc. Since time is sequential, we connect consecutive observations of the variable on the y-axis with a line. Linegraphs that have some notion of time on the x-axis are also called *time series* plots\index{time series plots}. Let's illustrate linegraphs using another dataset in the `nycflights23` \index{R packages!nycflights23} package: the `weather` data frame. -Let's explore the `weather` data frame from the `nycflights23` package by running `View(weather)` and `glimpse(weather)`. Furthermore let's read the associated help file by running `?weather` to bring up the help file. +Let's explore the `weather` data frame from the `nycflights23` package by running `View(weather)` and `glimpse(weather)`. Furthermore, let's read the associated help file by running `?weather` to bring up the help file. Observe that there is a variable called `temp` of hourly wind speed recordings in miles per hour at weather stations near all three major airports in New York City: Newark (`origin` code `EWR`), John F. Kennedy International (`JFK`), and LaGuardia (`LGA`). @@ -420,7 +427,7 @@ However, instead of considering hourly wind speeds for all days in 2023 for all ### Linegraphs via `geom_line` {#geomline} -Let's create a time series plot of the hourly wind speeds saved in the `early_january_2023_weather` data frame by using `geom_line()` to create a linegraph\index{R packages!ggplot2!geom\_line()}, instead of using `geom_point()` like we used previously to create scatterplots: +Let's create a time series plot (as seen in Figure \@ref(fig:hourlytemp)) of the hourly wind speeds saved in the `early_january_2023_weather` data frame by using `geom_line()` to create a linegraph\index{R packages!ggplot2!geom\_line()}, instead of using `geom_point()` like we used previously to create scatterplots: ```{r hourlytemp, fig.cap="Hourly wind speed in Newark for January 1-15, 2023."} ggplot(data = early_january_2023_weather, @@ -512,14 +519,22 @@ All eight bins spanning 0 mph to 40 mph on the x-axis have this interpretation. Let's now present the `ggplot()` code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in `aes()`: the single numerical variable `wind_speed`. The y-aesthetic of a histogram, the count of the observations in each bin, gets computed for you automatically. Furthermore, the geometric object layer is now a `geom_histogram()`. \index{R packages!ggplot2!geom\_histogram()} After running the following code, you'll see the histogram in Figure \@ref(fig:weather-histogram) as well as warning messages. We'll discuss the warning messages first. -```{r weather-histogram, warning=TRUE, fig.cap="Histogram of hourly wind speeds at three NYC airports.", fig.height=ifelse(knitr::is_latex_output(), 2.3, 4)} +```{r weather-histogram, warning=TRUE, fig.cap="Histogram of hourly wind speeds at three NYC airports.", fig.height=ifelse(knitr::is_latex_output(), 2.3, 4), warning=FALSE, message=FALSE} ggplot(data = weather, mapping = aes(x = wind_speed)) + geom_histogram() ``` +``` +`stat_bin()` using `bins=30`. Pick better value with `binwidth`. +``` + +``` +Warning: Removed 1033 rows containing non-finite outside the scale range (`stat_bin()`). +``` + The first message is telling us that the histogram was constructed using `bins = 30` for 30 equally spaced bins. This is known in computer programming as a default value; unless you override this default number of bins with a number you specify, R will choose 30 by default. We'll see in the next section how to change the number of bins to another value than the default. -The second message is telling us something similar to the warning message we received when we ran the code to create a scatterplot of departure and arrival delays for Envoy Air flights in Figure \@ref(fig:noalpha): that because some rows have missing `NA` value for `wind_speed`, they were omitted from the histogram. R is just giving us a friendly heads-up that this was the case. +The second warning message is telling us something similar to the warning message we received when we ran the code to create a scatterplot of departure and arrival delays for Envoy Air flights in Figure \@ref(fig:noalpha): that because some rows have missing `NA` value for `wind_speed`, they were omitted from the histogram. R is just giving us a friendly heads-up that this was the case. Now let's unpack the resulting histogram in Figure \@ref(fig:weather-histogram). Observe that values above 30 mph are rather rare. However, because of the large number of bins, it's hard to get a sense for which range of wind speeds is spanned by each bin; everything is one giant amorphous blob. So let's add white vertical borders demarcating the bins by adding a `color = "white"` argument to `geom_histogram()` and ignore the warning about setting the number of bins to a better value: @@ -744,7 +759,7 @@ if (!file.exists("images/patchwork_boxplot.png")) { } ``` -```{r apr2, echo=FALSE, fig.cap="Building up a boxplot of April wind speeds.", out.width="70%", out.height="70%", purl=FALSE} +```{r apr2, echo=FALSE, fig.cap="Building up a boxplot of April wind speeds.", out.width="80%", purl=FALSE} # Display the three plots side-by-side knitr::include_graphics("images/patchwork_boxplot.png") ``` @@ -782,7 +797,7 @@ Warning message: 1: Continuous x aesthetic -- did you forget aes(group=...)? ``` -Observe in Figure \@ref(fig:badbox) that this plot does not provide information about wind speed separated by month. The first warning message tells us why. It says that we have a "continuous", or numerical variable, on the x-position aesthetic. Boxplots, however, require a categorical variable to be mapped to the x-position aesthetic. +Observe in Figure \@ref(fig:badbox) that this plot does not provide information about wind speed separated by month. The first warning message tells us why. It says that we have a "continuous" (or numerical variable) on the x-position aesthetic. Boxplots, however, require a categorical variable to be mapped to the x-position aesthetic. We can convert the numerical variable `month` into a `factor` categorical variable by using the `factor()` \index{factors} function. After applying `factor(month)`, `month` goes from having just the numerical values 1, 2, ..., and 12 to having an associated ordering. With this ordering, `ggplot()` now knows how to work with this variable to produce the plot. @@ -795,7 +810,7 @@ The resulting Figure \@ref(fig:monthtempbox) shows 12 separate "box and whiskers * The "box" portions of the visualization represent the 1st quartile, the median (the 2nd quartile), and the 3rd quartile. * The height of each box (the value of the 3rd quartile minus the value of the 1st quartile) is the interquartile range (IQR). It is a measure of the spread of the middle 50% of values, with longer boxes indicating more variability. -* The "whisker" portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25th percentile and greater than the 75th percentiles, respectively. They're set to extend out no more than $1.5 \times IQR$ units away from either end of the boxes. We say "no more than" because the ends of the whiskers have to correspond to observed wind speeds. The length of these whiskers show how the data outside the middle 50% of values vary, with longer whiskers indicating more variability. +* The "whisker" portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25th percentile and greater than the 75th percentiles, respectively. They're set to extend out no more than $1.5 \times IQR$ units away from either end of the boxes. We say "no more than" because the ends of the whiskers have to correspond to observed wind speeds. The length of these whiskers shows how the data outside the middle 50% of values vary, with longer whiskers indicating more variability. * The dots representing values falling outside the whiskers are called \index{outliers} *outliers*. These can be thought of as anomalous ("out-of-the-ordinary") values. It is important to keep in mind that the definition of an outlier is somewhat arbitrary and not absolute. In this case, they are defined by the length of the whiskers, which are no more than $1.5 \times IQR$ units long for each boxplot. Looking at this side-by-side plot we can see that the months of February and March have higher median wind speeds as evidenced by the higher solid lines in the middle of the boxes. We can easily compare wind speeds across months by drawing imaginary horizontal lines across the plot. Furthermore, the heights of the 12 boxes as quantified by the interquartile ranges are informative too; they tell us about variability, or spread, of wind speeds recorded in a given month. @@ -862,14 +877,14 @@ Depending on how your categorical data is represented, you'll need to add a diff Let's generate barplots using these two different representations of the same basket of fruit: 3 apples and 2 oranges. Using the `fruits` data frame where all 5 fruits are listed individually in 5 rows, we map the `fruit` variable to the x-position aesthetic and add a \index{R packages!ggplot2!geom\_bar()} `geom_bar()` layer: -```{r geombar, fig.cap="Barplot when counts are not pre-counted.", fig.height=ifelse(knitr::is_latex_output(), 1.8, 4)} +```{r geombar, fig.cap="Barplot when counts are not pre-counted.", fig.height=ifelse(knitr::is_latex_output(), 1.3, 4)} ggplot(data = fruits, mapping = aes(x = fruit)) + geom_bar() ``` -However, using the `fruits_counted` data frame where the fruits have been "pre-counted", we once again map the `fruit` variable to the x-position aesthetic, but here we also map the `count` variable to the y-position aesthetic, and add a \index{R packages!ggplot2!geom\_col()} `geom_col()` layer instead. +However, using the `fruits_counted` data frame where the fruits have been "pre-counted," we once again map the `fruit` variable to the x-position aesthetic. Here, we also map the `count` variable to the y-position aesthetic, and add a \index{R packages!ggplot2!geom\_col()} `geom_col()` layer instead. -```{r, geomcol, fig.cap="Barplot when counts are pre-counted.", fig.height=ifelse(knitr::is_latex_output(), 1.8, 4)} +```{r, geomcol, fig.cap="Barplot when counts are pre-counted.", fig.height=ifelse(knitr::is_latex_output(), 1.3, 4)} ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) + geom_col() ``` @@ -883,7 +898,7 @@ Let's now go back to the `flights` data frame in the `nycflights23` package and (ref:geombar) Number of flights departing NYC in 2023 by airline using `geom_bar()`. -```{r flightsbar, fig.cap="(ref:geombar)", fig.height=ifelse(knitr::is_latex_output(), 1.4, 4)} +```{r flightsbar, fig.cap="(ref:geombar)", fig.height=ifelse(knitr::is_latex_output(), 3, 4)} ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() ``` @@ -941,7 +956,7 @@ Let's examine the same data used in our previous barplot of the number of flight * What is the third largest carrier in terms of departing flights? * How many carriers have fewer flights than Delta Air Lines Inc. (`DL`)? -```{r carrierpie, echo=FALSE, fig.cap="The dreaded pie chart.", fig.height=ifelse(knitr::is_latex_output(), 4.8, 5), purl=FALSE} +```{r carrierpie, echo=FALSE, fig.cap="The dreaded pie chart.", fig.height=ifelse(knitr::is_latex_output(), 6, 5), purl=FALSE} if (is_html_output()) { ggplot(flights, mapping = aes(x = factor(1), fill = carrier)) + geom_bar(width = 1) + @@ -995,7 +1010,9 @@ While it is quite difficult to answer these questions when looking at the pie ch ### Two categorical variables {#two-categ-barplot} -Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the *joint* distribution of two categorical variables at the same time. Let's examine the *joint* distribution of outgoing domestic flights from NYC by `carrier` as well as `origin`, in other words, the number of flights for each `carrier` and `origin` combination. This corresponds to the number of American Airlines flights from `JFK`, the number of American Airlines flights from `LGA`, the number of American Airlines flights from `EWR`, the number of Endeavor Air flights from `JFK`, and so on. Recall the `ggplot()` code that created the barplot of `carrier` frequency in Figure \@ref(fig:flightsbar): +Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the *joint* distribution of two categorical variables at the same time. + +Let's examine the *joint* distribution of outgoing domestic flights from NYC by `carrier` as well as `origin`, in other words, the number of flights for each `carrier` and `origin` combination. This corresponds to the number of American Airlines flights from `JFK`, the number of American Airlines flights from `LGA`, the number of American Airlines flights from `EWR`, the number of Endeavor Air flights from `JFK`, and so on. Recall the `ggplot()` code that created the barplot of `carrier` frequency in Figure \@ref(fig:flightsbar): ```{r, eval=FALSE} ggplot(data = flights, mapping = aes(x = carrier)) + @@ -1009,7 +1026,7 @@ ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar() ``` -```{r flights-stacked-bar, echo=FALSE, fig.cap="Stacked barplot of flight amount by carrier and origin.", fig.height=ifelse(knitr::is_latex_output(), 2.8, 4), purl=FALSE} +```{r flights-stacked-bar, echo=FALSE, fig.cap="Stacked barplot of flight amount by carrier and origin.", fig.height=ifelse(knitr::is_latex_output(), 2.2, 4), purl=FALSE} if (is_html_output()) { ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar() @@ -1065,7 +1082,7 @@ if (is_html_output()) { } ``` -Lastly, another type of barplot is a \index{barplot!faceted} *faceted barplot*. Recall in Section \@ref(facets) we visualized the distribution of hourly wind speeds at the 3 NYC airports *split* by month using facets. We apply the same principle to our barplot visualizing the frequency of `carrier` split by `origin`: instead of mapping `origin` to `fill` we include it as the variable to create small multiples of the plot across the levels of `origin`. +Lastly, another type of barplot is a \index{barplot!faceted} *faceted barplot*. Recall in Section \@ref(facets) we visualized the distribution of hourly wind speeds at the 3 NYC airports *split* by month using facets. We apply the same principle to our barplot visualizing the frequency of `carrier` split by `origin`. Instead of mapping `origin` to `fill` we include it as the variable to create small multiples of the plot across the levels of `origin` in Figure \@ref(fig:facet-bar-vert). ```{r eval=FALSE} ggplot(data = flights, mapping = aes(x = carrier)) + @@ -1097,7 +1114,7 @@ if (is_latex_output()) { **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What kinds of questions are not easily answered by looking at Figure \@ref(fig:flights-stacked-bar)? -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What can you say, if anything, about the relationship between airline and airport in NYC in 2023 in regards to the number of departing flights? +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What can you say, if anything, about the relationship between airline and airport in NYC in 2023 in regard to the number of departing flights? **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why might the side-by-side barplot be preferable to a stacked barplot in this case? @@ -1198,7 +1215,7 @@ if (is_latex_output()) { generate_r_file_link("02-visualization.R") ``` -If you want to further unlock the power of the `ggplot2` package for data visualization, we suggest that you check out RStudio's "Data Visualization with ggplot2" cheatsheet. This cheatsheet summarizes much more than what we've discussed in this chapter. In particular, it presents many more than the 5 `geom`etric objects we covered in this chapter while providing quick and easy to read visual descriptions. For all the `geom`etric objects, it also lists all the possible aesthetic attributes one can tweak. In the current version of RStudio in mid 2024, you can access this cheatsheet by going to the RStudio Menu Bar -> Help -> Cheatsheets -> "Data Visualization with ggplot2." `r if(is_html_output()) "You can see a preview in the figure below."` Alternatively, you can preview the cheat sheet by going to the ggplot2 Github page with this [link](https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf). +If you want to further unlock the power of the `ggplot2` package for data visualization, we suggest that you check out RStudio's "Data Visualization with ggplot2" cheatsheet. This cheatsheet summarizes much more than what we've discussed in this chapter. In particular, it presents many more than the 5 `geom`etric objects we covered in this chapter while providing quick and easy-to-read visual descriptions. For all the `geom`etric objects, it also lists all the possible aesthetic attributes one can tweak. In the current version of RStudio in mid 2024, you can access this cheatsheet by going to the RStudio Menu Bar -> Help -> Cheatsheets -> "Data Visualization with ggplot2." `r if(is_html_output()) "You can see a preview in the figure below."` Alternatively, you can preview the cheat sheet by going to the ggplot2 Github page with this [link](https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf). ```{r ggplot-cheatsheet, echo=FALSE, fig.cap="Data Visualization with ggplot2 cheatsheet.", purl=FALSE, out.width=if(!knitr::is_latex_output()) "100%"} if (is_html_output()) { @@ -1250,4 +1267,4 @@ ggplot(data = early_january_2023_weather, mapping = aes(x = time_hour, y = temp) This code pares down the `weather` data frame to a new data frame `early_january_2023_weather` consisting of hourly wind speed recordings only for `origin == "EWR"`, `month == 1`, and day less than or equal to `15`. -These two code segments are a preview of Chapter \@ref(wrangling) on data wrangling using the `dplyr` package. Data wrangling is the process of transforming and modifying existing data with the intent of making it more appropriate for analysis purposes. For example, these two code segments used the `filter()` function to create new data frames (`envoy_flights` and `early_january_2023_weather`) by choosing only a subset of rows of existing data frames (`flights` and `weather`). In the next chapter, we'll formally introduce the `filter()` and other data wrangling functions as well as the *pipe operator* ` |>` which allows you to combine multiple data wrangling actions into a single sequential *chain* of actions. On to Chapter \@ref(wrangling) on data wrangling! +These two code segments are a preview of Chapter \@ref(wrangling) on data wrangling using the `dplyr` package. Data wrangling is the process of transforming and modifying existing data with the intent of making it more appropriate for analysis purposes. For example, these two code segments used the `filter()` function to create new data frames (`envoy_flights` and `early_january_2023_weather`) by choosing only a subset of rows of existing data frames (`flights` and `weather`). In the next chapter, we'll formally introduce the `filter()` and other data-wrangling functions as well as the *pipe operator* ` |>` which allows you to combine multiple data-wrangling actions into a single sequential *chain* of actions. On to Chapter \@ref(wrangling) on data wrangling! diff --git a/03-wrangling.Rmd b/03-wrangling.Rmd index 5b98773f9..4343bfe5f 100755 --- a/03-wrangling.Rmd +++ b/03-wrangling.Rmd @@ -28,7 +28,7 @@ options(knitr.kable.NA = "") set.seed(76) ``` -So far in our journey, we've seen how to look at data saved in data frames using the `glimpse()` and `View()` functions in Chapter \@ref(getting-started), and how to create data visualizations using the `ggplot2` package in Chapter \@ref(viz). In particular we studied what we term the "five named graphs" (5NG): +So far in our journey, we've seen how to look at data saved in data frames using the `glimpse()` and `View()` functions in Chapter \@ref(getting-started), and how to create data visualizations using the `ggplot2` package in Chapter \@ref(viz). In particular, we studied what we term the "five named graphs" (5NG): 1. scatterplots via `geom_point()` 1. linegraphs via `geom_line()` @@ -49,7 +49,7 @@ In this chapter, we'll introduce a series of functions from the `dplyr` package Notice how we used `computer_code` font to describe the actions we want to take on our data frames. This is because the `dplyr` package for data wrangling has intuitively verb-named functions that are easy to remember. -There is a further benefit to learning to use the `dplyr` package for data wrangling: its similarity to the database querying language [SQL](https://en.wikipedia.org/wiki/SQL) (pronounced "sequel" or spelled out as "S", "Q", "L"). SQL (which stands for "Structured Query Language") is used to manage large databases quickly and efficiently and is widely used by many institutions with a lot of data. While SQL is a topic left for a book or a course on database management, keep in mind that once you learn `dplyr`, you can learn SQL easily. We'll talk more about their similarities in Subsection \@ref(normal-forms). +There is a further benefit to learning to use the `dplyr` package for data wrangling: its similarity to the database querying language [SQL](https://en.wikipedia.org/wiki/SQL) (pronounced "sequel" or spelled out as "S-Q-L"). SQL (which stands for "Structured Query Language") is used to manage large databases quickly and efficiently and is widely used by many institutions with a lot of data. While SQL is a topic left for a book or a course on database management, keep in mind that once you learn `dplyr`, you can learn SQL easily. We'll talk more about their similarities in Subsection \@ref(normal-forms). ### Needed packages {-#wrangling-packages} @@ -112,7 +112,7 @@ You would read this sequence as: So while both approaches achieve the same goal, the latter is much more human-readable because you can clearly read the sequence of operations line-by-line. But what are the hypothetical `x`, `f()`, `g()`, and `h()`? Throughout this chapter on data wrangling: 1. The starting value `x` will be a data frame. For example, the \index{R packages!nycflights23} `flights` data frame we explored in Section \@ref(nycflights). -1. The sequence of functions, here `f()`, `g()`, and `h()`, will mostly be a sequence of any number of the six data wrangling verb-named functions we listed in the introduction to this chapter. For example, the `filter(carrier == "MQ")` function and argument specified we previewed earlier. +1. The sequence of functions, here `f()`, `g()`, and `h()`, will mostly be a sequence of any number of the six data-wrangling verb-named functions we listed in the introduction to this chapter. For example, the `filter(carrier == "MQ")` function and argument specified we previewed earlier. 1. The result will be the transformed/modified data frame that you want. In our example, we'll save the result in a new data frame by using the `<-` assignment operator with the name `alaska_flights` via `alaska_flights <-`. ```{r, eval=FALSE} @@ -122,7 +122,7 @@ envoy_flights <- flights |> Much like when adding layers to a `ggplot()` using the `+` sign, you form a single *chain* of data wrangling operations by combining verb-named functions into a single sequence using the pipe operator `|>`. Furthermore, much like how the `+` sign has to come at the end of lines when constructing plots, the pipe operator `|>` has to come at the end of lines as well. -Keep in mind, there are many more advanced data wrangling functions than just the six listed in the introduction to this chapter; you'll see some examples of these in Section \@ref(other-verbs). However, just with these six verb-named functions you'll be able to perform a broad array of data wrangling tasks for the rest of this book. +Keep in mind, there are many more advanced data-wrangling functions than just the six listed in the introduction to this chapter; you'll see some examples of these in Section \@ref(other-verbs). However, just with these six verb-named functions you'll be able to perform a broad array of data-wrangling tasks for the rest of this book. @@ -130,7 +130,7 @@ Keep in mind, there are many more advanced data wrangling functions than just th ## `filter` rows {#filter} -```{r filter, fig.cap="Diagram of filter() rows operation.", echo=FALSE, purl=FALSE, out.width=if(!knitr::is_latex_output()) "100%"} +```{r filter, fig.cap="Diagram of filter() rows operation.", echo=FALSE, purl=FALSE, out.width="80%"} include_graphics("images/cheatsheets/filter.png") ``` @@ -238,15 +238,15 @@ if(!is_latex_output()) cat("See [Appendix A online](https://moderndive.com/v2/appendixa) for a glossary of such summary statistics.") ``` -Let's calculate two summary statistics of the `wind_speed` temperature variable in the `weather` data frame: the mean and standard deviation (recall from Section \@ref(nycflights) that the `weather` data frame is included in the `nycflights23` package). To compute these summary statistics, we need the `mean()` and `sd()` *summary functions* in R. Summary functions in R take in many values and return a single value, as illustrated in Figure \@ref(fig:summary-function). +Let's calculate two summary statistics of the `wind_speed` temperature variable in the `weather` data frame: the mean and standard deviation (recall from Section \@ref(nycflights) that the `weather` data frame is included in the `nycflights23` package). To compute these summary statistics, we need the `mean()` and `sd()` *summary functions* in R. Summary functions in R take in many values and return a single value, as shown in Figure \@ref(fig:summary-function). -```{r summary-function, fig.cap="Diagram illustrating a summary function in R.", echo=FALSE, purl=FALSE, out.height="80%", out.width="80%"} +```{r summary-function, fig.cap="Diagram illustrating summary() function in R.", echo=FALSE, purl=FALSE, out.width="80%"} include_graphics("images/cheatsheets/summary.png") ``` More precisely, we'll use the `mean()` and `sd()` summary functions within the `summarize()` \index{R packages!dplyr!summarize()} function from the `dplyr` package. Note you can also use the British English spelling of `summarise()`. As shown in Figure \@ref(fig:sum1), the `summarize()` function takes in a data frame and returns a data frame with only one row corresponding to the summary statistics. -```{r sum1, fig.cap="Diagram of summarize() rows.", echo=FALSE, purl=FALSE, out.height="80%", out.width="80%"} +```{r sum1, fig.cap="Diagram of summarize() rows.", echo=FALSE, purl=FALSE, out.width="80%"} include_graphics("images/cheatsheets/summarize1.png") ``` @@ -258,7 +258,7 @@ summary_windspeed <- weather |> summary_windspeed ``` -Why are the values returned `NA`? `NA` is how R encodes *missing values* \index{missing values} where `NA` indicates "not available" or "not applicable." If a value for a particular row and a particular column does not exist, `NA` is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult to do so? Perhaps there was an erroneous value that someone entered that has been corrected to read as missing? You'll often encounter issues with missing values when working with real data. +Why are the values returned `NA`? `NA` is how R encodes *missing values* \index{missing values} where `NA` indicates "not available" or "not applicable." If a value for a particular row and a particular column does not exist, `NA` is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it. Perhaps the data was not collected at all because it was too difficult to do so. Perhaps there was an erroneous value that someone entered that has been corrected to read as missing. You'll often encounter issues with missing values when working with real data. Going back to our `summary_windspeed` output, by default any time you try to calculate a summary statistic of a variable that has one or more `NA` missing values in R, `NA` is returned. To work around this fact, you can set the `na.rm` argument to `TRUE`, where `rm` is short for "remove"; this will ignore any `NA` missing values and only return the summary value for all non-missing values. @@ -315,7 +315,7 @@ summary_windspeed <- weather |> (ref:groupby) Diagram of group_by() and summarize(). -```{r groupsummarize, fig.cap="(ref:groupby)", echo=FALSE, purl=FALSE, fig.height=ifelse(knitr::is_latex_output(), 2.5, 4)} +```{r groupsummarize, fig.cap="(ref:groupby)", echo=FALSE, purl=FALSE, out.width="90%"} # To get `_` to work in caption title. Found at https://github.com/rstudio/bookdown/issues/209 include_graphics("images/cheatsheets/group_summary.png") ``` @@ -329,7 +329,7 @@ summary_temp <- weather |> summary_temp ``` -Say instead of a single mean wind speed for the whole year, we would like 12 mean temperatures, one for each of the 12 months separately. In other words, we would like to compute the mean wind speed split by month. We can do this by "grouping" temperature observations by the values of another variable, in this case by the 12 values of the variable `month`: +Say instead of a single mean wind speed for the whole year, we would like 12 mean temperatures, one for each of the 12 months separately. In other words, we would like to compute the mean wind speed split by month as shown via generic diagram in Figure \@ref(fig:groupsummarize). We can do this by "grouping" temperature observations by the values of another variable, in this case by the 12 values of the variable `month`: ```{r} summary_monthly_windspeed <- weather |> @@ -363,7 +363,7 @@ cut_levels <- diamonds |> n_distinct() ``` -Observe that now there is additional meta-data: ``# Groups: cut [`r cut_levels`]`` indicating that the grouping structure meta-data has been set based on the `r cut_levels` possible levels of the categorical variable `cut`: `"Fair"`, `"Good"`, `"Very Good"`, `"Premium"`, and `"Ideal"`. On the other hand, observe that the data has not changed: it is still a table of `r diamonds |> nrow() |> comma()` $\times$ `r diamonds |> ncol()` values. Only by combining a `group_by()` with another data wrangling operation, in this case `summarize()`, will the data actually be transformed. +Observe that now there is additional meta-data: ``# Groups: cut [`r cut_levels`]`` indicating that the grouping structure meta-data has been set based on the `r cut_levels` possible levels of the categorical variable `cut`: `"Fair"`, `"Good"`, `"Very Good"`, `"Premium"`, and `"Ideal"`. On the other hand, observe that the data has not changed: it is still a table of `r diamonds |> nrow() |> comma()` $\times$ `r diamonds |> ncol()` values. Only by combining a `group_by()` with another data-wrangling operation, in this case `summarize()`, will the data actually be transformed. ```{r} diamonds |> @@ -456,11 +456,11 @@ What happened here is that the second `group_by(month)` overwrote the grouping s ## `mutate` existing variables {#mutate} -```{r select, fig.cap="Diagram of mutate() columns.", echo=FALSE, purl=FALSE, out.height='80%', out.width='80%'} +```{r mutatevars, fig.cap="Diagram of mutate() columns.", echo=FALSE, purl=FALSE, out.width="40%"} include_graphics("images/cheatsheets/mutate.png") ``` -Another common transformation of data is to create/compute new variables based on existing ones. For example, say you are more comfortable thinking of temperature in degrees Celsius (°C) instead of degrees Fahrenheit (°F). The formula to convert temperatures from °F to °C is +Another common transformation of data is to create/compute new variables based on existing ones as shown in Figure \@ref(fig:mutatevars). For example, say you are more comfortable thinking of temperature in degrees Celsius (°C) instead of degrees Fahrenheit (°F). The formula to convert temperatures from °F to °C is $$ \text{temp in C} = \frac{\text{temp in F} - 32}{1.8} @@ -586,7 +586,7 @@ flights <- flights |> ## `arrange` and sort rows {#arrange} -One of the most commonly performed data wrangling tasks is to sort a data frame's rows in the alphanumeric order of one of the variables. The `dplyr` package's `arrange()` function \index{R packages!dplyr!arrange()} allows us to sort/reorder a data frame's rows according to the values of the specified variable. +One of the most commonly performed data-wrangling tasks is to sort a data frame's rows in the alphanumeric order of one of the variables. The `dplyr` package's `arrange()` function \index{R packages!dplyr!arrange()} allows us to sort/reorder a data frame's rows according to the values of the specified variable. Suppose we are interested in determining the most frequent destination airports for all domestic flights departing from New York City in 2023: @@ -629,7 +629,7 @@ The values in the variable `carrier` in the `flights` data frame match the value (ref:relationships-nycflights) Data relationships in nycflights from *R for Data Science*. -```{r reldiagram, fig.cap="(ref:relationships-nycflights)", echo=FALSE, purl=FALSE, out.height="65%", out.width="65%"} +```{r reldiagram, fig.cap="(ref:relationships-nycflights)", echo=FALSE, purl=FALSE, out.width="65%"} include_graphics("images/r4ds/relational-nycflights.png") ``` @@ -651,7 +651,7 @@ A visual representation of the `inner_join()` is shown in Figure \@ref(fig:ijdia (ref:inner-join-r4ds) Diagram of inner join from *R for Data Science*. -```{r ijdiagram, fig.cap="(ref:inner-join-r4ds)", echo=FALSE, purl=FALSE, out.height="120%"} +```{r ijdiagram, fig.cap="(ref:inner-join-r4ds)", echo=FALSE, purl=FALSE, out.width="80%"} include_graphics("images/r4ds/join-inner.png") ``` @@ -666,7 +666,7 @@ The `airports` data frame contains the airport codes for each airport: View(airports) ``` -However, if you look at both the `airports` and `flights` data frames, you'll find that the airport codes are in variables that have different names. In `airports` the airport code is in `faa`, whereas in `flights` the airport codes are in `origin` and `dest`. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure \@ref(fig:reldiagram). +However, if you look at both the `airports` and `flights` data frames, you'll find that the airport codes are in variables that have different names. In `airports`, the airport code is in `faa`, whereas in `flights` the airport codes are in `origin` and `dest`. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure \@ref(fig:reldiagram). In order to join these two data frames by airport code, our `inner_join()` operation will use the `by = c("dest" = "faa")` \index{R packages!dplyr!inner\_join()!by} argument with modified code syntax allowing us to join two data frames where the key variable has a different name: @@ -721,7 +721,7 @@ View(flights_weather_joined) ### Normal forms {#normal-forms} -The data frames included in the `nycflights23` package are in a form that minimizes redundancy of data. For example, the `flights` data frame only saves the `carrier` code of the airline company; it does not include the actual name of the airline. For example, you'll see that the first row of `flights` has `carrier` equal to `UA`, but it does not include the airline name of "United Air Lines Inc." +The data frames included in the `nycflights23` package are in a form that minimizes redundancy of data. For example, the `flights` data frame only saves the `carrier` code of the airline company; it does not include the actual name of the airline. For example, you'll see that the first row of `flights` has `carrier` equal to `UA`, but it does not include the airline name "United Air Lines Inc." The names of the airline companies are included in the `name` variable of the `airlines` data frame. In order to have the airline company name included in `flights`, we could join these two data frames as follows: @@ -731,7 +731,7 @@ joined_flights <- flights |> View(joined_flights) ``` -We are capable of performing this join because each of the data frames have _keys_ in common to relate one to another: the `carrier` variable in both the `flights` and `airlines` data frames. The *key* variable(s) that we base our joins on are often *identification variables* as we mentioned previously. +We are capable of performing this join because each of the data frames has _keys_ in common to relate one to another: the `carrier` variable in both the `flights` and `airlines` data frames. The *key* variable(s) that we base our joins on are often *identification variables* as we mentioned previously. This is an important property of what's known as *normal forms* of data. The process of decomposing data frames into less redundant tables without losing information is called *normalization*. More information is available on [Wikipedia](https://en.wikipedia.org/wiki/Database_normalization). @@ -756,7 +756,7 @@ Both `dplyr` and [SQL](https://en.wikipedia.org/wiki/SQL) we mentioned in the in ## Other verbs {#other-verbs} -Here are some other useful data wrangling verbs: +Here are some other useful data-wrangling verbs: * `select()` only a subset of variables/columns. * `relocate()` variables/columns to a new position. @@ -766,7 +766,7 @@ Here are some other useful data wrangling verbs: ### `select` variables {#select} -```{r selectfig, fig.cap="Diagram of select() columns.", echo=FALSE, purl=FALSE, out.width=if(!knitr::is_latex_output()) "100%"} +```{r selectfig, fig.cap="Diagram of select() columns.", echo=FALSE, purl=FALSE, out.width="70%"} include_graphics("images/cheatsheets/select.png") ``` @@ -891,7 +891,7 @@ named_dests |> ### Summary table -Let's recap our data wrangling verbs in Table \@ref(tab:wrangle-summary-table). Using these verbs and the pipe `|>` operator from Section \@ref(piping), you'll be able to write easily legible code to perform almost all the data wrangling and data transformation necessary for the rest of this book. +Let's recap our data-wrangling verbs in Table \@ref(tab:wrangle-summary-table). Using these verbs and the pipe `|>` operator from Section \@ref(piping), you'll be able to write easily legible code to perform almost all the data wrangling and data transformation necessary for the rest of this book. ```{r wrangle-summary-table, message=FALSE, echo=FALSE, purl=FALSE} # The following Google Doc is published to CSV and loaded using read_csv(): @@ -915,7 +915,7 @@ if (is_latex_output()) { `Data wrangling operation` = str_replace_all(`Data wrangling operation`, "`", ""), ) |> kbl( - caption = "Summary of data wrangling verbs", + caption = "Summary of data-wrangling verbs", booktabs = TRUE, linesep = "", format = "latex" @@ -929,7 +929,7 @@ if (is_latex_output()) { } else { ch4_scenarios |> kable( - caption = "Summary of data wrangling verbs", + caption = "Summary of data-wrangling verbs", booktabs = TRUE, format = "html" ) @@ -942,25 +942,25 @@ if (is_latex_output()) { \vspace{-0.1in} ``` -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Let's now put your newly acquired data wrangling skills to the test! +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Let's now put your newly acquired data-wrangling skills to the test! An airline industry measure of a passenger airline's capacity is the [available seat miles](https://en.wikipedia.org/wiki/Available_seat_miles), which is equal to the number of seats available multiplied by the number of miles or kilometers flown summed over all flights. For example, let's consider the scenario in Figure \@ref(fig:available-seat-miles). Since the airplane has 4 seats and it travels 200 miles, the available seat miles are $4 \times 200 = 800$. -```{r available-seat-miles, fig.cap="Example of available seat miles for one flight.", echo=FALSE, purl=FALSE, out.height=ifelse(knitr::is_latex_output(), "35%", "70%"), out.width=if(!knitr::is_latex_output()) "70%"} +```{r available-seat-miles, fig.cap="Example of available seat miles for one flight.", echo=FALSE, purl=FALSE, out.width="60%"} include_graphics("images/flowcharts/flowchart/flowchart.012.png") ``` Extending this idea, let's say an airline had 2 flights using a plane with 10 seats that flew 500 miles and 3 flights using a plane with 20 seats that flew 1000 miles, the available seat miles would be $2 \times 10 \times 500 + 3 \times 20 \times 1000 = 70,000$ seat miles. -Using the datasets included in the `nycflights23` package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints: +Using the datasets included in the `nycflights23` package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data-wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints: 1. **Crucial**: Unless you are very confident in what you are doing, it is worthwhile not starting to code right away. Rather, first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level *pseudocode* that is informal yet detailed enough to articulate what you are doing. This way you won't confuse *what* you are trying to do (the algorithm) with *how* you are going to do it (writing `dplyr` code). 1. Take a close look at all the datasets using the `View()` function: `flights`, `weather`, `planes`, `airports`, and `airlines` to identify which variables are necessary to compute available seat miles. 1. Figure \@ref(fig:reldiagram) showing how the various datasets can be joined will also be useful. -1. Consider the data wrangling verbs in Table \@ref(tab:wrangle-summary-table) as your toolbox! +1. Consider the data-wrangling verbs in Table \@ref(tab:wrangle-summary-table) as your toolbox! ```{block, type="learncheck", purl=FALSE} \vspace{-0.25in} @@ -994,17 +994,17 @@ if(!is_latex_output()) However, to provide a tips and tricks page covering all possible data wrangling questions would be too long to be useful!") ``` -If you want to further unlock the power of the `dplyr` package for data wrangling, we suggest that you check out RStudio's "Data Transformation with dplyr" cheatsheet. This cheatsheet summarizes much more than what we've discussed in this chapter, in particular more intermediate level and advanced data wrangling functions, while providing quick and easy-to-read visual descriptions. In fact, many of the diagrams illustrating data wrangling operations in this chapter, such as Figure \@ref(fig:filter) on `filter()`, originate from this cheatsheet. +If you want to further unlock the power of the `dplyr` package for data wrangling, we suggest that you check out RStudio's "Data Transformation with dplyr" cheatsheet. This cheatsheet summarizes much more than what we've discussed in this chapter, in particular more intermediate level and advanced data-wrangling functions, while providing quick and easy-to-read visual descriptions. In fact, many of the diagrams illustrating data wrangling operations in this chapter, such as Figure \@ref(fig:filter) on `filter()`, originate from this cheatsheet. In the current version of RStudio in 2024, you can access this cheatsheet by going to the RStudio Menu Bar -> Help -> Cheatsheets -> "Data Transformation with dplyr." `r if(is_html_output()) "You can see a preview in the figure below."` -```{r dplyr-cheatsheet, fig.cap="Data Transformation with dplyr cheatsheet.", echo=FALSE, purl=FALSE, out.width=if(!knitr::is_latex_output()) "100%"} +```{r dplyr-cheatsheet, fig.cap="Data Transformation with dplyr cheatsheet.", echo=FALSE, purl=FALSE, out.width="100%"} if (is_html_output()) { include_graphics("images/cheatsheets/dplyr_cheatsheet-1.png") } ``` -On top of the data wrangling verbs and examples we presented in this section, if you'd like to see more examples of using the `dplyr` package for data wrangling, check out [Chapter 5](http://r4ds.had.co.nz/transform.html) of *R for Data Science* [@rds2016]. +On top of the data-wrangling verbs and examples we presented in this section, if you'd like to see more examples of using the `dplyr` package for data wrangling, check out [Chapter 5](http://r4ds.had.co.nz/transform.html) of *R for Data Science* [@rds2016]. ### What's to come? diff --git a/04-tidy.Rmd b/04-tidy.Rmd index 6ad549640..726bc0217 100755 --- a/04-tidy.Rmd +++ b/04-tidy.Rmd @@ -112,7 +112,7 @@ Let's read in the exact same data, but this time from an Excel file saved on you At this point, you should see a screen pop-up like in Figure \@ref(fig:read-excel). After clicking on the "Import" \index{RStudio!import data} button on the bottom right of Figure \@ref(fig:read-excel), RStudio will save this spreadsheet's data in a data frame called `dem_score` and display its contents in the spreadsheet viewer. -```{r read-excel, echo=FALSE, fig.cap="Importing an Excel file to R.", purl=FALSE, out.width=if(!knitr::is_latex_output()) "100%"} +```{r read-excel, echo=FALSE, fig.cap="Importing an Excel file to R.", purl=FALSE, out.width="100%"} include_graphics("images/rstudio_screenshots/read_excel.png") ``` @@ -135,7 +135,7 @@ drinks |> After reading the help file by running `?drinks`, you'll see that `drinks` is a data frame containing results from a survey of the average number of servings of beer, spirits, and wine consumed in `r drinks |> nrow()` countries. This data was originally reported on FiveThirtyEight.com in Mona Chalabi's article: ["Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?"](https://fivethirtyeight.com/features/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/). -Let's apply some of the data wrangling verbs we learned in Chapter \@ref(wrangling) on the `drinks` data frame: +Let's apply some of the data-wrangling verbs we learned in Chapter \@ref(wrangling) on the `drinks` data frame: 1. `filter()` to only consider 4 countries: the United States, China, Italy, and Saudi Arabia, *then* 1. `select()` all columns except `total_litres_of_pure_alcohol` by using the `-` sign, *then* @@ -207,7 +207,7 @@ You have surely heard the word "tidy" in your life: -What does it mean for your data to be "tidy"? While "tidy" has a clear English meaning of "organized," the word "tidy" in data science using R means that your data follows a standardized format. We will follow Hadley Wickham's \index{Wickham, Hadley} definition of *"tidy" data* \index{tidy data} [@tidy] shown also in Figure \@ref(fig:tidyfig): +What does it mean for your data to be "tidy"? While "tidy" has a clear English meaning of "organized," the word "tidy" in data science using R means that your data follows a standardized format. We will follow Hadley Wickham's \index{Wickham, Hadley} British English definition of *"tidy" data* \index{tidy data} [@tidy] shown also in Figure \@ref(fig:tidyfig): > A *dataset* is a collection of values, usually either numbers (if quantitative) or strings AKA text data (if qualitative/categorical). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes. > @@ -219,7 +219,7 @@ What does it mean for your data to be "tidy"? While "tidy" has a clear English m (ref:tidy-r4ds) Tidy data graphic from *R for Data Science*. -```{r tidyfig, echo=FALSE, fig.cap="(ref:tidy-r4ds)", out.height="80%", out.width="80%", purl=FALSE} +```{r tidyfig, echo=FALSE, fig.cap="(ref:tidy-r4ds)", out.width="80%", purl=FALSE} include_graphics("images/r4ds/tidy-1.png") ``` @@ -579,8 +579,8 @@ Congratulations! You've completed the "Data Science with `tidyverse`" portion of However, we're going to leave Chapter \@ref(inference-for-regression) on "Inference for Regression" until after we've covered statistical inference in Chapters \@ref(sampling), \@ref(confidence-intervals), and \@ref(hypothesis-testing). Onwards and upwards into Statistical/Data Modeling as shown in Figure \@ref(fig:part2)! -(ref:flowchart-partii) *ModernDive* flowchart - on to Part II! +(ref:flowchart-partii) *ModernDive* flowchart – on to Part II! -```{r part2, echo=FALSE, fig.cap="(ref:flowchart-partii)", out.height="25%", purl=FALSE} +```{r part2, echo=FALSE, fig.cap="(ref:flowchart-partii)", out.width="50%", purl=FALSE} include_graphics("images/flowcharts/flowchart/flowchart.005.png") ``` diff --git a/05-regression.Rmd b/05-regression.Rmd index 261e2734a..56cfdd1cf 100755 --- a/05-regression.Rmd +++ b/05-regression.Rmd @@ -38,10 +38,10 @@ options(knitr.kable.NA = "") set.seed(76) ``` -We have introduced data visualization in Chapter \@ref(viz), data wrangling in Chapter \@ref(wrangling), and data importing and "tidy" data in Chapter \@ref(tidy). In this chapter we work with **regression**, a method that helps us study the relationship between an *outcome variable* or *response* and one or more *explanatory variables* or *regressors*. +We have introduced data visualization in Chapter \@ref(viz), data wrangling in Chapter \@ref(wrangling), and data importing and "tidy" data in Chapter \@ref(tidy). In this chapter, we work with **regression**, a method that helps us study the relationship between an *outcome variable* or *response* and one or more *explanatory variables* or *regressors*. The method starts by proposing a *statistical model*. Data is then collected and used to estimate the coefficients or parameters for the model, and these results are typically used for two purposes: -1. For **explanation** when we want to describe how changes in one or more of the regressors are associated to changes in the response, quantify those changes, establish which of the regressors truly have an association with the response, or determine whether the model used to describe the relationship between the response and the explanatory variables seems appropriate. +1. For **explanation** when we want to describe how changes in one or more of the regressors are associated with changes in the response, quantify those changes, establish which of the regressors truly have an association with the response, or determine whether the model used to describe the relationship between the response and the explanatory variables seems appropriate. 1. For **prediction** when we want to determine, based on the observed values of the regressors, what will the value of the response be? We are not concerned about how all the regressors relate and interact with one another or with the response, we simply want as good predictions as possible. As an illustration, assume that we want to study the relationship between blood pressure and potential risk factors such as daily salt intake, age, and physical activity levels. @@ -54,7 +54,7 @@ The most basic and commonly-used type of regression is *linear regression*\index Linear regression involves a *numerical* response and one or more regressors that can be *numerical* or *categorical*. It is called linear regression because the **statistical model** that describes the relationship between the expected response and the regressors is assumed to be linear. In particular, when the model has a single regressor, the linear regression is the equation of a line. Linear regression is the foundation for almost any other type of regression or related method. -In Chapter \@ref(regression) we introduce linear regression with only one regressor. In Section \@ref(model1), the explanatory variable is numerical. This scenario is known as *simple linear regression*. In Section \@ref(model2), the explanatory variable is categorical. +In Chapter \@ref(regression), we introduce linear regression with only one regressor. In Section \@ref(model1), the explanatory variable is numerical. This scenario is known as *simple linear regression*. In Section \@ref(model2), the explanatory variable is categorical. In Chapter \@ref(multiple-regression) on multiple regression, we extend these ideas and work with models with two explanatory variables. In Section \@ref(model4), we work with two numerical explanatory variables. In Section \@ref(model3), we work with one numerical and one categorical explanatory variable and study the model with and without interactions. @@ -108,7 +108,7 @@ These are all questions that are of interest to demographers and policy makers, n_countries <- un_member_states_2024 |> nrow() ``` -In this section, we aim to explain differences in fertility rates as a function of one numerical variable: life expectancy. Could it be that countries with higher life expectancy also have lower fertility rates? Could it be instead that countries with higher life expectancy tend to have higher fertility rates? Or could it be that there is no relationship between life expectancy and fertility rates? We answer these questions by modeling the relationship between fertility rates and life expectancy using *simple linear regression* \index{regression!simple linear}, where we have: +In this section, we aim to explain differences in fertility rates as a function of one numerical variable: life expectancy. Could it be that countries with higher life expectancy also have lower fertility rates? Could it be instead that countries with higher life expectancy tend to have higher fertility rates? Or could it be that there is no relationship between life expectancy and fertility rates? We answer these questions by modeling the relationship between fertility rates and life expectancy using *simple linear regression* \index{regression!simple linear} where we have: 1. A numerical outcome variable $y$ (the country's fertility rate) and 2. A single numerical explanatory variable $x$ (the country's life expectancy). @@ -207,9 +207,7 @@ UN_data_ch5 |> However, what if we want other summary statistics as well, such as the standard deviation (a measure of spread), the minimum and maximum values, and various percentiles? -Typing out all these summary statistic functions in `summarize()` would be long and tedious. Instead, we use the convenient [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) function from the `moderndive` \index{R packages!moderndive!tidy\_summary()} package. - -This function takes in a data frame, summarizes it, and returns commonly used summary statistics in tidy format. We take our `UN_data_ch5` data frame, `select()` only the outcome and explanatory variables `fert_rate` and `life_exp`, and pipe them into the `tidy_summary` function: +Typing all these summary statistic functions in `summarize()` would be long and tedious. Instead, we use the convenient [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) function in the `moderndive` \index{R packages!moderndive!tidy\_summary()} package. This function takes in a data frame, summarizes it, and returns commonly used summary statistics in tidy format. We take our `UN_data_ch5` data frame, `select()` only the outcome and explanatory variables `fert_rate` and `life_exp`, and pipe them into the `tidy_summary` function: ```{r eval=FALSE} UN_data_ch5 |> @@ -228,7 +226,7 @@ UN_data_ch5 |> ) ``` -We can also do this more directly by providing which `columns` we'd like a summary of inside the [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) function: +We can also do this more directly by providing which `columns` we'd like a summary of inside the `tidy_summary()` function: ```{r eval=FALSE} UN_data_ch5 |> @@ -274,9 +272,9 @@ life_summary_df <- summary_df |> Looking at this output, we can see how the values of both variables distribute. For example, the median fertility rate was `r fert_summary_df$median[1]`, whereas the median life expectancy was `r life_summary_df$median[1]` years. The middle 50% of fertility rates was between `r fert_summary_df$Q1[[1]]` and `r fert_summary_df$Q3[[1]]` (the first and third quartiles), and the middle 50% of life expectancies was from `r life_summary_df$Q1[[1]]` to `r life_summary_df$Q3[[1]]`. -The [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) function only returns what are known as *univariate* \index{univariate} summary statistics: functions that take a single variable and return some numerical summary of that variable. However, there also exist *bivariate* \index{bivariate} summary statistics: functions that take in two variables and return some summary of those two variables. +The `tidy_summary()` function only returns what are known as *univariate* \index{univariate} summary statistics: functions that take a single variable and return some numerical summary of that variable. However, there also exist *bivariate* \index{bivariate} summary statistics: functions that take in two variables and return some summary of those two variables. -In particular, when the two variables are numerical, we can compute the \index{correlation (coefficient)} *correlation coefficient*. Generally speaking, *coefficients* are quantitative expressions of a specific phenomenon. A *correlation coefficient* is a quantitative expression of the *strength of the linear relationship between two numerical variables*. Its value goes from -1 and 1 where: +In particular, when the two variables are numerical, we can compute the \index{correlation (coefficient)} *correlation coefficient*. Generally speaking, *coefficients* are quantitative expressions of a specific phenomenon. A *correlation coefficient* measures the *strength of the linear relationship between two numerical variables*. Its value goes from -1 and 1 where: * -1 indicates a perfect *negative relationship*: As one variable increases, the value of the other variable tends to go down, following a straight line. @@ -287,7 +285,7 @@ In particular, when the two variables are numerical, we can compute the \index{c Figure \@ref(fig:correlation1) gives examples of nine different correlation coefficient values for hypothetical numerical variables $x$ and $y$. -```{r correlation1, echo=FALSE, fig.cap="Nine different correlation coefficients.", fig.height=ifelse(knitr::is_latex_output(), 3.2, 4), purl=FALSE} +```{r correlation1, echo=FALSE, fig.cap="Nine different correlation coefficients.", fig.height=ifelse(knitr::is_latex_output(), 3, 4), purl=FALSE} correlation <- c(-0.9999, -0.9, -0.75, -0.3, 0, 0.3, 0.75, 0.9, 0.9999) n_sim <- 100 values <- NULL @@ -420,7 +418,7 @@ What is the main purpose of performing an exploratory data analysis (EDA) before You may recall from secondary/high school algebra that the equation of a line is $y = a + b\cdot x$. (Note that the $\cdot$ symbol is equivalent to the $\times$ "multiply by" mathematical symbol. We'll use the $\cdot$ symbol in the rest of this book as it is more succinct.) It is defined by two coefficients $a$ and $b$. The intercept coefficient $a$ is the value of $y$ when $x = 0$. The slope coefficient $b$ for $x$ is the increase in $y$ for every increase of one in $x$. This is also called the "rise over run." -However, when defining a regression line like the regression line in Figure \@ref(fig:numxplot3), we use slightly different notation: the equation of the regression line is $\widehat{y} = b_0 + b_1 \cdot x$ \index{regression!equation of a line}. The intercept coefficient is $b_0$, so $b_0$ is the value of $\widehat{y}$ when $x = 0$. The slope coefficient for $x$ is $b_1$, i.e., the increase in $\widehat{y}$ for every increase of one in $x$. Why do we put a "hat" on top of the $y$? It's a form of notation commonly used in regression to indicate that we have a \index{regression!fitted value} "fitted value," or the value of $y$ on the regression line for a given $x$ value. We discuss this more in the upcoming Subsection \@ref(model1points). +However, when defining a regression line like the one in Figure \@ref(fig:numxplot3), we use slightly different notation: the equation of the regression line is $\widehat{y} = b_0 + b_1 \cdot x$ \index{regression!equation of a line}. The intercept coefficient is $b_0$, so $b_0$ is the value of $\widehat{y}$ when $x = 0$. The slope coefficient for $x$ is $b_1$, i.e., the increase in $\widehat{y}$ for every increase of one in $x$. Why do we put a "hat" on top of the $y$? It's a form of notation commonly used in regression to indicate that we have a \index{regression!fitted value} "fitted value," or the value of $y$ on the regression line for a given $x$ value as discussed further in Subsection \@ref(model1points). We know that the regression line in Figure \@ref(fig:numxplot3) has a negative slope $b_1$ corresponding to our explanatory $x$ variable `life_exp`. Why? Because as countries tend to have higher `life_exp` values, they tend to have lower `fert_rate` values. However, what is the numerical value of the slope $b_1$? What about the intercept $b_0$? We do not compute these two values by hand, but rather we use a computer! @@ -431,8 +429,7 @@ We can obtain the values of the intercept $b_0$ and the slope for `life_exp` $b_ ```{r, eval=FALSE} # Fit regression model: -demographics_model <- lm(fert_rate ~ life_exp, - data = UN_data_ch5) +demographics_model <- lm(fert_rate ~ life_exp, data = UN_data_ch5) # Get regression coefficients coef(demographics_model) ``` @@ -456,11 +453,11 @@ $$ The intercept $b_0$ = `r demo_line[1]` is the average fertility rate $\widehat{y}$ = $\widehat{\text{fertility}\_\text{rate}}$ for those countries that had a `life_exp` of 0. Or in graphical terms, where the line intersects the $y$ axis for $x$ = 0. Note, however, that while the \index{regression!equation of a line!intercept} intercept of the regression line has a mathematical interpretation, it has no *practical* interpretation here, since observing a `life_exp` of 0 is impossible. Furthermore, looking at the scatterplot with the regression line in Figure \@ref(fig:numxplot3), no countries had a life expectancy anywhere near 0. -Of greater interest is the \index{regression!equation of a line!slope} slope $b_1$ = $b_{\text{life}\_\text{expectancy}}$ for `life_exp` of `r demo_line[2]`. This summarizes the relationship between the fertility rate and life expectancy variables. Note that the sign is negative, suggesting a negative relationship between these two variables. This means countries with higher life expectancies tend to have lower fertility rates. Recall from earlier that the correlation coefficient is `r cor_ch5`. They both have the same negative sign, but have a different value. Recall further that the correlation's interpretation is the "strength of linear association". The \index{regression!interpretation of the slope} slope's interpretation is a little different: +Of greater interest is the \index{regression!equation of a line!slope} slope $b_{\text{life}\_\text{expectancy}}$ for `life_exp` of `r demo_line[2]`. This summarizes the relationship between the fertility rate and life expectancy variables. Note that the sign is negative, suggesting a negative relationship between these two variables. This means countries with higher life expectancies tend to have lower fertility rates. Recall that the correlation coefficient is `r cor_ch5`. They both have the same negative sign, but have a different value. Recall also that the correlation's interpretation is the "strength of linear association." The \index{regression!interpretation of the slope} slope's interpretation is a little different: > For every increase of 1 unit in `life_exp`, there is an *associated* decrease of, *on average*, `r abs(demo_line[2])` units of `fert_rate`. -We only state that there is an *associated* increase and not necessarily a *causal* increase. For example, perhaps it may not be that higher life expectancies directly cause lower fertility rates. Instead, the following could hold true: wealthier countries tend to have stronger educational backgrounds, improved health, a higher standard of living, and have lower fertility rates, while at the same time these wealthy countries also tend to have higher life expectancies. In other words, just because two variables are strongly associated, it does not necessarily mean that one causes the other. This is summed up in the often quoted phrase, "correlation is not necessarily causation." We discuss this idea further in Subsection \@ref(correlation-is-not-causation). +We only state that there is an *associated* increase and not necessarily a *causal* increase. Perhaps it may not be that higher life expectancies directly cause lower fertility rates. Instead, wealthier countries could tend to have stronger educational backgrounds, improved health, a higher standard of living, and have lower fertility rates, while at the same time these wealthy countries also tend to have higher life expectancies. Just because two variables are strongly associated, it does not necessarily mean that one causes the other. This is summed up in the often-quoted phrase, "correlation is not necessarily causation." We discuss this idea further in Subsection \@ref(correlation-is-not-causation). Furthermore, we say that this associated decrease is *on average* `r abs(demo_line[2])` units of `fert_rate`, because you might have two countries whose `life_exp` values differ by 1 unit, but their difference in fertility rates may not be exactly $`r demo_line[2]`$. What the slope of $`r demo_line[2]`$ is saying is that across all possible countries, the *average* difference in fertility rate between two countries whose life expectancies differ by one is $`r demo_line[2]`$. @@ -468,18 +465,17 @@ Now that we have learned how to compute the equation for the regression line in ```{r, eval=FALSE} # Fit regression model: -demographics_model <- lm(fert_rate ~ life_exp, - data = UN_data_ch5) +demographics_model <- lm(fert_rate ~ life_exp, data = UN_data_ch5) # Get regression coefficients: coef(demographics_model) ``` -First, we "fit" the linear regression model to the `data` using the `lm()` \index{lm()} function and save this as `demographics_model`. When we say "fit", we mean "find the best fitting line to this data." `lm()` stands for "linear model" and is used as follows: `lm(y ~ x, data = data_frame_name)` where: +First, we "fit" the linear regression model to the `data` using the `lm()` \index{lm()} function and save this as `demographics_model`. When we say "fit," we mean "find the best fitting line to this data." `lm()` stands for "linear model" and is used as `lm(y ~ x, data = df_name)` where: * `y` is the outcome variable, followed by a tilde `~`. In our case, `y` is set to `fert_rate`. * `x` is the explanatory variable. In our case, `x` is set to `life_exp`. -* The combination of `y ~ x` is called a *model formula*. (Note the order of `y` and `x`.) In our case, the model formula is `fert_rate ~ life_exp`. We saw such model formulas earlier when we computed the correlation coefficient using the [`get_correlation()`](https://moderndive.github.io/moderndive/reference/get_correlation.html) function in Subsection \@ref(model1EDA). -* `data_frame_name` is the name of the data frame that contains the variables `y` and `x`. In our case, `data` is the `UN_data_ch5` data frame. +* The combination of `y ~ x` is called a *model formula*. (Note the order of `y` and `x`.) In our case, the model formula is `fert_rate ~ life_exp`. We saw such model formulas earlier with the [`get_correlation()`](https://moderndive.github.io/moderndive/reference/get_correlation.html) function in Subsection \@ref(model1EDA). +* `df_name` is the name of the data frame that contains the variables `y` and `x`. In our case, `data` is the `UN_data_ch5` data frame. Second, we take the saved model in `demographics_model` and apply the `coef()` function to it to obtain the regression coefficients. This gives us the components of the regression equation line: the intercept $b_0$ and the slope $b_1$. @@ -489,20 +485,20 @@ Second, we take the saved model in `demographics_model` and apply the `coef()` f \vspace{-0.1in} ``` -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Fit a new simple linear regression using `lm(fert_rate ~ obes_rate, data = UN_data_ch5)` where `obes_rate` is the new explanatory variable $x$. Get information about the "best-fitting" line from the regression coefficients by applying the `coef()` function. How do the regression results match up with the results from your earlier exploratory data analysis? +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Fit a simple linear regression using `lm(fert_rate ~ obes_rate, data = UN_data_ch5)` where `obes_rate` is the new explanatory variable $x$. Learn about the "best-fitting" line from the regression coefficients by applying the `coef()` function. How do the regression results match up with your earlier exploratory data analysis? -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does the intercept term $b_0$ represent in a simple linear regression model? +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does the intercept term $b_0$ represent in simple linear regression? -- A. The change in the outcome variable for a one-unit change in the explanatory variable. -- B. The predicted value of the outcome variable when the explanatory variable is zero. +- A. The change in the outcome for a one-unit change in the explanatory variable. +- B. The predicted value of the outcome when the explanatory variable is zero. - C. The standard error of the regression. - D. The correlation between the outcome and explanatory variables. -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Which of the following best describes the "slope" of a simple linear regression line? +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What best describes the "slope" of a simple linear regression line? -- A. The increase in the explanatory variable for a one-unit increase in the outcome variable. +- A. The increase in the explanatory variable for a one-unit increase in the outcome. - B. The average of the explanatory variable. -- C. The change in the outcome variable for a one-unit increase in the explanatory variable. +- C. The change in the outcome for a one-unit increase in the explanatory variable. - D. The minimum value of the outcome variable. **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does a negative slope in a simple linear regression indicate? @@ -572,7 +568,7 @@ best_fit_plot Now say we want to compute both the fitted value $\widehat{y} = b_0 + b_1 \cdot x$ and the residual $y - \widehat{y}$ for *all* `r n_demo_ch5` UN member states with complete data as of 2024. Recall that each country corresponds to one of the `r n_demo_ch5` rows in the `UN_data_ch5` data frame and also one of the `r n_demo_ch5` points in the regression plot in Figure \@ref(fig:numxplot4). -We could repeat the previous calculations we performed by hand `r n_demo_ch5` times, but that would be tedious and time consuming. Instead, we do this using a computer with the [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function. We apply the [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function to `demographics_model`, which is where we saved our `lm()` model in the previous section. In Table \@ref(tab:regression-points-1) we present the results of only the 21st through 24th courses for brevity's sake. +We could repeat the previous calculations we performed by hand `r n_demo_ch5` times, but that would be tedious and time consuming. Instead, we use a computer with the `get_regression_points()` function. We apply the `get_regression_points()` function to `demographics_model`, which is where we saved our `lm()` model in the previous section. In Table \@ref(tab:regression-points-1) we present the results of only the 21st through 24th courses for brevity. ```{r, eval=FALSE} regression_points <- get_regression_points(demographics_model) @@ -601,11 +597,11 @@ residual_24 <- regression_points$residual[24] |> round(3) |> format(nsmall = 3) This function is an example of what is known in computer programming as a *wrapper function*. \index{functions!wrapper} It takes other pre-existing functions and "wraps" them into a single function that hides its inner workings. This concept is illustrated in Figure \@ref(fig:moderndive-figure-wrapper). -```{r moderndive-figure-wrapper, echo=FALSE, fig.cap="The concept of a wrapper function.", out.height="60%", out.width="60%", purl=FALSE} +```{r moderndive-figure-wrapper, echo=FALSE, fig.cap="The concept of a wrapper function.", out.width="50%", purl=FALSE} include_graphics("images/shutterstock/wrapper_function.png") ``` -So all you need to worry about is what the inputs look like and what the outputs look like; you leave all the other details "under the hood of the car." In our regression modeling example, the [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function takes a saved `lm()` linear regression model as input and returns a data frame of the regression predictions as output. If you are interested in learning more about the [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function's inner workings, check out Subsection \@ref(underthehood). +So all you need to worry about is what the inputs look like and what the outputs look like; you leave all the other details "under the hood of the car." In our regression modeling example, the `get_regression_points()` function takes a saved `lm()` linear regression model as input and returns a data frame of the regression predictions as output. If you are interested in learning more about the `get_regression_points()` function's inner workings, check out Subsection \@ref(underthehood). We inspect the individual columns and match them with the elements of Figure \@ref(fig:numxplot4): @@ -621,7 +617,7 @@ Just as we did for the 21st country in the `UN_data_ch5` dataset (in the first r * `fert_rate_hat` $= `r fert_rate_hat_24` = `r demo_line[1]` + (`r demo_line[2]`) \cdot `r life_exp_24`$ is the fitted value $\widehat{y}$ on the regression line for this country. * `residual` $= `r residual_24` = `r fert_rate_24` - `r fert_rate_hat_24`$ is the value of the residual for this country. In other words, the model's fitted value was off by `r residual_24` fertility rate units for Brunei. -At this point, you can skip ahead if you like to Subsection \@ref(leastsquares) to learn about the processes behind what makes "best-fitting" regression lines. As a primer, a "best-fitting" line refers to the line that minimizes the *sum of squared residuals* out of all possible lines we can draw through the points. In Section \@ref(model2), we'll discuss another common scenario of having a categorical explanatory variable and a numerical outcome variable. +If you like, you can skip ahead to Subsection \@ref(leastsquares) to learn about the processes behind what makes "best-fitting" regression lines. As a primer, a "best-fitting" line refers to the line that minimizes the *sum of squared residuals* out of all possible lines we can draw through the points. In Section \@ref(model2), we'll discuss another common scenario of having a categorical explanatory variable and a numerical outcome variable. ```{block, type="learncheck", purl=FALSE} \vspace{-0.15in} @@ -638,8 +634,6 @@ At this point, you can skip ahead if you like to Subsection \@ref(leastsquares) **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Generate a data frame of the residuals of the *Learning check* model where you used `obes_rate` as the explanatory $x$ variable. -\newpage - **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Which of the following statements is true about the regression line in a simple linear regression model? - A. The regression line represents the average of the outcome variable. @@ -659,9 +653,7 @@ It is an unfortunate truth that life expectancy is not the same across all count 1. Differences between continents: Are there significant differences in average life expectancy between the six populated continents of the world: Africa, North America, South America, Asia, Europe, and Oceania? 1. Differences within continents: How does life expectancy vary within the world's five continents? For example, is the spread of life expectancy among the countries of Africa larger than the spread of life expectancy among the countries of Asia? -To answer such questions, we use an updated version of the `gapminder` data frame we visualized in Figure \@ref(fig:gapminder) in Subsection \@ref(gapminder) on the grammar of graphics. This updated data `un_member_states_2024` data we worked with earlier in this chapter, and it is included in the `moderndive` \index{R packages!moderndive} package. This dataset has international development statistics such as life expectancy, GDP per capita, and population for `r n_countries` countries for years near 2024. - -We use this data for basic regression again, but now using an explanatory variable $x$ that is categorical, as opposed to the numerical explanatory variable model we used in the previous Section \@ref(model1). +To answer such questions, we use an updated version of the `gapminder` data frame we visualized in Figure \@ref(fig:gapminder) in Subsection \@ref(gapminder) on the grammar of graphics. This updated `un_member_states_2024` data we saw earlier in this chapter. It is included in the `moderndive` \index{R packages!moderndive} package and has international development statistics such as life expectancy, GDP per capita, and population for `r n_countries` countries for years near 2024. We use this data for basic regression again, but now using an explanatory variable $x$ that is categorical, as opposed to the numerical explanatory variable model we used in the previous Section \@ref(model1): 1. A numerical outcome variable $y$ (a country's life expectancy) and 1. A single categorical explanatory variable $x$ (the continent that the country is a part of). @@ -715,10 +707,10 @@ gapminder2022 |> sample_n(size = 3) ``` ```{r model2-data-preview, echo=FALSE, purl=FALSE} gapminder2022 |> - sample_n(5) |> + sample_n(3) |> kbl( digits = 3, - caption = "Random sample of 5 out of 193 countries", + caption = "Random sample of 3 out of 188 countries", booktabs = TRUE, linesep = "" ) |> @@ -728,13 +720,13 @@ gapminder2022 |> ) ``` -Random sampling will likely produce a different subset of 3 rows for you than what's shown. Now that we have looked at the raw values in our `gapminder2022` data frame and got a sense of the data, we compute summary statistics. We again apply [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) from the `moderndive` package. Recall that this function takes in a data frame, summarizes it, and returns commonly used summary statistics. We take our `gapminder2022` data frame, `select()` only the outcome and explanatory variables `life_exp` and `continent`, and pipe them into [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html): +Random sampling will likely produce a different subset of 3 rows for you than what's shown. Now that we have looked at the raw values in our `gapminder2022` data frame and got a sense of the data, we compute summary statistics. We again apply `tidy_summary()` from the `moderndive` package. Recall that this function takes in a data frame, summarizes it, and returns commonly used summary statistics. We take our `gapminder2022` data frame, `select()` only the outcome and explanatory variables `life_exp` and `continent`, and pipe them into `tidy_summary()` in Table \@ref(tab:lifeexp-cont): ```{r eval=FALSE} gapminder2022 |> select(life_exp, continent) |> tidy_summary() ``` -```{r echo=FALSE} +```{r lifeexp-cont, echo=FALSE} gapminder2022 |> select(life_exp, continent) |> tidy_summary() |> @@ -754,7 +746,7 @@ gapminder2022 |> gapminder2022 |> count(continent) ``` -The [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) output now reports summaries for categorical variables and for the numerical variables we reviewed before. Let's focus just on discussing the results for the categorical `factor` variable `continent`: +The `tidy_summary()` output now reports summaries for categorical variables and for the numerical variables we reviewed before. Let's focus just on discussing the results for the categorical `factor` variable `continent`: - `n`: The number of non-missing entries for each group - `group`: Breaks down a categorical variable into its unique levels. For this variable, it is corresponding to Africa, Asia, North and South America, Europe, and Oceania. @@ -771,7 +763,7 @@ Turning our attention to the summary statistics of the numerical variable `life_ We can answer this question by performing the last of the three common steps in an exploratory data analysis: creating data visualizations. We visualize the distribution of our outcome variable $y$ = `life_exp` in Figure \@ref(fig:lifeexp2022hist). -```{r lifeexp2022hist, echo=TRUE, fig.cap="Histogram of life expectancy in 2022.", fig.height=ifelse(knitr::is_latex_output(), 3.4, 4)} +```{r lifeexp2022hist, echo=TRUE, fig.cap="Histogram of life expectancy in 2022.", fig.height=ifelse(knitr::is_latex_output(), 3, 4)} ggplot(gapminder2022, aes(x = life_exp)) + geom_histogram(binwidth = 5, color = "white") + labs(x = "Life expectancy", @@ -817,11 +809,10 @@ Observe that unfortunately the distribution of African life expectancies is much Recall that an alternative method to visualize the distribution of a numerical variable split by a categorical variable is by using a side-by-side boxplot. We map the categorical variable `continent` to the $x$-axis and the different life expectancies within each continent on the $y$-axis in Figure \@ref(fig:catxplot1). -```{r catxplot1, fig.cap="Life expectancy in 2022 by continent (boxplot).", fig.height=ifelse(knitr::is_latex_output(), 3, 4)} +```{r catxplot1, fig.cap="Life expectancy in 2022 by continent (boxplot).", fig.height=ifelse(knitr::is_latex_output(), 2.5, 4)} ggplot(gapminder2022, aes(x = continent, y = life_exp)) + geom_boxplot() + - labs(x = "Continent", - y = "Life expectancy", + labs(x = "Continent", y = "Life expectancy", title = "Life expectancy by continent") ``` @@ -933,6 +924,7 @@ gapminder2022 |> **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Conduct a new exploratory data analysis with the same explanatory variable $x$ being `continent` but with `gdp_per_capita` as the new outcome variable $y$. What can you say about the differences in GDP per capita between continents based on this exploration? **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** When using a categorical explanatory variable in regression, what does the baseline group represent? + - A. The group with the highest mean - B. The group chosen for comparison with all other groups - C. The group with the most data points @@ -950,7 +942,7 @@ In Subsection \@ref(model1table) we introduced simple linear regression, which i As we did in Subsection \@ref(model1table) when studying the relationship between fertility rates and life expectancy, we output the regression coefficients for this model. Recall that this is done in two steps: -1. We first "fit" the linear regression model using the `lm(y ~ x, data)` function and save it in `life_exp_model`. +1. We "fit" the linear regression model using `lm(y ~ x, data)` and save it in `life_exp_model`. 1. We get the regression coefficients by applying the `coef()` function to `life_exp_model`. ```{r} @@ -1074,7 +1066,7 @@ Understanding a regression table output when you are using a categorical explana \vspace{-0.1in} ``` -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Fit a new linear regression using `lm(gdp_per_capita ~ continent, data = gapminder2022)` where `gdp_per_capita` is the new outcome variable $y$. Get information about the "best-fitting" line from the regression coefficients. How do the regression results match up with the results from your previous exploratory data analysis? +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Fit a linear regression using `lm(gdp_per_capita ~ continent, data = gapminder2022)` where `gdp_per_capita` is the new outcome variable. Get information about the "best-fitting" line from the regression coefficients. How do the regression results match up with the results from your previous exploratory data analysis? **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How many "offsets" or differences from the baseline will a regression model output for a categorical variable with 4 levels? @@ -1097,19 +1089,18 @@ Recall in Subsection \@ref(model1points), we defined the following three concept 1. Fitted values $\widehat{y}$, or the value on the regression line for a given $x$ value 1. Residuals $y - \widehat{y}$, or the error between the observed value and the fitted value -We obtained these values and other values using the [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function from the `moderndive` package. This time, however, we add an argument setting `ID = "country"`: this is telling the function to use the variable `country` in `gapminder2022` as an *identification variable* in the output. This will help contextualize our analysis by matching values to countries. +We obtained these values and other values using the `get_regression_points()` function from the `moderndive` package. This time, however, we add an argument setting `ID = "country"`, which uses the variable `country` in `gapminder2022` as an *identification variable* in the output. This will help contextualize our analysis by matching values to countries. ```{r, eval=FALSE} -regression_points <- get_regression_points(life_exp_model, ID = "country") -regression_points +get_regression_points(life_exp_model, ID = "country") ``` ```{r model2-residuals, echo=FALSE, purl=FALSE} regression_points <- get_regression_points(life_exp_model, ID = "country") regression_points |> - slice(1:10) |> + filter(country %in% c("Afghanistan", "Albania", "Algeria", "Angola", "Argentina", "Barbados")) |> kbl( digits = 3, - caption = "Regression points (First 10 out of 142 countries)", + caption = "Regression points (Sample of 6 out of 142 countries)", booktabs = TRUE, linesep = "" ) |> @@ -1171,18 +1162,15 @@ The `residual` column is simply $y - \widehat{y}$ = `life_exp - life_exp_hat`. T Throughout this chapter we have been cautious when interpreting regression slope coefficients. We always discussed the "associated" effect of an explanatory variable $x$ on an outcome variable $y$. For example, our statement from Subsection \@ref(model1table) that "for every increase of 1 unit in `life_exp`, there is an *associated* decrease of on average `r abs(demo_line[2])` units of `fert_rate`." We include the term "associated" to be extra careful not to suggest we are making a *causal* statement. So while `life_exp` is negatively correlated with `fert_rate`, we can't necessarily make any statements about life expectancy's direct causal effect on fertility rates without more information. -Here is another example: a not-so-great medical doctor goes through medical records and finds that patients who slept with their shoes on tended to wake up more with headaches. So this doctor declares, "Sleeping with shoes on causes headaches!" +Here is another example depicted in Figure \@ref(fig:moderndive-figure-causal-graph-2): a not-so-great medical doctor goes through medical records and finds that patients who slept with their shoes on tended to wake up more with headaches. So this doctor declares, "Sleeping with shoes on causes headaches!" -```{r moderndive-figure-causal-graph-2, echo=FALSE, fig.cap="Does sleeping with shoes on cause headaches?", out.width="60%", out.height="60%", purl=FALSE} +```{r moderndive-figure-causal-graph-2, echo=FALSE, fig.cap="Does sleeping with shoes on cause headaches?", out.width="60%", purl=FALSE} include_graphics("images/shutterstock/shoes_headache.png") ``` -However, there is a good chance that if someone is sleeping with their shoes on, it is potentially because they are intoxicated from alcohol. Furthermore, higher levels of drinking leads to more hangovers, and hence more headaches. The amount of alcohol consumption here is what's known as a *confounding/lurking* variable\index{confounding variable}. It "lurks" behind the scenes, confounding the causal relationship (if any) of "sleeping with shoes on" with "waking up with a headache." We can summarize this in Figure \@ref(fig:moderndive-figure-causal-graph) with a *causal graph* where: - -* Y is a *response* variable; here it is "waking up with a headache." \index{variables!response / outcome / dependent} -* X is a *treatment* variable whose causal effect we are interested in; here it is "sleeping with shoes on."\index{variables!treatment} +However, there is a good chance that if someone is sleeping with their shoes on, it is potentially because they are intoxicated from alcohol. Furthermore, higher levels of drinking leads to more hangovers, and hence more headaches. The amount of alcohol consumption here is what's known as a *confounding/lurking* variable\index{confounding variable}. It "lurks" behind the scenes, confounding the causal relationship (if any) of "sleeping with shoes on" with "waking up with a headache." We can summarize this in Figure \@ref(fig:moderndive-figure-causal-graph) with a *causal graph* where Y is a *response* variable and X is a *treatment* variable whose causal effect we are interested in. \index{variables!treatment} -```{r moderndive-figure-causal-graph, echo=FALSE, out.width="50%", fig.cap="Causal graph.", purl=FALSE} +```{r moderndive-figure-causal-graph, echo=FALSE, out.width="30%", fig.cap="Causal graph.", purl=FALSE} include_graphics("images/flowcharts/flowchart.009-cropped.png") ``` @@ -1194,18 +1182,18 @@ Alcohol will cause people to be both more likely to sleep with their shoes on as Establishing causation is a tricky problem and frequently takes either carefully designed experiments or methods to control for the effects of confounding variables. Both these approaches attempt, as best they can, either to take all possible confounding variables into account or negate their impact. This allows researchers to focus only on the relationship of interest: the relationship between the outcome variable Y and the treatment variable X. -As you read news stories, be careful not to fall into the trap of thinking that correlation necessarily implies causation. Check out the [Spurious Correlations](http://www.tylervigen.com/spurious-correlations) website for some rather comical examples of variables that are correlated, but are definitely not causally related. +As you read news stories, be careful not to fall into the trap of thinking that correlation necessarily implies causation. Check out the [Spurious Correlations](http://www.tylervigen.com/spurious-correlations) website for some comical examples of variables that are correlated, but are not causally related. Coming back to our UN member states data, a confounding variable could be the level of economic development of a country. This could affect both life expectancy and fertility rates. A proxy for looking at the level of economic development could be the Human Development Index (HDI). It measures a country's average achievements in three basic aspects of human development: health (life expectancy), education (mean and expected years of schooling), and standard of living (gross national income per capita). This is stored in the `hdi_2022` column of `un_member_states_2024`. We explore its relationship with life expectancy and fertility rates: -```{r} +```{r fig.height=1.5} ggplot(data = un_member_states_2024, aes(x = hdi_2022, y = life_expectancy_2022)) + geom_point() + labs(x = "Human Development Index (HDI)", y = "Life Expectancy") ``` -```{r} +```{r fig.height=1.5} ggplot(data = un_member_states_2024, aes(x = hdi_2022, y = fertility_rate_2022)) + geom_point() + @@ -1228,7 +1216,7 @@ Looking at both of the scatterplots above, we see a strong positive linear relat Regression lines are also known as "best-fitting" lines. But what do we mean by "best"? We unpack the criteria that is used in regression to determine "best." Recall Figure \@ref(fig:numxplot4), where for Bosnia and Herzegovina we marked the *observed value* $y$ with a circle, the *fitted value* $\widehat{y}$ with a square, and the *residual* $y - \widehat{y}$ with an arrow. We re-display Figure \@ref(fig:numxplot4) in the top-left plot of Figure \@ref(fig:best-fitting-line) in addition to three more arbitrarily chosen countries: -```{r best-fitting-line, fig.height=ifelse(knitr::is_latex_output(), 5.7, 7), echo=FALSE, fig.cap="Example of observed value, fitted value, and residual.", purl=FALSE, message=FALSE} +```{r best-fitting-line, fig.height=ifelse(knitr::is_latex_output(), 5.4, 7), echo=FALSE, fig.cap="Example of observed value, fitted value, and residual.", purl=FALSE, message=FALSE} # First residual demographics_model <- lm(fert_rate ~ life_exp, data = UN_data_ch5) index <- with(UN_data_ch5, which(iso == "BIH")) @@ -1344,7 +1332,7 @@ The four plots refer to: 1. The addition of India, which had a life expectancy $x = `r india |> pull(life_exp)`$ and fertility rate $y = `r india |> pull(fert_rate)`$. The residual in this case is $`r india |> pull(fert_rate)` - `r india |> pull(fert_rate_hat)` = `r india |> pull(residual)`$, which we mark with a new `r if_else(!is_latex_output(), "blue", "")` arrow in the bottom-left plot. 1. The addition of the Solomon Islands, which had a life expectancy $x = `r solomon |> pull(life_exp)`$ and fertility rate $y = `r solomon |> pull(fert_rate)`$. The residual in this case is $`r solomon |> pull(fert_rate)` - `r solomon |> pull(fert_rate_hat)` = `r solomon |> pull(residual)`$, which we mark with a new `r if_else(!is_latex_output(), "blue", "")` arrow in the bottom-right plot. -Now say we repeated this process of computing residuals for all `r n_demo_ch5` countries with complete information, then we squared all the residuals, and then we summed them. We call this quantity the *sum of squared residuals*\index{sum of squared residuals}; it is a measure of the _lack of fit_ of a model. Larger values of the sum of squared residuals indicate a bigger lack of fit. This corresponds to a worse fitting model. +Now say we repeated this process of computing residuals for all `r n_demo_ch5` countries with complete information, then we squared all the residuals, and then we summed them. We call this quantity the *sum of squared residuals*\index{sum of squared residuals}; it is a measure of the _lack of fit_ of a model. Larger values of the sum of squared residuals indicate a bigger lack of fit. This corresponds to a worse-fitting model. If the regression line fits all the points perfectly, then the sum of squared residuals is 0. This is because if the regression line fits all the points perfectly, then the fitted value $\widehat{y}$ equals the observed value $y$ in all cases, and hence the residual $y - \widehat{y}$ = 0 in all cases, and the sum of even a large number of 0's is still 0. @@ -1354,7 +1342,7 @@ $$ \sum_{i=1}^{n}(y_i - \widehat{y}_i)^2 $$ -We use our data wrangling tools to compute the sum of squared residuals exactly: +We use our data-wrangling tools to compute the sum of squared residuals exactly: ```{r eval=FALSE} # Fit regression model and regression points @@ -1415,7 +1403,7 @@ ggplot(example, aes(x = x, y = y)) + geom_point(size = 4) ``` -Compute the sum of squared residuals by hand for each line and show that of these three lines, the regression line `r if_else(is_latex_output(), "", "in blue")` has the smallest value. +Compute the sum of squared residuals by hand for each line. Show that the regression line `r if_else(is_latex_output(), "", "in blue")` has the smallest value of these three lines. ```{block, type="learncheck", purl=FALSE} \vspace{-0.25in} @@ -1427,11 +1415,11 @@ Compute the sum of squared residuals by hand for each line and show that of thes Recall in this chapter we introduced a wrapper function from the `moderndive` package: -- [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) that returns point-by-point information from a regression model in Subsection \@ref(model1points). +- `get_regression_points()` that returns point-by-point information from a regression model in Subsection \@ref(model1points). -What is going on behind the scenes with the [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function? We mentioned in Subsection \@ref(model1table) that this was an example of a *wrapper function*. Such functions take other pre-existing functions and "wrap" them into single functions that hide the user from their inner workings. This way all the user needs to worry about is what the inputs look like and what the outputs look like. In this subsection, we'll "get under the hood" of these functions and see how the "engine" of these wrapper functions works. +What is going on behind the scenes with the `get_regression_points()` function? We mentioned in Subsection \@ref(model1table) that this was an example of a *wrapper function*. Such functions take other pre-existing functions and "wrap" them into single functions that hide the user from their inner workings. This way all the user needs to worry about is what the inputs look like and what the outputs look like. In this subsection, we'll "get under the hood" of these functions and see how the "engine" of these wrapper functions works. -The [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function is a wrapper function, returning information about the individual points involved in a regression model like the fitted values, observed values, and the residuals. [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) \index{R packages!moderndive!get\_regression\_points()} uses the `augment()` \index{R packages!broom!augment()} function in the [`broom` package](https://broom.tidyverse.org/) to produce the data shown in Table \@ref(tab:regpoints-augment). Additionally, it uses `clean_names()` \index{R packages!janitor!clean\_names()} from the [`janitor` package](https://github.com/sfirke/janitor) [@R-janitor] to clean up the variable names. +The `get_regression_points()` function is a wrapper function, returning information about the individual points involved in a regression model like the fitted values, observed values, and the residuals. `get_regression_points()` \index{R packages!moderndive!get\_regression\_points()} uses the `augment()` \index{R packages!broom!augment()} function in the [`broom` package](https://broom.tidyverse.org/) to produce the data shown in Table \@ref(tab:regpoints-augment). Additionally, it uses `clean_names()` \index{R packages!janitor!clean\_names()} from the [`janitor` package](https://github.com/sfirke/janitor) [@R-janitor] to clean up the variable names. ```{r, eval=FALSE} @@ -1494,7 +1482,7 @@ generate_r_file_link <- function(file) { generate_r_file_link("05-regression.R") ``` -As we suggested in Subsection \@ref(model1EDA), interpreting coefficients that are not close to the extreme values of -1, 0, and 1 can be somewhat subjective. To help develop your sense of correlation coefficients, we suggest you play the 80s-style video game called, "Guess the Correlation", at . +As we suggested in Subsection \@ref(model1EDA), interpreting coefficients that are not close to the extreme values of -1, 0, and 1 can be somewhat subjective. To help develop your sense of correlation coefficients, we suggest you play the 80s-style video game called, "Guess the Correlation," at previewed in Figure \@ref(fig:guess-the-correlation). (ref:guess-corr) Preview of "Guess the Correlation" game. diff --git a/06-multiple-regression.Rmd b/06-multiple-regression.Rmd index 84ca7721c..0381d4c7c 100644 --- a/06-multiple-regression.Rmd +++ b/06-multiple-regression.Rmd @@ -28,14 +28,14 @@ options(knitr.kable.NA = "") set.seed(76) ``` -In Chapter \@ref(regression) we studied simple linear regression as a +In Chapter \@ref(regression), we studied simple linear regression as a model that represents the relationship between two variables: an outcome variable or response $y$ and an explanatory variable or regressor $x$. Furthermore, to keep things simple, we only considered models with one explanatory $x$ variable that was either numerical in Section \@ref(model1) or categorical in Section \@ref(model2). -In this chapter we introduce multiple linear regression, the direct +In this chapter, we introduce multiple linear regression, the direct extension to simple linear regression when more than one explanatory variable is taken into account to explain changes in the outcome variable. As we show in the next few sections, much of the material @@ -206,10 +206,10 @@ life expectancy of 25% of the UN member states is less than 69.4 years. When a variable in the dataset is categorical, also called a `factor`, the summary shows all the categories or factor levels and the number of observations for each level. For example, income group (`income`) is a -factor with four factor levels: `Low Income`, `Lower middle income`, +factor with four levels: `Low Income`, `Lower middle income`, `Upper middle income`, and `High income`. The summary also provides the number of states for each factor level; observe, for example, that the -dataset has 56 UN members states that are considered `High Income` +dataset has 56 UN member states that are considered `High Income` states. Furthermore, we can compute the correlation coefficient between our two @@ -266,7 +266,7 @@ income groups are negative. Third, the slope for the `High income` group is clearly less steep than the slopes for all other three groups. So, the changes in fertility rate due to changes in life expectancy are dependent on the level of income of a given country. Fourth, observe -that high income countries have, in general, high life expectancy and +that high-income countries have, in general, high life expectancy and low fertility rates. ### Model with interactions {#model4interactiontable} @@ -421,13 +421,13 @@ by the categorical variable, the numerical variable, and the interaction effects. There are eight coefficients in our model and we have separated their coefficients into three lines to highlight their different roles. The first line shows the intercept and the effects of the categorical -explanatory variables. Recall that $D_2$, $D_3$ and $D_4$ are the dummy +explanatory variables. Recall that $D_2$, $D_3$, and $D_4$ are the dummy variables in the model and each is equal to one or zero depending on the category of the country at hand; correspondingly, the coefficients $b_{02}$, $b_{03}$, and $b_{04}$ are the offsets with respect to the baseline level of the intercept, $b_0$. Recall that the first dummy variable has been dropped and the intercept captures this effect. The -second line in the equation represent the effect of the numerical +second line in the equation represents the effect of the numerical variable, $x$. In our example $x$ is the value of life expectancy. The coefficient $b_1$ is the slope of the line and represents the change in fertility rate due to one unit change in life expectancy. The third line @@ -489,7 +489,7 @@ lm_data |> ) ``` -We can match the coefficients with the values computed: the fitted +We can match the coefficients with the values computed in Table \@ref(tab:regtable-interaction): the fitted fertility rate $\widehat{y} = \widehat{\text{fert rate}}$ for `Low income` countries is @@ -528,8 +528,8 @@ similarly. Since the life expectancy for `Low income` countries has a steeper slope than `High income` countries, one additional year of life expectancy -will decrease fertility rates more for the low income group than for the -high income group. This is consistent with our observation from Figure +will decrease fertility rates more for the low-income group than for the +high-income group. This is consistent with our observation from Figure \@ref(fig:numxcatxplot1). When the associated effect of one variable *depends on the value of another variable* we say that there is an interaction effect. This is the reason why the regression slopes are @@ -665,7 +665,8 @@ lm_data1 |> ) ``` -In this model without interactions, the slope is the same for all the +In this model without interactions presented in Table +\@ref(tab:regtable-parallel1), the slope is the same for all the regression lines, $b_1 = `r b_no_int[[5]]`$. Assuming that this model is correct, for any UN member state, every additional year of life expectancy reduces the average fertility rate by `r abs(b_no_int[[5]])` @@ -729,7 +730,7 @@ the interaction model seems more appropriate. In this subsection, we work with the regression model with interactions. The coefficients for this model were found earlier, saved in -`model_int`, and are shown below: +`model_int`, and are displayed in Table \@ref(tab:regtable-interaction1): ```{r regtable-interaction1, echo=FALSE, purl=FALSE} # Fit regression model: @@ -926,13 +927,13 @@ credit_ch6 |> Note that income is in thousands of dollars while debt and credit limit are in dollars. We can also compute summary statistics using the `tidy_summary()` function. We only `select()` the columns of interest -for our model: +for our model as shown in Table \@ref(tab:credittable): ```{r eval=FALSE} credit_ch6 |> select(debt, credit_limit, income) |> tidy_summary() ``` -```{r echo=FALSE} +```{r credittable, echo=FALSE} credit_ch6 |> select(debt, credit_limit, income) |> tidy_summary() |> @@ -957,7 +958,7 @@ and median credit card limit, `credit_limit`, are around \$4,736 and 57.5; so 75% of card holders had incomes below \$57,500. We visualize the relationship of the response variable with each of the -two explanatory variables using the R code below. These plots are shown +two explanatory variables using this R code. These plots are shown in Figure \@ref(fig:2numxplot1). ```{r, eval=FALSE} @@ -1059,7 +1060,7 @@ Let's look at some findings presented in the correlation matrix: credit limit is, the larger is the credit card debt, on average. 3. The correlation between `debt` and `income` is 0.464. The linear relationship is positive albeit somewhat weak. In other words, - higher income is only weakly associated to higher debt. + higher income is only weakly associated with higher debt. 4. Observe also that the correlation coefficient between the two explanatory variables, `credit_limit` and `income`, is 0.792. @@ -1240,7 +1241,7 @@ changes in the response that were not already accounted for by the first regressor. For example, the slope for `credit_limit` is \$0.264. Keeping `income` fixed to some value, for an additional increase of credit limit by one dollar the credit debt increases, on average, by \$0.264. -Similarly the slope of `income` is -\$7.663. Keeping `credit_limit` +Similarly, the slope of `income` is -\$7.663. Keeping `credit_limit` fixed to some level, for a one unit increase of `income` (\$1000 in actual income), there is an associated decrease of \$7.66 in credit card debt, on average. @@ -1326,8 +1327,8 @@ your previous exploratory data analysis? **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Which of the following statements best describes the interpretation of a regression coefficient in a multiple regression model? - - A. It represents the additional effect of a regressor on the response when other regressors have already been taken into account. - - B. It represents the average response variable value when all explanatory variables are zero. + - A. It is the additional effect of a regressor on the response when other regressors have already been taken into account. + - B. It is the average response variable value when all explanatory variables are zero. - C. It is always positive if the correlation is strong. - D. It cannot be interpreted if there are more than two explanatory variables. @@ -1335,7 +1336,7 @@ your previous exploratory data analysis? - A. It represents the line of best fit for each explanatory variable separately. - B. It minimizes the product of residuals. - - C. It minimizes the sum of squared residuals for all combinations of explanatory variables. + - C. It minimizes the sum of squared residuals for all combinations of regressors. - D. It shows the exact predictions for every data point. **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does the intercept represent in a multiple regression model with two explanatory variables? @@ -1343,7 +1344,7 @@ your previous exploratory data analysis? - A. The effect of one explanatory variable, keeping the other constant. - B. The change in the response variable per unit change in the explanatory variable. - C. The correlation between the two explanatory variables. - - D. The expected value of the response variable when all explanatory variables are zero. + - D. The expected value of the response variable when all regressors are zero. **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does the term "partial slope" refer to in a multiple regression model? @@ -1427,25 +1428,25 @@ portion of this book. We are ready to proceed to Part III: "Statistical Inference with `infer`." Statistical inference is the science of inferring about some unknown quantity using sampling. So far, we have only studied the regression coefficients and their interpretation. In -future chapters we learn how we can use information from a +later chapters we learn how we can use information from a sample to make inferences about the entire population. Once we have covered Chapters \@ref(sampling) on sampling, \@ref(confidence-intervals) on confidence intervals, and \@ref(hypothesis-testing) on hypothesis testing, we revisit the regression models in Chapter \@ref(inference-for-regression) on -inference for regression. This will complete the topics we want to cover -in this book, as shown in Figure \@ref(fig:part3)! +inference for regression. This will complete the topics in this book, as shown +in Figure \@ref(fig:part3)! -Furthermore in Chapter \@ref(inference-for-regression), we revisit the +Also in Chapter \@ref(inference-for-regression), we revisit the concept of residuals $y - \widehat{y}$ and discuss their importance when interpreting the results of a regression model. We perform what is known as a *residual analysis* of the `residual` variable of all -`get_regression_points()` outputs. Residual analyses allow you to verify +`get_regression_points()` outputs. Residual analyses enable us to verify what are known as the *conditions for inference for regression*. -(ref:flowchart-partiii) *ModernDive* flowchart - on to Part III! +(ref:flowchart-partiii) *ModernDive* flowchart – on to Part III! -```{r part3, echo=FALSE, fig.cap="(ref:flowchart-partiii)", purl=FALSE, out.width=if(!knitr::is_latex_output()) "100%"} +```{r part3, echo=FALSE, fig.cap="(ref:flowchart-partiii)", purl=FALSE, out.width="65%"} include_graphics("images/flowcharts/flowchart/flowchart.006.png") ``` diff --git a/07-sampling.Rmd b/07-sampling.Rmd index 629cc640d..0d46045f2 100644 --- a/07-sampling.Rmd +++ b/07-sampling.Rmd @@ -35,7 +35,7 @@ options(scipen = 99, digits = 3) set.seed(76) ``` -The third portion of this book introduces statistical inference. This chapter is about *sampling*. Sampling involves drawing repeated random samples from a population. In Section \@ref(sampling-activity) we illustrate sampling by working with samples of white and red balls and the proportion of red balls in these samples. In Section \@ref(sampling-framework) we present a theoretical framework and define what is the sampling distribution. We introduce one of the fundamental theoretical results in Statistics: the *Central Limit Theorem* in Section \@ref(central-limit-theorem). In Section \@ref(sampling-activity-mean) we present a second sampling activity, this time working with samples of chocolate-covered almonds and the average weight of these samples. In Section \@ref(sampling-other-scenarios) we present the sampling distribution in other scenarios. The concepts behind *sampling* form the basis of inferential methods, in particular confidence intervals and hypothesis tests; methods that are studied in Chapters \@ref(confidence-intervals) and \@ref(hypothesis-testing). +The third portion of this book introduces statistical inference. This chapter is about *sampling*. Sampling involves drawing repeated random samples from a population. In Section \@ref(sampling-activity), we illustrate sampling by working with samples of white and red balls and the proportion of red balls in these samples. In Section \@ref(sampling-framework), we present a theoretical framework and define what is the sampling distribution. We introduce one of the fundamental theoretical results in Statistics: the *Central Limit Theorem* in Section \@ref(central-limit-theorem). In Section \@ref(sampling-activity-mean), we present a second sampling activity, this time working with samples of chocolate-covered almonds and the average weight of these samples. In Section \@ref(sampling-other-scenarios), we present the sampling distribution in other scenarios. The concepts behind *sampling* form the basis of inferential methods, in particular confidence intervals and hypothesis tests; methods that are studied in Chapters \@ref(confidence-intervals) and \@ref(hypothesis-testing). ## Needed packages {-#sampling-packages} @@ -84,7 +84,7 @@ bowl The `bowl` has `r num_balls` rows representing the `r num_balls` balls in the bowl shown in Figure \@ref(fig:sampling-exercise-1). You can view and scroll through the entire contents of the `bowl` in RStudio's data viewer by running `View(bowl)`. The first variable `ball_ID` is used as an *identification variable* as discussed in Subsection \@ref(identification-vs-measurement-variables); none of the balls in the actual bowl are marked with numbers. -The second variable `color` indicates whether a particular virtual ball is red or white. We compute the proportion of red balls in the bowl using the `dplyr` data wrangling verbs presented in Chapter \@ref(wrangling). A few steps are needed in order to determine this proportion. We present these steps separately to remind you how they work but later introduce all the steps together and simplify some of the code. First, for each of the balls, we identify if it is red or not using a test for equality with the logical operator `==`. We do this by using the `mutate()` function from Section \@ref(mutate) that allows us to create a new Boolean variable called `is_red`. +The second variable `color` indicates whether a particular virtual ball is red or white. We compute the proportion of red balls in the bowl using the `dplyr` data-wrangling verbs presented in Chapter \@ref(wrangling). A few steps are needed in order to determine this proportion. We present these steps separately to remind you how they work but later introduce all the steps together and simplify some of the code. First, for each of the balls, we identify if it is red or not using a test for equality with the logical operator `==`. We do this by using the `mutate()` function from Section \@ref(mutate) that allows us to create a new Boolean variable called `is_red`. ```{r} bowl |> @@ -167,7 +167,7 @@ include_graphics("images/sampling/balls/tactile_3_c-2e.png") By looking at the histogram, we observe that the lowest proportion of red balls was between 0.20 and 0.25 while the highest was between 0.45 and 0.5. More importantly, the most frequently occurring proportions were between 0.30 and 0.35. -This activity performed by 33 students has the results stored in the `tactile_prop_red` data frame included in the `moderndive` package. The first 10 rows are printed below: +This activity performed by 33 students has the results stored in the `tactile_prop_red` data frame included in the `moderndive` package. The first 10 rows are given: ```{r} tactile_prop_red @@ -255,7 +255,7 @@ Based on this random sample, `r prop_red_sample1*100`% of the `virtual_shovel`'s #### Thirty-three virtual samples {-} -In Section \@ref(sampling-activity), students got 33 samples and sample proportions. They repeated/replicated the sampling process 33 times. We do this virtually by again using the function `rep_slice_sample()` and this time adding the `reps = 33` argument as we want to retrieve 33 random samples. We save these samples in the data frame `virtual_samples`, as shown below, and then provide a preview of its first 10 rows. If you want to inspect the entire `virtual_samples` data frame, use RStudio's data viewer by running `View(virtual_samples)`. +In Section \@ref(sampling-activity), students got 33 samples and sample proportions. They repeated/replicated the sampling process 33 times. We do this virtually by again using the function `rep_slice_sample()` and this time adding the `reps = 33` argument as we want to retrieve 33 random samples. We save these samples in the data frame `virtual_samples`, as shown, and then provide a preview of its first 10 rows. If you want to inspect the entire `virtual_samples` data frame, use RStudio's data viewer by running `View(virtual_samples)`. ```{r echo=-1} set.seed(76) @@ -266,7 +266,7 @@ virtual_samples Observe in the data viewer that the first 50 rows of `replicate` are equal to `1`, the next 50 rows of `replicate` are equal to `2`, and so on. The first 50 rows correspond to the first sample of 50 balls while the next 50 rows correspond to the second sample of 50 balls. This pattern continues for all `reps = 33` replicates, and thus `virtual_samples` has 33 $\cdot$ 50 = 1650 rows. -Using `virtual_samples` we find the proportion of red balls for each replicate. We use the same `dplyr` verbs as before. In particular, we add `group_by()` of the `replicate` variable. Recall from Section \@ref(groupby) that by assigning the grouping variable "meta-data" before `summarize()`, we perform the calculations needed for each replicate separately. The other line of code, as explained in the case of one sample, calculates the sample proportion of red balls. A preview of the first 10 rows is presented below: +Using `virtual_samples` we find the proportion of red balls for each replicate. We use the same `dplyr` verbs as before. In particular, we add `group_by()` of the `replicate` variable. Recall from Section \@ref(groupby) that by assigning the grouping variable "meta-data" before `summarize()`, we perform the calculations needed for each replicate separately. The other line of code, as explained in the case of one sample, calculates the sample proportion of red balls. A preview of the first 10 rows is presented: ```{r} virtual_prop_red <- virtual_samples |> @@ -359,7 +359,7 @@ if (is_latex_output()) { It was helpful to observe how sampling variation affects sample proportions in 33 samples. It was also interesting to note that while the 33 virtual samples provide different sample proportions than the 33 physical samples, the overall patterns were fairly similar. Because the samples were taken at random in both cases, any other set of 33 samples, virtual or physical, would provide a different set of sample proportions due to sampling variation, but the overall patterns would still be similar. Still, 33 samples are not enough to fully understand these patterns. -This is why we now study the sampling distribution and the effects of sampling variation with 1000 random samples. Trying to do this manually could be impractical but getting virtual samples can be done quickly and efficiently. Additionally, we have already developed the tools for this. We repeat the steps performed earlier using the `rep_slice_sample()` function with a sample `size` set to be 50. This time however, we set the number of replicates `reps` to `1000`, and use `summarize()` and `mean()` again on the Boolean values to calculate the sample proportions. We compute `virtual_prop_red` with the count of red balls and the corresponding sample proportion for all 1000 random samples. The proportions for the first 10 samples are shown below: +This is why we now study the sampling distribution and the effects of sampling variation with 1000 random samples. Trying to do this manually could be impractical but getting virtual samples can be done quickly and efficiently. Additionally, we have already developed the tools for this. We repeat the steps performed earlier using the `rep_slice_sample()` function with a sample `size` set to be 50. This time, however, we set the number of replicates (`reps`) to `1000`, and use `summarize()` and `mean()` again on the Boolean values to calculate the sample proportions. We compute `virtual_prop_red` with the count of red balls and the corresponding sample proportion for all 1000 random samples. The proportions for the first 10 samples are shown: ```{r echo=-1} set.seed(76) @@ -532,12 +532,7 @@ Observe that all three histograms are: - are somewhat bell-shaped, and - exhibit *sampling variation* that is different for each sample size. In particular, as the sample size increases from 25 to 50 to 100, the sample proportions do not vary as much and they seem to get closer to the middle value. -These are important characteristic of the *sampling distribution* of the sample proportion: the first observation relates to the shape of the distribution, the second to the center of the distribution, and the last one to the *sampling variation* and how it is affected by the sample size. These results are not coincidental or isolated to the example of sample proportions of red balls in a bowl. In the next subsection, a theoretical framework is introduced that helps explain with precise mathematical equations the behavior of sample proportions coming from random samples. - - - +These are important characteristics of the *sampling distribution* of the sample proportion: the first observation relates to the shape of the distribution, the second to the center of the distribution, and the last one to the *sampling variation* and how it is affected by the sample size. These results are not coincidental or isolated to the example of sample proportions of red balls in a bowl. In the next subsection, a theoretical framework is introduced that helps explain with precise mathematical equations the behavior of sample proportions coming from random samples. ```{block, type="learncheck", purl=FALSE} @@ -582,7 +577,7 @@ Learning checks need to be updated. AV 8/17/23 ## Sampling framework {#sampling-framework} -In Section \@ref(sampling-activity) we gained some intuition about sampling and its characteristics. In this section we introduce some statistical definitions and terminology related to sampling. We conclude by introducing key characteristics that will be formally studied in the rest of the chapter. +In Section \@ref(sampling-activity), we gained some intuition about sampling and its characteristics. In this section, we introduce some statistical definitions and terminology related to sampling. We conclude by introducing key characteristics that will be formally studied in the rest of the chapter. ### Population, sample, and the sampling distribution {#terminology-and-notation} @@ -640,7 +635,7 @@ This process is called a trial or a Bernoulli trial in honor of Jacob Bernoulli, Getting a sample of 25 balls is running 25 trials and getting 25 numbers, ones or zeros, representing whether or not we have observed red balls on each trial, respectively. The average of these 25 numbers (zeros or ones) represents precisely the proportion or red balls in a sample of 25 balls. -It is useful to represent a trial as a random variable. We use the uppercase letter $X$ and the subscript $1$ as $X_1$ to denote the random variable for the first trial. After the first trial is completed, so the color of the first ball is observed, the value of $X_1$ is realized as 1 if the ball is red or 0 if the ball is white. For example, if the first ball is red, we write $X_1 = 1$. Similarly we use $X_2$ to represent the second trial. For example, if the second ball is white, $X_2$ is realized as $X_2=0$, and so on. $X_1$, $X_2$, $\dots$ are random variables only before the trials have been performed. After the trials, they are just the *ones* or *zeros* representing red or white balls, respectively. +It is useful to represent a trial as a random variable. We use the uppercase letter $X$ and the subscript $1$ as $X_1$ to denote the random variable for the first trial. After the first trial is completed, so the color of the first ball is observed, the value of $X_1$ is realized as 1 if the ball is red or 0 if the ball is white. For example, if the first ball is red, we write $X_1 = 1$. Similarly, we use $X_2$ to represent the second trial. For example, if the second ball is white, $X_2$ is realized as $X_2=0$, and so on. $X_1$, $X_2$, $\dots$ are random variables only before the trials have been performed. After the trials, they are just the *ones* or *zeros* representing red or white balls, respectively. Moreover, since our experiment is to perform 25 trials and then find the average of them, this average or mean, before the trials are carried out, can also be expressed as a random variable: @@ -727,7 +722,7 @@ virtual_prop_red_25 |> summarize(E_Xbar_25 = mean(prop_red)) ``` -The variable `prop_red` in data frame `virtual_prop_red_25` contains the sample proportions for each of the 1000 samples taken. The average of these sample proportion is presented as object `E_Xbar_25` which represents the estimated expected value of $\overline X$, by using the average of the 1000 sample proportions. Each of the sample proportions is calculated from random samples of 25 balls from the bowl. This average happens to be precisely the same as the population proportion. +The variable `prop_red` in data frame `virtual_prop_red_25` contains the sample proportions for each of the 1000 samples taken. The average of these sample proportions is presented as object `E_Xbar_25` which represents the estimated expected value of $\overline X$, by using the average of the 1000 sample proportions. Each of the sample proportions is calculated from random samples of 25 balls from the bowl. This average happens to be precisely the same as the population proportion. It is worth spending a moment understanding this result. If we take one random sample of a given size, we know that the sample proportion from this sample would be somewhat different than the population proportion due to sampling variation; however, if we take many random samples of the same size, the average of the sample proportions are expected to be about the same as the population proportion. @@ -843,7 +838,7 @@ bowl |> summarize(p = mean(prop_red), SE_Xbar = sd(prop_red)) ``` -Observe that `p` is the estimated expected value and `SE_Xbar` is the estimated standard error based on the simulation of taking sample proportions for random samples of size $n=100$. Compare this value with the standard deviation for the entire bowl, discovered earlier. It is one tenth the size! This is not a coincidence: the standard error of $\overline X$ is equal to the standard deviation of the population (the bowl) divided by the square root of the sample size. In the case of sample proportions, the standard error of $\overline X$ can also be determined using the formula: +Observe that `p` is the estimated expected value and `SE_Xbar` is the estimated standard error based on the simulation of taking sample proportions for random samples of size $n=100$. Compare this value with the standard deviation for the entire bowl, discovered earlier. It is one-tenth the size! This is not a coincidence: the standard error of $\overline X$ is equal to the standard deviation of the population (the bowl) divided by the square root of the sample size. In the case of sample proportions, the standard error of $\overline X$ can also be determined using the formula: $$SE(\overline X) = \sqrt{\frac{p(1-p)}{n}}$$ where $p$ is the population proportion and $n$ is the size of our sample. This formula shows that the standard error is inversely proportional to the square root of the sample size: as the sample size increases, the standard error decreases. In our example, the standard error is @@ -864,7 +859,7 @@ virtual_prop_red_50 |> summarize(SE_Xbar_100 = sd(prop_red)) ``` -The standard errors for these examples, based on the proportion of red balls in the bowl and the sample sizes, are given below: +The standard errors for these examples, based on the proportion of red balls in the bowl and the sample sizes, are given: ```{r} sqrt(p * (1 - p) / 25) @@ -1032,7 +1027,7 @@ almonds_bowl num_pop_almonds <- length(almonds_bowl$weight) ``` -The first variable `ID` represents a virtual ID number given to each almond, and the variable `weight` contains the weight in grams for each almond in the bowl. The **population mean** weight of almonds, a population parameter, can be calculated in R using again the `dplyr` data wrangling verbs presented in Chapter \@ref(wrangling). Observe, in particular, inside the function `summarize()` the use of the `mean()`, `sd()`, and `n()` functions for the mean weight, the weight's standard deviation[^2], and the number of almonds in the bowl: +The first variable `ID` represents a virtual ID number given to each almond, and the variable `weight` contains the weight in grams for each almond in the bowl. The **population mean** weight of almonds, a population parameter, can be calculated in R using again the `dplyr` data-wrangling verbs presented in Chapter \@ref(wrangling). Observe, in particular, inside the function `summarize()` the use of the `mean()`, `sd()`, and `n()` functions for the mean weight, the weight's standard deviation[^2], and the number of almonds in the bowl: [^2]: As explained earlier, this function produces the sample standard deviation, which divides the sum of square deviations by $n-1$ instead of $n$, but for a list of 5000 observations the difference is not relevant, for practical purposes. @@ -1076,7 +1071,7 @@ Let's now take a random sample of 25 almonds, as shown in Figure \@ref(fig:twent include_graphics("images/sampling/almonds/twenty-five-almonds.png") ``` -Since the total weight is 88.6 grams, as shown in the Figure \@ref(fig:twenty-five-almonds), the sample mean weight will be $88.6/25 = 3.544$. The `moderndive` \index{R packages!moderndive!almonds\_sample} package contains the information of this sample in the `almonds_sample` data frame. Here, we present the weight of the first 10 almonds in the sample: +Since the total weight is 88.6 grams, as shown in Figure \@ref(fig:twenty-five-almonds), the sample mean weight will be $88.6/25 = 3.544$. The `moderndive` \index{R packages!moderndive!almonds\_sample} package contains the information of this sample in the `almonds_sample` data frame. Here, we present the weight of the first 10 almonds in the sample: ```{r} almonds_sample @@ -1094,7 +1089,7 @@ ggplot(almonds_sample, aes(x = weight)) + geom_histogram(binwidth = 0.1, color = "white") ``` -The weights of almonds in this sample range from `r min(almonds_sample$weight)` to `r max(almonds_sample$weight)` grams. There is not an obvious pattern in the distribution of this sample. We now compute the sample mean using our data wrangling tools from Chapter \@ref(wrangling). +The weights of almonds in this sample range from `r min(almonds_sample$weight)` to `r max(almonds_sample$weight)` grams. There is not an obvious pattern in the distribution of this sample. We now compute the sample mean using our data-wrangling tools from Chapter \@ref(wrangling). ```{r} almonds_sample |> summarize(sample_mean_weight = mean(weight)) @@ -1155,7 +1150,7 @@ Once again, we use random variable to formalize our understanding of the sample Getting a sample of 25 almonds is performing these trials 25 times to get 25 weights. Then, the average of these 25 numbers is the average weight or mean weight of a sample of 25 almonds. This is what we call a sample mean. Using random variables, we now let uppercase $X_1$ to be the random variable that represents the weight of the first almond before it has been selected, $X_2$ the weight of the second almond, and so on. These are random variables because they can take any possible almond weight value from the bowl. -After the first trial is completed, the value of $X_1$ is realized as the weight in grams of the first almond selected. We can represent this value by the lowercase $x_1$ as it is no longer a random variable but a number and we can write $X_1 = x_1$. After the second trial is completed, $X_2 = x_2$, where lowercase $x_2$ is the observe almond weight, and so on. +After the first trial is completed, the value of $X_1$ is realized as the weight in grams of the first almond selected. We can represent this value by the lowercase $x_1$ as it is no longer a random variable but a number and we can write $X_1 = x_1$. After the second trial is completed, $X_2 = x_2$, where lowercase $x_2$ is the observed almond weight, and so on. Since our experiment is to perform 25 trials and then find the average of them, this average or mean, before the trials are carried out, can also be expressed as a random variable: $$\overline X = \frac{X_1+X_2+\dots+X_{25}}{25}.$$ @@ -1344,7 +1339,7 @@ comparing_n_table |> In summary: 1. The estimated expected value was either 3.65 or 3.64. This is either near or equal to $\mu = 3.64$, the population mean weight of almonds in the entire bowl. -2. The standard error decreases when the sample size increases. If we focus on the result for $n=100$, the standard error was `r pull(comparing_n_table[3, 3])`. When compared with the population standard deviation $\sigma = 0.392$ this standard error is about one tenth the value of $\sigma$. Similarly when $n=25$ the standard error `r pull(comparing_n_table[1, 3])` is about one fifth the value of $\sigma$ and that pattern also holds when $n=50$. As was the case for the sample proportion, the standard error is inversely proportional to the squared sample size used to construct the distribution of $\overline X$. This is also a theoretical result that can be expressed as: +2. The standard error decreases when the sample size increases. If we focus on the result for $n=100$, the standard error was `r pull(comparing_n_table[3, 3])`. When compared with the population standard deviation $\sigma = 0.392$ this standard error is about one-tenth the value of $\sigma$. Similarly, when $n=25$ the standard error `r pull(comparing_n_table[1, 3])` is about one-fifth the value of $\sigma$ and that pattern also holds when $n=50$. As was the case for the sample proportion, the standard error is inversely proportional to the squared sample size used to construct the distribution of $\overline X$. This is also a theoretical result that can be expressed as: $$SE_{\overline X} = \frac{\sigma}{\sqrt {n}}$$ where $n$ is the sample size and $\sigma$ is the population standard deviation. @@ -1451,7 +1446,7 @@ In Sections \@ref(sampling-activity), \@ref(central-limit-theorem), and \@ref(sa ### Sampling distribution for two samples -Assume that we would like to compare the parameters of two populations, for example the means or proportion of those populations. To do this a random sample is taken from the first population and another random sample, independent from the first, is retrieved from the second population. Then we can use a statistic from each sample, such as the sample mean or sample proportion, and use them to produce sampling distributions that depend on two independent samples. We provide two examples to illustrate how the sampling distributions are affected. +Assume that we would like to compare the parameters of two populations, for example, the means or proportion of those populations. To do this a random sample is taken from the first population and another random sample, independent from the first, is retrieved from the second population. Then we can use a statistic from each sample, such as the sample mean or sample proportion, and use them to produce sampling distributions that depend on two independent samples. We provide two examples to illustrate how the sampling distributions are affected. #### Difference in sample means {-} @@ -1510,7 +1505,7 @@ We can use R to produce the required virtual samples and differences. We then us Our sampling exercise has again two components. First, we take a random sample of $n_1 = 50$ balls from the bowl of red balls and calculate the sample proportion or red balls. As we did before, we let $\overline X_1$ represent the possible values that the sample proportion can take for each possible sample. Recall that $\overline X_1$ is the sample proportion in this context, which is also the sample mean of Bernoulli trials. Second, we let $n_2 = 60$ represent the sample size used for samples from the almonds' bowl and the random variable $\overline X_2$ represent the possible values that the sample proportion of almonds greater than 3.8 grams can take for each possible sample. -To compare these two sample proportions, we find the difference, $\overline X_1 - \overline X_2$. The distribution of $\overline X_1 - \overline X_2$ is the sampling distribution of the difference in sample proportions. We use virtual sampling to approximate this distribution. We use the `rep_slice_sample` and `summarize` functions to produce the random samples and the necessary sample proportions, respectively. A total of 1000 random samples and sample proportions are acquired from each bowl with the appropriate sample sizes, 50 and 60, respectively. Moreover, the `inner_join` function introduce in Section \@ref(joins) is used here to merge the sample proportions into a single data frame and the difference in these sample proportions is calculated for each replication. +To compare these two sample proportions, we find the difference, $\overline X_1 - \overline X_2$. The distribution of $\overline X_1 - \overline X_2$ is the sampling distribution of the difference in sample proportions. We use virtual sampling to approximate this distribution. We use the `rep_slice_sample` and `summarize` functions to produce the random samples and the necessary sample proportions, respectively. A total of 1000 random samples and sample proportions are acquired from each bowl with the appropriate sample sizes, 50 and 60, respectively. Moreover, the `inner_join` function introduced in Section \@ref(joins) is used here to merge the sample proportions into a single data frame and the difference in these sample proportions is calculated for each replication. ```{r echo=-(1:3)} set.seed(76) @@ -1541,7 +1536,7 @@ ggplot(prop_joined, aes(x = prop_diff)) + labs(x = "Difference in sample proportions", title = "Histogram of 1000 differences in sample proportions") ``` -```{r samplingdistribution-virtual-diff-1000, echo=FALSE, fig.cap="The distribution of 1000 differences in sample proportions based on 1000 random samples of size 50 from the first bowl and 1000 random sample of size 60 from the second bowl.", purl=FALSE, fig.height=ifelse(knitr::is_latex_output(), 5.5, 7),} +```{r samplingdistribution-virtual-diff-1000, echo=FALSE, fig.cap="The distribution of 1000 differences in sample proportions based on 1000 random samples of size 50 from the first bowl and 1000 random samples of size 60 from the second bowl.", purl=FALSE, fig.height=ifelse(knitr::is_latex_output(), 5.5, 7),} virtual_histogram_diff <- ggplot(prop_joined, aes(x = prop_diff)) + geom_histogram(binwidth = 0.04, boundary = 0, color = "white") virtual_histogram_diff + @@ -1641,7 +1636,7 @@ if(!is_latex_output()) { ```{r echo=FALSE, results='asis', purl=FALSE} if (is_latex_output()) - cat("Check the online version of the book for a table that also includes the sampling distribution of each of these statistics using the Central Limit Theorem.") + cat("Check the online version of the book for a table that also includes the sampling distribution of each of these statistics as shown in Table \\@ref(tab:table-ch7) using the Central Limit Theorem.") ``` diff --git a/08-confidence-intervals.Rmd b/08-confidence-intervals.Rmd index b8a3e46d5..bd958f292 100755 --- a/08-confidence-intervals.Rmd +++ b/08-confidence-intervals.Rmd @@ -59,7 +59,7 @@ Similarly, when sampling chocolate-covered almonds and getting the sample mean w - the expected value of the sample mean is the population mean, and - the standard error of the sample mean is the standard deviation of the population divided by the square root of the sample size. -Moreover, these characteristics also apply to sampling distributions for the difference of sample means, the difference of sample proportions, and others. Recall that the sampling distribution is not restricted by the distribution of the population. As long as the samples taken are fairly large and we use the appropriate standard error, we can generalize these results appropriately. +Moreover, these characteristics also apply to sampling distributions for the difference in sample means, the difference in sample proportions, and others. Recall that the sampling distribution is not restricted by the distribution of the population. As long as the samples taken are fairly large and we use the appropriate standard error, we can generalize these results appropriately. The study of the sampling distribution is motivated by another question we have not yet answered: how can we determine the average weight of all the almonds if we do not have access to the entire bowl? We have seen by using simulations in Chapter \@ref(sampling) that the average of the sample means, derived from many random samples, will be fairly close to the expected value of the sample mean, which is precisely the population mean weight. @@ -192,7 +192,7 @@ sigma <- pop_sd(almonds_bowl$weight) ### Revisiting the almond activity for estimation {#revisit-almond} -In Chapter \@ref(sampling) one of the activities was to take many random samples of size 100 from a bowl of 5,000 chocolate-covered almonds. Since we have access to the contents of the entire bowl, we can compute the population parameters: +In Chapter \@ref(sampling), one of the activities was to take many random samples of size 100 from a bowl of 5,000 chocolate-covered almonds. Since we have access to the contents of the entire bowl, we can compute the population parameters: ```{r} almonds_bowl |> @@ -221,7 +221,7 @@ almonds_sample_100 |> In one of the activities performed in Chapter \@ref(sampling) we took many random samples, calculated their sample means, constructed a histogram using these sample means, and showed how the histogram is a good approximation of the sampling distribution of the sample mean. We redraw Figure \@ref(fig:sample-mean-100-with-normal) here as Figure \@ref(fig:sample-mean-100-with-normal-redraw). -```{r sample-mean-100-with-normal-redraw, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 2.2, 4), fig.cap="The distribution of the sample mean.", purl=FALSE} +```{r sample-mean-100-with-normal-redraw, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 1.9, 4), fig.cap="The distribution of the sample mean.", purl=FALSE} num_almonds_sample <- length(almonds_sample_100$weight) xbar <- mean(almonds_sample_100$weight) if (!file.exists("rds/virtual_mean_weight_100.rds")) { @@ -310,7 +310,7 @@ The density curve drawn with a solid line has the same mean as the one drawn wit #### The standard normal distribution {-} -A special normal distribution has mean $\mu$ = 0 and standard deviation $\sigma$ = 1. It is called the *standard normal distribution*, and it is represented by a density curve called the *$z$-curve*\index{distribution!standard normal}. If a random variable $Z$ follows the standard normal distribution, a realization of this random variable is called a standard value or $z$-value. The $z$-value also represents the number of standard deviations above the mean, if positive, or below the mean, if negative. For example, if $z=5$, the value observed represents a realization of the random variable $Z$ that is five standard deviation above the mean, $\mu = 0$. +A special normal distribution has mean $\mu$ = 0 and standard deviation $\sigma$ = 1. It is called the *standard normal distribution*, and it is represented by a density curve called the *$z$-curve*\index{distribution!standard normal}. If a random variable $Z$ follows the standard normal distribution, a realization of this random variable is called a standard value or $z$-value. The $z$-value also represents the number of standard deviations above the mean, if positive, or below the mean, if negative. For example, if $z=5$, the value observed represents a realization of the random variable $Z$ that is five standard deviations above the mean, $\mu = 0$. #### Linear transformations of random variables that follow the normal distribution {-} @@ -322,9 +322,9 @@ A property of the normal distribution is that any linear transformation of a ran $$z = \frac{x - \mu}{\sigma} = \frac{11 - 5}{2} = 3$$ is the corresponding value in a standard normal curve. Moreover, we have determined that $x = 11$ for this example is precisely $3$ standard deviations above the mean. -#### Finding probabilies under a density curve {-} +#### Finding probabilities under a density curve {-} -When a random variable can be represented by a density curve, the probability that the random variable takes a value in any given interval (on the X-axis) is equal to the area under the density curve for that interval. If we know the equation that represents the density curve, we could use the mathematical technique from calculus known as integration to determine this area. In the case of the normal curve, the integral for any interval does not have a close form solution, and the solution is calculated using numerical approximations. +When a random variable can be represented by a density curve, the probability that the random variable takes a value in any given interval (on the X-axis) is equal to the area under the density curve for that interval. If we know the equation that represents the density curve, we could use the mathematical technique from calculus known as integration to determine this area. In the case of the normal curve, the integral for any interval does not have a close-form solution, and the solution is calculated using numerical approximations. ```{r echo=FALSE, results="asis", purl=FALSE} if(!is_latex_output()) @@ -333,7 +333,7 @@ if(!is_latex_output()) We assume that a random variable $Z$ follows a standard normal distribution. We would like to know how likely it is for this random variable to take a value that is within one standard deviation from the mean. Equivalently, what is the probability that the observed value $z$ (in the X-axis) is between -1 and 1 as shown in Figure \@ref(fig:normal-curve-shaded-1a)? -```{r normal-curve-shaded-1a, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 1.5, 4), fig.width=3, fig.cap="Normal area within one standard deviation."} +```{r normal-curve-shaded-1a, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), .9, 4), fig.width=3, fig.cap="Normal area within one standard deviation."} ggplot(data = data.frame(x = c(-4, 4)), aes(x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1)) + geom_area(stat = "function", fun = dnorm, fill = "grey100", xlim = c(-4, -1)) + @@ -344,9 +344,9 @@ ggplot(data = data.frame(x = c(-4, 4)), aes(x)) + scale_x_continuous(breaks = c(-1,1)) ``` -Calculations show that this area is 0.6827 or about 68.27% of the total area under the curve. This is equivalent to saying that the probability of getting a value between $-1$ and 1 on a standard normal is 68.27%. This also means that if a random variable representing an experiment follows a normal distribution, the probability that the outcome of this experiment is within one standard deviation from the mean is 68.27%. Similarly, the area under the standard normal density curve between -2 and 2 is shown in Figure \@ref(fig:normal-curve-shaded-2a). +Calculations show that this area is 0.6827 (68.27%) of the total area under the curve. This is equivalent to saying that the probability of getting a value between $-1$ and 1 on a standard normal is 68.27%. This also means that if a random variable representing an experiment follows a normal distribution, the probability that the outcome of this experiment is within one standard deviation from the mean is 68.27%. Similarly, the area under the standard normal density curve between -2 and 2 is shown in Figure \@ref(fig:normal-curve-shaded-2a). -```{r normal-curve-shaded-2a, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 1.5, 4), fig.width=3, fig.cap="Normal area within two standard deviations."} +```{r normal-curve-shaded-2a, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 0.9, 4), fig.width=3, fig.cap="Normal area within two standard deviations."} ggplot(data = data.frame(x = c(-4, 4)), aes(x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1)) + geom_area(stat = "function", fun = dnorm, fill = "grey100", xlim = c(-4, -2)) + @@ -362,10 +362,10 @@ Calculations show that this area is equal to 0.9545 or 95.45%. If a random varia if(!is_latex_output()) cat('Please see [Appendix A online](https://moderndive.com/v2/appendixa) to produce these or other calculations in R. ') ``` -The result is that the area under the density curve around the mean that is exactly equal to 0.95, or 95%, is the area within 1.96 standard deviation from the mean. Remember this number as it will be used a few times in future sections. +The result is that the area under the density curve around the mean that is exactly equal to 0.95, or 95%, is the area within 1.96 standard deviation from the mean. Remember this number as it will be used a few times in later sections. In summary, if the possible outcomes of an experiment can be expressed as a random variable that follows the normal distribution, the probability of getting a result that is within one standard deviation from the mean is about 68.27%, within 2 standard deviations form the mean is 95.45%, within 1.96 standard deviations from the mean is 95%, and within 3 standard deviations from the mean is about 99.73%, to name a few. -Spend a few moments grasping this idea; observe, for example, that it is almost impossible to observe an outcome represented by a number that is five standard deviations above the mean as the chances of that happening are near zero. We are now ready to return to the our main goal: how to find an interval estimate of the population mean based on a single sample. +Spend a few moments grasping this idea; observe, for example, that it is almost impossible to observe an outcome represented by a number that is five standard deviations above the mean as the chances of that happening are near zero. We are now ready to return to our main goal: how to find an interval estimate of the population mean based on a single sample. ```{block, type="learncheck", purl=FALSE} \vspace{-0.15in} @@ -437,7 +437,7 @@ deviance <- sample_mean - mu z_almond <- deviance / se_xbar ``` -The horizontal axis (X-axis) represents the sample means that we can determine from all the possible random samples of 100 almonds. The `r ifelse(is_html_output(), "red", "left")` dot represents the expected value of the sampling distribution, $\mu = 3.64$, located on the X-axis at the center of the distribution. The density curve's height can be thought of as how likely those sample means are to be observed. For example, it is more likely to get a random sample with a sample mean around $`r mu`$ grams (which corresponds to the highest point of the curve) than it is to get a sample with a sample mean at around $3.5$ grams, since the curve's height is almost zero at that value. The `r ifelse(is_html_output(), "blue", "right")` dot is the sample mean from our sample of 100 almonds, $\overline{x} = `r sample_mean`$ grams. It is located `r deviance` grams above the population mean weight. How far is `r deviance` grams? It is helpful to express this distance in standardized values: +The horizontal axis (X-axis) represents the sample means that we can determine from all the possible random samples of 100 almonds. The `r ifelse(is_html_output(), "red", "left")` dot represents the expected value of the sampling distribution, $\mu = 3.64$, located on the X-axis at the center of the distribution. The density curve's height can be thought of as how likely those sample means are to be observed. For example, it is more likely to get a random sample with a sample mean around $`r mu`$ grams (which corresponds to the highest point of the curve) than it is to get a sample with a sample mean of around $3.5$ grams, since the curve's height is almost zero at that value. The `r ifelse(is_html_output(), "blue", "right")` dot is the sample mean from our sample of 100 almonds, $\overline{x} = `r sample_mean`$ grams. It is located `r deviance` grams above the population mean weight. How far is `r deviance` grams? It is helpful to express this distance in standardized values: $$\frac{`r sample_mean` - `r mu`}{`r se_xbar`} = `r z_almond`$$ @@ -493,7 +493,7 @@ p1 Since 1.96 standard errors were used on the construction of this interval, we call this a 95% confidence interval. A confidence interval can be viewed as an interval estimator of the population mean. Compare an interval estimator with the sample mean that is a point estimator. The latter estimates the parameter with a single number, the former provides an entire interval to account for the location of the parameter. An apt analogy involves fishing. Imagine that there is a single fish swimming in murky water. The fish is not visible but its movement produces ripples on the surface that can provide some limited information about the fish's location. To capture the fish, one could use a spear or a net. Because the information is limited, throwing the spear at the ripples may capture the fish but likely will miss it. -Throwing a net around the ripples, on the other hand, may give a much higher likelihood of capturing the fish. Using the sample mean only to estimate the population mean is like throwing a spear at the ripples in the hopes of capturing the fish. Constructing a confidence interval that may include the population mean is like throwing a net to surround the ripples. Keep this analogy in mind, as we will revisit it in future sections. +Throwing a net around the ripples, on the other hand, may give a much higher likelihood of capturing the fish. Using the sample mean only to estimate the population mean is like throwing a spear at the ripples in the hopes of capturing the fish. Constructing a confidence interval that may include the population mean is like throwing a net to surround the ripples. Keep this analogy in mind, as we will revisit it in later sections. ```{block, type="learncheck", purl=FALSE} \vspace{-0.15in} @@ -623,14 +623,13 @@ More importantly, the confidence interval constructed here contains the populati We can summarize the results so far: - If the size used for your random sample is large enough, the sampling distribution of the sample mean follows, approximately, the normal distribution. -- Using the sample mean observed and the standard error of the sampling distribution, we can construct 95% confidence intervals for the population mean. The formula for these intervals is given by +- Using the sample mean observed and the standard error of the sampling distribution, we can construct 95% confidence intervals for the population mean. The formula for these intervals (where $n$ is the sample size used) is given by -$$\left(\overline{x} - 1.96 \frac{\sigma}{\sqrt{n}},\quad \overline{x} + 1.96 \frac{\sigma}{\sqrt{n}}\right)$$ -where $n$ is the sample size used. +$$\left(\overline{x} - 1.96 \frac{\sigma}{\sqrt{n}},\quad \overline{x} + 1.96 \frac{\sigma}{\sqrt{n}}\right).$$ - When the population standard deviation is unknown (which is almost always the case), the sample standard deviation is used to estimate the standard error. This produces additional variability and the standardized values follow a $t$ distribution with $n-1$ degrees of freedom. The formula for 95% confidence intervals when the sample size is $n=100$ is given by -$$\left(\overline{x} - 1.98 \frac{s}{\sqrt{100}},\quad \overline{x} + 1.98 \frac{s}{\sqrt{100}}\right)$$ +$$\left(\overline{x} - 1.98 \frac{s}{\sqrt{100}},\quad \overline{x} + 1.98 \frac{s}{\sqrt{100}}\right).$$ - The method to construct 95% confidence intervals guarantees that in the long-run for 95% of the possible samples, the intervals determined will include the population mean. It also guarantees that 5% of the possible samples will lead to intervals that do not include the population mean. - As we have constructed intervals with a 95% level of confidence, we can construct intervals with any level of confidence. The only change in the equations will be the number of standard errors needed. @@ -642,7 +641,7 @@ We have used the sample `almonds_sample_100`, constructed a 95% confidence inter (ref:almond-mean-ci) One hundred 95% confidence intervals and whether the population mean is captured in each. -```{r almond-mean-cis, fig.cap="(ref:almond-mean-ci)", echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 4.2, 5), purl=FALSE} +```{r almond-mean-cis, fig.cap="(ref:almond-mean-ci)", echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 5, 5), purl=FALSE} set.seed(202) # Compute data frame with sampled data, sample means, and ci almond_mean_cis <- almonds_bowl |> @@ -677,14 +676,14 @@ Note that each interval was built using a different random sample. The `r ifelse This result motivates the meaning of a 95% confidence interval: If you could construct intervals using the procedure described earlier for every possible random sample, then 95% of these intervals will include the population mean and 5% of them will not. -Of course, in most situations it would be impractical or impossible to take every possible random sample. Still, for a large number of random samples, this result is approximately correct. In Figure \@ref(fig:almond-mean-cis), for example, 5 out of 100 confidence intervals do not include the population mean, and 95% do. It won't always match up perfectly like this, but the proportions should match pretty close to the confidence level chosen. +Of course, in most situations, it would be impractical or impossible to take every possible random sample. Still, for a large number of random samples, this result is approximately correct. In Figure \@ref(fig:almond-mean-cis), for example, 5 out of 100 confidence intervals do not include the population mean, and 95% do. It won't always match up perfectly like this, but the proportions should match pretty close to the confidence level chosen. The term "95% confidence" invites us to think we are talking about probabilities or chances. Indeed we are, but in a subtle way. Before a random sample has been procured, there is a 95% chance that when a confidence interval is constructed using the prospective random sample, this interval will contain the population mean. The moment a random sample has been attained, the interval constructed either contains the population mean or it does not; with certainty, there is no longer a chance involved. This is true even if we do not know what the population mean is. So the 95% confidence refers to the method or process to be used on a prospective sample. We are confident that if we follow the process to construct the interval, 95% of the time the random sample attained will lead us to produce an interval that contains the population mean. On the other hand, it would be improper to say that... "there is a 95% chance that the confidence interval contains the population mean." Looking at Figure \@ref(fig:almond-mean-cis), each of the confidence intervals either does or does not contain $\mu$. Once the confidence interval is determined, either the population mean is included or not. -In the literature, this explanation has been encapsulated in a short-hand version: we are 95% confident that the interval contains the population parameter. For example, in Subsection \@ref(t-distribution-CI) the 95% confidence interval for the population mean weight of almonds was (`r round(lower_bound_t, 3)`, `r round(upper_bound_t, 3)`), and we would say: "We are 95% confident that the population mean weight of almonds is between `r round(lower_bound_t, 3)` and `r round(upper_bound_t, 3)` grams". +In the literature, this explanation has been encapsulated in a short-hand version: we are 95% confident that the interval contains the population parameter. For example, in Subsection \@ref(t-distribution-CI) the 95% confidence interval for the population mean weight of almonds was (`r round(lower_bound_t, 3)`, `r round(upper_bound_t, 3)`), and we would say: "We are 95% confident that the population mean weight of almonds is between `r round(lower_bound_t, 3)` and `r round(upper_bound_t, 3)` grams." It is perfectly acceptable to use the short-hand statement, but always remember that the 95% confidence refers to the process, or method, and can be thought of as a chance or probability only before the random sample has been acquired. To further ensure that the probability-type of language is not misused, quotation marks are sometimes put around "95% confident" to emphasize that it is a short-hand version of the more accurate explanation. @@ -701,10 +700,12 @@ It is perfectly acceptable to use the short-hand statement, but always remember - C. Because it assumes that the sample size is always smaller when applying the $t$ distribution. - D. Because it accounts for the extra uncertainty that comes from using the sample standard deviation instead of the population standard deviation. +\newpage + **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is the effect of increasing the degrees of freedom on the $t$ distribution? - A. The tails of the distribution become thicker. -- B. The tails of the distribution becomes thinner. +- B. The tails of the distribution become thinner. - C. The distribution does not change with degrees of freedom. - D. The distribution becomes skewed to the right. @@ -729,7 +730,7 @@ We mentioned earlier that the number 1.96 relates to a 95% confidence process bu - Since the normal distribution is symmetric, the area on each tail is $\alpha/2 = 0.05/2 = 0.025$. - We need the exact number of standard deviations that produces the shaded area. Since the center of a standard normal density curve is zero, as shown in Figure \@ref(fig:normal-curve-shaded-3a), and the normal curve is symmetric, the number of standard deviations can be represented by $-q$ and $q$, the same magnitude but one positive and the other negative. -```{r normal-curve-shaded-3a, echo=FALSE, fig.cap="Normal curve with the shaded middle area being 0.95", fig.height=ifelse(knitr::is_latex_output(), 2, 4), fig.width=3} +```{r normal-curve-shaded-3a, echo=FALSE, fig.cap="Normal curve with the shaded middle area being 0.95", fig.height=ifelse(knitr::is_latex_output(), 1.5, 4), fig.width=3} ggplot(data = data.frame(x = c(-4, 4)), aes(x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1)) + geom_area(stat = "function", fun = dnorm, fill = "grey100", xlim = c(-4, -1.96)) + @@ -754,7 +755,7 @@ In R, the function `qnorm()` finds the value of $q$ when the area under this cur qnorm(0.025) ``` -or 1.96 standard deviation below the mean. Similarly, the total area under the curve to the left of $q$ is the total shaded area, 0.95, plus the the small white area on the left tail, $0.025$, and $0.95 + 0.025 = 0.975$, so +or 1.96 standard deviation below the mean. Similarly, the total area under the curve to the left of $q$ is the total shaded area, 0.95, plus the small white area on the left tail, $0.025$, and $0.95 + 0.025 = 0.975$, so ```{r} qnorm(0.975) @@ -805,7 +806,7 @@ $$1.96\cdot \frac{\sigma}{\sqrt{n}}.$$ If the sample size increases, the margin of error decreases proportional to the square root of the sample size. For example, if we secure a random sample of size 25, $1/\sqrt{25} = 0.2$, and if we draw a sample of size 100, $1/\sqrt{100} = 0.1$. By choosing a larger sample size, four times larger, we produce a confidence interval that is half the width. This result is worth considering. -A confidence interval is an estimator of the parameter of interest, such as the population mean weight of almonds. Ideally we would like to build a confidence interval with a high level of confidence, for example 95% confidence. But we also want an interval that is narrow enough to provide useful information. For example, assume we get the following 95% confidence intervals for the population mean weight of almonds: +A confidence interval is an estimator of the parameter of interest, such as the population mean weight of almonds. Ideally, we would like to build a confidence interval with a high level of confidence, for example, 95% confidence. But we also want an interval that is narrow enough to provide useful information. For example, assume we get the following 95% confidence intervals for the population mean weight of almonds: - between 2 and 4 grams, or - between 3.51 and 3.64 grams, or @@ -817,7 +818,7 @@ We have concluded the theory-based approach to construct confidence intervals. I ## Estimation with the bootstrap {#simulation-based-CI} -In 1979, Brad Efron published an article introducing a method called the bootstrap\index{bootstrap!statistical reference}[@Efron1979] that is next summarized. A random sample of size $n$ is taken from the population. +In 1979, Brad Efron published an article introducing a method called the bootstrap \index{bootstrap!statistical reference}[@Efron1979] that is next summarized. A random sample of size $n$ is taken from the population. This sample is used to find another sample, with replacement, also of size $n$. This is called *resampling with replacement*\index{resampling} and the resulting sample is called a *bootstrap sample*\index{bootstrap}. For example, if the original sample is $\{4,2,5,4,1,3,7,4,6,1\}$, one particular bootstrap sample could be $\{6, 4, 7, 4, 2, 7, 2, 5, 4, 1\}.$ Observe that the number 7 appears once in the original sample, but twice in the bootstrap sample; similarly, the number 3 in the original sample does not appear in the bootstrap sample. This is not uncommon for a bootstrap sample, some of the numbers in the original sample are repeated and others are not included. @@ -828,7 +829,7 @@ This method takes advantage of the large number of bootstrap samples that can be How many different bootstrap samples could we get from a single sample? A very large number, actually. If the original sample has 10 numbers, as the one shown above, each possible bootstrap sample of size 10 is determined by sampling 10 times with replacement, so the total number of bootstrap samples is $10^{10}$ or 10 billion different bootstrap samples. If the original sample has 20 numbers, the number of bootstrap samples is $20^{20}$, a number greater than the total number of stars in the universe. Even with modern powerful computers, it would be an onerous task to calculate every possible bootstrap sample. Instead, a thousand or so bootstrap samples are retrieved, similar to the simulations performed in Chapter \@ref(sampling), and this number is often large enough to provide useful results. -Since Efron [@Efron1979] proposed the bootstrap, the statistical community embraced this method. During the 1980s and 1990s, many theoretical and empirical results were presented showing the strength of bootstrap methods. As an illustration, Efron [@Efron1979], Hall [@Hall1986], Efron and Tibshirani[@EfronTibshi1986], and Hall [@Hall1988] showed that bootstrapping was at least as good if not better than existent methods, when the goal was to estimate the standard error of an estimator or find the confidence intervals of a parameter. Modifications were proposed to improve the algorithm in situations where the basic method was not producing accurate results. With the continuous improvement of computing power and speed, and the advantages of having ready to use statistical software for its implementation, the use of the bootstrap has become more and more popular in many fields. +Since Efron [@Efron1979] proposed the bootstrap, the statistical community embraced this method. During the 1980s and 1990s, many theoretical and empirical results were presented showing the strength of bootstrap methods. As an illustration, Efron [@Efron1979], Hall [@Hall1986], Efron and Tibshirani[@EfronTibshi1986], and Hall [@Hall1988] showed that bootstrapping was at least as good if not better than existent methods, when the goal was to estimate the standard error of an estimator or find the confidence intervals of a parameter. Modifications were proposed to improve the algorithm in situations where the basic method was not producing accurate results. With the continuous improvement of computing power and speed, and the advantages of having ready-to-use statistical software for its implementation, the use of the bootstrap has become more and more popular in many fields. As an illustration, if we are interested in the mean of the population, $\mu$, and we have collected one random sample, we can gain a large number of bootstrap samples from this original sample, use them to calculate sample means, order the sample means from smallest to largest, and choose the interval that contains the middle 95% of these sample means. This will be the simplest way to find a confidence interval based on the bootstrap. In the next few subsections, we explore how to incorporate this and similar methods to construct confidence intervals. @@ -837,7 +838,7 @@ As an illustration, if we are interested in the mean of the population, $\mu$, a To study and understand the behavior of bootstrap samples, we return to our example of the chocolate-covered almonds in a bowl. Recall that the bowl is considered the population of almonds, and we are interested in estimating the population mean weight of almonds. As we did before, we only have access to a single random sample. In this section, we use the data frame `almonds_sample_100`, a random sample of 100 almonds taken earlier. We call this the original sample, and it is used in this section to create the bootstrap samples. -The first 10 rows are shown below: +The first 10 rows are shown: ```{r} almonds_sample_100 @@ -918,7 +919,7 @@ ggplot(almonds_sample_100, aes(x = weight)) + (ref:compare-plots) Comparing `weight` in the resampled `boot_sample` with the original sample `almonds_sample_100`. -```{r origandresample, echo=FALSE, fig.cap="(ref:compare-plots)", purl=FALSE, fig.height=ifelse(knitr::is_latex_output(), 3, 4)} +```{r origandresample, echo=FALSE, fig.cap="(ref:compare-plots)", purl=FALSE, fig.height=ifelse(knitr::is_latex_output(), 4.2, 4)} p1 <- ggplot(boot_sample, aes(x = weight)) + geom_histogram(binwidth = 0.1, color = "white") + labs(title = "Resample of 100 almonds' weights") + @@ -939,7 +940,7 @@ This is the typical behavior of bootstrap samples. They are samples that have be #### Many bootstrap samples: resampling multiple times {#replicates .unnumbered} -In this subsection we take full advantage of resampling with replacement by taking many bootstrap samples and study relevant information, such as the variability of their sample means. We can start by using the R syntax we used before, this time for `r n_manual_rep` replications. +In this subsection, we take full advantage of resampling with replacement by taking many bootstrap samples and study relevant information, such as the variability of their sample means. We can start by using the R syntax we used before, this time for `r n_manual_rep` replications. ```{r echo= -1} set.seed(20) @@ -959,7 +960,7 @@ boot_means Observe that `boot_means` has `r n_manual_rep` rows, corresponding to the `r n_manual_rep` bootstrap sample means. Furthermore, observe that the values of `mean_weight` vary as shown in Figure \@ref(fig:resampling-35). -```{r resampling-35, fig.cap="Distribution of 35 sample means from 35 bootrap samples."} +```{r resampling-35, fig.cap="Distribution of 35 sample means from 35 bootstrap samples."} ggplot(boot_means, aes(x = mean_weight)) + geom_histogram(binwidth = 0.01, color = "white") + labs(x = "sample mean weight in grams") @@ -990,7 +991,7 @@ boot_means The data frame `boot_means` contains `r n_virtual_resample` sample mean weights. Each is calculated from a different bootstrap sample and visualized in Figure \@ref(fig:one-thousand-sample-means). -```{r one-thousand-sample-means, message=FALSE, fig.cap="Histogram of 1000 bootstrap sample mean weights of almonds.", fig.height=ifelse(knitr::is_latex_output(), 3.85, 4)} +```{r one-thousand-sample-means, message=FALSE, fig.cap="Histogram of 1000 bootstrap sample mean weights of almonds.", fig.height=ifelse(knitr::is_latex_output(), 4, 4)} ggplot(boot_means, aes(x = mean_weight)) + geom_histogram(binwidth = 0.01, color = "white") + labs(x = "sample mean weight in grams") @@ -1087,7 +1088,7 @@ almonds_sample_100 |> summarize(stat = mean(weight)) ``` -If we want to use `infer` instead, we use the functions `specify()` and `calculate()`\index{R packages!infer!observed statistic shortcut} as shown below: +If we want to use `infer` instead, we use the functions `specify()` and `calculate()`\index{R packages!infer!observed statistic shortcut} as shown: ```{r, results='hide'} almonds_sample_100 |> @@ -1107,7 +1108,7 @@ We now illustrate the sequence of verbs necessary to construct a confidence inte #### 1. `specify` variables {-} -```{r infer-specify, out.width="40%", out.height="40%", echo=FALSE, fig.cap="Diagram of the specify() verb.", purl=FALSE} +```{r infer-specify, out.width="40%", echo=FALSE, fig.cap="Diagram of the specify() verb.", purl=FALSE} knitr::include_graphics("images/flowcharts/infer/specify.png") ``` @@ -1129,12 +1130,12 @@ almonds_sample_100 |> In the case of almonds we only have a response variable and no explanatory variable of interest. Thus, we set the `x` on the right-hand side of the `~` to be `NULL`. -In cases where inference is focused on a single sample, as it is the almonds' weights example, either specification works. In examples we present in future sections, the `formula` specification is simpler and more flexible. In particular, this comes up in the upcoming Section \@ref(case-study-two-prop-ci) on comparing two proportions and Section \@ref(infer-regression) on inference for regression. +In cases where inference is focused on a single sample, as it is the almonds' weights example, either specification works. In the examples we present in later sections, the `formula` specification is simpler and more flexible. In particular, this comes up in the upcoming Section \@ref(case-study-two-prop-ci) on comparing two proportions and Section \@ref(infer-regression) on inference for regression. #### 2. `generate` replicates {-} -```{r infer-generate, out.width="40%", out.height="40%", echo=FALSE, fig.cap="Diagram of generate() replicates.", purl=FALSE} +```{r infer-generate, out.width="40%", echo=FALSE, fig.cap="Diagram of generate() replicates.", purl=FALSE} knitr::include_graphics("images/flowcharts/infer/generate.png") ``` @@ -1179,11 +1180,11 @@ almonds_sample_100 |> almonds_sample_100 |> #### 3. `calculate` summary statistics {-} -```{r infer-calculate, out.width="50%", out.height="50%", echo=FALSE, fig.cap="Diagram of calculate() summary statistics.", purl=FALSE} +```{r infer-calculate, out.width="50%", echo=FALSE, fig.cap="Diagram of calculate() summary statistics.", purl=FALSE} knitr::include_graphics("images/flowcharts/infer/calculate.png") ``` -After we `generate()` `r n_virtual_resample` bootstrap samples, we want to summarize each of them, for example, by calculating the sample mean of each one of them. As the diagram shows, the `calculate()` \index{R packages!infer!calculate()} function does this. +After we `generate()` `r n_virtual_resample` bootstrap samples, we want to summarize each of them, for example, by calculating the sample mean of each one of them. As Figure \@ref(fig:infer-calculate) shows, the `calculate()` \index{R packages!infer!calculate()} function does this. In our example, we calculate the mean `weight` for each bootstrap sample by setting the `stat` argument equal to `"mean"` inside the `calculate()` function. The `stat` argument can be used for other common summary statistics such as `"median"`, `"sum"`, `"sd"` (standard deviation), and `"prop"` (proportion). To see a list of other possible summary statistics you can use, type `?calculate` and read the help file. @@ -1229,7 +1230,7 @@ almonds_sample_100 |> almonds_sample_100 |> knitr::include_graphics("images/flowcharts/infer/visualize.png") ``` -The `visualize()` \index{R packages!infer!visualize()} verb provides a quick way to visualize the bootstrap distribution as a histogram of the numerical `stat` variable's values. The pipeline of the main `infer` verbs used for exploring bootstrap distribution results is shown in Figure \@ref(fig:infer-visualize). +The `visualize()` \index{R packages!infer!visualize()} verb provides a quick way to visualize the bootstrap distribution as a histogram of the numerical `stat` variable's values as shown in Figure \@ref(fig:bootstrap-distribution-infer). The pipeline of the main `infer` verbs used for exploring bootstrap distribution results is shown in Figure \@ref(fig:infer-visualize). ```{r bootstrap-distribution-infer, fig.height=ifelse(is_latex_output(), 1.7, 4) , fig.cap="Bootstrap distribution.", purl=FALSE} visualize(bootstrap_means) @@ -1243,8 +1244,8 @@ visualize(bootstrap_means) ggplot(bootstrap_means, aes(x = stat)) + geom_histogram() ``` -The `visualize()` function can take many other arguments to customize the plot further. In future sections we take advantage of this flexibility. In addition, it works with helper functions to add shading of the histogram values corresponding to the confidence interval values. -We have introduced the different elements on the `infer` workflow for constructing a bootstrap distribution and visualizing it. A summary of these steps is presented in Figure \@ref(fig:infer-workflow-ci). +The `visualize()` function can take many other arguments to customize the plot further. In later sections, we take advantage of this flexibility. In addition, it works with helper functions to add shading of the histogram values corresponding to the confidence interval values. +We have introduced the different elements of the `infer` workflow for constructing a bootstrap distribution and visualizing it. A summary of these steps is presented in Figure \@ref(fig:infer-workflow-ci). ```{r infer-workflow-ci, out.width="80%", echo=FALSE, fig.cap="infer package workflow for confidence intervals.", purl=FALSE} knitr::include_graphics("images/flowcharts/infer/ci_diagram.png") @@ -1411,18 +1412,18 @@ Disadvantage: Only works for estimators where the sample standard deviation can \vspace{-0.1in} ``` -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Construct a 95% confidence interval for the *median* weight of *all* almonds. Use the percentile method. Would it be appropriate to also use the standard-error method? +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Construct a 95% confidence interval for the *median* weight of *all* almonds with the percentile method. Is it appropriate to also use the standard-error method? -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are the advantages of using the `infer` package for constructing confidence intervals? +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are the advantages of using `infer` for building confidence intervals? **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is the main purpose of bootstrapping in statistical inference? - A. To visualize data distributions and identify outliers. -- B. To generate multiple samples from the original data for estimating population parameters and constructing confidence intervals. +- B. To generate multiple samples from the original data for estimating parameters. - C. To replace missing data points with the mean of the dataset. - D. To validate the assumptions of a regression model. -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Which function in the `infer` package is primarily used to denote the variables of interest for statistical inference? +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** "Which function denotes the variables of interest for inference?" - A. `rep_sample_n()` - B. `calculate()` @@ -1431,10 +1432,10 @@ Disadvantage: Only works for estimators where the sample standard deviation can **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is a key difference between the percentile method and the standard error method for constructing confidence intervals using bootstrap samples? -- A. The percentile method requires the population standard deviation, while the standard error method does not. -- B. The percentile method uses the middle 95% of bootstrap sample statistics, while the standard error method uses an estimated standard error from bootstrap samples. -- C. The standard error method always results in a narrower confidence interval than the percentile method. -- D. The percentile method requires more bootstrap samples than the standard error method. +- A. The percentile method requires the population standard deviation. +- B. The percentile method uses the middle 95% of bootstrap statistics, while the standard error method relies on the estimated standard error. +- C. The standard error method always results in a narrower confidence interval. +- D. The percentile method requires more bootstrap samples. ```{block, type="learncheck", purl=FALSE} \vspace{-0.25in} @@ -1473,7 +1474,7 @@ small sample sizes. The use of the bootstrap and bootstrap-related methods has become a cornerstone in modern statistical and data science practices. -In this section we introduce additional details about the bootstrap, and +In this section, we introduce additional details about the bootstrap, and explain the advantages and limitations of using the bootstrap. ### Confidence intervals and rate of convergence {#rate-convergence} @@ -1574,31 +1575,30 @@ This means that if you were to take a large number of random samples and construct the sample mean of these samples, their average will be clearly different than $\mu$, breaking down the theory we developed in Subsection \@ref(theory-based-CI). This problem may be fixed if the -sample size is large, but, depending of how extreme the population distribution +sample size is large, but, depending on how extreme the population distribution is, the sample size may need to be extremely large; perhaps in the order of thousands, or tens of thousands, or even more. Getting samples of those sizes may not be doable in real-life situations. -Second, in this chapter we have study confidence intervals for the population -mean $\mu$, because it is a fundamental quantity and it is the foundation for other cases. -However, building confidence intervals for other parameters using the -theory-based approach (for example, for the median, the first quartile, +Second, in this chapter, we have study confidence intervals for the population +mean $\mu$, because it is a fundamental quantity and it is the foundation for +other cases. However, building confidence intervals for other parameters using +the theory-based approach (for example, for the median, the first quartile, the standard deviation, etc.) becomes more complicated or even unfeasible. Third, when working with estimators more complicated than $\overline X$, it is often not possible to derive the standard error estimator with a formula as clean as $\sigma/\sqrt{n}$. Sometimes, there -is no formula for the standard error and alternative methods have to be used to estimate it. -This can create an additional source of bias. When bias is present, the confidence -intervals created using the -theory-based approach in Subsection \@ref(CI-general) could be suspect, even -completely useless. - -The bootstrap percentile method is not affected directly by the second and third limitations. -It can be implemented for any other parameter beyond the population mean, -as long as all the information needed can be extracted from each bootstrap sample. -On the other hand, the first limitation listed above can also affect the accuracy of -this method. +is no formula for the standard error and alternative methods have to be used to +estimate it. This can create an additional source of bias. When bias is present, +the confidence intervals created using the theory-based approach in Subsection +\@ref(CI-general) could be suspect, even completely useless. + +The bootstrap percentile method is not affected directly by the second and third +limitations. It can be implemented for any other parameter beyond the population +mean, as long as all the information needed can be extracted from each bootstrap +sample. On the other hand, the first limitation listed above can also affect the +accuracy of this method. Fortunately, since the inception of the bootstrap, many improvements have been made to the percentile method. Bootstrap methods have @@ -1671,11 +1671,11 @@ Let's now focus on the `"seed"` group participants who were exposed to yawning w ### Sampling scenario -Let's review the terminology and notation related to sampling we studied in Subsection \@ref(terminology-and-notation). In Chapter \@ref(sampling) our *study population* was the bowl of $N$ = `r nrow(bowl)` balls. Our *population parameter* of interest was the *population proportion* of these balls that were red, denoted mathematically by $p$. In order to estimate $p$, we extracted a sample of 50 balls using the shovel and computed the relevant *point estimate*: the *sample proportion* that were red, denoted mathematically by $\widehat{p}$. +Let's review the terminology and notation related to sampling we studied in Subsection \@ref(terminology-and-notation). In Chapter \@ref(sampling), our *study population* was the bowl of $N$ = `r nrow(bowl)` balls. Our *population parameter* of interest was the *population proportion* of these balls that were red, denoted mathematically by $p$. In order to estimate $p$, we extracted a sample of 50 balls using the shovel and computed the relevant *point estimate*: the *sample proportion* that were red, denoted mathematically by $\widehat{p}$. Who is the study population here? All humans? All the people who watch the show *Mythbusters*? It's hard to say! This question can only be answered if we know how the show's hosts recruited participants! In other words, what was the *sampling methodology*\index{sampling methodology} used by the *Mythbusters* to recruit participants? We alas are not provided with this information. Only for the purposes of this case study, however, we'll *assume* that the 50 participants are a representative sample of all Americans given the popularity of this show. Thus, we'll be assuming that any results of this experiment will generalize to all $N$ = 346 million Americans (2024 population estimate). -Just like with our sampling bowl, the population parameter here will involve proportions. However, in this case it will be the *difference in population proportions* $p_{seed} - p_{control}$, where $p_{seed}$ is the proportion of *all* Americans who if exposed to yawning will yawn themselves, and $p_{control}$ is the proportion of *all* Americans who if not exposed to yawning still yawn themselves. Correspondingly, the point estimate/sample statistic based the *Mythbusters*' sample of participants will be the *difference in sample proportions* $\widehat{p}_{seed} - \widehat{p}_{control}$. Let's extend Table \@ref(tab:table-ch8-c) of scenarios of sampling for inference to include our latest scenario. +Just like with our sampling bowl, the population parameter here will involve proportions. However, in this case, it will be the *difference in population proportions* $p_{seed} - p_{control}$, where $p_{seed}$ is the proportion of *all* Americans who if exposed to yawning will yawn themselves, and $p_{control}$ is the proportion of *all* Americans who if not exposed to yawning still yawn themselves. Correspondingly, the point estimate/sample statistic based on the *Mythbusters*' sample of participants will be the *difference in sample proportions* $\widehat{p}_{seed} - \widehat{p}_{control}$. Let's extend Table \@ref(tab:table-ch8-c) of scenarios of sampling for inference to include our latest scenario. ```{r table-ch8-c, echo=FALSE, message=FALSE, purl=FALSE} # The following Google Sheet is published to CSV and loaded using read_csv(): @@ -1892,13 +1892,14 @@ We thus plug this value in as the `point_estimate` argument. ```{r} myth_ci_se <- bootstrap_distribution_yawning |> - get_confidence_interval(type = "se", point_estimate = obs_diff_in_props) + get_confidence_interval(type = "se", point_estimate = obs_diff_in_props, + level = 0.95) myth_ci_se ``` Let's visualize both confidence intervals in Figure \@ref(fig:bootstrap-distribution-mythbusters-CI), with the percentile method interval marked with black lines and the standard-error method marked with grey lines. Observe that they are both similar to each other. -```{r bootstrap-distribution-mythbusters-CI, echo=FALSE, fig.cap="Two 95\\% confidence intervals: percentile method (black) and standard error method (grey).", purl=FALSE} +```{r bootstrap-distribution-mythbusters-CI, echo=FALSE, fig.cap="Two 95\\% confidence intervals: percentile method (black) and standard error method (grey).", fig.height=2.5, purl=FALSE} visualize(bootstrap_distribution_yawning) + ggtitle("") + shade_confidence_interval( diff --git a/09-hypothesis-testing.Rmd b/09-hypothesis-testing.Rmd index 40ad59dd5..d4a7e61cd 100755 --- a/09-hypothesis-testing.Rmd +++ b/09-hypothesis-testing.Rmd @@ -77,40 +77,38 @@ library(gt) ## Tying confidence intervals to hypothesis testing {#tying-CI-hypo} -In Chapter \@ref(confidence-intervals) we used a random sample to construct an +In Chapter \@ref(confidence-intervals), we used a random sample to construct an interval estimate of the population mean. -When using the theory-based approach, we relied in the Central Limit Theorem +When using the theory-based approach, we relied on the Central Limit Theorem to form these intervals and when using the simulation-based approach we do it, for example, using the bootstrap percentile method. -Hypothesis testing takes advantages of similar tools but the nature and the goal of -the problem are different. Still, there is a direct link between confidence +Hypothesis testing takes advantages of similar tools but the nature and the goal +of the problem are different. Still, there is a direct link between confidence intervals and hypothesis testing. -In this section we first describe the one-sample hypothesis test for the -population mean. -We then establish the connection between confidence intervals and hypothesis -test. -This connection is direct when using the theory-based approach but requires -careful consideration when using the simulation-based approach. +In this section, we first describe the one-sample hypothesis test for the +population mean. We then establish the connection between confidence intervals +and hypothesis test. This connection is direct when using the theory-based +approach but requires careful consideration when using the simulation-based +approach. We proceed by describing hypothesis testing in the case of two-sample problems. ### The one-sample hypothesis test for the population mean {#one-sample-hyp} Let's continue working with the population mean, $\mu$. -In Chapter \@ref(confidence-intervals) we used a random sample to construct a 95% -confidence interval for $\mu$. +In Chapter \@ref(confidence-intervals), we used a random sample to construct a +95% confidence interval for $\mu$. When performing hypothesis testing, we test a claim about $\mu$ by collecting a -random sample and using it to determine if the sample obtained is consistent to -the claim made. To illustrate this idea we return to the chocolate covered almonds -activity. Assume that the almonds' company has stated on its -website that the average weight of a chocolate-covered almond is exactly 3.6 grams. -We are not so sure about this claim and as researchers believe that it is +random sample and using it to determine if the sample obtained is consistent +with the claim made. To illustrate this idea we return to the chocolate-covered +almonds activity. Assume that the almonds' company has stated on its +website that the average weight of a chocolate-covered almond is exactly 3.6 +grams. We are not so sure about this claim and as researchers believe that it is different than 3.6 grams on average. To test these competing claims, -we again use the random sample -`almonds_sample_100` from the `moderndive` package, and -the first 10 lines are shown below: +we again use the random sample `almonds_sample_100` from the `moderndive` +package, and the first 10 lines are shown: ```{r} almonds_sample_100 @@ -148,18 +146,20 @@ researcher bears the _burden of proof_. The hypothesis shown above represents a *two-sided test*\index{hypothesis testing!two-sided test} because evidence against the null hypothesis could come from either direction (greater or less). -Sometimes it is convenient to work with a *left-sided test*\index{hypothesis testing!left-sided test}. In our example, the claim under the null hypothesis becomes: "the -average weight is _at least_ 3.6 grams" and the -researcher's goal is to find evidence against this claim in favor of the -competing claim "the weight is _less_ than 3.6 grams". The competing -hypotheses can now be written as $H_0: \mu \ge 3.6$ versus $H_A: \mu < 3.6$ or even -as $H_0: \mu = 3.6$ versus $H_A: \mu < 3.6$. +Sometimes it is convenient to work with a +*left-sided test*\index{hypothesis testing!left-sided test}. In our example, the +claim under the null hypothesis becomes: "the average weight is _at least_ +3.6 grams" and the researcher's goal is to find evidence against this claim in +favor of the competing claim "the weight is _less_ than 3.6 grams." The +competing hypotheses can now be written as $H_0: \mu \ge 3.6$ versus $H_A: \mu < 3.6$ or +even as $H_0: \mu = 3.6$ versus $H_A: \mu < 3.6$. Notice that we can drop the inequality part from the null hypothesis. We find this simplification convenient as we focus on the equal part of the null -hypothesis only and becomes clearer that evidence against the null hypothesis may come only with values to the left of 3.6, hence a left-sided test. +hypothesis only and becomes clearer that evidence against the null hypothesis +may come only with values to the left of 3.6, hence a left-sided test. Similarly, a *right-sided test*\index{hypothesis testing!right-sided test} can be stated $H_0: \mu = 3.6$ versus -$H_A: \mu > 3.6$. Claims under the null hypothesis for this type of test could be stated as "the average weight is *at most* 3.6 grams" or even "the average weight is *less* than 3.6 grams". But observe that _less_ does not contain the _equal_ part, how can the null then be +$H_A: \mu > 3.6$. Claims under the null hypothesis for this type of test could be stated as "the average weight is *at most* 3.6 grams" or even "the average weight is *less* than 3.6 grams." But observe that _less_ does not contain the _equal_ part, how can the null then be $H_0: \mu = 3.6$? The reason is related to the method used more than the semantics of the statement. Let's break this down: If, under the null hypothesis, the average weight is less than 3.6 grams, we can only find evidence against the null if we find sample means that are much greater than 3.6 grams, hence the alternative hypothesis is $H_A: \mu > 3.6$. Now, to find the evidence we are looking for, the methods we use require a point of reference, "less than 3.6" is not a fixed number since 2 is less than 3.6 but so too is 3.59. @@ -193,9 +193,9 @@ statistic $$t = \frac{\overline{x} - \mu}{s/\sqrt{n}}$$ follows a $t$ distribution with $n-1$ degrees of freedom. -Of course we do not know what $\mu$ is. But since we assume that the null +Of course, we do not know what $\mu$ is. But since we assume that the null hypothesis is true, we can use this value to obtain the test statistic as shown -in the code below. Table \@ref(tab:hypo-test-1) presents these values. +in this code next. Table \@ref(tab:hypo-test-1) presents these values. ```{r eval=FALSE} almonds_sample_100 |> @@ -216,7 +216,7 @@ hypo_test <- almonds_sample_100 |> hypo_test |> kbl( digits = 3, - caption = "Sample mean, standard deviation, size, and the t test statistic", + caption = "Sample mean, standard deviation, size, and the t-test statistic", booktabs = TRUE, linesep = "" ) |> @@ -245,7 +245,7 @@ Because this is a two-sided test, we care about extreme values that are The shaded regions on both tails of the $t$ distribution in Figure \@ref(fig:t-curve-hypo) represent the probability of these extreme values. -```{r t-curve-hypo, echo=FALSE, fig.height=ifelse(is_latex_output(), 2, 7), fig.width=3, purl=FALSE, fig.cap="The tails of a t curve for this hypothesis test."} +```{r t-curve-hypo, echo=FALSE, fig.height=ifelse(is_latex_output(), 1.5, 4), fig.width=3, purl=FALSE, fig.cap="The tails of a t curve for this hypothesis test."} t_value <- hypo_test[1,4][[1]] ggplot(data = data.frame(x = c(-4, 4)), aes(x)) + stat_function(fun = dt, args = list(df = 99))+ @@ -322,17 +322,19 @@ following game: the instructor rolls a six-sided fair die. If the top face shows a "one" you score zero on your test, otherwise you score 100. There is a 1 in 6 chance that you get zero. Would you play this game? -If you would not play the game, let's change it. Now the instructor rolls the die -twice, you score 0 in the test only if both rolls are "one". Otherwise, you -score 100. There is only a 1 in 36 chance to get zero. Would you now play this game? +If you would not play the game, let's change it. Now the instructor rolls the +die twice, you score 0 in the test only if both rolls are "one." Otherwise, you +score 100. There is only a 1 in 36 chance to get zero. Would you now play this +game? What if the instructor rolls the die five times and you score zero in the test -only if each roll is a "one", and you score 100 otherwise. The is only a 1 in 7776 -chance to get zero. Would you play this game? +only if each roll is a "one," and you score 100 otherwise. The is only a 1 in +7776 chance to get zero. Would you play this game? -Converted to probabilities, the three games shown above give you the probabilities of -getting a zero equal to $1/6 = 0.167$, $1/36 = 0.028$, and $1/7776 = 0.0001$, respectively. -Think of these as $p$-values and getting a zero in the test as committing a Type I Error. +Converted to probabilities, the three games shown above give you the +probabilities of getting a zero equal to $1/6 = 0.167$, $1/36 = 0.028$, and +$1/7776 = 0.0001$, respectively. Think of these as $p$-values and getting a zero +in the test as committing a Type I Error. In the context of a hypothesis test, when the random sample collected is extreme, the $p$-value is really small, and we reject the null hypothesis, there is always a chance that the null hypothesis was true, the random sample collected was very atypical, and these results led us to commit a Type I Error. @@ -345,18 +347,18 @@ the population mean $\mu$ is not equal to 3.6 grams. When the null hypothesis is Let's summarize the steps for hypothesis testing: 1. Based on the claim we are planning to test, state the null and alternative hypothesis in terms of $\mu$. -- Remember that the equal sign should go under the null hypothesis as this is needed for the method. -- The statement under the null hypothesis is assumed to be true during the process. -- Typically, researchers want to conclude in favor of the alternative hypothesis; that is, they try to see if the data provides evidence against the null hypothesis. -2. Set a significance level $\alpha$, based on your tolerance for committing a Type I Error, always before working with the sample. -3. Obtain the sample mean, sample standard deviation, $t$-test statistic, and $p$-value. -- When working with a two-sided test, as in the almond example above, the $p$-value is the area on both tails. -- For a left-sided test, find the area under the $t$ distribution to the left of the observed $t$ test statistic. -- For a right-sided test, find the area under the $t$ distribution to the right of the observed $t$ test statistic. + - Remember that the equal sign should go under the null hypothesis as this is needed for the method. + - The statement under the null hypothesis is assumed to be true during the process. + - Typically, researchers want to conclude in favor of the alternative hypothesis; that is, they try to see if the data provides evidence against the null hypothesis. +2. Set a significance level $\alpha$, based on your tolerance for committing a Type I Error, always before working with the sample. +3. Obtain the sample mean, sample standard deviation, $t$-test statistic, and $p$-value. + - When working with a two-sided test, as in the almond example above, the $p$-value is the area on both tails. + - For a left-sided test, find the area under the $t$ distribution to the left of the observed $t$ test statistic. + - For a right-sided test, find the area under the $t$ distribution to the right of the observed $t$ test statistic. 4. Determine whether the result of the test is statistically significant (if the null is rejected) or non-significant (the null is not rejected). -#### The simulation-based approach +#### The simulation-based approach {-} When using a simulation-based approach such as the bootstrap percentile method, we repeat the first two steps of the theory-based approach: @@ -502,7 +504,7 @@ n_songs_genre <- n_songs / 2 To begin the analysis, `r n_songs` tracks were selected at random from Spotify's song archive. We will use "song" and "track" interchangeably going forward. There were `r n_songs_genre` metal tracks and `r n_songs_genre` deep house tracks selected. -The `moderndive` package contains the data on the songs by genre in the `spotify_by_genre` data frame. There are six genres selected in that data (`country`, `deep-house`, `dubstep`, `hip-hop`, `metal`, and `rock`). You will have the opportunity to explore relationships with the other genres and popularity in the Learning checks. Let's explore this data by focusing on just `metal` and `deep-house` by looking at twelve randomly selected rows and our columns of interest. Note here that we also group our selection so that three songs of each of the four possible groupings of `track_genre` and `popular_or_not` are selected. +The `moderndive` package contains the data on the songs by genre in the `spotify_by_genre` data frame. There are six genres selected in that data (`country`, `deep-house`, `dubstep`, `hip-hop`, `metal`, and `rock`). You will have the opportunity to explore relationships with the other genres and popularity in the *Learning checks*. Let's explore this data by focusing on just `metal` and `deep-house` by looking at 12 randomly selected rows and our columns of interest in Table \@ref(tab:twelve-spotify). Note that we also group our selection so that each of the four possible groupings of `track_genre` and `popular_or_not` are selected. ```{r echo=FALSE} set.seed(2) @@ -517,7 +519,7 @@ spotify_metal_deephouse |> sample_n(size = 3) ``` -```{r echo=FALSE} +```{r twelve-spotify, echo=FALSE} spotify_metal_deephouse <- spotify_by_genre |> filter(track_genre %in% c("metal", "deep-house")) |> select(track_id, track_genre, artists, track_name, popularity, popular_or_not) @@ -537,7 +539,7 @@ sampled_spotify_metal_deephouse |> font_size = ifelse(is_latex_output(), 8, 16), latex_options = c("HOLD_position") ) |> - column_spec(3, width = "1.5in") # Adjust the column number and width as needed + column_spec(3, width = "1.5in") # Adjust the column number/width as needed ``` The `track_genre` variable indicates what genre the song is classified under, the `artists` and `track_name` columns provide additional information about the track by providing the artist and the name of the song, `popularity` is the metric mentioned earlier given by Spotify, and `popular_or_not` is a categorical representation of the `popularity` column with any value of 50 (the 75th percentile of `popularity`) referring to `popular` and anything at or below 50 as `not_popular`. The decision made by the authors to call a song "popular" if it is above the 75th percentile (3rd quartile) of `popularity` is arbitrary and could be changed to any other value.) @@ -590,7 +592,7 @@ p_deephouse_popular <- (n_deephouse_popular_original / n_songs_genre) |> round(3) ``` -So of the `r n_songs_genre` metal songs, `r n_metal_popular_original` were popular, for a proportion of `r n_metal_popular_original`/`r n_songs_genre` = `r p_metal_popular` = `r p_metal_popular*100`%. On the other hand, of the `r n_songs_genre` deep house songs, `r n_deephouse_popular_original` were popular, for a proportion of `r n_deephouse_popular_original`/`r n_songs_genre` = `r p_deephouse_popular` = `r p_deephouse_popular*100`%. Comparing these two rates of popularity, it appears that metal songs were popular at a rate `r p_metal_popular` - `r p_deephouse_popular` = `r observed_test_statistic` = `r observed_test_statistic*100`% higher than deep house songs. This is suggestive of an advantage for metal songs in terms of popularity. +So of the `r n_songs_genre` metal songs, `r n_metal_popular_original` were popular, for a proportion of `r n_metal_popular_original`/`r n_songs_genre` = `r p_metal_popular` = `r p_metal_popular*100`%. On the other hand, of the `r n_songs_genre` deep house songs, `r n_deephouse_popular_original` were popular, for a proportion of `r n_deephouse_popular_original`/`r n_songs_genre` = `r p_deephouse_popular` = `r p_deephouse_popular*100`%. Comparing these two rates of popularity, it appears that metal songs were popular at a rate `r p_metal_popular` $-$ `r p_deephouse_popular` = `r observed_test_statistic` = `r observed_test_statistic*100`% higher than deep house songs. This is suggestive of an advantage for metal songs in terms of popularity. The question is, however, does this provide *conclusive* evidence that there is greater popularity for metal songs compared to deep house songs? Could a difference in popularity rates of `r observed_test_statistic * 100`% still occur by chance, even in a hypothetical world where no difference in popularity existed between the two genres? In other words, what is the role of *sampling variation* in this hypothesized world? To answer this question, we will again rely on a computer to run *simulations*. @@ -599,7 +601,7 @@ The question is, however, does this provide *conclusive* evidence that there is First, try to imagine a hypothetical universe where there was no difference in the popularity of metal and deep house. In such a hypothetical universe, the genre of a song would have no bearing on their chances of popularity. Bringing things back to our `spotify_metal_deephouse` data frame, the `popular_or_not` variable would thus be an irrelevant label. If these `popular_or_not` labels were irrelevant, then we could randomly reassign them by "shuffling" them to no consequence! -To illustrate this idea, let's narrow our focus to 52 chosen songs of the `r n_songs` that you saw earlier. The `track_genre` column shows what the original genre of the song was. Note that to keep this smaller dataset of 52 rows to be a representative sample of the 2000 rows, we have sampled such that the popularity rate for each of `metal` and `deep-house` is close to the original rates of `r p_metal_popular` and `r p_deephouse_popular`, respectively, prior to shuffling. This data is available in the `spotify_52_original` data frame in the `moderndive` package. We also remove the `track_id` column for simplicity. It is an identification variable that is not relevant for our analysis. +To illustrate this idea, let's narrow our focus to 52 chosen songs of the `r n_songs` that you saw earlier. The `track_genre` column shows what the original genre of the song was. Note that to keep this smaller dataset of 52 rows to be a representative sample of the 2000 rows, we have sampled such that the popularity rate for each of `metal` and `deep-house` is close to the original rates of `r p_metal_popular` and `r p_deephouse_popular`, respectively, prior to shuffling. This data is available in the `spotify_52_original` data frame in the `moderndive` package. We also remove the `track_id` column for simplicity. It is an identification variable that is not relevant for our analysis. A sample of this is shown in Table \@ref(tab:spotify-52-sample). ```{r eval=FALSE} spotify_52_original |> @@ -607,7 +609,7 @@ spotify_52_original |> head(10) ``` -```{r echo=FALSE} +```{r spotify-52-sample, echo=FALSE} spotify_52_original |> select(-track_id) |> head(10) |> @@ -631,9 +633,9 @@ spotify_52_shuffled |> head(10) ``` -(ref:spotify-shuffled-52) Shuffled version of `popular_or_not` in representative sample of metal and deep-house songs +(ref:spotify-shuffled-52) Shuffled version of `popular_or_not` in sample -```{r echo=FALSE} +```{r spotify-shuffled-52-sample, echo=FALSE} spotify_52_shuffled |> select(-track_id) |> head(10) |> @@ -645,11 +647,11 @@ spotify_52_shuffled |> font_size = ifelse(is_latex_output(), 8, 16), latex_options = c("HOLD_position") ) |> - column_spec(3, width = "1.5in") # Adjust the column number and width as needed + column_spec(3, width = "1.5in") ``` -Observe in the `popular_or_not` column how the `popular` and `not popular` results are now listed in a different order. Some of the original `popular` now are `not popular`, some of the `not popular` are `popular`, and others are the same as the original. +Observe in Table \@ref(tab:spotify-shuffled-52-sample) that the `popular_or_not` column how the `popular` and `not popular` results are now listed in a different order. Some of the original `popular` now are `not popular`, some of the `not popular` are `popular`, and others are the same as the original. Again, such random shuffling of the `popular_or_not` label only makes sense in our hypothesized universe of no difference in popularity between genres. Is there a tactile way for us to understand what is going on with this shuffling? One way would be by using a standard deck of 52 playing cards, which we display in Figure \@ref(fig:deck-of-cards). @@ -659,7 +661,7 @@ include_graphics("images/shutterstock/shutterstock_670789453.jpg") Since we started with equal sample sizes of 1000 songs for each genre, we can think about splitting the deck in half to have 26 cards in two piles (one for `metal` and another for `deep-house`). After shuffling these 52 cards as seen in Figure \@ref(fig:shuffling), we split the deck equally into the two piles of 26 cards each. Then, we can flip the cards over one-by-one, assigning "popular" for each red card and "not popular" for each black card keeping a tally of how many of each genre are popular. -```{r shuffling, echo=FALSE, fig.cap="Shuffling a deck of cards.", purl=FALSE, out.width="50%", out.height="50%"} +```{r shuffling, echo=FALSE, fig.cap="Shuffling a deck of cards.", purl=FALSE, out.width="50%"} include_graphics("images/shutterstock/shutterstock_128283971.jpg") ``` @@ -749,7 +751,7 @@ Observe how this difference in rates is not the same as the difference in rates What we just demonstrated in this activity is the statistical procedure known as *hypothesis testing* using a *permutation test*. The term "permutation" \index{permutation} is the mathematical term for "shuffling": taking a series of values and reordering them randomly, as you did with the playing cards. In fact, permutations are another form of *resampling*, like the bootstrap method you performed in Chapter \@ref(confidence-intervals). While the bootstrap method involves resampling *with* replacement, permutation methods involve resampling *without* replacement. -We do not need restrict our analysis to a dataset of 52 rows only. It is useful to manually shuffle the deck of cards and assign values of popular or not popular to different songs, but the same ideas can be applied to each of the 2000 tracks in our `spotify_metal_deephouse` data. We can think for this data about an inference about an unknown difference of population proportions with the 2000 tracks being our sample. We denote this as $p_{m} - p_{d}$, where $p_{m}$ is the population proportion of songs with metal names being popular and $p_{d}$ is the equivalent for deep house songs. Recall that this is one of the scenarios for inference we have seen so far in Table \@ref(tab:table-diff-prop). +We do not need restrict our analysis to a dataset of 52 rows only. It is useful to manually shuffle the deck of cards and assign values of popular or not popular to different songs, but the same ideas can be applied to each of the 2000 tracks in our `spotify_metal_deephouse` data. We can think with this data about an inference about an unknown difference in population proportions with the 2000 tracks being our sample. We denote this as $p_{m} - p_{d}$, where $p_{m}$ is the population proportion of songs with metal names being popular and $p_{d}$ is the equivalent for deep house songs. Recall that this is one of the scenarios for inference we have seen so far in Table \@ref(tab:table-diff-prop). ```{r table-diff-prop, echo=FALSE, message=FALSE, purl=FALSE} # The following Google Doc is published to CSV and loaded using read_csv(): @@ -813,7 +815,7 @@ H_0 &: \text{metal and deep house have the same popularity rate}\\ \end{aligned} $$ -Note some of the choices we have made. First, we set the null hypothesis $H_0$ to be that there is no difference in popularity rate and the "challenger" alternative hypothesis $H_A$ to be that there is a difference in favor of metal. As discussed earlier, the null hypothesis is set to reflect a situation of "no change." As we discussed earlier, in this case, $H_0$ corresponds to there being no difference in popularity. Furthermore, we set $H_A$ to be that metal is popular at a *higher* rate, a subjective choice reflecting a prior suspicion we have that this is the case. As discussed earlier this is a *one-sided test* \index{hypothesis testing!one-sided test}. It can be left- or right-sided, and this becomes clear once we express it in terms of proportions. If someone else however does not share such suspicions and only wants to investigate that there is a difference, whether higher or lower, they would construct a \index{hypothesis testing!two-sided test} *two-sided test*. +Note some of the choices we have made. First, we set the null hypothesis $H_0$ to be that there is no difference in popularity rate and the "challenger" alternative hypothesis $H_A$ to be that there is a difference in favor of metal. As discussed earlier, the null hypothesis is set to reflect a situation of "no change." As we discussed earlier, in this case, $H_0$ corresponds to there being no difference in popularity. Furthermore, we set $H_A$ to be that metal is popular at a *higher* rate, a subjective choice reflecting a prior suspicion we have that this is the case. As discussed earlier this is a \index{hypothesis testing!one-sided test} *one-sided test*. It can be left- or right-sided, and this becomes clear once we express it in terms of proportions. If someone else however does not share such suspicions and only wants to investigate that there is a difference, whether higher or lower, they would construct a \index{hypothesis testing!two-sided test} *two-sided test*. We can re-express the formulation of our hypothesis test using the mathematical notation for our population parameter of interest, the difference in population proportions $p_{m} - p_{d}$: @@ -830,12 +832,12 @@ Third, a **test statistic** \index{hypothesis testing!test statistic} is a *poin Fourth, the **observed test statistic** \index{hypothesis testing!observed test statistic} is the value of the test statistic that we observed in real life. In our case, we computed this value using the data saved in the `spotify_metal_deephouse` data frame. It was the observed difference of $\widehat{p}_{m} -\widehat{p}_{d} = `r p_metal_popular` - `r p_deephouse_popular` = `r observed_test_statistic` = `r observed_test_statistic*100`\%$ in favor of metal songs. -Fifth, the **null distribution** \index{hypothesis testing!null distribution} is the sampling distribution of the test statistic *assuming the null hypothesis $H_0$ is true*. Let's unpack these ideas slowly. The key to understanding the null distribution is that the null hypothesis $H_0$ is *assumed* to be true. We are not saying that $H_0$ is true at this point, we are only assuming it to be true for hypothesis testing purposes. In our case, this corresponds to our hypothesized universe of no difference in popularity rates. Assuming the null hypothesis $H_0$, also stated as "Under $H_0$," how does the test statistic vary due to sampling variation? In our case, how will the difference in sample proportions $\widehat{p}_{m} - \widehat{p}_{f}$ vary due to sampling under $H_0$? Recall from Subsection \@ref(sampling-variation) that distributions displaying how point estimates vary due to sampling variation are called *sampling distributions*. The only additional thing to keep in mind about null distributions is that they are sampling distributions *assuming the null hypothesis $H_0$ is true*. +Fifth, the **null distribution** \index{hypothesis testing!null distribution} is the sampling distribution of the test statistic *assuming the null hypothesis $H_0$ is true*. Let's unpack these ideas slowly. The key to understanding the null distribution is that the null hypothesis $H_0$ is *assumed* to be true. We are not saying that $H_0$ is true at this point, we are only assuming it to be true for hypothesis-testing purposes. In our case, this corresponds to our hypothesized universe of no difference in popularity rates. Assuming the null hypothesis $H_0$, also stated as "Under $H_0$," how does the test statistic vary due to sampling variation? In our case, how will the difference in sample proportions $\widehat{p}_{m} - \widehat{p}_{f}$ vary due to sampling under $H_0$? Recall from Subsection \@ref(sampling-variation) that distributions displaying how point estimates vary due to sampling variation are called *sampling distributions*. The only additional thing to keep in mind about null distributions is that they are sampling distributions *assuming the null hypothesis $H_0$ is true*. Sixth, the **$p$-value** \index{hypothesis testing!p-value} is the probability of obtaining a test statistic just as extreme as or more extreme than the observed test statistic *assuming the null hypothesis $H_0$ is true*. You can think of the $p$-value as a quantification of "surprise": assuming $H_0$ is true, how surprised are we with what we observed? Or in our case, in our hypothesized universe of no difference in genre popularity, how surprised are we that we observed higher popularity rates of `r observed_test_statistic` from our collected samples if no difference in genre popularity exists? Very surprised? Somewhat surprised? -The $p$-value quantifies this probability, or what proportion had a more "extreme" result? Here, extreme is defined in terms of the alternative hypothesis $H_A$ that metal popularity is a higher rate than deep house. In other words, how often was the popularity of metal _even more_ pronounced than $`r p_metal_popular` - `r p_deephouse_popular` = `r observed_test_statistic` = `r observed_test_statistic*100`\%$? +The $p$-value quantifies this probability, or what proportion had a more "extreme" result? Here, extreme is defined in terms of the alternative hypothesis $H_A$ that metal popularity is at a higher rate than deep house. In other words, how often was the popularity of metal _even more_ pronounced than $`r p_metal_popular` - `r p_deephouse_popular` = `r observed_test_statistic` = `r observed_test_statistic*100`\%$? Seventh and lastly, in many hypothesis testing procedures, it is commonly recommended to set the **significance level** \index{hypothesis testing!significance level} of the test beforehand. It is denoted by $\alpha$. Please review our discussion of $\alpha$ in Section \@ref(one-sample-hyp) when we discussed the theory-based approach. For now, it is sufficient to recall that if the $p$-value is less than or equal to $\alpha$, we reject the null hypothesis $H_0$. @@ -854,13 +856,13 @@ We then showed you how to perform the same task using the `infer` package workfl 1. `calculate()` the summary statistic of interest. 1. `visualize()` the resulting bootstrap distribution and confidence interval. -```{r infer-ci, echo=FALSE, fig.cap="Confidence intervals with the infer package.", purl=FALSE, out.width="90%", out.height="90%"} +```{r infer-ci, echo=FALSE, fig.cap="Confidence intervals with the infer package.", purl=FALSE, out.width="90%"} include_graphics("images/flowcharts/infer/visualize.png") ``` In this section, we will now show you how to seamlessly modify the previously seen `infer` code for constructing confidence intervals to conduct hypothesis tests. You will notice that the basic outline of the workflow is almost identical, except for an additional `hypothesize()` step between the `specify()` and `generate()` steps, as can be seen in Figure \@ref(fig:inferht). -```{r inferht, echo=FALSE, fig.cap="Hypothesis testing with the infer package.", purl=FALSE, out.width="90%", out.height="90%"} +```{r inferht, echo=FALSE, fig.cap="Hypothesis testing with the infer package.", purl=FALSE, out.width="90%"} include_graphics("images/flowcharts/infer/ht.png") ``` @@ -868,7 +870,7 @@ include_graphics("images/flowcharts/infer/ht.png") alpha <- 0.1 ``` -Furthermore, we will use a pre-specified significance level $\alpha$ = `r alpha` for this hypothesis test. Please read the discussion about $\alpha$ in Subsection \@(one-sample-hyp) or until later on in Section \@ref(ht-interpretation). +Furthermore, we will use a pre-specified significance level $\alpha$ = `r alpha` for this hypothesis test. Please read the discussion about $\alpha$ in Subsection \@ref(one-sample-hyp) or later on in Section \@ref(ht-interpretation). ### `infer` package workflow {#infer-workflow-ht} @@ -1194,11 +1196,11 @@ Let's recap the steps necessary to conduct a hypothesis test using the terminolo While this is a lot to digest, especially the first time you encounter hypothesis testing, the nice thing is that once you understand this general framework, then you can understand *any* hypothesis test. In a famous blog post, computer scientist Allen Downey called this the ["There is only one test"](http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html) framework, for which he created the flowchart displayed in Figure \@ref(fig:htdowney). -```{r htdowney, echo=FALSE, fig.cap="Allen Downey's hypothesis testing framework.", purl=FALSE, out.width=ifelse(knitr::is_latex_output(), "110%", "100%")} +```{r htdowney, echo=FALSE, fig.cap="Allen Downey's hypothesis testing framework.", purl=FALSE, out.width=ifelse(knitr::is_latex_output(), "85%", "100%")} include_graphics("images/copyright/there_is_only_one_test.png") ``` -Notice its similarity with the "hypothesis testing with `infer`" diagram you saw in Figure \@ref(fig:inferht). That is because the `infer` package was explicitly designed to match the "There is only one test" framework. So if you can understand the framework, you can easily generalize these ideas for all hypothesis testing scenarios. Whether for population proportions $p$, population means $\mu$, differences in population proportions $p_1 - p_2$, differences in population means $\mu_1 - \mu_2$, and as you will see in Chapter \@ref(inference-for-regression) on inference for regression, population regression slopes $\beta_1$ as well. In fact, it applies more generally even than just these examples to more complicated hypothesis tests and test statistics as well. +Notice its similarity with the "hypothesis testing with `infer`" diagram you saw in Figure \@ref(fig:inferht). That is because the `infer` package was explicitly designed to match the "There is only one test" framework. So if you can understand the framework, you can easily generalize these ideas for all hypothesis-testing scenarios. Whether for population proportions $p$, population means $\mu$, differences in population proportions $p_1 - p_2$, differences in population means $\mu_1 - \mu_2$, and as you will see in Chapter \@ref(inference-for-regression) on inference for regression, population regression slopes $\beta_1$ as well. In fact, it applies more generally even than just these examples to more complicated hypothesis tests and test statistics as well. ```{block lc-downey, type="learncheck", purl=FALSE} \vspace{-0.15in} @@ -1245,14 +1247,14 @@ In other words, _not guilty_ verdicts are not suggesting the defendant is _innoc 1. We reject the null hypothesis $H_0$ in favor of $H_A$ only if the evidence found in the sample suggests that $H_A$ is true. The significance level $\alpha$ is used as a guideline to set the threshold on just how strong of evidence we require. 1. We ultimately decide to either "fail to reject $H_0$" or "reject $H_0$." -So while gut instinct may suggest "failing to reject $H_0$" and "accepting $H_0$" are equivalent statements, they are not. "Accepting $H_0$" is equivalent to finding a defendant innocent. However, courts do not find defendants "innocent," but rather they find them "not guilty." Putting things differently, defense attorneys do not need to prove that their clients are innocent, rather they only need to prove that clients are not "guilty beyond a reasonable doubt". +So while gut instinct may suggest "failing to reject $H_0$" and "accepting $H_0$" are equivalent statements, they are not. "Accepting $H_0$" is equivalent to finding a defendant innocent. However, courts do not find defendants "innocent," but rather they find them "not guilty." Putting things differently, defense attorneys do not need to prove that their clients are innocent, rather they only need to prove that clients are not "guilty beyond a reasonable doubt." So going back to our songs activity in Section \@ref(ht-infer), recall that our hypothesis test was $H_0: p_{m} - p_{d} = 0$ versus $H_A: p_{m} - p_{d} > 0$ and that we used a pre-specified significance level of $\alpha$ = `r alpha`. We found a $p$-value of `r pull(p_value)`. Since the $p$-value was smaller than $\alpha$ = `r alpha`, we rejected $H_0$. In other words, we found needed levels of evidence in this particular sample to say that $H_0$ is false at the $\alpha$ = `r alpha` significance level. We also state this conclusion using non-statistical language: we found enough evidence in this data to suggest that there was a difference in the popularity of our two genres of music. ### Types of errors -Unfortunately, there is some chance a jury or a judge can make an incorrect decision in a criminal trial by reaching the wrong verdict. For example, finding a truly innocent defendant "guilty". Or on the other hand, finding a truly guilty defendant "not guilty." This can often stem from the fact that prosecutors do not have access to all the relevant evidence, but instead are limited to whatever evidence the police can find. +Unfortunately, there is some chance a jury or a judge can make an incorrect decision in a criminal trial by reaching the wrong verdict. For example, finding a truly innocent defendant "guilty." Or on the other hand, finding a truly guilty defendant "not guilty." This can often stem from the fact that prosecutors do not have access to all the relevant evidence, but instead are limited to whatever evidence the police can find. The same holds for hypothesis tests. We can make incorrect decisions about a population parameter because we only have a sample of data from the population and thus sampling variation can lead us to incorrect conclusions. @@ -1281,7 +1283,7 @@ trial_table <- tibble( gtsave(trial_table, "images/gt_error_table_v2.png") ``` -```{r trial-errors-table, echo=FALSE, fig.cap="Type I and Type II errors in criminal trials.", purl=FALSE, out.width=if(!knitr::is_latex_output()) "100%"} +```{r trial-errors-table, echo=FALSE, fig.cap="Type I and Type II errors in criminal trials.", purl=FALSE, out.width="95%"} include_graphics("images/gt_error_table_v2.png") ``` @@ -1306,7 +1308,7 @@ hypo_table <- tibble( gtsave(hypo_table, "images/gt_error_table_ht_v2.png") ``` -```{r hypo-test-errors, echo=FALSE, fig.cap="Type I and Type II errors in hypothesis tests.", purl=FALSE, out.width=if(!knitr::is_latex_output()) "100%"} +```{r hypo-test-errors, echo=FALSE, fig.cap="Type I and Type II errors in hypothesis tests.", purl=FALSE, out.width="95%"} include_graphics("images/gt_error_table_ht_v2.png") ``` @@ -1314,7 +1316,7 @@ include_graphics("images/gt_error_table_ht_v2.png") ### How do we choose alpha? {#choosing-alpha} If we are using a sample to make inferences about a population, we are operating under uncertainty and run the risk of making statistical errors. These are not errors in calculations or in the procedure used, but errors in the sense that the sample used may lead us to construct a confidence interval that does not contain the true value of the population parameter, for example. -In the case of hypothesis testing, there are two well defined errors: a Type I and a Type II error: +In the case of hypothesis testing, there are two well-defined errors: a Type I and a Type II error: - A Type I Error\index{hypothesis testing!Type I Error} is rejecting the null hypothesis when it is true. The probability of a Type I Error occurring is $\alpha$, the *significance level*, which we defined in Subsection \@ref(one-sample-hyp) and in Section \@ref(understanding-ht) - A Type II Error\index{hypothesis testing!Type II Error} is failing to reject the null hypothesis when it is false. The probability of a Type II Error is denoted by $\beta$. The value of $1-\beta$ is known as the *power* of the test. @@ -1327,7 +1329,7 @@ So for example if we used $\alpha$ = 0.01, we would be using a hypothesis testin So what value should you use for $\alpha$? \index{hypothesis testing!tradeoff between alpha and beta} While different fields of study have adopted different conventions, although $\alpha = 0.05$ is perhaps the most popular threshold, there is nothing special about this or any other number. Please review Subsection \@ref(one-sample-hyp) and our discussion about $\alpha$ and our tolerance for uncertainty. In addition, observe that choosing a relatively small value of $\alpha$ reduces our chances of rejecting the null hypothesis, and also of committing a Type I Error; but increases the probability of committing a Type II Error. -On the other hand, choosing a relatively large value of $\alpha$ increases the chances of failing to reject the null hypothesis, and also of committing a Type I Error; but reduces the probability of committing a Type II Error. Depending of the problem at hand, we may be willing to have a larger significance level in certain scenarios and a smaller significance level in others. +On the other hand, choosing a relatively large value of $\alpha$ increases the chances of failing to reject the null hypothesis, and also of committing a Type I Error; but reduces the probability of committing a Type II Error. Depending on the problem at hand, we may be willing to have a larger significance level in certain scenarios and a smaller significance level in others. ```{block lc7-0, type="learncheck", purl=FALSE} @@ -1371,7 +1373,7 @@ movies n_movies_sample <- movies_sample |> nrow() ``` -We will focus on a random sample of `r n_movies_sample` movies that are classified as either "action" or "romance" movies but not both. We disregard movies that are classified as both so that we can assign all `r n_movies_sample` movies into either category. Furthermore, since the original `movies` dataset was a little messy, we provide a pre-wrangled version of our data in the `movies_sample` data frame included in the `moderndive` package. If you are curious, you can look at the necessary data wrangling code to do this on [GitHub](https://github.com/moderndive/moderndive/blob/master/data-raw/process_data_sets.R). +We will focus on a random sample of `r n_movies_sample` movies that are classified as either "action" or "romance" movies but not both. We disregard movies that are classified as both so that we can assign all `r n_movies_sample` movies into either category. Furthermore, since the original `movies` dataset was a little messy, we provide a pre-wrangled version of our data in the `movies_sample` data frame included in the `moderndive` package. If you are curious, you can look at the necessary data-wrangling code to do this on [GitHub](https://github.com/moderndive/moderndive/blob/master/data-raw/process_data_sets.R). ```{r} movies_sample @@ -1506,7 +1508,7 @@ movies_sample |> After we have set the null hypothesis, we generate "shuffled" replicates assuming the null hypothesis is true by repeating the shuffling/permutation exercise you performed in Section \@ref(ht-activity). -We will repeat this resampling without replacement of `type = "permute"` a total of ``reps = `r n_reps` `` times. Feel free to run the code below to check out what the `generate()` step produces. +We will repeat this resampling without replacement of `type = "permute"` a total of ``reps = `r n_reps` `` times. Feel free to run the code to check out what the `generate()` step produces. ```{r eval=FALSE} movies_sample |> @@ -1623,7 +1625,7 @@ But this $p$-value is larger than our (even smaller) pre-specified $\alpha$ sign **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is the value of the $p$-value for the two-sided hypothesis test comparing the mean rating of romance to action movies? -**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Test your data wrangling knowledge and EDA skills: +**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Test your data-wrangling knowledge and EDA skills: - Use `dplyr` and `tidyr` to create the necessary data frame focused on only action and romance movies (but not both) from the `movies` data frame in the `ggplot2movies` package. - Make a boxplot and a faceted histogram of this population data comparing ratings of action and romance movies from IMDb. @@ -1714,10 +1716,9 @@ $$ `r t_stat` $$ -How can we compute the $p$-value using this theory-based test statistic? We could do it by calculating the degrees of freedom and using the function `pt()` as we did earlier. Instead, we use the function `t_test()` from the package `infer` with the appropriate `formula` and `order`, as we did for the simulation based approach. - -The results are shown below: +How can we compute the $p$-value using this theory-based test statistic? We could do it by calculating the degrees of freedom and using the function `pt()` as we did earlier. Instead, we use the function `t_test()` from the package `infer` with the appropriate `formula` and `order`, as we did for the simulation-based approach. +The results are shown next: ```{r} movies_sample |> @@ -1775,7 +1776,7 @@ ggplot(data = flights_sample, mapping = aes(x = carrier, y = air_time)) + This is what we like to call "no PhD in Statistics needed" moments. You do not have to be an expert in statistics to know that Alaska Airlines and Hawaiian Airlines have *notably* different air times. The two boxplots do not even overlap! Constructing a confidence interval or conducting a hypothesis test would frankly not provide much more insight than Figure \@ref(fig:ha-as-flights-boxplot). -Let's investigate why we observe such a clear cut difference between these two airlines using data wrangling. Let's first group the rows of `flights_sample` not only by `carrier` but also by destination `dest`. Subsequently, we will compute two summary statistics: the number of observations using `n()` and the mean airtime: +Let's investigate why we observe such a clear-cut difference between these two airlines using data wrangling. Let's first group the rows of `flights_sample` not only by `carrier` but also by destination `dest`. Subsequently, we will compute two summary statistics: the number of observations using `n()` and the mean airtime: ```{r} flights_sample |> @@ -1793,7 +1794,7 @@ This is a clear example of not needing to do anything more than a simple explora On top of the many common misunderstandings about hypothesis testing and $p$-values we listed in Section \@ref(ht-interpretation), another unfortunate consequence of the expanded use of $p$-values and hypothesis testing is a phenomenon known as "p-hacking." \index{p-hacking} p-hacking is the act of "cherry-picking" only results that are "statistically significant" while dismissing those that are not, even if at the expense of the scientific ideas. There are lots of articles written recently about misunderstandings and the problems with $p$-values. We encourage you to check some of them out: 1. [Misuse of $p$-values](https://en.wikipedia.org/wiki/Misuse_of_p-values) -2. [What a nerdy debate about $p$-values shows about science - and how to fix it](https://www.vox.com/science-and-health/2017/7/31/16021654/p-values-statistical-significance-redefine-0005) +2. [What a nerdy debate about $p$-values shows about science – and how to fix it](https://www.vox.com/science-and-health/2017/7/31/16021654/p-values-statistical-significance-redefine-0005) 3. [Statisticians issue warning over misuse of $P$ values](https://www.nature.com/articles/nature.2016.19503) 4. [You Cannot Trust What You Read About Nutrition](https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/) 5. [A Litany of Problems with p-values](http://www.fharrell.com/post/pval-litany/) @@ -1822,7 +1823,7 @@ If you want more examples of the `infer` workflow for conducting hypothesis test We conclude with the `infer` pipeline for hypothesis testing in Figure \@ref(fig:infer-workflow-ht). -```{r infer-workflow-ht, out.width="100%", out.height="100%", echo=FALSE, fig.cap="infer package workflow for hypothesis testing.", purl=FALSE} +```{r infer-workflow-ht, out.width="100%", echo=FALSE, fig.cap="infer package workflow for hypothesis testing.", purl=FALSE} include_graphics("images/flowcharts/infer/ht_diagram_trimmed.png") ``` diff --git a/10-inference-for-regression.Rmd b/10-inference-for-regression.Rmd index 78590578d..3c5bc5362 100644 --- a/10-inference-for-regression.Rmd +++ b/10-inference-for-regression.Rmd @@ -150,7 +150,7 @@ ggplot(UN_data_ch10, aes(x = life_exp, y = fert_rate)) + Finally, we review how to determine the fitted values and residuals for observations in the dataset. France is one of the UN member states, and suppose we want to determine the fitted fertility rate for France based on the linear regression. -We start by determining what is the location of France in the `UN_data_ch10` data frame, using `rowid_to_column()` and `filter()` with the variable country equal to "France". +We start by determining what is the location of France in the `UN_data_ch10` data frame, using `rowid_to_column()` and `filter()` with the variable country equal to "France." The `pull()` function converts the row number as a data frame to a single value: ```{r eval=FALSE} @@ -202,7 +202,7 @@ Based on our regression line we would expect France's fertility rate to be `r lm The observed fertility rate for France was `r france_data$fert_rate[1]`, so the residual for France is $y_{`r france_id`} - \widehat{y}_{`r france_id`} = `r actual_france` - `r fitted_france` = `r resid_france`$. Using R we are not required to manually calculate the fitted values and residual for each UN member state. -We do this directly using the regression model `simple_model` and the [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function. +We do this directly using the regression model `simple_model` and the `get_regression_points()` function. To do this only for France, we `filter()` the `r france_id`th observation in the data frame. ```{r eval=FALSE} @@ -355,7 +355,8 @@ Furthermore, an *estimator* for the standard deviation of $\epsilon_i$ is given $$s = \sqrt{\frac{\sum_{i=1}^n \left[y_i - (b_0 + b_1 \cdot x_i)\right]^2}{n-2}} = \sqrt{\frac{\sum_{i=1}^n \left(y_i - \widehat{y}_i\right)^2}{n-2}}.$$ These or equivalent calculations are done in R when using the `lm()` function. -For `old_faithful_2024` we get: +For `old_faithful_2024` we get the results shown in Table +\@ref(tab:regtable-ch10-1): ```{r, eval=FALSE} # Fit regression model: @@ -409,18 +410,18 @@ These properties will be useful in the next subsection, once we perform theory-b To wrap-up this section, we'll be investigating how regression relates to two different statistical techniques. One of them was covered already in this book, the difference in sample means, and the other is new to the text but is related, ANOVA. We'll see how both can be represented in the regression framework. -#### Two sample difference in means {-} +#### Two-sample difference in means {-} The two-sample difference in means is a common statistical technique used to compare the means of two groups as seen in Section \@ref(ht-case-study). It is often used to determine if there is a significant difference in the mean response between two groups, such as a treatment group and a control group. The two-sample difference in means can be represented in the regression framework by using a dummy variable to represent the two groups. -Let's again consider the `movies_sample` data frame in the `moderndive` package. We'll compare once more the average rating for the genres of "Action" versus "Romance". We can use the `lm()` function to fit a linear model with a dummy variable for the genre and then use [`get_regression_table()`](https://moderndive.github.io/moderndive/reference/get_regression_table.html): +Let's again consider the `movies_sample` data frame in the `moderndive` package. We'll compare once more the average rating for the genres of "Action" versus "Romance." We can use the `lm()` function to fit a linear model with a dummy variable for the genre and then use [`get_regression_table()`](https://moderndive.github.io/moderndive/reference/get_regression_table.html): ```{r eval=FALSE} mod_diff_means <- lm(rating ~ genre, data = movies_sample) get_regression_table(mod_diff_means) ``` -```{r echo=FALSE} +```{r diff-means-reg, echo=FALSE} mod_diff_means <- lm(rating ~ genre, data = movies_sample) get_regression_table(mod_diff_means) |> kbl(caption = "Regression table for two-sample difference in means example") |> @@ -431,7 +432,7 @@ get_regression_table(mod_diff_means) |> ``` -Note that `p_value` for the `genre: Romance` row is the $p$-value for the hypothesis test of +Note from Table \@ref(tab:diff-means-reg) that `p_value` for the `genre: Romance` row is the $p$-value for the hypothesis test of $$ H_0: \text{action and romance have the same mean rating} @@ -444,7 +445,7 @@ This $p$-value result matches closely with what was found in Section \@ref(ht-ca #### ANOVA {-} -ANOVA, or analysis of variance, is a statistical technique used to compare the means of three or more groups by seeing if there is a statistically significant difference between the means of multiple groups. ANOVA can be represented in the regression framework by using dummy variables to represent the groups. Let's say we wanted to compare the `popularity` (numeric) values in the `spotify_by_genre` data frame from the `moderndive` package across the genres of "country", "hip-hop", and "rock". We use the `slice_sample()` function after narrowing in on our selected columns and filtered rows of interest to see what a few rows of this data frame look like. +ANOVA, or analysis of variance, is a statistical technique used to compare the means of three or more groups by seeing if there is a statistically significant difference between the means of multiple groups. ANOVA can be represented in the regression framework by using dummy variables to represent the groups. Let's say we wanted to compare the `popularity` (numeric) values in the `spotify_by_genre` data frame from the `moderndive` package across the genres of `country`, `hip-hop`, and `rock`. We use the `slice_sample()` function after narrowing in on our selected columns and filtered rows of interest to see what a few rows of this data frame look like in Table \@ref(tab:spotify-for-anova-slice-five). ```{r echo=-1} set.seed(6) @@ -460,7 +461,7 @@ spotify_for_anova |> (ref:spotify-for-anova-slice) Five randomly selected rows from `spotify_for_anova` -```{r echo=FALSE} +```{r spotify-for-anova-slice-five, echo=FALSE} spotify_for_anova |> slice_sample(n = 5) |> kbl(caption = "(ref:spotify-for-anova-slice)") |> @@ -471,9 +472,9 @@ spotify_for_anova |> ``` -Before we fit a linear model, let's take a look at the boxplot of `track_genre` versus `popularity` to see if there are any differences in the distributions of the three genres. +Before we fit a linear model, let's take a look at the boxplot of `track_genre` versus `popularity` in Figure \@ref(fig:pop-by-genre-plot) to see if there are any differences in the distributions of the three genres. -```{r fig.cap="Boxplot of popularity by genre.", fig.height=ifelse(knitr::is_latex_output(), 3.2, 4)} +```{r pop-by-genre-plot, fig.cap="Boxplot of popularity by genre.", fig.height=ifelse(knitr::is_latex_output(), 3.2, 4)} ggplot(spotify_for_anova, aes(x = track_genre, y = popularity)) + geom_boxplot() + labs(x = "Genre", y = "Popularity") @@ -488,14 +489,14 @@ mean_popularities_by_genre <- spotify_for_anova |> mean_popularities_by_genre ``` -We can use the `lm()` function to fit a linear model with dummy variables for the genres. We'll then use the [`get_regression_table()`](https://moderndive.github.io/moderndive/reference/get_regression_table.html) function to get the regression table. +We can use the `lm()` function to fit a linear model with dummy variables for the genres. We'll then use the [`get_regression_table()`](https://moderndive.github.io/moderndive/reference/get_regression_table.html) function to get the regression table in Table \@ref(tab:anova-reg-table). ```{r eval=FALSE} mod_anova <- lm(popularity ~ track_genre, data = spotify_for_anova) get_regression_table(mod_anova) ``` -```{r echo=FALSE} +```{r anova-reg-table, echo=FALSE} mod_anova <- lm(popularity ~ track_genre, data = spotify_for_anova) get_regression_table(mod_anova) |> kbl(caption = "Regression table for ANOVA example") |> @@ -666,7 +667,7 @@ $$ The formula for a 95% confidence interval for $\beta_1$ is given by $b_1 \pm q \cdot SE(b_1)$ where the critical value $q$ is determined by the level of confidence required, the sample size used, and the corresponding degrees of freedom needed for the $t$-distribution. We now illustrate how to find the 95% confidence interval for the slope in the Old Faithful example manually, but we show later how to do this directly in R using the function [`get_regression_table()`](https://moderndive.github.io/moderndive/reference/get_regression_table.html). -First, observe that $n = `r n_old_faithful`$, so the degrees of freedom are $n-2 = 112$. The critical value for a 95% confidence interval on a $t$-distribution with 112 degrees of freedom is $q = 1.981$. Second, the estimates $b_0$, $b_1$, and $s$ were found earlier and are shown here again: +First, observe that $n = `r n_old_faithful`$, so the degrees of freedom are $n-2 = 112$. The critical value for a 95% confidence interval on a $t$-distribution with 112 degrees of freedom is $q = 1.981$. Second, the estimates $b_0$, $b_1$, and $s$ were found earlier and are shown here again in Table \@ref(tab:regtable-ch10-2): ```{r regtable-ch10-2, echo=FALSE, purl=FALSE} # Fit regression model: @@ -708,7 +709,7 @@ lb0 <- round(b1 - q*se_b0,3) ub0 <- round(b1 + q*se_b0,3) # t t_stat <- round(b1/se_b1,3) -p_value <- round(2*(1 - pt(abs(t_stat), n_old_faithful-2)),3) +p_value <- round(2*(1 - pt(abs(t_stat), n_old_faithful-2)), 3) ``` $$SE(b_1) = \frac{s}{\sqrt{\sum_{i=1}^n(x_i - \bar x)^2}} = \frac{`r s`}{`r denom_se_b1`} = `r se_b1`.$$ @@ -717,8 +718,8 @@ Finally, the 95% confidence interval for $\beta_1$ is given by: $$\begin{aligned} b_1 &\pm q \cdot SE(b_1)\\ -= `r b1` &\pm `r q`\cdot `r se_b1`\\ -= (`r lb1` &, `r ub1`) +&= `r b1` \pm `r q`\cdot `r se_b1`\\ +&= (`r lb1` , `r ub1`) \end{aligned}$$ We are 95% confident that the population slope $\beta_1$ is a number between `r lb1` and `r ub1`. @@ -727,8 +728,8 @@ The construction of a 95% confidence interval for $\beta_0$ follows exactly the $$\begin{aligned} b_0 &\pm q \cdot SE(b_0)\\ -= `r b0` &\pm `r q`\cdot `r se_b0`\\ -= (`r lb0` &, `r ub0`) +&= `r b0` \pm `r q`\cdot `r se_b0`\\ +&= (`r lb0`, `r ub0`) \end{aligned}$$ The results of the confidence intervals are valid only if the linear model assumptions are satisfied. @@ -780,7 +781,7 @@ We can intuitively think of the $p$-value as quantifying how "extreme" the estim For a two-sided test, if the test statistic is $t = 2$ for example, the $p$-value is calculated as the area under the $t$-curve to the left of $-2$ and to the right of $2$ is shown in Figure \@ref(fig:pvalue1). -```{r pvalue1, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 3, 4), fig.cap="Illustration of a two-sided p-value for a t-test."} +```{r pvalue1, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 2, 4), fig.cap="Illustration of a two-sided p-value for a t-test."} n <- n_old_faithful shade <- function(t, a,b) { z = dt(t, df = n-2) @@ -815,7 +816,7 @@ We have enough statistical evidence to conclude that there is a linear relations **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** In the context of a linear regression model, what does the null hypothesis $H_0: \beta_1 = 0$ represent? -- A. There is no linear association between the response and the the explanatory variable. +- A. There is no linear association between the response and the explanatory variable. - B. The difference between the observed and predicted values is zero. - C. The linear association between response and explanatory variable crosses the origin. - D. The probability of committing a Type II Error is zero. @@ -978,7 +979,7 @@ best_fit_plot <- ggplot(old_faithful_2024, aes(x = duration, y = waiting)) + best_fit_plot ``` -We can calculate all the $n = `r n_old_faithful`$ residuals by applying the [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function to the regression model `model_1`. +We can calculate all the $n = `r n_old_faithful`$ residuals by applying the `get_regression_points()` function to the regression model `model_1`. Observe how the resulting values of `residual` are roughly equal to `waiting - waiting_hat` (there may be a slight difference due to rounding error). ```{r} @@ -992,15 +993,12 @@ fitted_and_residuals #### Residual diagnostics {-} *Residual diagnostics* are used to verify conditions **L**, **N**, and **E**. -While there are more sophisticated statistical approaches that can be used, we focus on data visualization. - -One of the most useful plots is a *residual plot*, which is a scatterplot of the residuals against the fitted values. +While there are more sophisticated statistical approaches that can be used, we focus on data visualization. One of the most useful plots is a *residual plot*, which is a scatterplot of the residuals against the fitted values. We use the `fitted_and_residuals` object to draw the scatterplot using `geom_point()` with the fitted values (`waiting_hat`) on the x-axis and the residuals (`residual`) on the y-axis. In addition, we add titles to our axes with `labs()` and draw a horizontal line at 0 for reference using `geom_hline()` and `yintercept = 0`, as shown in the following code: ```{r eval=FALSE} -fitted_and_residuals |> - ggplot(aes(x = waiting_hat, y = residual)) + +ggplot(fitted_and_residuals, aes(x = waiting_hat, y = residual)) + geom_point() + labs(x = "duration", y = "residual") + geom_hline(yintercept = 0, col = "blue") @@ -1009,7 +1007,7 @@ fitted_and_residuals |> In Figure \@ref(fig:scatt-and-residual) we show this residual plot (right plot) alongside the scatterplot for `duration` vs `waiting` (left plot). Note how the residuals on the left plot are determined by the vertical distance between the observed response and the linear regression. On the right plot (residuals), we have removed the effect of the linear regression and the effect of the residuals is seen as the vertical distance from each point to the zero line (y-axis). -Using this residuals plot, it is easier to spot patterns or trends that may be in conflict with the assumptions of the model, as we describe below. +Using this residuals plot, it is easier to spot patterns or trends that may be in conflict with the assumptions of the model, as we describe next. ```{r scatt-and-residual, echo=FALSE, fig.cap="The scatterplot and residual plot for the Old Faithful data.", purl=FALSE, message=FALSE, fig.height=ifelse(knitr::is_latex_output(), 2.3, 4)} g1 <- ggplot(old_faithful_2024, aes(x = duration, y = waiting)) + @@ -1029,7 +1027,7 @@ In what follows we show how the residual plot can be used to determine whether t ##### Linearity of relationship {-} We want to check whether the association between the response $y_i$ and the explanatory variable $x_i$ is **L**inear. -We expect, due the error term in the model, that the scatterplot of residuals against fitted values shows some random variation, but the variation should not be systematic in any direction and the trend should not show non-linear patterns. +We expect, due to the error term in the model, that the scatterplot of residuals against fitted values shows some random variation, but the variation should not be systematic in any direction and the trend should not show non-linear patterns. A scatterplot of residuals against fitted values showing no patterns but simply a cloud of points that seems randomly assigned in every direction with the residuals' variation (y-axis) about the same for any fitted values (x-axis) and with points located as much above as below the zero line is called a *null* plot. Plots of residuals against fitted values or regressors that are *null* plots do not show any evidence against the assumptions of the model. @@ -1072,7 +1070,7 @@ If they are not independent, some patterns of dependency may appear in the obser The residuals could be used for this purpose too as they are a rough approximation of the error terms. If data was collected in a time sequence or other type of sequence, the residuals may also help us determine lack of independence by plotting the residuals against time. As it happens, the Old Faithful geyser eruption example does have a time component we can use: the `old_faithful_2024` dataset contains the `date` variable. -We show the plot of `residuals` against `date` (time): +We show the plot of `residuals` against `date` (time) in Figure \@ref(fig:time-plot): ```{r time-plot, echo=FALSE, fig.cap="Scatterplot of date (time) vs residuals for the Old Faithful example.", fig.height=ifelse(knitr::is_latex_output(), 3, 4), message=FALSE, purl=FALSE} old_faithful_2024 |> @@ -1111,7 +1109,7 @@ We can also use a *quantile-to-quantile* plot or *QQ-plot*. The QQ-plot creates a scatterplot of the quantiles (or percentiles) of the residuals against the quantiles of a normal distribution. If the residuals follow approximately a normal distribution, the scatterplot would be a straight line of 45 degrees. To draw a QQ-plot for the Old Faithful geyser eruptions example, we use the `fitted_and_residuals` data frame that contains the residuals of the regression, `ggplot()` with `aes(sample = residual)` and the `geom_qq()` function for drawing the QQ-plot. -We also include the function `geom_qq_line()` to add a 45 degree line for comparison: +We also include the function `geom_qq_line()` to add a 45-degree line for comparison: ```{r eval = FALSE} fitted_and_residuals |> @@ -1139,7 +1137,7 @@ grid.arrange(g1, g2, ncol=2) The histogram of the residuals shown in Figure \@ref(fig:model1residualshist) (left plot) does not appear exactly normal as there are some deviations, such as the highest bin value appearing just to the right of the center. But the histogram does not seem too far from normality either. The QQ-plot (right plot) supports this conclusion. -The scatterplot is not exactly on the 45 degree line but it does not deviate much from it either. +The scatterplot is not exactly on the 45-degree line but it does not deviate much from it either. We compare these results with residuals found by simulation that do not appear to follow normality as shown in Figure \@ref(fig:not-normal-residuals). In this case of the model yielding the clearly non-normal residuals on the right, any results from an inference for regression would not be valid. @@ -1167,11 +1165,9 @@ g1 + g2 + plot_layout(ncol = 2) ##### Equality or constancy of variance for errors {-} The final assumption we check is the **E**quality or constancy of the variance for the error term across all fitted values or regressor values. -Constancy of variance is also known as \index{homoskedasticity}*homoskedasticity*. - -Using the residuals again as rough estimates of the error terms, we want to check that the dispersion of the residuals is the same for any fitted value $\widehat{y}_i$ or regressor $x_i$. +Constancy of variance is also known as \index{homoskedasticity}*homoskedasticity*. Using the residuals again as rough estimates of the error terms, we want to check that the dispersion of the residuals is the same for any fitted value $\widehat{y}_i$ or regressor $x_i$. In Figure \@ref(fig:scatt-and-residual), we showed the scatterplot of residuals against fitted values (right plot). -We can also produce the scatterplot of residuals against the regressor `duration`: +We can also produce the scatterplot of residuals against the regressor `duration` in Figure \@ref(fig:residual-plot): ```{r residual-plot, fig.cap="Plot of residuals against the regressor.", message=FALSE, fig.height=ifelse(knitr::is_latex_output(), 1.5, 4)} ggplot(fitted_and_residuals, aes(x = duration, y = residual)) + @@ -1180,7 +1176,7 @@ ggplot(fitted_and_residuals, aes(x = duration, y = residual)) + geom_hline(yintercept = 0) ``` -With the exception of the change of scale on the x-axis, it is equivalent for visualization purposes to producing a plot of residuals ($e_i$) against either the fitted values ($\widehat{y}_i$) or the regressor values ($x_i$). +With the exception of the change of scale on the x-axis, it is equivalent (for visualization purposes) to producing a plot of residuals ($e_i$) against either the fitted values ($\widehat{y}_i$) or the regressor values ($x_i$). This happens because the fitted values are a linear transformation of the regressor values, $\widehat{y}_i = b_0 + b_1\cdot x_i$. Observe the vertical dispersion or spread of the residuals for different values of `duration`: @@ -1251,7 +1247,7 @@ On the other hand, when the number of violations to the normality assumption is Using the advanced methods suggested earlier here may correct these problems too. When the **E**quality or constancy of the variance is not met, adjusting the variance by adding weights to individual observations may be possible if relevant information is available that makes those weights known. -This method is called *weighted linear regression* or *weighted least squares*, and it is a direct extension to the model we have studied. +This method is called *weighted linear regression* or *weighted least squares*, and it is a direct extension of the model we have studied. If information of the weights is not available, some methods can be used to provide an estimator for the internal structure of the variance in the model. One of the most popular of these methods is called the *sandwich estimator*. @@ -1271,7 +1267,7 @@ In this way, the stakeholders of any analysis are aware of the model's shortcomi **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Use the the `un_member_states_2024` data frame included in the `moderndive` package with response variable fertility rate (`fert_rate`) and the regressor life expectancy (`life_exp`). -- Use the [`get_regression_points()`](https://moderndive.github.io/moderndive/reference/get_regression_points.html) function to get the observed values, fitted values, and residuals for all UN member countries. +- Use the `get_regression_points()` function to get the observed values, fitted values, and residuals for all UN member countries. - Perform a residual analysis and look for any systematic patterns in the residuals. Ideally, there should be little to no pattern but comment on what you find here. @@ -1391,7 +1387,7 @@ se_ci <- bootstrap_distn_slope |> se_ci ``` -The resulting standard error-based 95% confidence interval for $\beta_1$ of $(`r se_ci[["lower_ci"]]`, `r se_ci[["upper_ci"]]`)$ is slightly different than the percentile-based confidence interval. Note that neither of these confidence intervals contain 0 and are entirely located above 0. This is suggesting that there is in fact a meaningful positive relationship between waiting times and duration for Old Faithful eruptions. +The resulting standard error-based 95% confidence interval for $\beta_1$ of $(`r se_ci[["lower_ci"]]`, `r se_ci[["upper_ci"]]`)$ is slightly different than the percentile-based confidence interval. Note that neither of these confidence intervals contains 0 and is entirely located above 0. This is suggesting that there is in fact a meaningful positive relationship between waiting times and duration for Old Faithful eruptions. ### Hypothesis test for population slope using `infer` {#infer-hypo-slr} @@ -1435,7 +1431,7 @@ if (!file.exists("rds/null_distn_slope.rds")) { Observe the resulting null distribution for the fitted slope $b_1$ in Figure \@ref(fig:null-distribution-slope). -```{r null-distribution-slope, echo=FALSE, fig.cap="Null distribution of slopes.", fig.height=ifelse(knitr::is_latex_output(), 2.5, 4), purl=FALSE} +```{r null-distribution-slope, echo=FALSE, fig.cap="Null distribution of slopes.", fig.height=ifelse(knitr::is_latex_output(), 2, 4), purl=FALSE} visualize(null_distn_slope) ``` @@ -1546,14 +1542,14 @@ coffee_data ``` By looking at the fourth row we can tell, for example, that the `total_cup_points` are `r coffee_data[4,"total_cup_points"][[1]]` with `aroma` score equal to `r coffee_data[4,"aroma"][[1]]` points, `flavor` score equal to `r coffee_data[4,"flavor"][[1]]` points, `moisture_percentage` equal to `r coffee_data[4,"moisture_percentage"][[1]]`%, and `r as.character(coffee_data[4,"continent_of_origin"][[1]])` is the `country_of_origin`. -We also display the summary for these variables: +We also display the summary for these variables in Table \@ref(tab:coffee-tidy-summary): ```{r eval=FALSE} coffee_data |> tidy_summary() ``` -```{r echo=FALSE} +```{r coffee-tidy-summary, echo=FALSE} coffee_data |> tidy_summary() |> kbl( @@ -1593,7 +1589,7 @@ Using this notation, for example, $x_{i2}$ is the value of the $i$th observation We can now create data visualizations. When performing multiple regression it is useful to construct a scatterplot matrix, a matrix that contains the scatterplots for all the variable pair combinations. -In R, we can use the function `ggpairs()` from package `GGally` to generate a scatterplot matrix and some useful additional information. In Figure \@ref(fig:coffee-scatter-matrix) we present the scatterplot matrix for all the variables of interest. +In R, we can use the function `ggpairs()` from package `GGally` to generate a scatterplot matrix and some useful additional information. In Figure \@ref(fig:coffee-scatter-matrix), we present the scatterplot matrix for all the variables of interest. ```{r coffee-scatter-matrix, fig.cap="Scatterplot matrix for coffee variables of interest.", fig.height=8.5, purl=FALSE, message=FALSE} coffee_data |> @@ -1604,8 +1600,8 @@ We first comment on the plots with the response (`total_cup_points`) on the vert They are located in the last row of plots in the figure. When plotting `total_cup_points` against `aroma` (bottom row, leftmost plot) or `total_cup_points` against `flavor` (bottom row, second plot from the left), we observe a strong and positive linear relationship. The plot of `total_cup_points` against `moisture_percentage` (bottom row, third plot from the left) does not provide much information and it appears that these variables are not associated in any way, but we observe an outlying observation for `moisture_percentage` around zero. -The plot of `total_cup_points` versus `continent_of_origin` (bottom row, four plot from the left) shows four histograms, one for each factor level; the fourth group seems to have a larger dispersion, even though the number of observations seems smaller. -The associated boxplots connecting these two variables (fourth row, rightmost plot) suggest that the first factor level of `"Africa"` has a higher mean cup points than the other three. +The plot of `total_cup_points` versus `continent_of_origin` (bottom row, fourth plot from the left) shows four histograms, one for each factor level; the fourth group seems to have a larger dispersion, even though the number of observations seems smaller. +The associated boxplots connecting these two variables (fourth row, rightmost plot) suggest that the first level of the factor (`"Africa"`) has a higher mean cup points than the other three. It is often useful to also find any linear associations between numerical regressors as is the case for the scatterplot for `aroma` against `flavor` which suggests a strong positive linear association. By contrast, almost no relationship can be found when observing `moisture_percentage` against either `aroma` or `flavor`. @@ -1626,9 +1622,9 @@ Note also that the correlation between `aroma` and `flavor` (`r corr_table[2,1]` ### Least squares for multiple regression {#least-squares-multiple} -Observe that we have three numerical regressors and one factor (`continent_of_origin`) with four factor levels: `"Africa"`, `"Asia"`, `"North America"`, and `"South America"`. +Observe that we have three numerical regressors and one factor (`continent_of_origin`) with four levels: `"Africa"`, `"Asia"`, `"North America"`, and `"South America"`. To introduce the factor levels in the linear model, we represent the factor levels using dummy variables as described in Subsection \@ref(model4interactiontable). -These are the dummy variable that we need: +These are the dummy variables that we need: $$\begin{aligned} D_1 &= \left\{ @@ -1723,7 +1719,7 @@ This means that for some random samples the estimated value $b_1$ will be greate - The least-square estimators are linear combinations of the observed responses $y_1$, $y_2$, $\dots$, $y_n$. For $b_3$, for example, there are known constants $c_1$, $c_2$, $\dots$, $c_n$ such that $b_1 = \sum_{i=1}^n c_iy_i$. -When using the `lm()` function, all the necessary calculations are done in R. For the coffee example, we get: +When using the `lm()` function, all the necessary calculations are done in R. In Table \@ref(tab:regtable-coffee), for the coffee example, we get: ```{r, eval=FALSE} # Fit regression model: @@ -1851,7 +1847,7 @@ We are now ready to discuss inference for multiple linear regression. ## Theory-based inference for multiple linear regression {#theory-multiple-regression} -In this section we introduce some of the conceptual framework needed to understand inference in multiple linear regression. +In this section, we introduce some of the conceptual framework needed to understand inference in multiple linear regression. We illustrate this framework using the coffee example and the R function [`get_regression_table()`](https://moderndive.github.io/moderndive/reference/get_regression_table.html) introduced in Subsection \@ref(regression-table). Inference for multiple linear regression is a natural extension of inference for simple linear regression. Recall that the linear model, for the $i$th observation is given by @@ -1884,7 +1880,7 @@ $$b_j \sim Normal(\beta_j, SD(b_j))$$ for $j = 1, \dots, p$. Note also that $\sigma$ is typically unknown, and it is estimated using the estimated standard deviation $s$ instead of $\sigma$. The estimated standard deviation for $b_j$ is called the standard error of $b_j$ and written as $SE(b_j)$. Again, remember that the standard error is a function of $s$. -The standard errors are shown when applying the [`get_regression_table()`](https://moderndive.github.io/moderndive/reference/get_regression_table.html) function on a regression model. This is the output for model `mod_mult`: +The standard errors are shown when applying the [`get_regression_table()`](https://moderndive.github.io/moderndive/reference/get_regression_table.html) function on a regression model. The output for the model `mod_mult` in in Table \@ref(tab:mult-model-1): (ref:mod-mult-table) The regression table for `mod_mult` @@ -1921,7 +1917,7 @@ Using the `coffee_data` example, suppose that we consider the model: $$y_i = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2} + \beta_3 x_{i3} + \epsilon_i\,$$ where $\beta_1$, $\beta_2$, and $\beta_3$, are the parameters for the partial slopes of `aroma`, `flavor`, and `moisture_precentage`, respectively. Recall that we assume that the parameters $\beta_1$, $\beta_2$, and $\beta_3$ are constants, unknown to us, but constants. -We use multiple linear regression and find the least-square estimates $b_1$, $b_2$, and $b_3$, respectively. The results using the `coffee_data` are calculated and stored in the object `mod_mult_1` and shown below: +We use multiple linear regression and find the least-square estimates $b_1$, $b_2$, and $b_3$, respectively. The results using the `coffee_data` are calculated and stored in the object `mod_mult_1` and shown in Table \@ref(tab:regtable-coffee-mult-1): ```{r, eval=FALSE} # Fit regression model: @@ -1944,7 +1940,7 @@ lm_data <- data.frame("Coefficients" = c("b0", "b1", "b2", "b3", "s"),"Values" = lm_data |> kbl( digits = 3, - caption = "Coffee example linear regression coefficients", + caption = "Coffee example linear regression coefficients with three regressors", booktabs = TRUE, linesep = "" ) |> @@ -1958,7 +1954,7 @@ Now, assume that we decide instead to construct a model without the regressor `f $$y_i = \beta_0 + \beta_1 \cdot x_{i1} + \beta_3 \cdot x_{i3} + \epsilon_i\,.$$ Using multiple regression we compute the least-square estimates $b'_1$ and $b'_3$, respectively, with the $'$ denoting potentially different coefficient values in this different model. -The results using the `coffee_data` are calculated and stored in the object `mod_mult_2` and shown below: +The results using the `coffee_data` are calculated and stored in the object `mod_mult_2` and shown in Table \@ref(tab:regtable-coffee-mult-2): ```{r, eval=FALSE} # Fit regression model: @@ -1981,7 +1977,7 @@ lm_data <- data.frame("Coefficients" = c("b'0", "b'1", "b'3", "s'"), lm_data |> kbl( digits = 3, - caption = "Coffee example linear regression coefficients", + caption = "Coffee example linear regression coefficients with two regressors", booktabs = TRUE, linesep = "" ) |> @@ -2050,10 +2046,10 @@ b_1 &\pm q \cdot SE(b_1)\\ = (`r round(lb_mult,2)` &, `r round(ub_mult,2)`) \end{aligned}$$ -The interpretation of this interval is the customary: "We are 95% confident that the population partial slope for `aroma` ($\beta_1$) in the model `mod_mult` is a number between `r round(lb_mult,2)` and `r round(ub_mult,2)`". +The interpretation of this interval is the customary: "We are 95% confident that the population partial slope for `aroma` ($\beta_1$) in the model `mod_mult` is a number between `r round(lb_mult,2)` and `r round(ub_mult,2)`." We find these values using [`get_regression_table()`](https://moderndive.github.io/moderndive/reference/get_regression_table.html) in model `mod_mult`. -This time, however, we add the argument `conf.level = 0.98` to get 98% confidence intervals. +This time, however, we add the argument `conf.level = 0.98` to get 98% confidence intervals as shown in Table \@ref(tab:mult-model-2). ```{r eval=FALSE} get_regression_table(mod_mult, conf.level = 0.98) @@ -2110,7 +2106,7 @@ H_0: \beta_2 = 0\quad \text{with }\beta_0, \beta_1, \beta_3, \beta_{02}, \beta_{ H_A: \beta_2 \ne 0\quad \text{with }\beta_0, \beta_1, \beta_3, \beta_{02}, \beta_{03}, \beta_{04} \text{ given and arbitrary.} \end{aligned}$$ -The relevant code is shown below with the output: +The relevant code is shown as follows with the output: (ref:mod-mult-1-table) The regression table for total_cup_points by aroma + flavor + moisture_percentage @@ -2146,11 +2142,11 @@ Table \@ref(tab:mult-model-3) provides information for all the coefficients. Obs ### Hypothesis test for model comparison {#hypo-test-mult-lm-1} -There is another hypothesis test that can be performed for multiple linear regression, a test that compares two models, one with a given set of regressors called the *full model* and the other with with only a subset of those regressors called the *reduced model*. +There is another hypothesis test that can be performed for multiple linear regression, a test that compares two models, one with a given set of regressors called the *full model* and the other with only a subset of those regressors called the *reduced model*. Using the `coffee_data` example, suppose that the full model is: -$$Y = \beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot X_2 + \beta_3 \cdot X_3 + \epsilon$$ where $\beta_1$, $\beta_2$, and $\beta_3$ are the parameters for the partial slopes of `aroma`, `flavor`, and `moisture_precentage`, respectively. The multiple linear regression outcome using the `coffee_data` dataset on this model are stored in the object `mod_mult_1`. The reduced model does not contain the regressor `flavor` and is given by +$$Y = \beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot X_2 + \beta_3 \cdot X_3 + \epsilon$$ where $\beta_1$, $\beta_2$, and $\beta_3$ are the parameters for the partial slopes of `aroma`, `flavor`, and `moisture_precentage`, respectively. The multiple linear regression outcomes using the `coffee_data` dataset on this model are stored in the object `mod_mult_1`. The reduced model does not contain the regressor `flavor` and is given by $$Y = \beta_0 + \beta_1 \cdot X_1 + \beta_3 \cdot X_3 + \epsilon.$$ @@ -2196,7 +2192,7 @@ The test statistic is given in the second row for the column `F`. The test statistic is $F =`r round(anova_table[2,5],2)`$ and the associated $p$-value is near zero. In conclusion, we reject the null hypothesis and conclude that the full model was needed. -The ANOVA test can be used to test more then one regressor at a time. +The ANOVA test can be used to test more than one regressor at a time. This is useful when you have factors (categorical variables) in your model, as all the factor levels should be tested simultaneously. We use our example again, this time making the full model the model with factor `continent_of_origin` in addition to all three numerical regressors, and the reduced model is the model without `continent_of_origin`. These models have been computed already in `mod_mult` and `mod_mult_1`, respectively. @@ -2233,11 +2229,11 @@ In particular, we continue using the plot of residuals against the fitted values While most of the treatment and interpretations are similar to those presented for simple linear regression, when the plot of residuals against fitted values is not a null plot, we know that there is some sort of violation to one or more of the assumptions of the model. It is no longer clear what is the reason for this, but at least one assumption is not met. -On the other hand, if the residuals against fitted values plot appears to be close to a null plot, none of the assumptions have been broken and we can proceed with the use of this model. +On the other hand, if the residuals-against-fitted-values plot appears to be close to a null plot, none of the assumptions have been broken and we can proceed with the use of this model. Let's use the coffee example. Recall that we have shown that regressors `aroma`, `flavor`, and `continent_of_origin` were statistically significant, but `moisture_percentage` was not. -So, we create a model only with the relevant regressors, called `mod_mult_final`, determine the residuals, and create a plot of residuals against fitted values and a QQ-plot using the `grid.arrange()` function from the `gridExtra` package: +So, we create a model only with the relevant regressors called `mod_mult_final`, determine the residuals, and create a plot of residuals against fitted values and a QQ-plot using the `grid.arrange()` function from the `gridExtra` package in Figure \@ref(fig:grid-arrange-plot-check): ```{r} # Fit regression model: @@ -2247,7 +2243,7 @@ mod_mult_final <- lm(total_cup_points ~ aroma + flavor + continent_of_origin, fit_and_res_mult <- get_regression_points(mod_mult_final) ``` -```{r, fig.cap="Residuals vs. fitted values plot and QQ-plot for the multiple regression model.", fig.height=ifelse(knitr::is_latex_output(), 4, 4)} +```{r, grid-arrange-plot-check, fig.cap="Residuals vs. fitted values plot and QQ-plot for the multiple regression model.", fig.height=ifelse(knitr::is_latex_output(), 2, 4)} g1 <- fit_and_res_mult |> ggplot(aes(x = total_cup_points_hat, y = residual)) + geom_point() + @@ -2261,8 +2257,8 @@ grid.arrange(g1, g2, ncol=2) The plot of residuals against fitted values (left) appears to be close to a null plot. This result is desirable because it supports the **L**inearity condition as no patterns are observed in this plot. The **E**qual or constant variance also holds as the vertical dispersion seems to be fairly uniform for any fitted values. -Since we have assumed the data collected was random and there are no time sequences or other sequences to consider, the assumption of **I**ndependence seem to be acceptable too. -Finally, the QQ-plot suggests (with the exception of one or two observations) that the residuals follow approximately the normal distribution. We can conclude that this model seems to be good enough in holding the assumptions of the model. +Since we have assumed the data collected was random and there are no time sequences or other sequences to consider, the assumption of **I**ndependence seems to be acceptable too. +Finally, the QQ-plot suggests (with the exception of one or two observations) that the residuals follow approximately the normal distribution. We can conclude that this model seems to be good enough to hold the assumptions of the model. This example concludes our treatment of theory-based inference. We now proceed to study the simulation-based inference. @@ -2326,7 +2322,7 @@ observed_fit <- coffee_data |> observed_fit ``` -As we would expect, these values match up with the values of `mod_mult_table` given in the first two columns there: +As we would expect, these values match up with the values of `mod_mult_table` given in the first two columns there in Table \@ref(tab:mod-mult-table-again): (ref:mod-mult-again) The regression table for `mod_mult` @@ -2334,7 +2330,7 @@ As we would expect, these values match up with the values of `mod_mult_table` gi mod_mult_table ``` -```{r echo=FALSE} +```{r mod-mult-table-again, echo=FALSE} mod_mult_table |> kbl( digits = 3, @@ -2348,11 +2344,11 @@ mod_mult_table |> ) ``` -We will use the `observed_fit` values as our point estimates for the partial slopes in the confidence intervals. +The `observed_fit` values will be our point estimates for the partial slopes in the confidence intervals. #### Bootstrap distribution for the partial slopes {-} -We will now construct the bootstrap distribution for the partial slopes using the `infer` workflow. Just like in Section \@ref(infer-ci-slr), we are now resampling entire rows of values. We construct the bootstrap distribution for the fitted partial slopes using our full sample of `r n_coffee` coffees: +We will now find the bootstrap distribution for the partial slopes using the `infer` workflow. Just like in Section \@ref(infer-ci-slr), we are now resampling entire rows of values to construct the bootstrap distribution for the fitted partial slopes using our full sample of `r n_coffee` coffees: 1. `specify()` the variables of interest in `coffee_data` with the formula: `total_cup_points ~ aroma + flavor + moisture_percentage + continent_of_origin`. 2. `generate()` replicates by using `bootstrap` resampling with replacement @@ -2371,7 +2367,7 @@ coffee_data |> ```{r echo=FALSE} # Fix the width for the explanatory variable output -#options(width = 125) +#options(width = 150) if (!file.exists("rds/generated_distn_slopes.rds")) { set.seed(76) generated_distn_slopes <- coffee_data |> @@ -2420,10 +2416,31 @@ boot_distribution_mlr Since we have 7 coefficients in our model corresponding to the intercept, three levels of `continent_of_origin`, `aroma`, `flavor`, and `moisture_percentage`, we have 7 rows for each `replicate`. This results in a total of `r nrow(boot_distribution_mlr)` rows in the `boot_distribution_mlr` data frame. We can visualize the bootstrap distribution for the partial slopes in Figure \@ref(fig:boot-distn-slopes). -```{r boot-distn-slopes, fig.cap="Bootstrap distributions of partial slopes.", fig.height=8, fig.width=6} +```{r eval=FALSE} visualize(boot_distribution_mlr) ``` +```{r echo=FALSE} +boot_mlr_viz <- visualize(boot_distribution_mlr) +if (!file.exists("images/boot_mlr_viz.png")) { + ggsave( + filename = "images/boot_mlr_viz.png", + plot = boot_mlr_viz, + width = 6, + height = 11, + dpi = 320 + ) +} +``` + +```{r boot-distn-slopes, echo=FALSE, out.width="68%", fig.height=12, fig.cap="Bootstrap distributions of partial slopes."} +if(is_latex_output()) { + knitr::include_graphics("images/boot_mlr_viz.png") +} else { + boot_mlr_viz +} +``` + #### Confidence intervals for the partial slopes {-} As we did in Section \@ref(infer-ci-slr), we can construct 95% confidence intervals for the partial slopes. Here, we focus on using the percentile-based method with the `get_confidence_interval()` function and `type = "percentile"` to construct these intervals. @@ -2437,9 +2454,11 @@ confidence_intervals_mlr <- boot_distribution_mlr |> confidence_intervals_mlr ``` -We can also visualize these confidence intervals in Figure \@ref(fig:ci-slopes-multiple). +In reviewing the confidence intervals, we note that the confidence intervals for `aroma` and `flavor` do not include 0. This suggests that they are statistically significant. This indicates that `aroma` and `flavor` have a meaningful and reliable impact on the response variable in the presence of other predictors. -In reviewing the confidence intervals, we note that the confidence intervals for `aroma` and `flavor` do not include 0, which suggests that they are statistically significant. We also note that 0 is included in the confidence interval for `moisture_percentage`, which again provides evidence that it might not be a useful regressor in this multiple linear regression model. +We also note that 0 is included in the confidence interval for `moisture_percentage`. This again provides evidence that it might not be a useful regressor in this multiple linear regression model. This implies that `moisture_percentage` does not significantly contribute to explaining the variability in the response variable after accounting for other predictors in the model. + +We can also visualize these confidence intervals. This is done in Figure \@ref(fig:ci-slopes-multiple). ```{r ci-slopes-multiple, fig.cap="95% confidence intervals for the partial slopes.", fig.height=8.5, fig.width=6} visualize(boot_distribution_mlr) + @@ -2470,12 +2489,35 @@ null_distribution_mlr We can now conduct hypothesis tests for the partial slopes in multiple linear regression. We can use the permutation test to test the null hypothesis $H_0: \beta_i = 0$ versus the alternative hypothesis $H_A: \beta_i \neq 0$ for each of the partial slopes. Let's use a significance level of $\alpha = 0.05$. We can visualize the $p$-values in the null distribution by comparing them to the observed test statistics. We do this by adding a `shade_p_value()` layer `visualize()`. -```{r, fig.height=ifelse(knitr::is_latex_output(), 6, 8), fig.width=6, fig.cap="Shaded p-values for the partial slopes in this multiple linear regression."} +```{r eval=FALSE} visualize(null_distribution_mlr) + shade_p_value(obs_stat = observed_fit, direction = "two-sided") ``` -From these visualizations, we can surmise that `aroma` and `flavor` are statistically significant, as their observed test statistics fall far to the right of the null distribution. On the other hand, `moisture_percentage` is not statistically significant, as its observed test statistic falls within the null distribution. We can also compute the numerical $p$-values using the `get_p_value()` function. +```{r echo=FALSE} +mlr_pvalue_viz <- visualize(null_distribution_mlr) + + shade_p_value(obs_stat = observed_fit, direction = "two-sided") +#if (!file.exists("images/mlr_pvalue_viz.png")) { + ggsave( + filename = "images/mlr_pvalue_viz.png", + plot = mlr_pvalue_viz, + width = 6, + height = 11, + dpi = 320 + ) +#} +``` + +```{r shaded-p-values-partial, echo=FALSE, out.width="55%", fig.height=12, fig.cap="Shaded p-values for the partial slopes in this multiple regression."} +if(is_latex_output()) { + knitr::include_graphics("images/mlr_pvalue_viz.png") +} else { + boot_mlr_viz +} +``` + + +From these visualizations in Figure \@ref(fig:shaded-p-values-partial), we can surmise that `aroma` and `flavor` are statistically significant, as their observed test statistics fall far to the right of the null distribution. On the other hand, `moisture_percentage` is not statistically significant, as its observed test statistic falls within the null distribution. We can also compute the numerical $p$-values using the `get_p_value()` function. ```{r} null_distribution_mlr |> @@ -2515,7 +2557,7 @@ These results match up with our findings from the visualizations of the shaded $ - A. The variable is likely to have no effect on the response. - B. The null hypothesis should be accepted. -- C. The variable is likely statistically significant, and we should reject the null hypothesis. +- C. The variable is likely statistically significant, and we should reject the null. - D. The observed data should be discarded. @@ -2530,7 +2572,7 @@ These results match up with our findings from the visualizations of the shaded $ ### Summary of statistical inference -We've finished the last two scenarios from the "Scenarios of sampling for inference" table, which we re-display in Table \@ref(tab:table-ch10). +We've finished the last two scenarios, which we re-display in Table \@ref(tab:table-ch10). ```{r table-ch10, echo=FALSE, message=FALSE, purl=FALSE} # The following Google Doc is published to CSV and loaded using read_csv(): diff --git a/11-tell-your-story-with-data.Rmd b/11-tell-your-story-with-data.Rmd index 8fee106b9..08116ee8f 100755 --- a/11-tell-your-story-with-data.Rmd +++ b/11-tell-your-story-with-data.Rmd @@ -31,7 +31,7 @@ Recall in the Preface and at the end of chapters throughout this book, we displa (ref:finalflowchart) *ModernDive* flowchart. -```{r moderndive-figure-conclusion, echo=FALSE, out.width="100%", out.height="100%", fig.cap="(ref:finalflowchart)", purl=FALSE} +```{r moderndive-figure-conclusion, echo=FALSE, out.width="100%", fig.cap="(ref:finalflowchart)", purl=FALSE} include_graphics("images/flowcharts/flowchart/flowchart.002.png") ``` @@ -68,7 +68,7 @@ This philosophy is also well-summarized in ["Practical Data Science for Stats"]( In other words, to be equipped to "think with data" in the 21st century and beyond, analysts need practice going through the ["data/science pipeline"](http://r4ds.had.co.nz/explore-intro.html) we saw in the Preface (re-displayed in Figure \@ref(fig:pipeline-figure-conclusion)). It is our opinion that, for too long, statistics education has only focused on parts of this pipeline, instead of going through it in its *entirety*. -```{r pipeline-figure-conclusion, fig.cap="Data/science pipeline.", out.height="80%", out.width="80%", echo=FALSE, purl = FALSE} +```{r pipeline-figure-conclusion, fig.cap="Data/science pipeline.", out.width="80%", echo=FALSE, purl = FALSE} include_graphics("images/r4ds/data_science_pipeline.png") ``` @@ -83,6 +83,7 @@ Let's load all the packages needed for this chapter (this assumes you've already library(tidyverse) library(moderndive) library(fivethirtyeight) +library(infer) ``` ```{r message=FALSE, echo=FALSE, purl=FALSE} @@ -172,7 +173,7 @@ Observe that the mean `price` of `r mean(house_prices$price) |> dollar()` is lar However, the median is not as sensitive to such outlier house prices. This is why news about the real estate market generally report median house prices and not mean/average house prices. We say here that the median is more *robust to outliers* than the mean. Similarly, while both the standard deviation and interquartile-range (IQR) are both measures of spread and variability, the IQR being based on quantiles as `Q3 - Q1` is more *robust to outliers*.\index{outliers} -Let's now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Let's first create *univariate* visualizations. These are plots focusing on a single variable at a time. Since `price` and `sqft_living` are numerical variables, we can visualize their distributions using a `geom_histogram()` as seen in Section \@ref(histograms) on histograms. On the other hand, since `condition` is categorical, we can visualize its distribution using a `geom_bar()`. Recall from Section \@ref(geombar) on barplots that since `condition` is not "pre-counted", we use a `geom_bar()` and not a `geom_col()`. +Let's now perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Let's first create *univariate* visualizations. These are plots focusing on a single variable at a time. Since `price` and `sqft_living` are numerical variables, we can visualize their distributions using a `geom_histogram()` as seen in Section \@ref(histograms) on histograms. On the other hand, since `condition` is categorical, we can visualize its distribution using a `geom_bar()`. Recall from Section \@ref(geombar) on barplots that since `condition` is not "pre-counted," we use a `geom_bar()` and not a `geom_col()`. ```{r, eval=FALSE, message=FALSE} # Histogram of house price: @@ -212,7 +213,7 @@ p3 <- ggplot(house_prices, aes(x = condition)) + p1 + p2 + p3 + plot_layout(ncol = 2) ``` -First, observe in the bottom plot that most houses are of condition "3", with a few more of conditions "4" and "5", and almost none that are "1" or "2". +First, observe in the bottom plot that most houses are of condition `3`, with a few more of conditions `4` and `5`, and almost none that are `1` or `2`. Next, see in the histogram for `price` (the top-left plot) that a majority of houses are less than two million dollars. Observe also that the x-axis stretches out to 8 million dollars, even though there does not appear to be any houses close to that price. This is because there are a *very small number* of houses with prices closer to 8 million as noted in the `tidy_summary()`. These are the outlier house prices we mentioned earlier. We say that the variable `price` is *right-skewed* as exhibited by the long right tail.\index{skew} @@ -556,7 +557,7 @@ We first make this prediction visually in Figure \@ref(fig:house-price-interacti * The regression line for the condition = `r ex_condition` homes and * The vertical dashed black line at `log10_size` equals `r ex_size_log10`, since our predictor variable is the log10 transformed square feet of living space of $\log10(`r ex_size`) = `r ex_size_log10`$. -```{r house-price-interaction-3, echo=FALSE, message=FALSE, fig.cap="Interaction model with prediction.", fig.height=ifelse(knitr::is_latex_output(), 4.5, 5), purl=FALSE} +```{r house-price-interaction-3, echo=FALSE, message=FALSE, fig.cap="Interaction model with prediction.", fig.height=ifelse(knitr::is_latex_output(), 5, 5), purl=FALSE} new_house <- tibble(log10_size = log10(1900), condition = factor(5)) %>% # Doesn't work when switching to |> from %>% get_regression_points(price_interaction, newdata = .) @@ -599,21 +600,20 @@ Let's next check the results of our multiple linear regression on house prices u #### Theory-based hypothesis testing for partial slopes {-} -Recall the results of our theory-based inference that were shown when we interpreted the `estimate` column of `get_regression_table(price_interaction)` in Table \@ref(tab:seattle-interaction) and are shown in what follows. We can now use these values to perform hypothesis tests on the partial slopes. +Recall the results of our theory-based inference that were shown when we interpreted the `estimate` column of `get_regression_table(price_interaction)` in Table \@ref(tab:seattle-interaction) and are shown again in Table \@ref(tab:price-inter-table-again). We can now use these values to perform hypothesis tests on the partial slopes. ```{r, eval=FALSE} -price_interaction <- lm(log10_price ~ log10_size * condition, - data = house_prices) +price_interaction <- lm(log10_price ~ log10_size * condition, data = house_prices) get_regression_table(price_interaction) ``` -```{r echo=FALSE, purl=FALSE} +```{r price-inter-table-again, echo=FALSE, purl=FALSE} price_interaction <- lm(log10_price ~ log10_size * condition, data = house_prices ) get_regression_table(price_interaction) |> kbl( digits = 3, - caption = "Regression table for interaction model", + caption = "Regression table for interaction model (shown again)", booktabs = TRUE, linesep = "" ) |> @@ -633,7 +633,9 @@ Let's begin by retrieving the observed fit values from our `price_interaction` m ```{r} observed_fit_coefficients <- house_prices |> - specify(log10_price ~ log10_size * condition) |> + specify( + log10_price ~ log10_size * condition + ) |> fit() observed_fit_coefficients ``` @@ -642,7 +644,9 @@ Next, we build a null distribution of the partial slopes using the `generate()` ```{r eval=FALSE} null_distribution_housing <- house_prices |> - specify(log10_price ~ log10_size * condition) |> + specify( + log10_price ~ log10_size * condition + ) |> hypothesize(null = "independence") |> generate(reps = 1000, type = "permute") |> fit() @@ -669,25 +673,26 @@ We then visualize this null distribution and shade the observed fit values to se ```{r eval=FALSE} visualize(null_distribution_housing) + - shade_p_value(obs_stat = observed_fit_coefficients, direction = "two-sided") + shade_p_value(obs_stat = observed_fit_coefficients, + direction = "two-sided") ``` ```{r echo=FALSE, message=FALSE} null_housing_shaded <- visualize(null_distribution_housing) + shade_p_value(obs_stat = observed_fit_coefficients, direction = "two-sided") -if (!file.exists("images/patchwork_boxplot.png")) { +if (!file.exists("images/null_housing_shaded.png")) { ggsave( filename = "images/null_housing_shaded.png", plot = null_housing_shaded, width = 6, - height = 9, + height = 11, dpi = 320 ) } ``` -```{r echo=FALSE, out.height="100%"} +```{r echo=FALSE, out.width="90%"} if(is_latex_output()) knitr::include_graphics("images/null_housing_shaded.png") ``` @@ -777,7 +782,7 @@ ggplot(US_births_1999, aes(x = date, y = births)) + title = "US Births in 1999") ``` -We see a big dip occurring just before January 1st, 2000, most likely due to the holiday season. However, what about the large spike of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike? +In Figure \@ref(fig:us-births), we see a big dip occurring just before January 1st, 2000, most likely due to the holiday season. However, what about the large spike of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike? Let's sort the rows of `US_births_1999` in descending order of the number of births. Recall from Section \@ref(arrange) that we can use the `arrange()` function from the `dplyr` function to do this, making sure to sort `births` in `desc`ending order: @@ -801,11 +806,9 @@ The date with the highest number of births (14,540) is in fact 1999-09-09. If we \vspace{-0.25in} ``` -Time to think with data and further tell your story with data! How could statistical modeling help you here? What types of statistical inference would be helpful? What else can you find and where can you take this analysis? What assumptions did you make in this analysis? We leave these questions to you as the reader to explore and examine. - -Remember to get in touch with us via our contact info in the Preface. We’d love to see what you come up with! +Time to think with data and further tell your story with data! How could statistical modeling help you here? What types of statistical inference would be helpful? What else can you find and where can you take this analysis? What assumptions did you make in this analysis? We leave these questions to you as the reader to explore and examine. Remember to get in touch with us via our contact info in the Preface. We’d love to see what you come up with! -Please check out additional problem sets and labs at as well. +Please check out additional problem sets and labs at . ### Scripts of R code @@ -814,7 +817,7 @@ Please check out additional problem sets and labs at