-
Notifications
You must be signed in to change notification settings - Fork 2
/
Day2_intro_to_R.qmd
1072 lines (676 loc) · 31.3 KB
/
Day2_intro_to_R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
df_print: paged
code_download: TRUE
toc: true
toc_depth: 1
editor_options:
chunk_output_type: console
editor:
markdown:
wrap: sentence
---
```{r, setup, include=FALSE}
knitr::opts_chunk$set(
eval=FALSE, warning=FALSE, error=FALSE
)
```
# R: Tools of the Trade
In this section, we're going to cover (or review) some fundamental R tools such as:
- File paths
- Working directory
- RStudio Projects
- Code scripts and R Documents
## File paths
- You can use the Files tab to access the files on your computer
- **Absolute paths** always start with the root directory and provide the full path to the file or directory. For example, the absolute path to the home directory of a user named "john" would be "/home/john".
- On the other hand, a **relative path** is a path to a file or directory that is relative to the current directory. For example, from the directory "/home/john", the relative path to a data set located in the Desktop would be "Desktop/data.csv".
## Working directory
Working directory: where R starts looking for files on your computer.
In other words, where you're "working from." This is also where R will save any output you save (e.g., processed data or plots).
You can set it with functions.
```{r}
getwd() # function to get working directory
setwd("<insert valid path here>") # function to set working directory
```
### EXERCISE
Set the working directory for your RStudio session to `R-fundamentals-summer-workshop` or `R-fundamentals-summer-workshop-main` (or whatever the correct path is for your project folder).
```{r}
# run this code to change your working directory only for this exercise
setwd("~")
# type code below to change the working directory back to your R workshop folder
```
## RStudio Projects
R experts keep all the files associated with a project together -- input data, R scripts, analytical results, figures.
This is such a wise and common practice that RStudio has built-in support for this via **projects**.
There are multiple ways to create a project. Here we cover two of them.
**Option 1 -- Create a new directory:**
On the top-right corner of RStudio, click on the button that has a blue cube with an R inside and says "Project: (None):. ![](images/project1.jpeg)
Then, click on "New Project..." in the drop-down menu.
![](images/project2.jpeg)
Then, click on "New Directory".
![](images/project3.jpeg)
Then, click on "New Project".
![](images/project4a.jpeg)
Then, populate a directory name (whatever name you want to be associated with the folder), select a location, and click on "Create Project". In this case, the folder will be called "test_project" and will be located in the Desktop.
![](images/project5a.jpeg)
**Option 2 -- Project in an existing directory:**
On the top-right corner of RStudio, click on the button that has a blue cube with an R inside and says "Project: (None):. ![](images/project1.jpeg)
Then, click on "New Project..." in the drop-down menu.
![](images/project2.jpeg)
Then, click on "Existing Directory".
![](images/project3.jpeg)
Then, select the location of your existing directory and click on "Create Project".
![](images/project4b.jpeg)
Regardless of which option you use, you'll end up with a directory containing a file ending in **".Rproj"**.
Double clicking on that file automatically opens up the project and associated documents in the correct working directory. **THIS IS HOW WE WANT TO OPEN RSTUDIO FOR THIS WORKSHOP.**
### EXERCISE
Check that you are currently working within the `R-fundamentals-summer-workshop` or `R-fundamentals-summer-workshop-main` Project. If not, save any changes you've made to this file (or others in this folder), and then open the relevant .RProj file to open the project (which will restart your R session as well).
```{r}
```
## Code scripts and R Document files
This is an R Quarto file. It is very similar to an R Markdown file. There are two view modes for this file - Source and Visual. The code in the gray chunks (surrounded by \`\`\` in Source view) is R code that you can run. There's a green triangle you can click to run all of the code in a chunk, or put your cursor in a line and press Command+Enter (Mac) or Control+Enter (PC) to execute that command (which can span multiple lines).
We're using an R Quarto Document instead of an R script so that we can include additional text and explanation in the file. In an .R script, everything has to be an R command or comment. Here, between chunks of R code, we can include text and other information.
# Packages
R comes with lots of built-in basic functions (commands). But we almost always need to use functions from additional packages as well. Packages can also include data sets sometimes.
## Installing
- RStudio often prompts you to install packages you're missing: if you open a file that uses a package that you don't have installed
- Do NOT put installation commands in R scripts or R Markdown files (included below only because we're teaching)
```{r, installpackages, eval=FALSE}
# install 1
install.packages("praise")
# install a few at once (put names in a vector)
install.packages(c("palmerpenguins", "tidyverse", "janitor"))
```
Installing a package is similar to going to the store to buy cold medication of a particular brand. Once you buy it, you have access to the medication, but you haven't used it yet.
Packages are installed from a repository. CRAN (Comprehensive R Archive Network) is the most common. People also install packages from Bioconductor and GitHub (process differs for these).
If you're asked a question about whether you want to install packages from source, say NO. This is because installing from source requires compiling into computer-readable documents which take time. If you say NO then R uses a pre-compiled version.
You only need to install packages once (for each major version of R).
To update a package, re-install it.
### EXERCISE
Let's install all the packages that we're going to use throughout the workshop!
Write code to install these packages:`praise`, `palmerpenguins`, `janitor`, `tidyverse`, `RColorBrewer`, `ggpubr`, `gridExtra`, and `grid`.
```{r, installpackagesworkshop, eval=FALSE}
```
## What if you can't install packages in your computer? Posit Cloud to the rescue!
If for some reason you can't install packages in your computer, you can use R and RStudio within Posit Cloud.
Posit Cloud allows you to run R on the cloud (in other words, other people's computers).
When using cloud services, keep in mind that you're sending data to other people's computers. Always consider the type of data that you're using and keep Northwestern's [research data](https://researchintegrity.northwestern.edu/resources/research-data-policy.pdf) and [data classification](https://policies.northwestern.edu/docs/data-classification-policy.pdf) policies in mind. You can always consult with [Research Computing and Data Services' Data Management team](https://www.it.northwestern.edu/departments/it-services-support/research/data-storage/). For this workshop, it's okay to use Posit Cloud.
To use Posit Cloud, go to <https://posit.cloud/>.
![](images/posit_cloud_main.jpeg)
The first time you use it, follow these steps:
1. [Sign up for a free acount](https://posit.cloud/plans/free) ![](images/posit_cloud_sign_up.jpeg)
2. Once you are logged in, click on "New Project" on the top right. ![](images/posit_cloud_new_project.jpeg)
3. Then click on New RStudio Project. It'll take a couple of moments to get the project ready. ![](images/posit_cloud_rstudio_project.jpeg)
4. Once you see RStudio's interface, you're good to go! ![](images/posit_cloud_rstudio.jpeg)
## Using Packages
To use a package in your current session, use the `library()` function to load all of the functions (and other objects) in the package into your environment. You do not need quotes around the package names.
Loading a package is similar to opening the cold medication after you come home from the store. Not only you already have the medication, but you also opened it, so that it's ready for you to use.
You need a separate library command for each package.
Put these at the top of your script or R Markdown file so people (collaborators or your future self) can see which packages are needed for the code.
Only load the packages you need. Packages can conflict with each other, so don't load ones you aren't using.
```{r}
library(praise)
library(palmerpenguins) # we're going to use this later
```
Now that we loaded `praise`, we can use functions within it!
```{r}
praise()
```
Note that, as opposed to installing, you need to load packages in each R session where you want to use them.
# Import Data
## Reading in a Data File
Most of the time, you'll read in data from a file. For example:
```{r}
evp <- read.csv("data/ev_police_jan.csv")
```
Most data in R will be stored in a data frame (more below).
As with other variables, when we type the name of the data frame, it will print the contents of the variable to the console.
```{r, eval=FALSE}
evp
```
Note: Evanston police data comes from the [Stanford Open Policing Project](https://openpolicing.stanford.edu/data/). This is a small subset of data: just January 2017.
TIPS: if you get an error message that a file can't be found when you're trying to import it:
1) Check the spelling of the filename for typos
2) Check your working directory (`getwd()`) and make sure the path to the file is correct and completely specified given what your working directory is.
3) Make sure you extracted the contents of the .zip file you downloaded. We've seen some problems on Windows computers especially where a .zip file isn't really unzipped - it's just letting you see inside without actually expanding the contents and creating real files. Right click on the file and look for an Extract All or similar option.
## From a Package
Today, we're going to work with data on penguins from <https://github.com/allisonhorst/palmerpenguins>. It's been included in an R package: `palmerpenguins`.
There's a data frame in the package called penguins. Like with functions in a package, we can use it once we've loaded the package with the library command above.
```{r}
penguins
```
## Manual Creation
It's also possible to create a data frame from scratch. You will not do this often.
```{r, manualdf}
month_info <- data.frame(
month = month.name,
index = 1:12,
days = c(31,28,31,30,31,30,31,31,30,31,30,31)
)
month_info
```
# Data Frame Basics
- Rectangles of data
- Rows are observations
- Columns are variables
- Columns (variables) have names
- Multiple vectors stuck together - each column is a vector
- Each column has 1 type of data (numeric, character, logical, etc.)
- Data frames can have columns with different types of data
## What is a data frame?
It has columns and rows. Let's look at it:
```{r, eval=FALSE}
View(evp)
```
Or we can look at a few rows in the console:
```{r}
head(evp)
```
What size is it?
```{r}
dim(evp)
nrow(evp)
ncol(evp)
```
What are the variables?
```{r}
names(evp)
```
What type of data is in each variable?
```{r}
str(evp)
```
Get some basic info about the variables:
```{r}
summary(evp)
```
## EXERCISE
Using the `penguins` data frame, write commands to figure out:
- How many rows and columns it has?
- What are the names of the variables?
```{r}
# load the package
library(palmerpenguins)
```
## Working with Variables
Each column (variable) is a vector. We can access them individually with the `$` notation using the name of each variable:
```{r}
evp$location
penguins$species
```
We can do things like count values, summarize numeric values, etc.
```{r}
table(evp$location)
median(month_info$days)
```
We can also use the vectors in expressions:
```{r}
month_info$days/7
```
## EXERCISE
Import data from `"data/nu_degrees.csv"` and save it in a variable (choose an appropriate name).
View the data.
What is the average (the function is `mean()`) number of bachelor's degrees issued in a year?
What is the standard deviation in the number of doctoral degrees issued each year? The function is `sd()`.
How many total masters degrees were issued for the years in the data? Hint: `sum()`.
```{r}
```
Bonus -- challenge: How many total degrees were issued each year? How many total degrees were issued across all years and types?
```{r}
```
## Indexing by Position: Vectors
We can access the individual values in vectors or data frames by their index position. We put the number of the element we want in `[]`.
Let's start with vectors:
```{r}
month_info$month
month_info$month[1]
```
We can also select a range of values:
```{r}
1:5 # shortcut for making a vector with values 1, 2, 3, 4, 5
month_info$month[1:5]
```
Or non-adjacent values by using a vector of values.
```{r}
# We can make our own vector with the `c()` function:
c(1, 3, 5, 7, 9, 11)
month_info$month[c(1, 3, 5, 7, 9, 11)]
```
Negative indices mean to exclude the value at the given position:
```{r}
month_info$month[-1]
month_info$month[-1:-6]
table(month_info$days)
table(month_info$days[-2])
```
## EXERCISE
Using the data on Northwestern degrees from the last exercise:
What is the mean number of bachelors degrees issued in the first 5 years of the data?
What is the mean number of bachelors degrees issued in the last 5 years of the data?
What is the mean number of doctoral degrees per year excluding the most recent year of 2021-22?
```{r}
```
## Detour: Tibbles vs. Data Frames
`class()` tells us the type of the object -- what's stored in the variable.
```{r}
class(evp)
```
What is `penguins`?
```{r}
class(penguins)
```
"tbl_df" is a tibble data frame. These behave a little bit differently from normal data frames. You'll see tibbles instead of data frames within the tidyverse set of packages (and those packages that work within that framework).
The biggest difference is that tibbles give you a tibble back when subsetting with \[\], while data frames sometimes give you a vector.
For now: treat tibbles as data frames; be aware that there may be a few differences (one is coming below!). Mostly tibbles are more consistent in their behavior than data frames -- they like to stay as tibbles.
## Indexing by Position: Data Frames
With data frames, there are two dimensions (rows and columns), so we need two indices:
```{r}
month_info
month_info[1, 1]
month_info[2, 1]
month_info[1, 2]
```
There is a shortcut if we want all rows or all columns -- we can leave one position blank:
Select first row
```{r}
month_info[1, ]
```
Select first two rows
```{r}
month_info[1:2, ]
```
Select first column
```{r}
month_info[ ,1]
```
This gave us a vector back, but with `penguins`:
```{r}
penguins[ ,1]
```
we get a data frame/tibble back. This is one difference between tibbles and regular data frames.
We can select rows and columns at the same time:
```{r}
month_info[6:7, 1:2]
```
If we want rows or columns that aren't next to each other, you can use a vector.
```{r}
month_info[c(1, 3), ]
```
## EXERCISE
Using the Northwestern degree data set from above: select rows 2 through 5, and columns 1 through 2
```{r}
```
Bonus -- challenge: look up the `seq()` function in the Help. Use it to select every other row from the degree data.
```{r}
```
## Indexing by Names
We've seen that we can reference columns by name with `$` notation (no quotes on names)
```{r}
names(month_info)
month_info$days
```
Note that the `$` notation got us a vector back.
```{r}
month_info$days
```
Use names in `[]`: put them in quotes
```{r}
month_info[ ,"days"]
```
With penguins (tibble), it stays as a tibble:
```{r}
penguins[ ,"bill_length_mm"]
```
Multiple columns by name, need to use a vector of names:
```{r}
month_info[ ,c("month", "days")]
```
## Boolean/Conditional Selection
Suppose we want to select rows from a data frame that might meet a specific condition. For example, select all data for patients aged greater than 65, or select all sales items that were sold in the month of December.
If we have a Boolean vector (`TRUE` and `FALSE` values) that is the same length as the number of rows, we can use it to select data.
For example, suppose your data set is as below:
```
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
```
You want to choose rows where the **condition** `mpg` ≥ 15 evaluates to TRUE
```
mpg cyl disp hp drat wt qsec vs am gear carb condition
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 FALSE
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 FALSE
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 FALSE
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 FALSE
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 FALSE
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 TRUE
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 TRUE
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 TRUE
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 TRUE
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 TRUE
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 TRUE
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 TRUE
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 TRUE
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 TRUE
```
Therefore, the selected rows will be the ones where the condition evaluated to TRUE:
```
mpg cyl disp hp drat wt qsec vs am gear carb condition
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 TRUE
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 TRUE
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 TRUE
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 TRUE
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 TRUE
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 TRUE
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 TRUE
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 TRUE
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 TRUE
```
And R will return the following subset of the original data frame to you:
```
mpg cyl disp hp drat wt qsec vs am gear carb
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
```
Let's see another example...
Say, you want to find the names of the months which have exactly 31 days.
```{r}
month_info
month_info$days # vector containing number of days in each month
month_info$days == 31 # boolean vector of T,F values based on a condition
month_info$month[month_info$days == 31] # works!
```
The command above works because R selects the values which evaluated to TRUE and discards the values that evaluated to FALSE.
Another example:
```{r}
month_info$month
length(month_info$month)
# create a Boolean vector of length equal to number of rows in month_info
selection_vector <- c(T, T, T, T, T, T, F, F, F, F, F, F)
length(selection_vector)
month_info$month[selection_vector]
month_info$month[!selection_vector] # inverted selection
```
Let's see an example with the penguins data frame:
Start with the criteria we want:
```{r}
penguins$bill_length_mm < 34
```
Hmmm - there's some missing values in there (NA) - we'll come back to that in a minute.
Select the **rows** (all columns) where bill length is less than 34:
```{r}
penguins[penguins$bill_length_mm < 34, ]
```
If I forget the `,` in the `[]`:
```{r, eval=FALSE}
penguins[penguins$bill_length_mm < 34]
```
It tries to index the columns instead, and our vector is too long.
Multiple conditions
```{r}
penguins[penguins$bill_length_mm < 34 & penguins$bill_depth_mm < 16, ]
dim(penguins[penguins$bill_length_mm < 34 & penguins$bill_depth_mm < 16, ])
```
```{r}
penguins[penguins$bill_length_mm < 34 | penguins$bill_length_mm > 58, ]
dim(penguins[penguins$bill_length_mm < 34 | penguins$bill_length_mm > 58, ])
```
## EXERCISE
Which years were more than 600 doctoral degrees awarded according to the `nu_degrees.csv` data?
You might want to break this down into two steps: 1) write an expression for determining whether there were more than 600 doctoral degrees, then 2) use that expression to index the correct column of your degrees data.
```{r}
```
# Missing Values
`NA` is the symbol for missing data; it can be used with any data type
```{r}
NA
y <- c("dog", "cat", NA, NA, "bird")
y
x <- c(NA, 2, NA, 4, NA, 6, NA, 6, 6, 4)
x
is.na(y)
is.na(x)
```
A common operation is to count how many missing values there are in a vector. We use the function `is.na()` to get TRUE when the value is `NA` and FALSE otherwise. Then we sum that result; this works because TRUE converts to 1 and FALSE converts to 0. So it counts the number of missing values (the number of TRUEs).
```{r}
sum(c(TRUE, FALSE))
x
sum(is.na(x))
```
Different functions handle missing values in different ways. Most commonly, you'll get an answer of `NA` unless you tell the function to remove or exclude the missing values in the calculation.
```{r}
mean(x)
mean(x, na.rm=TRUE)
```
If we don't want the missing rows included in our results above when we were selecting rows of penguins:
```{r}
is.na(penguins$bill_length_mm)
!is.na(penguins$bill_length_mm) # ! is not
dim(penguins)
# not missing bill length AND bill length < 34
penguins[!is.na(penguins$bill_length_mm) & penguins$bill_length_mm < 34, ]
dim(penguins[!is.na(penguins$bill_length_mm) & penguins$bill_length_mm < 34, ])
penguins[penguins$bill_length_mm < 34, ]
dim(penguins[penguins$bill_length_mm < 34, ])
```
The function `complete.cases()` can be used to find which rows have no missing values at all. The function `na.omit()` directly returns the subset of data with complete cases. You may want to use these functions at some point.
## EXERCISE
How many missing values in `penguins$bill_depth_mm`?
What is the mean of the non-missing values of `penguins$bill_depth_mm`?
```{r}
```
Bonus -- challenge: Select the rows from penguins where `penguins$bill_depth_mm` is missing.
```{r}
```
Bonus -- challenge: What is the mean of non-missing values of `penguins$bill_depth_mm` for penguins of species Gentoo?
```{r}
```
# Column Names
## Fixing Bad Names on Import
What if the variable/column names in your dataset have spaces, start with numbers, or are otherwise not names that you can use with R?
First, take a look at `annoying.csv` in Excel.
Then, run this code:
```{r}
bad_data <- read.csv("data/annoying.csv")
bad_data
names(bad_data)
```
The package `{janitor}` expedites the initial data exploration and cleaning that comes with any new data set.
```{r}
library(janitor)
bad_data <- clean_names(bad_data)
bad_data
names(bad_data)
```
## Renaming Columns
Sometimes, we want to change the column names of a dataframe after importing into R. For example, in `bad_data` above, perhaps we want the columns `x2022` and `x2023` to be called `year_2022` and `year_2023` respectively.
Just like we use `names()` to get the names, use the same function to assign the names. This is different syntax than we see other places in R; `names()` is a special type of function called a replacement function.
We can change the name of a data frame column as follows:
```{r}
names(bad_data)
names(bad_data)[2]
names(bad_data)[2] <- "year_2022"
names(bad_data) # inspect new names
bad_data
```
You can also change all or several column names at once.
```{r}
names(bad_data)
names(bad_data) <- c("fruit", "year_2022", "year_2023", "year_2024")
names(bad_data) # inspect new names
bad_data
```
# Making new columns
We can add a new column to the data frame by naming it with the `$` notation, and assigning a value to it.
For example, make a column that has bill length in **centimeters (cm)** instead of **millimeters (mm)**
```{r}
names(penguins)
penguins$bill_length_cm <- penguins$bill_length_mm / 10 # make new variable: note CM instead of MM in the name
names(penguins) # check to see that it was added
```
Columns will be added to the end (right).
```{r}
View(penguins)
```
## EXERCISE
Using `month_info`: make a new variable as part of `month_info`, called `weeks`, that is the number of days divided by 7.
```{r}
```
Make a new column in the degrees data with the total number of degrees awarded each year
```{r}
```
Bonus -- challenge: Make a new logical column called `long_month` that is TRUE if the month has 31 days.
```{r}
```
# Dropping Columns
To remove a column from a data frame, we select the data we want to keep, and then save that back to the same column:
```{r}
month_info
month_info[ , 1:3]
month_info <- month_info[ , -c(2,3)]
month_info
```
Of course, you can also do it with names instead of indices.
# Replacing Values
How do we change our data? A common task might be replacing missing values -- either actual NA values or other values that indicate missing. One way to do this is by replacing missing values by the mean of the non-missing values in that column.
Let's go through an example to replace missing values in the `bill_depth_mm` column of the `penguins` data frame.
First, we identify and select the values we want to change:
```{r}
# find NA / missing values in the column bill_depth_mm
penguins$bill_depth_mm[is.na(penguins$bill_depth_mm)]
```
Next, compute the replacement value, which is the mean of non-missing values of `bill_depth_mm` in this case:
```{r}
bill_depth_mean <- mean(penguins$bill_depth_mm, na.rm=TRUE)
bill_depth_mean
```
Now, assign the replacement value `bill_depth_mean` to the missing observations:
```{r}
penguins$bill_depth_mm[is.na(penguins$bill_depth_mm)] <- bill_depth_mean
sum(is.na(penguins$bill_depth_mm))
```
You can replace any values you need to replace (it doesn't have to be missing), and often-times it is common to use functions to compute the replacement values.
## EXERCISE
Here is code to make a new data frame. Run the code and take a look at the data frame
```{r}
set.seed(12924)
some_numbers <- data.frame(v1 = rnorm(10), v2 = runif(10))
some_numbers
```
You decide that you don't want any negative numbers in your data frame. Write code to replace all values of column v1 in `some_numbers` that are below 0 with the value 0.
```{r}
# Identify the values to be replaced/changed in some_numbers
# What is the new/replacement value?
# Assign the new/replacement value to the elements identified in the first step
# check some_numbers to see if the replacement worked!
some_numbers
```
# Bonus: Building Up Expressions
What if we want to select observations with more complicated criteria?
Which penguins are heavier than average?
Let's build this up together...
```{r}
```
What proportion of Adelie penguins have above average flipper length?
```{r}
```
# PRACTICE SET 1
Set 1 exercises use the policing data here:
```{r}
evp <- read.csv("data/ev_police_jan.csv")
```
If you mess up the data frame, you can always read it back in from the file.
## EXERCISE 1
What are the variable names, and how many observations are there?
```{r}
```
## EXERCISE 2
Are there any missing values in the data?
There are a few ways you might approach this
```{r}
```
## EXERCISE 3
What is the most common location? You may want to use the `table()` function.
Hint: you can use the `sort()` function in combination with the `table()` function: `sort(table(x))`
```{r}
```
## EXERCISE 4
How many stops found contraband?
Hint (for one approach to this): if you `sum()` a boolean variable, TRUE = 1, FALSE = 0, so it counts the TRUE values. Remember to deal with missing values with `sum()`: `sum(x, na.rm=TRUE)`. You could also make a table.
```{r}
```
What proportion of all stops in the data found contraband?
```{r}
```
## EXERCISE 5
Rename the column "raw_row_number" to "id"
```{r}
```
## EXERCISE 6
Subset the data frame to just have rows where subject_sex is "female" and save the result in a new variable.
```{r}
```
## EXERCISE 7
Make a new column in the data frame called evanston that is TRUE if the location is 60201 or 60202 and FALSE otherwise.
```{r}
```
## EXERCISE 8
This exercise is more challenging than the others.
What is the *make* of the oldest vehicles in the data set.