-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-user-documentation.Rmd
1125 lines (941 loc) · 55.2 KB
/
07-user-documentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# User Documentation
## Purpose of the Outdoor Equity App
The [Outdoor Equity App](https://shinyapps.bren.ucsb.edu/oe_app/) provides users with tools to explore trends at different overnight reservable sites to analyze access to these sites. The intended audience is federal public land managers and researchers as well as nonprofit organizations and recreation users.
The [Recreation Information Database (RIDB) data](https://ridb.recreation.gov/landing) are comprehensive when it comes to information regarding the site and reservation, but do not include information about the visitor outside of their home ZIP code. [US Census American Community Survey (ACS) data](https://www.census.gov/programs-surveys/acs/data.html) is used to approximate socioeconomic demographics by joining information about the visitors' home ZIP code to the RIDB data.
## How to Use the Outdoor Equity App
### About the App
The About tab of the [Outdoor Equity App](https://shinyapps.bren.ucsb.edu/oe_app/) includes background information about what the App is, why outdoor recreation is important, the App creators, and what data is used in the App. These sections are similar to the [About Section][About] of this technical documentation. The About tab also includes example questions that a user might explore through the different parts of the Analysis tab.
The Analysis tab of the App consists of three parts Data Summary, Data Relationship and Visitorshed Maps. Each of these pages includes a brief explanatory section of how to interpret the plot or graph, a section to subset the data to the desired campsite, and the plot or map outputs.
The Metadata tab of the App includes metadata for all variables in the combined RIDB and ACS dataset. This section mirrors the Metadata section in the [Products and Deliverables Section][Products and Deliverables] of this document.
The Data Download section of the App allows a user to download a subset of data include as many or as few campsites and variables as they require for further analysis or use.
## How to Maintain the Outdoor Equity App
### Repository Directory Structure
Our code is version controlled in a [GitHub repository](https://github.com/outdoor-equity/outdoor-equity). This includes the data cleaning, wrangling, and analysis, creation of plots and maps, and structure of the shiny app. An overview of the directory structure can be found in the [Appendix Section][Repository Directory Structure].
### Data Preparation Methods
#### RIDB Data {#ridb-data}
[RIDB data](https://ridb.recreation.gov/landing) data are available through direct download as CSV files or via the API as JSON files via Recreation.gov. API access requires creating a Recreation.gov account and requesting second tier API access via the Recreation.gov website's [Contact Us](https://recreationonestopprod.servicenowservices.com/external?id=external_contact_us) page. CSV files are readily available for download via the [Recreation.gov website](https://ridb.recreation.gov/download). Data are collected each time a visitor makes a reservation through Recreation.gov. Data packages are posted annually in the spring by Recreation One Stop (R1S) and contain the previous fiscal year's reservations (ex: the 2018 package includes 2018-10-01 through 2018-09-30). Data packages are available for download from 2006 to present. Each annual data package file contains a range between 2 million and 5 million observations (or reservations) and includes variables in character, numeric, and date/time formats about each reservation. A shift in the data collection and storage processes occurred in 2019, changing what variables are available and how they are labeled. Currently, the Outdoor Equity App contains only data for reservable sites in California and reservations for fiscal year 2018 due to time limitations of this project. To see more about expanding the App temporally and spatially see the [How to Expand the Outdoor Equity App Section](#how-to-expand-the-outdoor-equity-app).
We created a reproducible workflow for cleaning and wrangling data, which is employed in the `data_wrangle_and_clean.Rmd` document that sources custom functions to prepare RIDB data for joining with ACS data. All custom functions are listed in the [Access, Clean, and Wrangle Data Section][Access, Clean, and Wrangle Data] of this document. These functions rely heavily on functions from the `tidyverse` collection of packages [@R-tidyverse].
```{r, include=FALSE}
ridb_data_ca_overnight_only <- ridb_data %>%
janitor::clean_names() %>%
filter(facility_state == "CA") %>%
filter(use_type == "Overnight")
```
First, we use the `function_ridb_subset-pre2018.R` to subset the RIDB data, filtering only reservations within the selected state that are listed as "Overnight" reservations within the `use_type` variable. For the 2018 California dataset this results in a starting "raw" data frame with `r comma(nrow(ridb_data_ca_overnight_only))` reservations
```{r, eval=FALSE}
# filter for state
filter(facility_state == state_abbrev) %>%
# filter for use type
filter(use_type == "Overnight")
```
We then select only the necessary variables, including information about the site (agency, park or forest name, site name, site type, and site location) and information about the reservation (home ZIP come, total paid, visit start and end dates, visit order date, and number of people in party).
```{r, eval=FALSE}
# select variables
select(c("agency", "parent_location", "region_description",
"park", "site_type", "facility_id", "facility_state",
"facility_longitude", "facility_latitude",
"customer_zip", "total_paid", "start_date",
"end_date", "order_date", "number_of_people")) %>%
mutate(site_type = tolower(site_type)) %>%
filter(!site_type %in% c("historic tour", "hiking zone",
"group picnic area", "cave tour",
"management", "picnic",
"entry point", "trailhead"))
```
Next we normalize the customer ZIP code values. This includes filtering for only US ZIP codes and shortening all 9 digit ZIP codes to include only the first 5 digits.
```{r, eval=FALSE}
# filter out invalid ZIP codes
filter(str_detect(string = customer_zip,
pattern = "^[:digit:]{5}(?!.)") |
str_detect(string = customer_zip,
pattern = "^[:digit:]{5}(?=-)")) %>%
filter(!customer_zip %in% c("00000", "99999")) %>%
mutate(customer_zip = str_extract(string = customer_zip,
pattern = "[:digit:]{5}"))
```
This function results in the removal of `r comma(nrow(ridb_data_ca_overnight_only) - nrow(data_joined_2018))` reservations (or `r percent((nrow(ridb_data_ca_overnight_only) - nrow(data_joined_2018)) / nrow(ridb_data_ca_overnight_only), accuracy = 0.01)`) from the original "raw" dataset that included all reservations for reservable overnight campsites in California.
```{r, include=FALSE}
no_valid_booking_window <- data_joined_2018 %>%
mutate(booking_window = as.numeric(difftime(start_date, order_date),
units = "days")) %>%
filter(booking_window < 0)
```
We created a second custom function `function_ridb_variable_calculate-pre2018.R` to calculate and manipulate variables of interest. We use start, end, and order dates to calculate the lengths of stay and booking windows (number of days from order to start date) of each reservation. The booking window calculations return a number of results that are negative. This is a known issue that others working with the RIDB data have encountered. This resulted in `r comma(nrow(no_valid_booking_window))` reservations (or `r percent(nrow(no_valid_booking_window) / nrow(ridb_data_ca_overnight_only), accuracy = 0.01)`) without a valid booking window, which are removed for plots that visualize the booking window variable.
```{r, eval=FALSE}
mutate(start_date = as.Date(start_date),
end_date = as.Date(end_date),
order_date = as.Date(order_date),
# calculate new variables
length_of_stay = as.numeric(difftime(end_date, start_date),
units = "days"),
booking_window = as.numeric(difftime(start_date, order_date),
units = "days"))
```
We calculate daily cost per reservation by dividing total costs by lengths of stay. We then calculated daily cost per visitor by dividing the daily cost by the number of visitors.
```{r, eval=FALSE}
# calculate new variables
mutate(daily_cost = total_paid / length_of_stay,
daily_cost_per_visitor = daily_cost / number_of_people)
```
We aggregate the `site_type` variable to create 7 broader site categories.
```{r, eval=FALSE}
# aggregate site type
mutate(aggregated_site_type =
case_when(site_type %in% c("walk to", "hike to",
"group hike to", "group walk to"
) ~ "remote",
site_type %in% c("cabin nonelectric", "cabin electric",
"yurt","shelter nonelectric"
) ~ "shelter",
site_type %in% c("boat in", "anchorage") ~ "water",
site_type %in% c("group equestrian",
"equestrian nonelectric"
) ~ "equestrian",
site_type %in% c("rv nonelectric", "rv electric",
"group rv area nonelectric"
) ~ "rv only",
site_type %in% c("group standard nonelectric",
"standard nonelectric",
"standard electric",
"group standard area nonelectric",
"group standard electric"
) ~ "rv or tent",
site_type %in% c("tent only nonelectric",
"group tent only area nonelectric",
"tent only electric"
) ~ "tent only"))
```
We create an administrative unit variable by combining the `parent_location` and `region_description` variables as different federal agencies track the administrative unit information within different variables. Then we update the `agency`, `admin_uni` and `park` variable character strings using multiple functions from the `stringr` package [@R-stringr]. These updates expand acronyms, remove unnecessary characters (such as "--" or location codes), and normalize any discrepencies in characater strings.
```{r, eval=FALSE}
mutate(admin_unit =
case_when(agency == "USFS" ~ parent_location,
agency %in% c(
"NPS", "BOR", "USACE"
) ~ region_description)) %>%
# edit values
mutate(
# agency abbreviations to names
agency = str_replace(string = agency,
pattern = "NPS",
replacement = "National Park Service"),
agency = str_replace(string = agency,
pattern = "USFS",
replacement = "US Forest Service"),
agency = str_replace(string = agency,
pattern = "USACE",
replacement = "US Army Corps of Engineers"),
agency = str_replace(string = agency,
pattern = "BOR",
replacement = "Bureau of Reclamation"),
# update admin_unit values (generic)
admin_unit = str_replace(string = admin_unit,
pattern = paste(c("NF - FS", "NF -FS",
"NF- FS", "NF-FS",
"-FS", " - FS"),
collapse = "|"),
replacement = "National Forest"),
admin_unit = str_to_title(admin_unit),
admin_unit = str_replace(string = admin_unit,
pattern = "And",
replacement = "&"),
# update park values (generic)
park = str_remove(string = park,
pattern = paste(c("\\(.*", " \\(.*",
"---.*", " ---.*",
",.*"),
collapse = "|")),
park = str_to_title(park),
park = str_replace(string = park,
pattern = "Cg",
replacement = "Campground"),
park = str_replace(string = park,
pattern = "Nhp",
replacement = "National Historic Park"),
park = str_replace(string = park,
pattern = "@",
replacement = "At"),
park = str_replace(string = park,
pattern = "&",
replacement = "And"),
park = str_replace(string = park,
pattern = paste(c("/", " / "), collapse = "|"),
replacement = " "),
park = str_remove_all(string = park,
pattern = " \\d.*"),
# update park values (CA specific)
park = str_remove(string = park,
pattern = paste(c(" - Angeles Nf", " -Hwy"),
collapse = "|")),
park = str_replace(string = park,
pattern = "Tunnel Mills Il",
replacement = "Tunnel Mills"))
```
We calculate distance traveled by measuring the distance from the latitude and longitude coordinate facility locations to the centroid of the home ZIP code. Here we utilize both the `tidycensus` package [@R-tidycensus] and the `sf` package [@R-sf].
```{r, eval=FALSE}
# bootstrap geometries and reproject to NAD 83
df_geometries <- df %>%
st_as_sf(coords = c("facility_longitude", "facility_latitude"),
crs = 4326) %>%
st_transform(crs = 4269) # using NAD83 because measured in meters
# get centroid of geometries for all US ZIP codes
df_zip_centroids_us <-
get_acs(geography = "zcta", year = 2018,
geometry = TRUE,
summary_var = "B01001_001",
survey = "acs5",
variables = c(male = "B01001_002")) %>%
select(NAME, geometry) %>%
mutate(zip_code = str_sub(NAME, start = -5, end = -1)) %>%
select(zip_code, geometry) %>%
st_centroid()
# join data and calculate `distance_traveled` variable
df_joined_geometries <-
left_join(x = df_geometries %>% as.data.frame(),
y = df_zip_centroids_us %>% as.data.frame(),
by = c("customer_zip" = "zip_code")) %>%
st_sf(sf_column_name = 'geometry.x') %>%
mutate(distance_traveled_m = st_distance(x = geometry.x,
y = geometry.y,
by_element = TRUE),
distance_traveled_m = as.numeric(distance_traveled_m))
```
And finally, we add a variable indicating in which state or territory each customer zip code is located. This portion of the code utilizes the `zipcodeR` package [@R-zipcodeR].
```{r, eval=FALSE}
# create df of fips and full state names
fips_list <- c(
"01", "02", "04", "05", "06", "08", "09", "10", "11", "12",
"13", "15", "16", "17", "18", "19", "20", "21", "22", "23",
"24", "25", "26", "27", "28", "29", "30", "31", "32", "33",
"34", "35", "36", "37", "38", "39", "40", "41", "42", "44",
"45", "46", "47", "48", "49", "50", "51", "53", "54", "55",
"56", "72")
state_list <- c(
"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL",
"GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME",
"MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH",
"NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI",
"SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI",
"WY", "PR")
states_names_list <- c(
"Alabama", "Alaska", "Arizona", "Arkansas", "California",
"Colorado", "Connecticut", "Delaware",
"District of Columbia", "Florida","Georgia", "Hawaii",
"Idaho", "Illinois", "Indiana", "Iowa", "Kansas",
"Kentucky", "Louisiana", "Maine","Maryland",
"Massachusetts", "Michigan", "Minnesota", "Mississippi",
"Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire",
"New Jersey", "New Mexico", "New York", "North Carolina",
"North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania",
"Rhode Island","South Carolina", "South Dakota", "Tennessee",
"Texas", "Utah", "Vermont", "Virginia", "Washington",
"West Virginia", "Wisconsin","Wyoming", "Puerto Rico")
df_states_fips <- as.data.frame(list(fips = fips_list,
state = state_list,
state_full = states_names_list))
# loop through state df to get all ZIP codes w/in state
df_states_zip_codes <- data.frame()
for (i in seq_along(fips_list)){
state <- zipcodeR::search_fips(state_fips = fips_list[[i]]) %>%
select(zipcode, state)
df_states_zip_codes <- rbind(df_states_zip_codes, state)
}
# add full state name and fips code to list of all ZIP codes for each state
df_states_fips_zip_codes <-
left_join(x = df_states_zip_codes,
y = df_states_fips,
by = "state") %>%
select(-fips) %>%
rename(customer_zip_state = state,
customer_zip_state_full = state_full,
zip_code = zipcode)
```
#### U.S. Census Data
US Census data is publicly accessible in many ways. Our product utilizes the R package `tidycensus` [@R-tidycensus] to access the necessary variables from the 2018 [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs/data.html) via API. API access requires an api key, which can then be saved into your RStudio environment. Learn more about API keys and working with `tidycensus` [here](https://walker-data.com/tidycensus/articles/basic-usage.html).
```{r, eval=FALSE, message=FALSE}
# API set up
# Only have to run this the first time using on a new machine
census_api <- source("private/census-api.R")
census_api_key(key = census_api[[1]], install = TRUE, overwrite = TRUE)
# run in console:
readRenviron("~/.Renviron")
```
Sample data are collected for the ACS each year and includes many variables that cover social, economic, housing, and demographic characteristics. All variable options can be easily viewed using the code below:
```{r, eval=FALSE}
# look at variable options
View(load_variables(2018, "acs5", cache = TRUE))
```
The Outdoor Equity App utilizes the [ACS 5-year data](https://www.census.gov/data/developers/data-sets/acs-5year.html), which is an estimate representing data collected over the designated 5 year period. We used the 5-year ACS data over the 1-year ACS data because it increases "statistical reliability of the data for less populated areas and small population subgroups" [@ACS].
The variables included in the App include median-income, race, language(s) spoken at home, and highest level of education attained. All variables are represented as estimates in numeric format for a census tract. Data are called by geographic region, in our case the ZIP code tabulation area, and include an estimated number of people that fall into each category within each ZIP code, a margin of error, and an estimated number of total people in the area. Within our custom functions `function_acs_education.R`, `function_acs_language.R`, `function_acs_median_income.R`, and `function_acs_race.R` we first imported just the necessary columns for each ACS variable. See Table \@ref(tab:acs-race-tab) for an example of the race variable.
```{r, message=FALSE}
# import variables for race
race_df <-
get_acs(
geography = "zcta", year = 2018,
state = "CA",
summary_var = "B03002_001", #Estimate!!Total:
variables = c(
white = "B03002_003", #White alone
black = "B03002_004", #Black or African American alone
native_american = "B03002_005", #American Indian and Alaska Native alone
asian = "B03002_006", #Asian alone
pacific_islander = "B03002_007", #Native Hawaiian/Other Pacific Islander alone
other = "B03002_008", #Some other race alone
multiracial = "B03002_009", #Two or more races
hispanic_latinx = "B03002_012" #Hispanic or Latino
))
```
```{r, acs-race-tab, message=FALSE, echo=FALSE}
knitr::kable(head(race_df, 10),
caption = "Example of raw ACS Race Variable.")
```
We calculated percentages for the categories within each ACS variable for race, language(s) spoken in the home, and highest level of education by dividing the estimate for a category by the estimated total population for that ZIP code.
```{r, eval=FALSE}
race_df_percent <-
race_df %>%
group_by(zip_code, race) %>%
summarise(estimate = sum(estimate),
moe = sum(moe),
summary_est = unique(summary_est),
summary_moe = unique(summary_moe),
percent = estimate / summary_est)
```
For median-income, we use the estimated median-income as is, without further adjustments.
```{r, eval=FALSE}
median_income_df <-
get_acs(geography = "zcta", year = 2018,
state = "CA",
variables = c(
median_income = "B19013_001"
)) %>%
clean_names() %>%
rename(median_income = estimate) %>%
mutate(zip_code = str_sub(name, start = -5, end = -1)) %>%
select(median_income, zip_code)
```
We transform teh ACS dataframe by pivoting wider, which moves the different categories and percentages (ex: racial groups) to their own columns to create a single row for each ZIP code. This is necessary for joining the ACS data sets to the RIDS data set.
```{r, eval=FALSE}
select(zip_code, summary_est, race, percent) %>%
rename(zip_code_population = summary_est) %>%
# create column for each percentage for each group (pivot_wider)
# necessary to be able to left_join() with RIDB data
pivot_wider(names_from = "race",
values_from = "percent")
```
Finally, we use the `function_join_ridb_acs.R` custom functions to join the RIDB and ACS data to create the dataset that is available for download on the App and utilized for all plots and maps.
```{r, eval=FALSE}
joined_df <-
left_join(x = ridb_df,
y = acs_df_race,
by = c("customer_zip" = "zip_code")) %>%
left_join(y = acs_df_education,
by = c("customer_zip" = "zip_code")) %>%
left_join(y = acs_df_median_income,
by = c("customer_zip" = "zip_code")) %>%
left_join(y = acs_df_language,
by = c("customer_zip" = "zip_code"))
```
### Statistical Analysis and Data Wrangling for Plots
Each type of plot and map require unique data wrangling which is explained in this section.
#### Data Summary
We created custom functions for data wrangling and plotting for use within the Outdoor Equity App code. Wrangling begins with filtering based on the user's choice campsite input. We then further subset the dataset to include only the necessary columns for plotting.
```{r, eval=FALSE}
# example for booking window data summary
booking_window_rdf <- reactive({
validate(
need(siteInput != "",
"Please select a reservable site to visualize.")
) # EO validate
ridb_df %>%
filter(park %in% siteInput,
booking_window > 0,
booking_window != "Inf") %>%
select(park, booking_window) %>%
filter(!is.na(booking_window))
})
```
We create histograms, bar charts, or density plots to show the distribution of the data for each variable. For ACS variables, data from reservations are plotted against data for the full California population. This allows a user to compare the distribution within the reservation data to that of the California census in order to see where reservations either under- or over-represent that variable as compared to California residents. We used the `plotly` package [@R-plotly] to allow a user to hover over the plot to view helper text. We also included additional helper text outside of the plots for more complicated plots.
```{r, eval=FALSE}
## -- example for booking window data summary -- ##
# parameters
hist_colors <- c("#64863C", "#466C04")
quant_80_color <- c("#FACE00")
caption_color <- c("#ac8d00")
# plot for shiny app
plotly <- ggplot(
data = booking_window_rdf()) +
geom_histogram(
aes(x = booking_window,
text = paste0(percent(..count.. / nrow(booking_window_rdf()),
accuracy = 0.1),
" of all visits are reserved between ", xmin,
" and ", xmax,
" days before the start of the visit",
"<br>",
"(Total reservations to site: ",
comma(nrow(booking_window_rdf()), accuracy = 1),
")")),
binwidth = 7,
center = 7 / 2,
fill = hist_colors[[1]],
col = hist_colors[[2]], size = 0.05) +
labs(x = "Days in advance before visit (each bar = 1 week)",
y = "") +
scale_x_continuous(limits = c(0, x_max)) +
geom_vline(xintercept = quant_80,
linetype = "dashed", alpha = 0.5, color = quant_80_color) +
theme_minimal() +
theme(plot.background = element_rect("white"),
panel.grid.major.y = element_blank())
# add 6 month / 1 year annotation if data allows
if (x_max <= 180) {
# don't include 6 month or 1 year annotation
plotly
} else if (x_max > 180 & x_max <= 360){
# include 6 month annotation
plotly <- plotly +
geom_vline(xintercept = 180,
linetype = "dashed", size = .3, alpha = .5) +
annotate("text", label = "6 months", size = 3,
x = 180, y = 5)
} else if (x_max >= 360) {
# include 6 month and 1 year annotation
plotly <- plotly +
geom_vline(xintercept = 180,
linetype = "dashed", size = .3, alpha = .5) +
annotate("text", label = "6 months", size = 3,
x = 180, y = 5) +
geom_vline(xintercept = 360,
linetype = "dashed", size = .3, alpha = .5) +
annotate("text", label = "1 year", size = 3,
x = 360, y = 5)
} # EO else if for plotly
ggplotly(plotly,
tooltip = list("text"),
dynamicTicks = TRUE) %>%
layout(title =
list(text =
paste0('<b>', siteInput, '<br>', admin_unitInput, '</b>',
'<br>',
'Number of days from reservation to start of visit'),
font = list(size = 15)),
xaxis = list(separatethousands = TRUE),
yaxis = list(separatethousands = TRUE),
margin = list(b = 130, t = 100),
annotations =
list(x = x_max/2, y = -0.6,
text =
paste0("80% of reservations reserve their visit less than ",
'<b>', quant_80, '</b>',
" days before the start date",
"<br>",
"(shown on plot with yellow dotted line)."),
showarrow = F,
xre = 'paper', yref = 'paper',
xanchor = 'middle', yanchor = 'auto',
xshift = 0, yshift = 0,
font = list(size = 12, color = caption_color))) %>%
config(modeBarButtonsToRemove = list("pan", "select", "lasso2d", "autoScale2d",
"hoverClosestCartesian",
"hoverCompareCartesian"))
```
#### Data Relationships
Visualizing the relationship between RIDB and ACS variables is challenging because the socioeconomic data available are an estimate of the ZIP code population, rather than data of the individual making the reservation. In order to create simple, easy to interpret plots, we determined whether reservations were from a location with "high" percentages of a given ACS variable category (ex: one of the eight racial groups). We determined "high" as anything above the weighted 3rd quartile (weighted based on the California census). This method allows for reservations to be included in the "high" category for multiple values within one ACS variable (ex: a reservation might be from a ZIP code that falls into the "high" category for both Black and Asian folks). By allowing a reservation to appear in multiple categories within a single ACS variable, the uncertainty of the reserver's actual socioeconomic status is retained.
```{r, include=FALSE}
source("r/function_acs_education.R")
source("r/function_acs_language.R")
source("r/function_acs_race.R")
source("r/function_acs_median_income.R")
# read in census data for CA
acs_subset_calculate_race(geography = "zcta", year = 2018,
state = "California")
acs_subset_calculate_education(geography = "zcta", year = 2018,
state = "California")
acs_subset_calculate_language(geography = "zcta", state = "California")
acs_subset_calculate_median_income(geography = "zcta", year = 2018,
state = "California")
# combine all CA census variables
acs_ca_all <- left_join(x = data_acs_2018_education_percent_California,
y = data_acs_2018_median_income_California,
by = "zip_code") %>%
left_join(y = data_acs_2018_race_percent_California,
by = "zip_code") %>%
left_join(y = data_acs_2020_language_percent_California,
by = "zip_code")
data_acs_ca_all <- acs_ca_all %>%
mutate(mean_zip_code_population = rowMeans(acs_ca_all[,c(2,8,17)])) %>%
select(-c("zip_code_population.x",
"zip_code_population.y",
"zip_code_population"))
```
Example code for calculating the "high" threshold for language can be seen below:
```{r, eval=FALSE}
# `function_acs_top_quartile_language.R` code
threshold_df <- acs_df %>%
# filter variables of interest
select(zip_code, english_only, not_english_only, mean_zip_code_population) %>%
pivot_longer(cols = 2:3,
names_to = "language",
values_to = "language_percentage") %>%
# filter category of interest
filter(language == language_group) %>%
drop_na(language_percentage)
# weighted median value (weighted based on ZIP code populations)
weighted_half <- weighted.mean(x = threshold_df$language_percentage,
w = threshold_df$mean_zip_code_population)
# drop rows below weighted median
df_half <- threshold_df %>% filter(language_percentage >= weighted_half)
# weighted 3rd quartile
## weighted median value of top half (weighted based on ZIP code populations)
weighted_quartile <- weighted.mean(x = df_half$language_percentage,
w = df_half$mean_zip_code_population)
```
```{r}
source("r/function_acs_top_quartile_language.R")
# language groups
language_group <- c("english_only", "not_english_only")
# calculate value of 3rd quartile for each language group
language_quants_df <-
language_group %>%
map_dbl(language_top_quartile, acs_df = data_acs_ca_all) %>%
cbind("language_group" = language_group,
"weighted_quartile" = .) %>%
as.data.frame() %>%
mutate(language_group = str_replace_all(string = language_group,
pattern = "_",
replacement = " "),
language_group = str_to_title(language_group),
weighted_quartile = percent(as.numeric(weighted_quartile),
accuracy = 0.01)) %>%
rename("Language Group" = language_group,
"Threshold" = weighted_quartile)
```
```{r, include=FALSE}
source("r/function_acs_top_quartile_education.R")
# education groups
education_group <-
c("hs_GED_or_below", "some_college", "college", "master_or_above")
# calculate value of 3rd quartile for each educational group
education_quants_df <-
education_group %>%
map_dbl(education_top_quartile, acs_df = data_acs_ca_all) %>%
cbind("education_group" = education_group,
"weighted_quartile" = .) %>%
as.data.frame() %>%
mutate(education_group = str_replace_all(string = education_group,
pattern = "_",
replacement = " "),
education_group = str_to_title(education_group),
weighted_quartile = percent(as.numeric(weighted_quartile),
accuracy = 0.01)) %>%
rename("Education Group" = education_group,
"Threshold" = weighted_quartile)
```
```{r, include=FALSE}
source("r/function_acs_top_quartile_race.R")
# race groups
race_group <- c(
"other", "pacific_islander", "multiracial", "asian",
"black", "white", "native_american", "hispanic_latinx")
# calculate value of 3rd quartile for each racial group
race_quants_df <-
race_group %>%
map_dbl(race_top_quartile, acs_df = data_acs_ca_all) %>%
cbind("race_group" = race_group,
"weighted_quartile" = .) %>%
as.data.frame() %>%
mutate(race_group = str_replace_all(string = race_group,
pattern = "_",
replacement = " "),
race_group = str_to_title(race_group),
weighted_quartile = percent(as.numeric(weighted_quartile),
accuracy = 0.01)) %>%
rename("Race Group" = race_group,
"Threshold" = weighted_quartile)
```
The thresholds used within the Outdoor Equity App can be seen in Table \@ref(tab:thres-tab1) and Table \@ref(tab:thres-tab2).
```{r, thres-tab1, echo=FALSE}
knitr::kable(list(language_quants_df, education_quants_df),
caption = "Threshold values for ACS language and education variables.")
```
```{r, thres-tab2, echo=FALSE}
knitr::kable(race_quants_df,
caption = "Threshold values for ACS race variable.")
```
We create lollipop and bar plots to show the relationships of the data for each set of variables. We used the `plotly` package [@R-plotly] to allow a user to hover over the plot to view helper text. An example of the code used to create the education compared to booking window plot is show below:
```{r, eval=FALSE}
## -- data wrangle -- ##s
# create reactive dataframe and further subset
rdf <- reactive ({
validate(
need(siteInput != "",
"Please select a reservable site to visualize.")
) # EO validate
education_top_quartile_df %>%
# filter to user site of choice
filter(park == siteInput) %>%
# select to variables of interest
select(park, customer_zip,
education, education_percentage, education_y_lab,
booking_window) %>%
drop_na(booking_window, education_percentage) %>%
filter(booking_window >= 0) %>%
# summarize to inner quartile range, median, and total reservations
group_by(education, education_y_lab) %>%
summarize(median_booking_window = median(booking_window),
quartile_lower = quantile(booking_window)[[2]],
quartile_upper = quantile(booking_window)[[4]],
count = n())
}) #EO reactive df
validate(need(
nrow(rdf()) > 0,
paste0("There are no reservations to ", siteInput, ", ", admin_unitInput,
" that come from communities in the high range for any educational",
"categories.")
)) # EO validate
## -- create plot -- ##
# parameters
education_group_colors <-
c(
"HS, GED,\nor Below" = "#a6cee3",
"Some College or\nTrade School" = "#1f78b4",
"Associates or\nBachelors Degree" = "#b2df8a",
"Masters Degree\nor Above" = "#33a02c"
)
# create plot
plotly <- ggplot(data = rdf(),
aes(x = median_booking_window,
y = education_y_lab)) +
geom_segment(aes(xend = 0, yend = education_y_lab)) +
geom_point(
aes(
color = education_y_lab,
fill = education_y_lab,
text = paste0(
comma(count, accuracy = 1),
" unique visits were made by people who live in ZIP codes with high ",
" rates of",
"<br>", education,
" as the maximum level of education. ",
"Typically these visitors reserved their visit between",
"<br>", comma(quartile_lower, accuracy = 1), " and ",
comma(quartile_upper, accuracy = 1),
" days before the start of their trip, with a median booking window of ",
comma(median_booking_window, accuracy = 1), " days."
)
),
size = 3.5, shape = 21, stroke = 2 ) +
scale_y_discrete(expand = c(0.45, 0)) +
scale_fill_manual(values = education_group_colors) +
scale_color_manual(values = education_group_colors) +
labs(x = paste("Estimated Number of Days in Advance Site is Reserved"),
y = "") +
theme_minimal() +
theme(
plot.background = element_rect("white"),
panel.grid.major.y = element_blank(),
legend.position = "none")
# create plotly
ggplotly(plotly,
tooltip = list("text")) %>%
config(
modeBarButtonsToRemove =
list("zoom", "pan", "select", "lasso2d", "autoScale2d",
"hoverClosestCartesian", "hoverCompareCartesian")) %>%
layout(title = list(
text = paste0( '<b>', siteInput, '<br>', admin_unitInput, '</b>', '<br>',
'Number of Days in Advance Site is Reserved by Visitors with",
"Different Levels of Education'),
font = list(size = 15) )) %>%
add_annotations(
text = "Reservations from ZIP codes<br>with high proportions of:",
x = -0.15, y = 0.9,
font = list(size = 11),
xref = 'paper', yref = 'paper',
showarrow = FALSE)
```
#### Spatial analysis
We created a state-level visitorshed map to show the number of visitors to overnight reservable sites from each state. We use the package `zipcodeR` [@R-zipcodeR] to determine the state in which each ZIP code, and thus visitor, is located, using the RIDB data. We then create a dataframe with the necessary geometries using the `tigris` [@R-tigris] package. And finally we simplify the geometries to lower the load time within the App using the `rmapshaper` [@R-rmapshaper] package.
```{r, eval=FALSE}
# ZIP codes and respective states
fips_list <-
c("01", "02", "04", "05", "06", "08", "09", "10", "11", "12",
"13", "15", "16", "17", "18", "19", "20", "21", "22", "23",
"24", "25", "26", "27", "28", "29", "30", "31", "32", "33",
"34", "35", "36", "37", "38", "39", "40", "41", "42", "44",
"45", "46", "47", "48", "49", "50", "51", "53", "54", "55",
"56", "72")
state_list <-
c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL",
"GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME",
"MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH",
"NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI",
"SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI",
"WY", "PR")
states_names_list <-
c("Alabama", "Alaska", "Arizona", "Arkansas", "California",
"Colorado", "Connecticut", "Delaware", "District of Columbia",
"Florida","Georgia", "Hawaii", "Idaho", "Illinois", "Indiana",
"Iowa", "Kansas", "Kentucky", "Louisiana", "Maine","Maryland",
"Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire","New Jersey",
"New Mexico", "New York", "North Carolina", "North Dakota", "Ohio",
"Oklahoma", "Oregon", "Pennsylvania", "Rhode Island","South Carolina",
"South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia",
"Washington", "West Virginia", "Wisconsin","Wyoming", "Puerto Rico")
# create dataframe of states
df_states_fips <- as.data.frame(list(fips = fips_list,
state = state_list,
state_full = states_names_list))
# loop through state df to get all ZIP codes w/in state
df_states_zip_codes <- data.frame()
for (i in seq_along(fips_list)){
state <- zipcodeR::search_fips(state_fips = fips_list[[i]]) %>%
select(zipcode, state)
df_states_zip_codes <- rbind(df_states_zip_codes, state)
}
# add full state name, fips code, etc. to list of all ZIP codes for each state
df_states_fips_zip_codes <-
left_join(x = df_states_zip_codes,
y = df_states_fips,
by = "state") %>%
select(fips, zipcode) %>%
rename(zip_code = zipcode)
# get state geometries for full US
df_state_geometries_us <- tigris::states(year = 2018) %>%
select(GEOID, STUSPS, NAME, geometry) %>%
rename(fips = GEOID,
state_abbrev = STUSPS,
state = NAME) %>%
rmapshaper::ms_simplify(keep = 0.005, keep_shapes = TRUE)
```
We then calculate the number of reservations per state for the site (as chosen by the app user) and use `tmaps` package [@R-tmap] to create an interactive map that colors states based on the total number of reservations for that state.
```{r, eval=FALSE}
## -- data wrangle -- ##
# reactive data frame for siteInput
rdf <- reactive ({
validate(
need(siteInput != "",
"Please select a reservable site to visualize.")
) # EO validate
ridb_df %>%
filter(park %in% siteInput) %>%
select(agency, admin_unit, park, customer_zip_state_full,
customer_zip_state)
})
# value of total reservations for this park
total_reservations <- nrow(rdf())
# number of reservations and % per state
map_data <- rdf() %>%
group_by(customer_zip_state_full, customer_zip_state) %>%
summarize(number_reservations = n(),
percentage_reservations =
percent((number_reservations / total_reservations),
accuracy = 0.01)) %>%
filter(!is.na(customer_zip_state_full))
# add state geometries
map_data_geometries <-
state_geometries_df %>%
left_join(y = map_data,
by = c("state_abbrev" = "customer_zip_state")) %>%
mutate_at(vars(number_reservations),
~replace(number_reservations, is.na(number_reservations), 0)) %>%
mutate_at(vars(percentage_reservations),
~replace(percentage_reservations, is.na(percentage_reservations),
0)) %>%
select(customer_zip_state_full, number_reservations,
percentage_reservations, geometry)
## -- create map -- ##
tmap_mode("view")
tm_shape(map_data_geometries) +
tm_borders(col = "grey", alpha = 0.5) +
tm_fill(col = "number_reservations",
title = "Number of Visits",
palette = "YlGn",
n = 10,
style = "jenks",
id = "customer_zip_state_full",
popup.vars = c("Total Visits" = "number_reservations",
"Percentage of All Visits" = "percentage_reservations")
) +
tm_view(set.view = c(-101.834335, 40.022356, 2)) +
tmap_options(basemaps = 'https://server.arcgisonline.com/ArcGIS/rest/services/Canvas/World_Light_Gray_Base/MapServer/tile/{z}/{y}/{x}')
```
A ZIP-code level visitorshed map shows the number of visitors to sites for each ZIP code within California only due to computational load of calculating for all ZIP codes in the United States. We then create a dataframe with the necessary geometries using the `tigris` [@R-tigris] package. And finally we simplify the geometries to lower the load time within the App using the `rmapshaper` [@R-rmapshaper] package.
```{r, eval=FALSE}
## -- ZIP codes and respective states -- ##
fips_list <-
c("01", "02", "04", "05", "06", "08", "09", "10", "11", "12",
"13", "15", "16", "17", "18", "19", "20", "21", "22", "23",
"24", "25", "26", "27", "28", "29", "30", "31", "32", "33",
"34", "35", "36", "37", "38", "39", "40", "41", "42", "44",
"45", "46", "47", "48", "49", "50", "51", "53", "54", "55",
"56", "72")
state_list <-
c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL",
"GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME",
"MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH",
"NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI",
"SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI",
"WY", "PR")
states_names_list <-
c("Alabama", "Alaska", "Arizona", "Arkansas", "California",
"Colorado", "Connecticut", "Delaware", "District of Columbia",
"Florida","Georgia", "Hawaii", "Idaho", "Illinois", "Indiana",
"Iowa", "Kansas", "Kentucky", "Louisiana", "Maine","Maryland",
"Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire","New Jersey",
"New Mexico", "New York", "North Carolina", "North Dakota", "Ohio",
"Oklahoma", "Oregon", "Pennsylvania", "Rhode Island","South Carolina",
"South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia",
"Washington", "West Virginia", "Wisconsin","Wyoming", "Puerto Rico")
# create dataframe of states
df_states_fips <- as.data.frame(list(fips = fips_list,
state = state_list,
state_full = states_names_list))
# loop through state df to get all ZIP codes w/in state
df_states_zip_codes <- data.frame()
for (i in seq_along(fips_list)){
state <- zipcodeR::search_fips(state_fips = fips_list[[i]]) %>%
select(zipcode, state)
df_states_zip_codes <- rbind(df_states_zip_codes, state)
}
# add full state name, fips code, etc. to list of all ZIP codes for each state
df_states_fips_zip_codes <-
left_join(x = df_states_zip_codes,
y = df_states_fips,
by = "state") %>%
rename(state_abbrev = state,
zip_code = zipcode) %>%
relocate(fips, .before = 2)
## -- CA ZIP geometries -- ##
df_zip_geometries_ca <- get_acs(geography = "zcta", year = 2018,
geometry = TRUE,
state = "California",
summary_var = "B01001_001",
variables = c(male = "B01001_002")) %>%
select(NAME, geometry) %>%
mutate(zip_code = str_sub(NAME, start = -5, end = -1)) %>%
select(zip_code, geometry) %>%
rmapshaper::ms_simplify(keep = 0.005, keep_shapes = TRUE) %>%
left_join(y = df_states_fips_zip_codes,
by = "zip_code") %>%
relocate(zip_code, .before = 1)
```
We calculate the number of reservations per ZIP code for the site (as chosen by the app user) and use `tmaps` to create an interactive map that colors ZIP codes based on the total number of reservations for that ZIP code.
```{r, eval=FALSE}
## -- data wrangle -- ##
# reactive data frame for siteInput
rdf <- reactive ({
validate(
need(siteInput != "",
"Please select a reservable site to visualize.")
) # EO validate
ridb_df %>%
select(agency, admin_unit, park, customer_zip, facility_latitude,
facility_longitude) %>%
filter(park %in% siteInput)
})
# number of reservations per ZIP code
map_data <- rdf() %>%
group_by(customer_zip) %>%
summarize(number_reservations = n())
# join with ZIP geometries
map_data_geometries <-
zip_geometries_df %>%
left_join(map_data,
by = c("zip_code" = "customer_zip")) %>%
mutate_at(vars(number_reservations),
~replace(number_reservations, is.na(number_reservations), 0)) %>%
select(zip_code, number_reservations, geometry)
# value of total CA reservations for this park
total_reservations <- sum(map_data_geometries$number_reservations)
# add percentage of all CA reservations for each ZIP
map_data_geometries <- map_data_geometries %>%
mutate(percentage_reservations =
percent((number_reservations / total_reservations),
accuracy = 0.01)) %>%
mutate_at(vars(percentage_reservations),
~replace(percentage_reservations, is.na(percentage_reservations),
0))
# calculate location of park for point on map
park_location_geom <- rdf() %>%