-
Notifications
You must be signed in to change notification settings - Fork 0
/
CheckPopSyn3.Rmd
91 lines (67 loc) · 2.86 KB
/
CheckPopSyn3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
title: "CheckingPopSyn3"
author: "Mausam Duggal"
date: "April 29, 2016"
output:
html_document:
highlight: tango
theme: cerulean
---
```{r init, echo=FALSE, message=FALSE}
library(dplyr); library(ggplot2); library(knitr)
```
#### PopSyn3 Check
This code compares the input population and households prepared for all of Ontario by MTO to that output by PopSyn3.
The table "joined" has the final comparison columns followed by dot plots at the end.
```{r Set Working Directory}
opts_knit$set(root.dir = 'c:/personal/r')
```
Batch in PopSyn3 Pop and HHold Outputs and CSD level Inputs
for comparing popsyn3 accuracy
```{r Files for comparison}
pop3 <- read.csv(file = "c:/personal/r/synpop_person_with_required_fields.csv", stringsAsFactors = FALSE)
hh3 <- read.csv(file = "c:/personal/r/synpop_hh.csv", stringsAsFactors = FALSE)
#' read control total file
csdin <- read.csv(file = "c:/personal/r/tazData.csv", stringsAsFactors = FALSE)
```
With the files batched in, summarize the Control and Input files across CSDUID
In the PopSyn3 outputs, the CSDUID is in the TAZ field.
```{r Summarize the files}
# Summarize control data files:
csdin.sum <- csdin %>% group_by(csduid)%>%
summarise(hhin=sum(tothh), popin=sum(totpop))
# summarize Popsyn3 Population data:
pop3.sum <- pop3 %>% group_by(taz) %>%
summarise(popl3 = sum(finalweight))
# summarize and count Popsyn3 Household data:
hh3.sum <- hh3 %>% group_by(taz) %>%
summarise(hh3 = sum(finalweight))
```
Now join the summarized datasets to produce a mastew copy for comparison
Also add two new columns that calculate the differences in population and households
```{r Establish Differences}
#' The GGH area does not have any corressponding CSD IDs so it is removed
joined <- merge(csdin.sum,pop3.sum, by.x="csduid", by.y="taz", all.x = T) %>%
subset(., csduid!=0)
#now join the housing data
joined <- merge(joined, hh3.sum, by.x="csduid", by.y="taz", all.x = T)
#' estimate differences between input and output fields
joined$popdiff <- joined$popin-joined$popl3
joined$hhdiff <- joined$hhin-joined$hh3
# get rid of records that had a zero in the Input population
joined <- subset(joined, joined$popin !=0)
```
Report totals of the differences
```{r Basic Statistics of the differences}
# Sum up the population and household difference columns
sum(joined$popdiff)
sum(joined$hhdiff)
```
Create Dot Plots of the differences
```{r Create Dot Plots of Differences}
ggplot(joined, aes(x=csduid, y=hhdiff))+geom_line(color="grey")+geom_point(color="red")+
ggtitle("Household Differences (Input Households - PopSyn3 Households)")
# plot the differences of population
ggplot(joined, aes(x=csduid, y=popdiff))+geom_line(color="grey")+geom_point(color="red")+
ggtitle("Population Differences (Input Population - PopSyn3 Population)")
```