2011DataComparisons_1.Rmd

---
title: "Checking PopSyn3, NHS, and TSRC"
author: "Mausam Duggal"
date: "April 29, 2016"
output: 
  html_document: 
    highlight: tango
    theme: cerulean
---

```{r init, echo=FALSE, message=FALSE}
library(dplyr); library(ggplot2); library(knitr); library(gaussfacts); library(rmsfact); library(reshape2); library(DT); library(png); library(grid)
```

#### 2011 Data Comparison 

Rolf and Carlos have noted some significant differences in the data being generated by PopSyn3 and those after expanding the TSRC. Of special significance was the differences observed in the frquency counts by **income** segments. Given the importance of getting the data inputs correct from the point of view of model estimation as well as aligning all the input datasets so that they match across the various models, we are putting together a comparison of the different datasets by certain key variables.

The TSRC data has records that are **18 years and above** only. This presents some compatibility challenges as the 2011NHS data that I have handy is for 15 years and above. So, two sets of comparisons were generated. First, by slicing PopSyn3 estimates from 18 years and above. This allows a direct comparison between PopSyn3 outputs and TSRC, while leaving 2011 NHS as the odd one out. Second, by slicing PopSYn3 from 15 years and above to make it compatible with the 2011NHS and leaving TSRC as the odd one out.

Finally, as a sanity check, we have also compared **PopSyn3 to the inputs** provided by MTO both at the aggregate level and by income segments.

```{r echo=TRUE}
opts_knit$set(root.dir = 'c:/personal/r')
#wd <- setwd("c:/personal/r")

```

Something light before we jump into the data

```{r}

gaussfact()
rmsfact()
```


Batch in the PopSyn3 and 2011NHS data gotten from a semi-custom tabulation
```{r echo=TRUE}

nhs <- read.csv("c:/personal/r/2011NHS.csv", stringsAsFactors = FALSE)
hhold <- read.csv("c:/personal/r/households.csv", stringsAsFactors = FALSE)
pp <- read.csv("c:/personal/r/persons.csv", stringsAsFactors = FALSE)

# read in the PopSyn3 control files as well
csdin <- read.csv("c:/personal/r/tazData.csv", stringsAsFactors = FALSE)
```

#### Clean the 2011 NHS.

The 2011NHS semi-custom tabulation is an extremely big dataset with **2621 columns**. A lot of these are not needed for what we want to do, so it is best to bring this to a manageable size. Further, the data is not really designed in a format that is conducive to data wrangling. So, some cleaning steps are needed.

```{r warning=FALSE, tidy=TRUE, echo=TRUE}

#' filter out records that correspond to only those that belong to Ontario
on <- nhs[c(19844:40115), ] 

#' only keep the income columns
on1 <- on[c(1, 2231:2621)]

#' only keep records with DA numbers. This will get rid of sub-totals by various
#' sub-level geographies.

on2 <- filter(on1, grepl("^[35]",Geography))

#write.csv(on2, "2011NHS_ON2.csv")
#' All the columns come in as characters. So, we need to convert them back to 
#' numeric for undertaking analysis

# create vector of column names to apply the conversion to
cols <- colnames(on2)
cols1 <- cols[-1]    # get rid of the first element in the column vector

#' conver to numeric
on2[,cols1]<-lapply(cols1, function(x) as.numeric(as.character(on2[,x])))
on2[is.na(on2)] <- 0    # set NA to zero

```

#### Get the PopSyn3 ready for comparison

The 2011 PopSyn3 outputs have already been batched in. Given that the person file only includes a placeholder for the household id, it is necessary to join the two dataframe, strip out unnecessary columns and create the necessary columns for estimating the number of people in each income category. 

Of note, all those person records with a **population below 15** are excluded because that is what the 2011NHS used for reporting their numbers. This could present a challenge with the TSRC data because people below that age will need to be removed as well for an accurate comparison.

```{r warning=FALSE, tidy=TRUE, echo=TRUE}
#' join the tables to get the income variable mapped to the person table.
#' Further only keep persons 15 years or older.

pp1 <- left_join(pp, hhold, by.x = "hhid", by.y = "hhid") %>% 
  subset(., age>17) %>% subset(., select = c(hhid, hhinc))

#' create an income categorical variable
pp1 <- transform(pp1, incat = ifelse(hhinc < 50001, 1,
                                     ifelse(hhinc > 50000 & hhinc < 80001, 2,
                                            ifelse(hhinc > 80000 & hhinc < 100001, 3, 4))))

# summarise to get counts by income category
pp1.sum <- pp1 %>% group_by(incat) %>% summarise(cnt = n())

```

#### Populate Rolf and Carlos' TSRC expanded numbers

These are just manually batched in. Of note, currently the income categories **don't quite align** because the TSRC data breaks at 70,000 CAD; however, the 2011 NHS goes all the way to 80,000 CAD. So request Rolf and Carlos to produce TSRC with revised income categories. Also need to confirm that the **TSRC is reporting household income** levels and not those of the individual person.

```{r warning=FALSE, tidy=TRUE, echo=TRUE}

#' Input Carlos' values. First create an empty dataframe create data frame to receive results

# get column names for dataframe
cs <- c('Less50k', '50-80k', '80-100k', '>100k', 'Source', 'Total')
rs <- c('2011NHS', '2011PopSyn3', 'TSRC')

on3_sum <- data.frame(matrix(NA_real_, ncol=6, nrow=3))
  # set column names and add in the Source field
  colnames(on3_sum) <- cs
  rownames(on3_sum) <- rs
  on3_sum$Source <- rs
  
# change these values with those from Carlos 
tsrc <- c(3187556, 1347092, 1590159, 2506893)

#' Populate the table
for (i in 1:4){
  on3_sum[3,i] <- tsrc[i]
}

```

#### Populate the PopSyn3 numbers

```{r warning=FALSE, tidy=TRUE, echo=TRUE}

#' Populate the table
for (i in 1:4){
  on3_sum[2,i] <- pp1.sum[i,2]
}

```

#### Populate the 2011NHS numbers

```{r}
# get columns for estimating annual income
on2_t <- on2[c(5:17)]

# get column sums
on2_sum <- as.data.frame(t(colSums(on2_t, na.rm=TRUE)))

# populate the cells
on3_sum[1,1] <- on2_sum %>% .[, c(1:7)] %>% rowSums(na.rm = TRUE)

on3_sum[1,2] <- on2_sum %>% .[, c(8:9)] %>% rowSums(na.rm = TRUE)

on3_sum[1,3] <- on2_sum[,10]

on3_sum[1,4] <- on2_sum %>% .[, c(11:13)] %>% rowSums(na.rm = TRUE)

#' add in grand totals
on3_sum$Total = apply(on3_sum[,c(1:4)],1,sum)

#' display the results
datatable(on3_sum)

```

#### Compare how these numbers stack up.

```{r warning=FALSE, tidy=TRUE, echo=TRUE}
# melt the data to long format to make the bar graph
on4 <- melt(on3_sum, id.vars="Source") 
colnames(on4) <- c("Source", "Income", "Persons")

ggplot(data=on4, aes(x=Income, y=Persons, fill = factor(Source))) +
    geom_bar(position="dodge", stat = "identity") +
  scale_fill_discrete(name = "Data Source",
                      labels=c("2011NHS", "2011PopSyn3", "TSRC (less50k, 50k-70k, 70k-100k, 100k+)")) +
  xlab("Income Categories") + ylab("Persons") + 
  ggtitle("Comparison of Persons by Income Categories")

```

#### Display average household income choropleth map from ArcMap.

```{r}
img <- readPNG("Inc.png") 
      grid.raster(img)
```


Not particularly surprising that none of the datasets match. But, there are some interesting variances. For example, 2011NHS reporting a greater number of persons in the less than 50k CAD household income, infact close to 68% of the population greater than or equal to 15 years. The 2011 PopSyn3 tends to overpredict the greater than 100k income category.  

#### Check what the inputs that went in for PopSyn3 look like and compare them to PopSyn3's hhold level output

This is really a sanity check to make sure PopSyn3's inputs match those that PopSyn3 produces 

```{r}
csdin_sum <- as.data.frame(t(colSums(csdin[c('hhinc0_29', 'hhinc30_59', 'hhinc60_99', 
                                             'hhinc100pl')], na.rm = TRUE)))

# add in income category and populate it for PopSyn3 households
hhold <- transform(hhold, Incat = ifelse(hhinc < 30000, 1,
                                         ifelse(hhinc >=30000 & hhinc < 60000, 2,
                                                ifelse(hhinc >=60000 & hhinc < 100000, 3, 4))))

# Now summarize and prepare for comparison
hhold_sum <- hhold %>% group_by(Incat) %>% summarise(cnt = n()) 
hhold_sum1 <- as.data.frame(t(hhold_sum[-1]))
  colnames(hhold_sum1) <- colnames(csdin_sum)

```

#### Litmus test of PopSyn3

```{r}

#' display the results
com <- rbind(csdin_sum, hhold_sum1)
rownames(com) <- c("PopSyn3_Inputs", "PopSyn3_Outputs")
datatable(com)

```