๐ 25-01-2022
R-Ladies Brisbane GitHub hosts an event for R-Ladies Gaborone on the introduction to R packages for data cleaning, namely Naniar, Janitor, Amelia and Datawizard.
The presentation used the 2022 NYC housing dataset to demonstrate the R package for data cleaning.
Step 1 installs and loads the packages and this is where data wrangling is done with datawizard and janitor for exploration.
library(Amelia)
library(naniar)
library(data.table)
library(datawizard)
library(janitor)
library(readr)
library(ggplot2)
library(dplyr)
HP <- read_csv("https://raw.githubusercontent.com/sndaba/RPackagesForDataCleaning/main/NYC_2022.csv")
View(HP)
HP <- datawizard::data_remove(HP,"latitude") #remove data.frame,column
HP <- datawizard::data_remove(HP,"longitude") #remove data.frame,column
HP <- datawizard::data_remove(HP,"id") #remove data.frame,column
HP <- datawizard::data_reorder(HP,c("host_id","name")) #add the names of the cols in the new order
HP <- datawizard::data_reorder(HP,c("host_name","name")) #add the names of the cols in the new order
HP <- datawizard::data_reorder(HP,c("host_id","host_name")) #add the names of the cols in the new order
HP <- janitor::clean_names(HP) #changes to lower case
HP <- datawizard::data_rename(HP,"price","house_price") #changes col name
janitor::get_dupes(HP,colnames(HP)) #checks whether there are any duplicates
janitor::tabyl(HP,host_name) %>% adorn_pct_formatting(digit=0,affix_sign=TRUE) #col tabulation
janitor::top_levels(as.factor(HP$house_price),5) %>% #shows the lowest, middle and highest numeric range
adorn_pct_formatting(digits = 0, affix_sign=TRUE)
The next Step 2 looks for missing values and visualise the findings using naniar and ggplot.
naniar::any_miss(HP) #check for NA
naniar::miss_var_summary(HP) #NA frequency
naniar::gg_miss_var(HP) #NA visualization
naniar::gg_miss_upset(HP,order.by="freq") #variable NA values relationship
ggplot2::ggplot(HP,aes(x=year_built,y=year_remod_add))+ #categorical variable
geom_miss_point()+
facet_wrap(~calculated_host_listings_count)+
theme_dark()
In Step 3, the data frame is changed to a data table using data.table and Multiple imputation is used fo missing values with Amelia.
drop_dt <- data.table::as.data.table(HP) #set data.frame to data.table
drop_col <- c('name', #drop column
'host_name',
'neighbourhood',
'neighbourhood_group',
'room_type')
col <- drop_dt[,!drop_col,with=FALSE] #create new table
res.amelia <- Amelia::amelia(col,m=5) #5 imputed data sets
Amelia::compare.density(res.amelia,var="house_price") #density plot to analysis
HP <- naniar::impute_mean_if(HP,.predicate = is.numeric)
naniar::any_miss(HP) #check if there are any NA
Finally, Step 4 sets data frame to data table, iltering rows based on conditions and data.table for the key concept for binary search to sort the data table using the key.
View(HP_dt <- data.table::as.data.table(HP)) #set data frame to data table
class(HP_dt)
head(HP_dt[room_type=="Private room" & house_price>181500],4) #filtering rows based on conditions
head(HP_dt[,.(host_id,host_name,name)],4) #select given columns
head(HP_dt[neighbourhood=="Harlem",.(neighbourhood, #select given cols by row selection
number_of_reviews,
availability_365
)],4)
head(average <- HP_dt[,.(mean_price=mean(house_price)), #grouping using by
by=neighbourhood],4)
head(chain_gang <- HP_dt[,.(.N,maximum=max(house_price), #chaining statement
minimum=min(house_price)
%>% round(2)),by=neighbourhood],4)
#key concept for binary search. Sorts the data table by the key
data.table::setkey(HP_dt,neighbourhood) #setting key for the data table
data.table::key(HP_dt) #check data table key
head(HP_dt[.("Harlem")],3) #select rows using key
room_by_neighbourhood <- HP_dt[.("Bedford-Stuyvesant"),
.(neighbourhood_group,neighbourhood,
room_type)]
head(room_by_neighbourhood)
#group using keyby
head(ans <- HP_dt["Hell's Kitchen",.(neighbourhood,
max_review=number_of_reviews),
keyby=availability_365],4)
Useful URLs Shared during the presentation
CODATA webinar, importance of data cleaning
CRAN: Simple Tools for Examining and Cleaning Dirty Data Janitor
Data Structures, Summaries, and Visualisations for Missing Data Naniar
A Program for Missing Data Amelia
The validity of multiple-imputation-based analyses Multiple imputation
Package to easily manipulate, clean, transform, and prepare your data for analysis. DataWizard
Provides a high-performance version of base Rโs data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. Data.table