Skip to content

Commit

Permalink
Update vars_funs interface in the R package to match the Python pac…
Browse files Browse the repository at this point in the history
…kage (#34)

* Add basic Python project with just vars_rename

* Add unit tests for vars_rename

* Add pytest-coverage workflow

* Add Development docs to python/README.md

* Clean up docs in vars_funs.py

* Fix typo in pytest-coverage workflow

* Accept any python >=3.9 in python package

* Use optional-dependencies for dev deps in pyproject.toml

* Fix vars_rename docstring in Python package

* Update typing in vars_funs.py to be compatible with Python 3.9

* Add Sphinx docs for Python package

* Update actions/checkout versions across workflows

* Add Python docs generation to docs workflow

* Fix ruff linter errors

* Install both test and docs requirements when running pytest

* Fix paths in pytest-coverage workflow

* Better path management in docs conf.py

* Rename build jobs in docs workflow

* Include csv files in package data when building Python package

* Temporarily disable branch restriction for docs deployment to test it out

* Update deploy-pages version

* Revert "Temporarily disable branch restriction for docs deployment to test it out"

This reverts commit 9cc256b.

* Fix broken link in Python docs

* Switch to new style python type hints since we don't support 3.9 anyway

* Remove unnecessary templates_path config from pyproject.toml

* Empty commit to try to bust build-pkgdown-site actions cache

* Draft Python version of vars_recode

* Remove unnecessary .python-version file

* Add pip install directions to README and index.rst for Python package

* Remove unnecessary uv.lock file

* Rename 'test' -> 'dev' in pyproject optional-dependencies

* Switch order of authors in pyproject.toml

* Capitalize VAR_NAME_PREFIX constant in vars_funs.py

* Remove unnecessary OutputType enum from vars_funs.py

* Remove duplicative type checking in vars_funs.py

* Wrap Python tests in classes for clearer organization

* Change chars_sample fixtures to symlinks to R data in Python package

* WIP add vars_recode

* Add tests for vars_recode and fixup logic

* Add docs for vars_dict and vars_recode in Python package

* Remove unnecessary select_dtypes filter in Python vars_recode

* Add python/ subdir to RBuildignore so it does not get built into R package

* Update docs to fix incorrect EXT_WALL code translation

* Clarify docs for vars_dict data object in reference.rst

* Stricter dictionary schema validation in Python version of vars_recode

* Remove outdated comment in python/ccao/vars_funs.py

Co-authored-by: Dan Snow <[email protected]>

* Remove deprecated `vars_check_class` function from R package

* Update roxygen docs to remove vars_check_class and tweak vars_recode docs

* Rename `type` and `dictionary` params in R package vars funs to match Python

* Fix error in var_funs example code

* Remove unused `class_dict` data object from the R package

* Update R package docs to reflect removal of `class_dict` data object

* Remove unused class_dict.rda object

* Remove pytest and ruff caches from R build

* Remove warnings and fallback behavior for deprecated `type` and `dict` args in `vars_funs.R`

---------

Co-authored-by: Dan Snow <[email protected]>
  • Loading branch information
jeancochrane and dfsnow authored Dec 6, 2024
1 parent 2b027d4 commit 6445f79
Show file tree
Hide file tree
Showing 13 changed files with 116 additions and 521 deletions.
2 changes: 2 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
^\.github$
^\.gitlab$
^\.pre-commit-config\.yaml$
^\.pytest_cache$
^\.ruff_cache$
^_pkgdown\.yml$
^cache$
^ci$
Expand Down
1 change: 0 additions & 1 deletion NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ export(town_get_assmnt_year)
export(town_get_triad)
export(val_limit_ratios)
export(val_round_fmv)
export(vars_check_class)
export(vars_recode)
export(vars_rename)
importFrom(magrittr,"%>%")
Expand Down
28 changes: 0 additions & 28 deletions R/data.R
Original file line number Diff line number Diff line change
Expand Up @@ -72,34 +72,6 @@
"chars_sample_universe"


#' Data dictionary of Cook County property classes
#'
#' A dataset containing a translation for property class codes to
#' human-readable class descriptions. Also describes which classes are included
#' in residential regressions and reporting classes.
#'
#' @format A data frame with 197 rows and 8 variables:
#' \describe{
#' \item{major_class_code}{First digit of class code, major class}
#' \item{major_class_type}{Human-readable description of the major class}
#' \item{assessment_level}{Level of assessment for the property class}
#' \item{regression_class}{Boolean indicating whether or not this class is
#' included in CAMA regressions}
#' \item{modeling_group}{Modeling group used for internal CCAO data selection}
#' \item{reporting_group}{Reporting group name used for internal CCAO reports
#' and aggregate statistics}
#' \item{class_code}{Full, 3-digit class code of the property sub-class}
#' \item{class_desc}{Human-readable description of the property sub-class}
#' \item{min_size}{Integer of minimum size for 200-class property codes}
#' \item{max_size}{Integer of maximum size for 200-class property codes}
#' \item{min_age}{Integer of minimum age for 200-class property codes}
#' \item{max_age}{Integer of maximum age for 200-class property codes}
#' }
#'
#' @note Includes all Cook County real property classes.
"class_dict"


#' Data dictionary of Certificate of Error reason codes
#'
#' A dataset containing numeric codes and corresponding text explanations for
Expand Down
160 changes: 51 additions & 109 deletions R/vars_funs.R
Original file line number Diff line number Diff line change
@@ -1,88 +1,25 @@
#' Check if a property class falls within its expected square footage and age
#' boundaries
#'
#' @description Check property characteristics against class definitions as defined # nolint
#' \href{https://datascience.cookcountyassessor.com/wiki/data/class-definitions.pdf}{here}. # nolint
#'
#' @param age Integer or numeric vector of ages of properties. Either 1 long
#' or the same length as \code{sqft} and \code{class}.
#' @param sqft Integer or numeric vector of the square footage of properties.
#' Either 1 long or the same length as \code{age} and \code{class}.
#' @param class String or character vector of class codes. Either 1 long or
#' the same length as \code{sqft} and \code{age}.
#'
#' @return A logical vector indicating that the specified class falls within
#' the parameters specified by \code{\link{class_dict}}. Throws error if input
#' data types are incorrect or if length conditions of input vectors
#' are not met.
#'
#' @examples
#' vars_check_class(50, 800, "202")
#' vars_check_class(c(50, 80), c(800, 1000), c("202", "203"))
#' vars_check_class(c(50, 80), 1000, "210")
#' vars_check_class(50, c(800, 2000), "202")
#' vars_check_class(50, 1000, c("202", "203"))
#' @importFrom magrittr %>%
#' @importFrom rlang .data
#' @family vars_funs
#' @export
vars_check_class <- function(age, sqft, class) {
# Simple error checking
stopifnot(
is.numeric(age),
is.numeric(sqft),
is.character(class)
)

# Take only the classes from the dictionary which are residential (200)
res_classes <- dplyr::filter(
ccao::class_dict, substr(.data$class_code, 1, 1) == "2"
)

# Element-wise comparison to test that age & sqft return the expected class
mapply(
function(x, y, z) {
idx <- res_classes$min_age <= x &
res_classes$max_age >= x &
res_classes$min_size <= y &
res_classes$max_size >= y

possible_classes <- res_classes$class_code[idx]

if (length(possible_classes) == 0) {
return(NA)
} else {
return(z %in% possible_classes)
}
},
x = age, y = sqft, z = class,
USE.NAMES = FALSE,
SIMPLIFY = TRUE
)
}


#' Bulk rename variables from CCAO SQL to standardized or pretty names
#' and visa versa
#'
#' @description Bulk rename columns from one type of CCAO data to another. For
#' example, rename all columns pulled from SQL to their standard names used
#' in modeling. Or, rename all standard modeling names to "pretty" names for
#' publication. This function will only rename things specified in
#' the user-supplied \code{dict} argument, all other names in the data will
#' remain unchanged.
#' the user-supplied \code{dictionary} argument, all other names in the data
#' will remain unchanged.
#'
#' Options for \code{names_from} and \code{names_to} are specific to the
#' specified \code{dict}. Run this function with \code{names_from} equal to
#' \code{NULL} to see a list of available options for the specified dictionary.
#' specified \code{dictionary}. Run this function with \code{names_from} equal
#' 'to \code{NULL} to see a list of available options for the specified
#' 'dictionary.
#'
#' @param data A data frame or tibble with columns to be renamed.
#' @param names_from The source/name type of data. See description
#' @param names_to The target names. See description
#' @param type Output type. Either \code{"inplace"}, which renames the input
#' data frame, or \code{"vector"}, which returns a named character vector with
#' the construction new_col_name = old_col_name.
#' @param dict The dictionary used to translate names. Uses
#' @param output_type Output type. Either \code{"inplace"}, which renames the
#' input data frame, or \code{"vector"}, which returns a named character
#' vector with the construction new_col_name = old_col_name.
#' @param dictionary The dictionary used to translate names. Uses
#' \code{\link{vars_dict}} by default. Use \code{\link{vars_dict_legacy}} for
#' legacy data column names.
#'
Expand All @@ -98,21 +35,21 @@ vars_check_class <- function(age, sqft, class) {
#' data = sample_data,
#' names_from = "sql",
#' names_to = "standard",
#' dict = ccao::vars_dict_legacy
#' dictionary = ccao::vars_dict_legacy
#' )
#' vars_rename(
#' data = sample_data,
#' names_from = "sql",
#' names_to = "pretty",
#' dict = ccao::vars_dict_legacy
#' dictionary = ccao::vars_dict_legacy
#' )
#'
#' # No renames will occur since no column names here are from SQL
#' vars_rename(
#' data = class_dict[1:5, 1:5],
#' data = chars_sample_athena[1:5, 1:10],
#' names_from = "sql",
#' names_to = "pretty",
#' dict = ccao::vars_dict_legacy
#' dictionary = ccao::vars_dict_legacy
#' )
#'
#' # With data from Athena
Expand All @@ -123,33 +60,33 @@ vars_check_class <- function(age, sqft, class) {
#' data = sample_data_athena,
#' names_from = "athena",
#' names_to = "model",
#' dict = ccao::vars_dict
#' dictionary = ccao::vars_dict
#' )
#' vars_rename(
#' data = sample_data_athena,
#' names_from = "athena",
#' names_to = "pretty",
#' dict = ccao::vars_dict
#' dictionary = ccao::vars_dict
#' )
#' @md
#' @family vars_funs
#' @export
vars_rename <- function(data,
names_from = NULL,
names_to = NULL,
type = "inplace",
dict = ccao::vars_dict) {
output_type = "inplace",
dictionary = ccao::vars_dict) {
# Check input data dictionary
stopifnot(
is.data.frame(dict),
sum(startsWith(names(dict), "var_name_")) >= 2,
nrow(dict) > 0
is.data.frame(dictionary),
sum(startsWith(names(dictionary), "var_name_")) >= 2,
nrow(dictionary) > 0
)

# Get vector of possible inputs to names_from and names_to from dictionary
poss_names_args <- gsub(
"var_name_", "",
names(dict)[startsWith(names(dict), "var_name_")]
names(dictionary)[startsWith(names(dictionary), "var_name_")]
)

# If args aren't in possible, throw error and list possible args
Expand All @@ -171,7 +108,7 @@ vars_rename <- function(data,
is.data.frame(data) | is.character(data),
tolower(names_from) %in% poss_names_args,
tolower(names_to) %in% poss_names_args,
tolower(type) %in% c("inplace", "vector")
tolower(output_type) %in% c("inplace", "vector")
)

# If the input is a dataframe, extract the names from that dataframe
Expand All @@ -181,15 +118,15 @@ vars_rename <- function(data,
to <- paste0("var_name_", names_to)

# Rename using dict, replacing any NAs with the original column names
names_wm <- dict[[to]][match(names_lst, dict[[from]])]
names_wm <- dictionary[[to]][match(names_lst, dictionary[[from]])]
names_wm[is.na(names_wm)] <- names_lst[is.na(names_wm)]

# Return names inplace if the input data is a data frame, else return a
# character vector of new names
if (is.data.frame(data) && type == "inplace") {
if (is.data.frame(data) && output_type == "inplace") {
names(data) <- names_wm
return(data)
} else if (is.character(data) || type == "vector") {
} else if (is.character(data) || output_type == "vector") {
return(names_wm)
}
}
Expand All @@ -204,7 +141,7 @@ vars_rename <- function(data,
#' must be specified via a user-defined dictionary. The default dictionary is
#' \code{\link{vars_dict}}.
#'
#' Options for \code{type} are:
#' Options for \code{code_type} are:
#'
#' - \code{"long"}, which transforms EXT_WALL = 1 to EXT_WALL = Frame
#' - \code{"short"}, which transforms EXT_WALL = 1 to EXT_WALL = FRME
Expand All @@ -215,11 +152,11 @@ vars_rename <- function(data,
#' @param cols A \code{<tidy-select>} column selection or vector of column
#' names. Looks for all columns with numerically encoded character
#' values by default.
#' @param type Output/recode type. See description for options.
#' @param code_type Output/recode type. See description for options.
#' @param as_factor If \code{TRUE}, re-encoded values will be returned as
#' factors with their levels pre-specified by the dictionary. Otherwise, will
#' return re-encoded values as characters only.
#' @param dict The dictionary used to translate encodings. Uses
#' @param dictionary The dictionary used to translate encodings. Uses
#' \code{\link{vars_dict}} by default. Use \code{\link{vars_dict_legacy}} for
#' legacy data column encodings.
#'
Expand All @@ -238,12 +175,12 @@ vars_rename <- function(data,
#' sample_data
#' vars_recode(
#' data = sample_data,
#' dict = ccao::vars_dict_legacy
#' dictionary = ccao::vars_dict_legacy
#' )
#' vars_recode(
#' data = sample_data,
#' type = "short",
#' dict = ccao::vars_dict_legacy
#' code_type = "short",
#' dictionary = ccao::vars_dict_legacy
#' )
#'
#' # Recode only the specified columns
Expand All @@ -253,26 +190,26 @@ vars_rename <- function(data,
#' vars_recode(
#' data = gar_sample,
#' cols = dplyr::starts_with("GAR"),
#' dict = ccao::vars_dict_legacy
#' dictionary = ccao::vars_dict_legacy
#' )
#' vars_recode(
#' data = gar_sample,
#' cols = "GAR1_SIZE",
#' dict = ccao::vars_dict_legacy
#' dictionary = ccao::vars_dict_legacy
#' )
#'
#' # Using data from Athena
#' sample_data_athena <- chars_sample_athena[1:5, c(1:5, 10:20)]
#' sample_data_athena
#' vars_recode(
#' data = sample_data_athena,
#' type = "code",
#' dict = ccao::vars_dict_legacy
#' code_type = "code",
#' dictionary = ccao::vars_dict_legacy
#' )
#' vars_recode(
#' data = sample_data_athena,
#' type = "long",
#' dict = ccao::vars_dict_legacy
#' code_type = "long",
#' dictionary = ccao::vars_dict_legacy
#' )
#' @md
#' @importFrom magrittr %>%
Expand All @@ -281,18 +218,23 @@ vars_rename <- function(data,
#' @export
vars_recode <- function(data,
cols = dplyr::everything(),
type = "long",
code_type = "long",
as_factor = TRUE,
dict = ccao::vars_dict) {
dictionary = ccao::vars_dict) {
# Check input data dictionary
stopifnot(
is.data.frame(dict),
sum(startsWith(names(dict), "var_name_")) >= 1,
nrow(dict) > 0
is.data.frame(dictionary),
sum(startsWith(names(dictionary), "var_name_")) >= 1,
nrow(dictionary) > 0
)

# Check that the dictionary contains the correct columns
if (!any(c("var_code", "var_value", "var_value_short") %in% names(dict))) {
if (
!any(
c("var_code", "var_value", "var_value_short")
%in% names(dictionary)
)
) {
stop(
"Input dictionary must contain the following columns: ",
"var_code, var_value, var_value_short"
Expand All @@ -302,20 +244,20 @@ vars_recode <- function(data,
# Error/input checking
stopifnot(
is.data.frame(data),
type %in% c("code", "short", "long"),
code_type %in% c("code", "short", "long"),
is.logical(as_factor)
)

# Translate inputs to column names
var <- switch(type,
var <- switch(code_type,
"code" = "var_code",
"long" = "var_value",
"short" = "var_value_short"
)

# Convert chars dict into long format that can be easily referenced use
# any possible input column names
dict_long <- dict %>%
dict_long <- dictionary %>%
dplyr::filter(
.data$var_type == "char" & .data$var_data_type == "categorical"
) %>%
Expand Down
1 change: 0 additions & 1 deletion _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ reference:
- chars_fix_age
- chars_sparsify
- chars_update
- vars_check_class
- vars_recode
- vars_rename
- subtitle: Adjust estimated values
Expand Down
6 changes: 0 additions & 6 deletions data-raw/class_dict.R

This file was deleted.

Loading

0 comments on commit 6445f79

Please sign in to comment.