diff --git a/NEWS.md b/NEWS.md index 1fc2f97f..48fe9035 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,4 +1,4 @@ -# janitor 2.0.1 (unreleased) +# janitor 2.0.1 (2020-04-12) ## Bug fixes and Breaking changes diff --git a/docs/404.html b/docs/404.html index 453c9135..7f43f6e3 100644 --- a/docs/404.html +++ b/docs/404.html @@ -8,11 +8,13 @@
vignettes/janitor.Rmd
janitor.Rmd
clean_names()
Call this function every time you read data.
-It works in a %>%
pipeline, and handles problematic variable names, especially those that are so well-preserved by readxl::read_excel()
and readr::read_csv()
.
It works in a %>%
pipeline, and handles problematic variable names, especially those that are so well-preserved by readxl::read_excel()
and readr::read_csv()
.
compare_df_cols()
For cases when you are given a set of data files that should be identical, and you wish to read and combine them for analysis. But then dplyr::bind_rows()
or rbind()
fails, because of different columns or because the column classes don’t match across data.frames.
For cases when you are given a set of data files that should be identical, and you wish to read and combine them for analysis. But then dplyr::bind_rows()
or rbind()
fails, because of different columns or because the column classes don’t match across data.frames.
compare_df_cols()
takes unquoted names of data.frames / tibbles, or a list of data.frames, and returns a summary of how they compare. See what the column types are, which are missing or present in the different inputs, and how column types differ.
df1 <- data.frame(a = 1:2, b = c("big", "small")) # a factor by default df2 <- data.frame(a = 10:12, b = c("medium", "small", "big"), c = 0, stringsAsFactors = FALSE) df3 <- df1 %>% - dplyr::mutate(b = as.character(b)) + dplyr::mutate(b = as.character(b)) compare_df_cols(df1, df2, df3) #> column_name df1 df2 df3 @@ -313,7 +313,7 @@
Convert a mix of date and datetime formats to date
Building on
-excel_numeric_to_date()
, the new functionsconvert_to_date()
andconvert_to_datetime()
are more robust to a mix of inputs. Handy when reading many spreadsheets that should have the same column formats, but don’t.For instance, here a vector with a date and an Excel datetime sees both values succcesfully converted to Date class:
+For instance, here a vector with a date and an Excel datetime sees both values successfully converted to Date class:
convert_to_date(c("2020-02-29", "40000.1")) #> [1] "2020-02-29" "2009-07-06"
vignettes/tabyls.Rmd
tabyls.Rmd
This vignette demonstrates tabyl
in the context of studying humans in the starwars
dataset from dplyr:
This is often called a “crosstab” or “contingency” table. Calling tabyl
on two columns of a data.frame produces the same result as the common combination of dplyr::count()
, followed by tidyr::pivot_wider()
to wide form:
This is often called a “crosstab” or “contingency” table. Calling tabyl
on two columns of a data.frame produces the same result as the common combination of dplyr::count()
, followed by tidyr::pivot_wider()
to wide form:
t2 <- humans %>% tabyl(gender, eye_color) @@ -348,8 +348,8 @@
This can be handy when you have a data.frame that is not a simple tabulation generated by
tabyl
but would still benefit from theadorn_
formatting functions.A simple example: calculate the proportion of records meeting a certain condition, then format the results.
-percent_above_165_cm <- humans %>% - group_by(gender) %>% - summarise(pct_above_165_cm = mean(height > 165, na.rm = TRUE)) + group_by(gender) %>% + summarise(pct_above_165_cm = mean(height > 165, na.rm = TRUE)) percent_above_165_cm %>% adorn_pct_formatting() @@ -358,11 +358,11 @@#> <chr> <chr> #> 1 feminine 12.5% #> 2 masculine 100.0%
You can control which columns are adorned by using the
+...
argument. It accepts the tidyselect helpers. That is, you can specify columns the same way you would usingdplyr::select()
.You can control which columns are adorned by using the
...
argument. It accepts the tidyselect helpers. That is, you can specify columns the same way you would usingdplyr::select()
.For instance, say you have a numeric column that should not be included in percentage formatting and you wish to exempt it. Here, only the
count
column is adorned:@@ -101,7 +105,6 @@ Changelog -mtcars %>% - count(cyl, gear) %>% - rename(proportion = n) %>% + count(cyl, gear) %>% + rename(proportion = n) %>% adorn_percentages("col", na.rm = TRUE, proportion) %>% adorn_pct_formatting(,,,proportion) # the commas say to use the default values of the other arguments #> cyl gear proportion @@ -392,8 +392,8 @@
Here’s a more complex example that uses a data.frame of means, not counts. We create a table containing the mean of a 3rd variable when grouped by two other variables, then use
adorn_
functions to round the values and append Ns. The first part is pretty straightforward:library(tidyr) # for spread() mpg_by_cyl_and_am <- mtcars %>% - group_by(cyl, am) %>% - summarise(mpg = mean(mpg)) %>% + group_by(cyl, am) %>% + summarise(mpg = mean(mpg)) %>% spread(am, mpg) mpg_by_cyl_and_am @@ -427,6 +427,7 @@
+diff --git a/docs/authors.html b/docs/authors.html index 8b21e98f..4d70ae9e 100644 --- a/docs/authors.html +++ b/docs/authors.html @@ -8,11 +8,13 @@Authors • janitor + + @@ -36,10 +38,12 @@ + + @@ -67,7 +71,7 @@ janitor - 2.0.0 + 2.0.1
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
-– “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insight” - The New York Times, 2014
+– “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insight“ - The New York Times, 2014
the latest development version from GitHub with
install.packages("devtools") -devtools::install_github("sfirke/janitor")
library(readxl); library(janitor); library(dplyr); library(here) roster_raw <- read_excel(here("dirty_data.xlsx")) # available at http://github.com/sfirke/janitor -glimpse(roster_raw) +glimpse(roster_raw) #> Rows: 13 #> Columns: 11 #> $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA,… @@ -173,7 +173,7 @@.name_repair = make_clean_names) # Tells read_excel() how to repair repetitive column names, overriding the # default repair setting -glimpse(roster_raw_cleaner) +glimpse(roster_raw_cleaner) #> Rows: 13 #> Columns: 11 #> $ first_name <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA, "… @@ -190,9 +190,9 @@
This can be further cleaned:
+roster <- roster_raw_cleaner %>% remove_empty(c("rows", "cols")) %>% - mutate(hire_date = excel_numeric_to_date(hire_date), - cert = coalesce(certification, certification_2)) %>% # from dplyr - select(-certification, -certification_2) # drop unwanted columns + mutate(hire_date = excel_numeric_to_date(hire_date), + cert = coalesce(certification, certification_2)) %>% # from dplyr + select(-certification, -certification_2) # drop unwanted columns roster #> # A tibble: 12 x 8 @@ -227,7 +227,7 @@
Finding duplicates
Use
-get_dupes()
to identify and examine duplicate records during data cleaning. Let’s see if any teachers are listed more than once:@@ -101,7 +105,6 @@ Changelog -roster %>% get_dupes(contains("name")) +roster %>% get_dupes(contains("name")) #> # A tibble: 4 x 9 #> first_name last_name dupe_count employee_status subject hire_date percent_allocat… full_time cert #> <chr> <chr> <int> <chr> <chr> <date> <dbl> <chr> <chr> @@ -272,7 +272,7 @@#> <NA> 2 0.16666667 NA
Two variables:
@@ -101,7 +105,6 @@ Changelog -roster %>% - filter(hire_date > as.Date("1950-01-01")) %>% + filter(hire_date > as.Date("1950-01-01")) %>% tabyl(employee_status, full_time) #> employee_status No Yes #> Administration 0 1 @@ -319,7 +319,7 @@
- submit suggestions and report bugs: https://github.com/sfirke/janitor/issues
-- let me know what you think on Mastodon @samfirke@a2mi.social +
- let me know what you think on Mastodon: @samfirke@a2mi.social
- compose a friendly e-mail to:
diff --git a/docs/issue_template.html b/docs/issue_template.html index 2d9cf95e..ce4dc689 100644 --- a/docs/issue_template.html +++ b/docs/issue_template.html @@ -8,11 +8,13 @@NA • janitor + + @@ -36,10 +38,12 @@ + + @@ -67,7 +71,7 @@ janitor - 2.0.0 + 2.0.1++janitor 2.0.1 (2020-04-12) 2020-04-12 +
++++Bug fixes and Breaking changes
+Transliteration of characters within
+make_clean_names()
now operates across operating systems, independent of differences instringi
installations (Fix #365, thanks to @eamoncaddigan for reporting and @billdenney for fixing).This bug patch represents a breaking change with the way that
+make_clean_names()
worked in janitor versions 1.2.1.9000 and 2.0.0 as the transliterations are now more generalized and follow a more best-practice approach to transliterating to ASCII.@@ -101,7 +105,6 @@ Changelog --janitor 2.0.0 (2020-04-07) Unreleased +janitor 2.0.0 (2020-04-07) 2020-04-08
-Breaking Changes
+Breaking changes
- @@ -255,7 +269,7 @@
clean_names()
andmake_clean_names()
are now more locale-independent and translation to ASCII is simpler (in many cases, Unicode is removed, e.g., the Greek character “delta” becomes a “d”). You may also now control how substitutions occur and add your own substitutions (like “%” becoming “percent”). As a result of these changes, the clean names generated by these functions may break with what was produced in prior versions of janitor. (Fix #331, thanks to @billdenney)
-Major Features
+Major featuresThe new function
row_to_names()
handles the case where a dirty data file is read in with its names stored as a row of the data.frame, rather than in the names. This function sets the names of the data.frame to this row and optionally cleans up the rows above and including where the names were stored. Thanks to @billdenney for writing this feature.@@ -288,7 +302,7 @@A fully-overhauled
tabyl
-
tabyl()
is now a single function that can count combinations of one, two, or three variables, ala base R’stable()
. The resultingtabyl
data.frames can be manipulated and formatted using a family ofadorn_
functions. See the tabyls vignette for more.The now-redundant legacy functions
+crosstab()
andadorn_crosstab()
have been deprecated, but remain in the package for now. Existing code that relies on the version oftabyl
present in janitor versions <= 0.3.1 will break if thesort
argument was used, as that argument no longer exists intabyl
(usedplyr::arrange()
instead).The now-redundant legacy functions
crosstab()
andadorn_crosstab()
have been deprecated, but remain in the package for now. Existing code that relies on the version oftabyl
present in janitor versions <= 0.3.1 will break if thesort
argument was used, as that argument no longer exists intabyl
(usedplyr::arrange()
instead).@@ -302,7 +316,7 @@
-Major Features
+Major features
clean_names()
transliterates accented letters, e.g.,çãüœ
becomescauoe
(#120). Thanks to @fernandovmacedo.- @@ -317,7 +331,7 @@
-Minor Features
+Minor features
- The utility function
round_half_up()
is now exported for public use. It’s an exact implementation of http://stackoverflow.com/questions/12688717/round-up-from-5-in-r/12688836#12688836, written by @mrdwab.- @@ -373,16 +387,16 @@
-Major Features
+Major featuresDeprecated the following functions:
- -
+use_first_valid_of()
- usedplyr::coalesce()
insteaduse_first_valid_of()
- usedplyr::coalesce()
instead- -
+convert_to_NA()
- usedplyr::na_if()
insteadconvert_to_NA()
- usedplyr::na_if()
instead- @@ -390,7 +404,7 @@
add_totals_row()
andadd_totals_col()
- replaced by the single functionadorn_totals()
@@ -101,7 +105,6 @@ Changelog --Minor Features
+Minor features
adorn_totals()
andns_to_percents()
can now be called on data.frames that have non-numeric columns beyond the first one (those columns will be ignored) (#57) diff --git a/docs/pkgdown.yml b/docs/pkgdown.yml index 404831b4..f32affa8 100644 --- a/docs/pkgdown.yml +++ b/docs/pkgdown.yml @@ -4,5 +4,5 @@ pkgdown_sha: ~ articles: janitor: janitor.html tabyls: tabyls.html -last_built: 2020-04-08T03:24Z +last_built: 2020-04-12T14:27Z diff --git a/docs/planning.html b/docs/planning.html index 8a95e634..88e62703 100644 --- a/docs/planning.html +++ b/docs/planning.html @@ -8,11 +8,13 @@High-level package planning • janitor + + @@ -36,10 +38,12 @@ + + @@ -67,7 +71,7 @@ janitor - 2.0.0 + 2.0.1
This function is deprecated, use adorn_totals
instead.
add_totals_col(dat, na.rm = TRUE)- +
should missing values (including NaN) be omitted from the calculations? |
Returns a data.frame with a totals column containing row-wise sums.
-This function is deprecated, use adorn_totals
instead.
add_totals_row(dat, fill = "-", na.rm = TRUE)- +
should missing values (including NaN) be omitted from the calculations? |
Returns a data.frame with a totals row, consisting of "Total" in the first column and column sums in the others.
-This function is deprecated, use the adorn_
family of functions instead.
adorn_crosstab( @@ -145,7 +143,7 @@- +Add presentation formatting to a crosstabulation table.
show_totals = FALSE, rounding = "half to even" )
method to use for truncating percentages - either "half to even", the base R default method, or "half up", where 14.5 rounds up to 15. |
Returns a data.frame.
-This function adds back the underlying Ns to a tabyl
whose percentages were calculated using adorn_percentages()
, to display the Ns and percentages together. You can also call it on a non-tabyl data.frame to which you wish to append Ns.
adorn_ns(dat, position = "rear", ns = attr(dat, "core"), ...)- +
a data.frame with Ns appended
-diff --git a/docs/reference/adorn_pct_formatting.html b/docs/reference/adorn_pct_formatting.html index e22e5de0..2cf8f624 100644 --- a/docs/reference/adorn_pct_formatting.html +++ b/docs/reference/adorn_pct_formatting.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogFormat a data.frame of decimals as percentages. — adorn_pct_formatting • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
Numeric columns get multiplied by 100 and formatted as percentages according to user specifications. This function defaults to excluding the first column of the input data.frame, assuming that it contains a descriptive variable, but this can be overridden by specifying the columns to adorn in the ...
argument. Non-numeric columns are always excluded.
adorn_pct_formatting( @@ -144,7 +142,7 @@- +Format a data.frame of decimals as percentages.
affix_sign = TRUE, ... )
a data.frame with formatted percentages
-diff --git a/docs/reference/adorn_percentages.html b/docs/reference/adorn_percentages.html index f02da2cb..92d2397e 100644 --- a/docs/reference/adorn_percentages.html +++ b/docs/reference/adorn_percentages.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogConvert a data.frame of counts to percentages. — adorn_percentages • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
This function defaults to excluding the first column of the input data.frame, assuming that it contains a descriptive variable, but this can be overridden by specifying the columns to adorn in the ...
argument.
adorn_percentages(dat, denominator = "row", na.rm = TRUE, ...)- +
Returns a data.frame of percentages, expressed as numeric values between 0 and 1.
-diff --git a/docs/reference/adorn_rounding.html b/docs/reference/adorn_rounding.html index 772eedae..0e845bcd 100644 --- a/docs/reference/adorn_rounding.html +++ b/docs/reference/adorn_rounding.html @@ -8,11 +8,13 @@@@ -106,7 +107,6 @@ ChangelogRound the numeric columns in a data.frame. — adorn_rounding • janitor + + @@ -36,15 +38,14 @@ - + - @@ -72,7 +73,7 @@ janitor - 2.0.0 + 2.0.1
Can run on any data.frame with at least one numeric column. This function defaults to excluding the first column of the input data.frame, assuming that it contains a descriptive variable, but this can be overridden by specifying the columns to round in the ...
argument.
If you're formatting percentages, e.g., the result of adorn_percentages()
, use adorn_pct_formatting()
instead. This is a more flexible variant for ad-hoc usage. Compared to adorn_pct_formatting()
, it does not multiply by 100 or pad the numbers with spaces for alignment in the results data.frame. This function retains the class of numeric input columns.
adorn_rounding(dat, digits = 1, rounding = "half to even", ...)- +
Returns the data.frame with rounded numeric columns.
-diff --git a/docs/reference/adorn_title.html b/docs/reference/adorn_title.html index ee675dab..035ba2ed 100644 --- a/docs/reference/adorn_title.html +++ b/docs/reference/adorn_title.html @@ -8,11 +8,13 @@mtcars %>% tabyl(am, cyl) %>% adorn_percentages("all") %>% - mutate(dummy = "a") %>% + mutate(dummy = "a") %>% adorn_rounding()#> am 4 6 8 dummy #> 0 0.1 0.1 0.4 a #> 1 0.2 0.1 0.1 a@@ -192,7 +189,7 @@Examp mtcars %>% tabyl(am, cyl) %>% adorn_percentages("row") %>% - adorn_rounding(digits = 1, rounding = "half up", starts_with("8"))
#> am 4 6 8 #> 0 0.1578947 0.2105263 0.6 #> 1 0.6153846 0.2307692 0.2
This function adds the column variable name to the top of a tabyl
for a complete display of information. This makes the tabyl prettier, but renders the data.frame less useful for further manipulation.
adorn_title(dat, placement = "top", row_name, col_name)- +
the input tabyl, augmented with the column title. Non-tabyl inputs that are of class tbl_df
are downgraded to basic data.frames so that the title row prints correctly.
@@ -176,8 +173,8 @@Examp # Adding a title to a non-tabyl library(tidyr); library(dplyr) mtcars %>% - group_by(gear, am) %>% - summarise(avg_mpg = mean(mpg)) %>% + group_by(gear, am) %>% + summarise(avg_mpg = mean(mpg)) %>% spread(gear, avg_mpg) %>% adorn_title("top", row_name = "Gears", col_name = "Cylinders")
#> Cylinders #> 1 Gears 3 4 5 diff --git a/docs/reference/adorn_totals.html b/docs/reference/adorn_totals.html index 9a4bb732..8784bd58 100644 --- a/docs/reference/adorn_totals.html +++ b/docs/reference/adorn_totals.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogAppend a totals row and/or column to a data.frame. — adorn_totals • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
This function defaults to excluding the first column of the input data.frame, assuming that it contains a descriptive variable, but this can be overridden by specifying the columns to be totaled in the ...
argument. Non-numeric columns are converted to character class and have a user-specified fill character inserted in the totals row.
adorn_totals(dat, where = "row", fill = "-", na.rm = TRUE, name = "Total", ...)- +
Returns a data.frame augmented with a totals row, column, or both. The data.frame is now also of class tabyl
and stores information about the attached totals and underlying data in the tabyl attributes.
mtcars %>% diff --git a/docs/reference/as_tabyl.html b/docs/reference/as_tabyl.html index 2ca5ff64..43b16e3b 100644 --- a/docs/reference/as_tabyl.html +++ b/docs/reference/as_tabyl.html @@ -8,11 +8,13 @@@@ -112,7 +113,6 @@ ChangelogAdd + + @@ -36,8 +38,8 @@ - + @@ -78,7 +79,7 @@ janitor - 2.0.0 + 2.0.1tabyl
attributes to a data.frame. — as_tabyl • janitor
tabyl
attributes to a data.frame.A tabyl
is a data.frame containing counts of a variable or co-occurrences of two variables (a.k.a., a contingency table or crosstab). This specialized kind of data.frame has attributes that enable adorn_
functions to be called for precise formatting and presentation of results. E.g., display results as a mix of percentages, Ns, add totals rows or columns, rounding options, in the style of Microsoft Excel PivotTable.
A tabyl
can be the result of a call to janitor::tabyl()
, in which case these attributes are added automatically. This function adds tabyl
class attributes to a data.frame that isn't the result of a call to tabyl
but meets the requirements of a two-way tabyl:
1) First column contains values of variable 1
2) Column names 2:n are the values of variable 2
3) Numeric values in columns 2:n are counts of the co-occurrences of the two variables.*
* = this is the ideal form of a tabyl, but janitor's adorn_
functions tolerate and ignore non-numeric columns in positions 2:n.
For instance, the result of dplyr::count()
followed by tidyr::spread()
can be treated as a tabyl
.
For instance, the result of dplyr::count()
followed by tidyr::spread()
can be treated as a tabyl
.
The result of calling tabyl()
on a single variable is a special class of one-way tabyl; this function only pertains to the two-way tabyl.
as_tabyl(dat, axes = 2, row_var_name = NULL, col_var_name = NULL)- +
(optional) the name of the variable in the column dimension; used by |
Returns the same data.frame, but with the additional class of "tabyl" and the attribute "core".
-as_tabyl(mtcars)#> mpg cyl disp hp drat wt qsec vs am gear carb diff --git a/docs/reference/chisq.test.html b/docs/reference/chisq.test.html index 165046ef..97d210c6 100644 --- a/docs/reference/chisq.test.html +++ b/docs/reference/chisq.test.html @@ -8,11 +8,13 @@@@ -107,7 +108,6 @@ ChangelogApply stats::chisq.test to a two-way tabyl — chisq.test • janitor + + @@ -36,8 +38,8 @@ - + @@ -45,7 +47,6 @@ - @@ -73,7 +74,7 @@ janitor - 2.0.0 + 2.0.1
This generic function overrides stats::chisq.test. If the passed table is a two-way tabyl, it runs it through janitor::chisq.test.tabyl, otherwise it just calls stats::chisq.test.
-chisq.test(x, ...) @@ -148,7 +146,7 @@- +Apply stats::chisq.test to a two-way tabyl
# S3 method for tabyl chisq.test(x, tabyl_results = TRUE, ...)
if TRUE and x is a tabyl object, also return `observed`, `expected`, `residuals` and `stdres` as tabyl |
The result is the same as the one of stats::chisqt.test. If `tabyl_results` +
The result is the same as the one of stats::chisq.test. If `tabyl_results` is TRUE, the returned tables `observed`, `expected`, `residuals` and `stdres` are converted to tabyls.
-tab <- tabyl(mtcars, gear, cyl) diff --git a/docs/reference/clean_names.html b/docs/reference/clean_names.html index a1c1b670..5b04017c 100644 --- a/docs/reference/clean_names.html +++ b/docs/reference/clean_names.html @@ -8,11 +8,13 @@@@ -112,7 +113,6 @@ ChangelogCleans names of an object (usually a data.frame). — clean_names • janitor + + @@ -36,8 +38,8 @@ - + @@ -78,7 +79,7 @@ janitor - 2.0.0 + 2.0.1
Resulting names are unique and consist only of the _
character, numbers, and letters.
Capitalization preferences can be specified using the case
parameter.
Accented characters are transliterated to ASCII. For example, an "o" with a @@ -148,7 +147,6 @@
This function takes and returns a data.frame, for ease of piping with
`%>%`
. For the underlying function that works on a character vector
of names, see make_clean_names
.
clean_names(dat, ...) @@ -164,7 +162,7 @@- +Cleans names of an object (usually a data.frame).
# S3 method for tbl_graph clean_names(dat, ...)
... | Arguments passed on to
|
---|
Returns the data.frame with clean names.
-clean_names()
is intended to be used on data.frames
@@ -229,7 +226,6 @@
clean_names()
on sf
and tbl_graph
(from
tidygraph
) objects. For cleaning named lists and vectors, consider
using make_clean_names()
.
-
# not run: diff --git a/docs/reference/compare_df_cols.html b/docs/reference/compare_df_cols.html index 41fc5f78..58ca7237 100644 --- a/docs/reference/compare_df_cols.html +++ b/docs/reference/compare_df_cols.html @@ -9,11 +9,13 @@@@ -108,7 +109,6 @@ ChangelogGenerate a comparison of data.frames (or similar objects) that indicates if they will successfully bind together by rows. — compare_df_cols • janitor + + @@ -37,16 +39,15 @@ + - - @@ -74,7 +75,7 @@ janitor - 2.0.0 + 2.0.1
Generate a comparison of data.frames (or similar objects) that indicates if they will successfully bind together by rows.
-compare_df_cols( @@ -148,7 +146,7 @@- +Generate a comparison of data.frames (or similar objects) that indicates if bind_method = c("bind_rows", "rbind"), strict_description = FALSE )
bind_method | What method of binding should be used to determine
matches? With "bind_rows", columns missing from a data.frame would be
-considered a match (as in |
@@ -178,7 +176,7 @@
---|
A data.frame with a column named "column_name" with a value named @@ -189,7 +187,6 @@
describe_class
).
-
Due to the returned "column_name" column, no input data.frame may be @@ -201,13 +198,11 @@
strict_description
is FALSE
, data.frames may still bind
because some classes (like factors and characters) can bind even if they
appear to differ.
-
Other Data frame type comparison:
compare_df_cols_same()
,
describe_class()
#> column_name data.frame(A = 1) data.frame(B = 2) diff --git a/docs/reference/compare_df_cols_same.html b/docs/reference/compare_df_cols_same.html index 49ce3c17..ad313d7b 100644 --- a/docs/reference/compare_df_cols_same.html +++ b/docs/reference/compare_df_cols_same.html @@ -8,11 +8,13 @@@@ -106,7 +107,6 @@ ChangelogDo the the data.frames have the same columns & types? — compare_df_cols_same • janitor + + @@ -36,15 +38,14 @@ - + - @@ -72,7 +73,7 @@ janitor - 2.0.0 + 2.0.1
Check whether a set of data.frames are row-bindable. Calls
compare_df_cols()
and returns TRUE if there are no mis-matching rows. `
compare_df_cols_same( @@ -144,7 +142,7 @@- +Do the the data.frames have the same columns & types?
bind_method = c("bind_rows", "rbind"), verbose = TRUE )
bind_method | What method of binding should be used to determine
matches? With "bind_rows", columns missing from a data.frame would be
-considered a match (as in |
@@ -168,18 +166,16 @@ Print the mismatching columns if binding will fail. |
---|
TRUE
if row binding will succeed or FALSE
if it will
fail.
Other Data frame type comparison:
compare_df_cols()
,
describe_class()
#> [1] TRUE#> [1] TRUE#> [1] TRUE#> column_name ..1 ..2 diff --git a/docs/reference/convert_to_NA.html b/docs/reference/convert_to_NA.html index d5bb85b4..9a38f535 100644 --- a/docs/reference/convert_to_NA.html +++ b/docs/reference/convert_to_NA.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogConvert string values to true + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1NA
values. — convert_to_NA • janitor
NA
values.Converts instances of user-specified strings into NA
. Can operate on either a single vector or an entire data.frame.
convert_to_NA(dat, strings)- +
character vector of strings to convert. |
Returns a cleaned object. Can be a vector, data.frame, or tibble::tbl_df
depending on the provided input.
Deprecated, do not use in new code. Use dplyr::na_if()
instead.
Deprecated, do not use in new code. Use dplyr::na_if()
instead.
janitor_deprecated
Convert many date and datetime formats as may be received from Microsoft Excel
-convert_to_date( @@ -156,7 +154,7 @@- +Convert many date and datetime formats as may be received from Microsoft character_fun = lubridate::ymd_hms, string_conversion_failure = c("error", "warning") )
POSIXct objects for `convert_to_datetime()` or Date objects for `convert_to_date()`.
-Character conversion checks if it matches something that looks like @@ -199,21 +196,16 @@
convert_to_datetime
: Convert to a date-time (POSIXct)
Other If your input data has a mix of Excel numeric dates and actual dates, - see the more powerful functions `convert_to_date` and - `convert_to_datetime`.: +
Other Date-time cleaning:
excel_numeric_to_date()
convert_to_date("2009-07-06")#> [1] "2009-07-06"convert_to_date(40000)#> [1] "2009-07-06"convert_to_date("40000.1")#> [1] "2009-07-06"# Mixed date source data can be provided. diff --git a/docs/reference/crosstab.html b/docs/reference/crosstab.html index d67c6982..a53f3c67 100644 --- a/docs/reference/crosstab.html +++ b/docs/reference/crosstab.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ Changelog -Generate a crosstabulation of two vectors. — crosstab • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
This function is deprecated, use tabyl(dat, var1, var2)
instead.
crosstab(...)- +
arguments |
Describe the class(es) of an object
-describe_class(x, strict_description = TRUE) @@ -144,7 +142,7 @@- +Describe the class(es) of an object
# S3 method for default describe_class(x, strict_description = TRUE)
A character scalar describing the class(es) of an object where if the scalar will match, columns in a data.frame (or similar object) should bind together without issue.
-For package developers, an S3 generic method can be written for
describe_class()
for custom classes that may need more definition
than the default method. This function is called by compare_df_cols
.
Other Data frame type comparison:
compare_df_cols_same()
,
compare_df_cols()
diff --git a/docs/reference/excel_numeric_to_date.html b/docs/reference/excel_numeric_to_date.html index b182b72b..4f78bb54 100644 --- a/docs/reference/excel_numeric_to_date.html +++ b/docs/reference/excel_numeric_to_date.html @@ -8,11 +8,13 @@describe_class(1)#> [1] "numeric"#> [1] "factor(levels=c(\"A\"))"#> [1] "ordered, factor(levels=c(\"A\", \"B\"))"#> [1] "factor"
Converts numbers like 42370
into date values like
2016-01-01
.
Defaults to the modern Excel date encoding system. However, Excel for Mac @@ -152,7 +153,8 @@
A list of all timezones is available from base::OlsonNames()
, and the
current timezone is available from base::Sys.timezone()
.
If your input data has a mix of Excel numeric dates and actual dates, see the +more powerful functions `convert_to_date()` and `convert_to_datetime()`.
excel_numeric_to_date( @@ -162,7 +164,7 @@- +Convert dates encoded as serial numbers to Date class.
round_seconds = TRUE, tz = "" )
Returns a vector of class Date if include_time
is
FALSE
. Returns a vector of class POSIXlt if include_time
is
TRUE
.
When using include_time=TRUE
, days with leap seconds will not
be accurately handled as they do not appear to be accurately handled by
Windows (as described in
https://support.microsoft.com/en-us/help/2722715/support-for-the-leap-second).
Other If your input data has a mix of Excel numeric dates and actual dates, - see the more powerful functions `convert_to_date` and - `convert_to_datetime`.: +
Other Date-time cleaning:
convert_to_date()
excel_numeric_to_date(40000)#> [1] "2009-07-06"excel_numeric_to_date(40000.5) # No time is included#> [1] "2009-07-06"excel_numeric_to_date(40000.5, include_time = TRUE) # Time is included#> [1] "2009-07-06 13:00:00 EDT"excel_numeric_to_date(40000.521, include_time = TRUE) # Time is included#> [1] "2009-07-06 13:30:14 EDT"excel_numeric_to_date(40000.521, include_time = TRUE, diff --git a/docs/reference/fisher.test.html b/docs/reference/fisher.test.html index 4cc9026d..9d2823e0 100644 --- a/docs/reference/fisher.test.html +++ b/docs/reference/fisher.test.html @@ -8,11 +8,13 @@@@ -107,7 +108,6 @@ Changelog -Apply stats::fisher.test to a two-way tabyl — fisher.test • janitor + + @@ -36,8 +38,8 @@ - + @@ -45,7 +47,6 @@ - @@ -73,7 +74,7 @@ janitor - 2.0.0 + 2.0.1
This generic function overrides stats::fisher.test. If the passed table is a two-way tabyl, it runs it through janitor::fisher.test.tabyl, otherwise it just calls stats::fisher.test.
-fisher.test(x, ...) @@ -148,7 +146,7 @@- +Apply stats::fisher.test to a two-way tabyl
# S3 method for tabyl fisher.test(x, ...)
if x is a vector, must be another vector or factor of the same length |
The result is the same as the one of stats::fisher.test.
-tab <- tabyl(mtcars, gear, cyl) diff --git a/docs/reference/get_dupes.html b/docs/reference/get_dupes.html index b4669f6e..ab3802c7 100644 --- a/docs/reference/get_dupes.html +++ b/docs/reference/get_dupes.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogGet rows of a + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1data.frame
with identical values for the specified variables. — get_dupes • janitor
data.frame
with identical values for the specifie
For hunting duplicate records during data cleaning. Specify the data.frame and the variable combination to search for duplicates and get back the duplicated rows.
-get_dupes(dat, ...)- +
Unquoted variable names to search for duplicates. This takes a tidyselect specification. |
Returns a data.frame with the full records where the specified variables have duplicated values, as well as a variable dupe_count
showing the number of rows sharing that combination of duplicated values. If the input data.frame was of class tbl_df
, the output is as well.
get_dupes(mtcars, mpg, hp)#> mpg hp dupe_count cyl disp drat wt qsec vs am gear carb diff --git a/docs/reference/index.html b/docs/reference/index.html index 00d2fcae..0555c822 100644 --- a/docs/reference/index.html +++ b/docs/reference/index.html @@ -8,11 +8,13 @@@@ -101,7 +105,6 @@ ChangelogFunction reference • janitor + + @@ -36,10 +38,12 @@ + + @@ -67,7 +71,7 @@ janitor - 2.0.0 + 2.0.1
janitor has simple little tools for examining and cleaning dirty data.
-These functions have already become defunct or may be defunct as soon as the next release.
-Resulting strings are unique and consist only of the _
character,
-numbers, and letters. By default, the resulting strings will only consist of
-ASCII characters, but non-ASCII (e.g. Unicode) may be allowed by setting
-ascii=FALSE
. Capitalization preferences can be specified using the
-case
parameter.
Resulting strings are unique and consist only of the _
+character, numbers, and letters. By default, the resulting strings will only
+consist of ASCII characters, but non-ASCII (e.g. Unicode) may be allowed by
+setting ascii=FALSE
. Capitalization preferences can be specified
+using the case
parameter.
For use on the names of a data.frame, e.g., in a `%>%`
pipeline,
call the convenience function clean_names
.
When ascii=TRUE
(the default), accented characters are transliterated to
-ASCII. For example, an "o" with a German umlaut over it becomes "o", and the
-Spanish character "enye" becomes "n".
The order of operations is: replace
, (optional) ASCII conversion, removing
-initial spaces and punctuation, apply base::make.names()
, apply
-to_any_case
, and add numeric suffixes to duplicates.
See the documentation for snakecase::to_any_case
` for more about how to control its behavior.
When ascii=TRUE
(the default), accented characters are transliterated
+to ASCII. For example, an "o" with a German umlaut over it becomes "o", and
+the Spanish character "enye" becomes "n".
The order of operations is: replace
, (optional) ASCII conversion,
+removing initial spaces and punctuation, apply base::make.names()
,
+apply to_any_case
, and add numeric suffixes to
+duplicates.
See the documentation for snakecase::to_any_case
for more about how
+to control its behavior.
On some systems, not all transliterators to ASCII are available. If this is
+the case on your system, all available transliterators will be used, and a
+warning will be issued once per session indicating that results may be
+different when run on a different system. That warning can be disabled with
+options(janitor_warn_transliterators=FALSE)
.
If the objective of your call to make_clean_names()
is only to translate to
+ASCII, try the following instead:
+stringi::stri_trans_general(x, id="Any-Latin;Greek-Latin;Latin-ASCII")
.
make_clean_names( @@ -175,7 +193,7 @@- +Cleans a vector of text, typically containing the names of an object.
numerals = "asis", ... )
case | The desired target case (default is |
|
---|---|---|
use_make_names | -Should `make.names()` be applied to ensure that the -output is usable as a name without quoting? (Avoiding `make.names()` + | Should |
Returns the "cleaned" character vector.
-diff --git a/docs/reference/pipe.html b/docs/reference/pipe.html index a7e08d3e..85077719 100644 --- a/docs/reference/pipe.html +++ b/docs/reference/pipe.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogPipe operator — %>% • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
Exported from the magrittr package. To learn more, run ?magrittr::`%>%`
.
lhs %>% rhs- + +
mtcars %>% diff --git a/docs/reference/remove_constant.html b/docs/reference/remove_constant.html index 7d5afc04..06f4cf8d 100644 --- a/docs/reference/remove_constant.html +++ b/docs/reference/remove_constant.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogRemove constant columns from a data.frame or matrix. — remove_constant • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
Remove constant columns from a data.frame or matrix.
-remove_constant(dat, na.rm = FALSE, quiet = TRUE)- +
remove_empty()
for removing empty
columns or rows.
Other remove functions:
remove_empty()
diff --git a/docs/reference/remove_empty.html b/docs/reference/remove_empty.html index 90fa6cc7..1598a6b6 100644 --- a/docs/reference/remove_empty.html +++ b/docs/reference/remove_empty.html @@ -8,11 +8,13 @@# To find the columns that are constant data.frame(A=1, B=1:3) %>% - dplyr::select_at(setdiff(names(.), names(remove_constant(.)))) %>% + dplyr::select_at(setdiff(names(.), names(remove_constant(.)))) %>% unique()#> A #> 1 1
Removes all rows and/or columns from a data.frame or matrix that
are composed entirely of NA
values.
remove_empty(dat, which = c("rows", "cols"), quiet = TRUE)- +
Returns the object without its missing rows or columns.
-remove_constant()
for removing
constant columns.
Other remove functions:
remove_constant()
# not run: diff --git a/docs/reference/remove_empty_cols.html b/docs/reference/remove_empty_cols.html index 317dddfd..d4cde55e 100644 --- a/docs/reference/remove_empty_cols.html +++ b/docs/reference/remove_empty_cols.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogRemoves empty columns from a data.frame. — remove_empty_cols • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
This function is deprecated, use remove_empty("cols")
instead.
remove_empty_cols(dat)- +
the input data.frame. |
Returns the data.frame with no empty columns.
-# not run: diff --git a/docs/reference/remove_empty_rows.html b/docs/reference/remove_empty_rows.html index 88f17516..7815ceee 100644 --- a/docs/reference/remove_empty_rows.html +++ b/docs/reference/remove_empty_rows.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogRemoves empty rows from a data.frame. — remove_empty_rows • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
This function is deprecated, use remove_empty("rows")
instead.
remove_empty_rows(dat)- +
the input data.frame. |
Returns the data.frame with no empty rows.
-# not run: diff --git a/docs/reference/round_half_up.html b/docs/reference/round_half_up.html index bb9747f1..0c1df91e 100644 --- a/docs/reference/round_half_up.html +++ b/docs/reference/round_half_up.html @@ -8,11 +8,13 @@@@ -106,7 +107,6 @@ ChangelogRound a numeric vector; halves will be rounded up, ala Microsoft Excel. — round_half_up • janitor + + @@ -36,15 +38,14 @@ - + - @@ -72,7 +73,7 @@ janitor - 2.0.0 + 2.0.1
In base R round()
, halves are rounded to even, e.g., 12.5 and 11.5 are both rounded to 12. This function rounds 12.5 to 13 (assuming digits = 0
). Negative halves are rounded away from zero, e.g., -0.5 is rounded to -1.
This may skew subsequent statistical analysis of the data, but may be desirable in certain contexts. This function is implemented exactly from http://stackoverflow.com/a/12688836; see that question and comments for discussion of this issue.
-round_half_up(x, digits = 0)- +
how many digits should be displayed after the decimal point? |
round_half_up(12.5)#> [1] 13round_half_up(1.125, 2)#> [1] 1.13round_half_up(1.125, 1)#> [1] 1.1round_half_up(-0.5, 0) # negatives get rounded away from zero#> [1] -1diff --git a/docs/reference/round_to_fraction.html b/docs/reference/round_to_fraction.html index 71059a41..bde0e340 100644 --- a/docs/reference/round_to_fraction.html +++ b/docs/reference/round_to_fraction.html @@ -8,11 +8,13 @@@@ -113,7 +114,6 @@ ChangelogRound to the nearest fraction of a specified denominator. — round_to_fraction • janitor + + @@ -36,8 +38,8 @@ - + @@ -79,7 +80,7 @@ janitor - 2.0.0 + 2.0.1
Round a decimal to the precise decimal value of a specified fractional denominator. Common use cases include addressing floating point imprecision and enforcing that data values fall into a certain set.
@@ -150,11 +149,10 @@Set denominator = 1
to round to whole numbers.
The digits
argument allows for rounding of the subsequent result.
round_to_fraction(x, denominator, digits = Inf)- +
the input x rounded to a decimal value that has an integer numerator relative
to denominator
(possibly subsequently rounded to a number of decimal
digits).
If digits
is Inf
, x
is rounded to the fraction
and then kept at full precision. If digits
is "auto"
, the
number of digits is automatically selected as
ceiling(log10(denominator)) + 1
.
diff --git a/docs/reference/row_to_names.html b/docs/reference/row_to_names.html index 21f55668..92f5b8e5 100644 --- a/docs/reference/row_to_names.html +++ b/docs/reference/row_to_names.html @@ -8,11 +8,13 @@round_to_fraction(1.6, denominator = 2)#> [1] 1.5round_to_fraction(pi, denominator = 7) # 22/7#> [1] 3.142857#> [1] 8.142857 9.250000#> [1] 8.143 9.250#> [1] 8.1400 9.2500 10.2997
Elevate a row to be the column names of a data.frame.
-row_to_names(dat, row_number, remove_row = TRUE, remove_rows_above = TRUE)- +
A data.frame with new names (and some rows removed, if specified)
-x <- data.frame(X_1 = c(NA, "Title", 1:3), diff --git a/docs/reference/signif_half_up.html b/docs/reference/signif_half_up.html index 600546fd..12c6f020 100644 --- a/docs/reference/signif_half_up.html +++ b/docs/reference/signif_half_up.html @@ -8,11 +8,13 @@@@ -112,7 +113,6 @@ ChangelogRound a numeric vector to the specified number of significant digits; halves will be rounded up. — signif_half_up • janitor + + @@ -36,8 +38,8 @@ - + @@ -78,7 +79,7 @@ janitor - 2.0.0 + 2.0.1
In base R signif()
, halves are rounded to even, e.g.,
signif(11.5, 2)
and signif(12.5, 2)
are both rounded to 12.
This function rounds 12.5 to 13 (assuming digits = 2
). Negative halves
@@ -148,11 +147,10 @@
signif_half_up(x, digits = 6)- +
integer indicating the number of significant digits to be used. |
signif_half_up(12.5, 2)#> [1] 13signif_half_up(1.125, 3)#> [1] 1.13signif_half_up(-2.5, 1) # negatives get rounded away from zero#> [1] -3diff --git a/docs/reference/tabyl.html b/docs/reference/tabyl.html index cfe0a371..866adb30 100644 --- a/docs/reference/tabyl.html +++ b/docs/reference/tabyl.html @@ -8,11 +8,13 @@@@ -107,7 +108,6 @@ ChangelogGenerate a frequency table (1-, 2-, or 3-way). — tabyl • janitor + + @@ -36,8 +38,8 @@ - + @@ -45,7 +47,6 @@ - @@ -73,7 +74,7 @@ janitor - 2.0.0 + 2.0.1
A fully-featured alternative to table()
. Results are data.frames and can be formatted and enhanced with janitor's family of adorn_
functions.
Specify a data.frame and the one, two, or three unquoted column names you want to tabulate. Three variables generates a list of 2-way tabyls, split by the third variable.
Alternatively, you can tabulate a single variable that isn't in a data.frame by calling tabyl
on a vector, e.g., tabyl(mtcars$gear)
.
tabyl(dat, ...) @@ -148,7 +146,7 @@- +Generate a frequency table (1-, 2-, or 3-way).
# S3 method for data.frame tabyl(dat, var1, var2, var3, show_na = TRUE, show_missing_levels = TRUE, ...)
(optional) the column name of the third variable (the list in a 3-way tabulation). |
Returns a data.frame with frequencies and percentages of the tabulated variable(s). A 3-way tabulation returns a list of data.frames.
-diff --git a/docs/reference/top_levels.html b/docs/reference/top_levels.html index 9366f597..6e8af455 100644 --- a/docs/reference/top_levels.html +++ b/docs/reference/top_levels.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogGenerate a frequency table of a factor grouped into top-n, bottom-n, and all other levels. — top_levels • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
Get a frequency table of a factor variable, grouped into categories by level.
-top_levels(input_vec, n = 2, show_na = FALSE)- +
should cases where the variable is NA be shown? |
Returns a data.frame (actually a tbl_df
) with the frequencies of the grouped, tabulated variable. Includes counts and percentages, and valid percentages (calculated omitting NA
values, if present in the vector and show_na = TRUE
.)
#> as.factor(mtcars$hp) n percent diff --git a/docs/reference/untabyl.html b/docs/reference/untabyl.html index 9d6d6907..1c348746 100644 --- a/docs/reference/untabyl.html +++ b/docs/reference/untabyl.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogRemove + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1tabyl
attributes from a data.frame. — untabyl • janitor
tabyl
attributes from a data.frame.Strips away all tabyl
-related attributes from a data.frame.
untabyl(dat)- +
a data.frame of class |
Returns the same data.frame, but without the tabyl
class and attributes.
diff --git a/docs/reference/use_first_valid_of.html b/docs/reference/use_first_valid_of.html index 23cddb93..84cc36e0 100644 --- a/docs/reference/use_first_valid_of.html +++ b/docs/reference/use_first_valid_of.html @@ -8,11 +8,13 @@@@ -105,7 +106,6 @@ ChangelogReturns first non-NA value from a set of vectors. — use_first_valid_of • janitor + + @@ -36,14 +38,13 @@ - + - @@ -71,7 +72,7 @@ janitor - 2.0.0 + 2.0.1
At each position of the input vectors, iterates through in order and returns the first non-NA value. This is a robust replacement of the common ifelse(!is.na(x), x, ifelse(!is.na(y), y, z))
. It's more readable and handles problems like ifelse
's inability to work with dates in this way.
use_first_valid_of(..., if_all_NA = NA)- +
what value should be used when all of the vectors return |
Returns a single vector with the selected values.
-Deprecated, do not use in new code. Use dplyr::coalesce()
instead.
Deprecated, do not use in new code. Use dplyr::coalesce()
instead.
janitor_deprecated