Skip to content

Commit

Permalink
update description of crosstab()
Browse files Browse the repository at this point in the history
closes #33
  • Loading branch information
sfirke committed Jul 29, 2016
1 parent e494f5d commit 6d67f46
Show file tree
Hide file tree
Showing 2 changed files with 59 additions and 53 deletions.
28 changes: 17 additions & 11 deletions vignettes/introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ names(clean_df) # they are clean
## `tabyl()` - a better version of `table()`
`tabyl()` takes a vector and returns a frequency table, like `table()`. But its additional features are:

+ It returns a data.frame (actually, a `tbl_df`) - for manipulating further, or printing with `knitr::kable()`.
+ It returns a data.frame - for manipulating further, or printing with `knitr::kable()`.
+ It automatically calculates percentages
+ It can (optionally) display `NA` values
+ When `NA` values are present, it will calculate an additional column `valid_percent` in the style of SPSS
Expand All @@ -64,36 +64,42 @@ table(x)
## Crosstabulate two variables with `crosstab()`
`crosstab()` generates a crosstab table. There many R crosstab functions already; this one is distinguished by:

+ It returns a data.frame (actually, a `tbl_df`)
+ It returns a data.frame
+ It is simple.
+ It calculates frequencies by default but can calculate row, column, and table-wise percentages.
+ It can (optionally) display `NA` values
+ It can be called with `%>%` in a pipeline.

It wraps the common pipeline of `group_by %>% summarise %>% mutate %>% spread` from the dplyr and tidyr packages, often used in exploratory analysis.

Usage:
```{r}
y <- c(1, 1, 2, 1, 2)
x <- c("a", "a", "b", "b", NA)
crosstab(x, y)
crosstab(x, y, percent = "row")
```
This gives the same result as the much longer pipeline:

If the variables are in the same data frame, call `crosstab` with the `%>%`pipe:
```{r}
dat <- data.frame(x, y, stringsAsFactors = FALSE)
dat %>%
crosstab(x, y, percent = "row")
```

This function wraps the common pipeline of `group_by %>% summarise %>% mutate %>% spread` from the dplyr and tidyr packages, often used in exploratory analysis. The simple `crosstab` call above produces the same result* as this much longer pipeline:
```{r, message=FALSE, results = "hide"}
library(dplyr) ; library(tidyr)
data_frame(x, y) %>%
group_by(x, y) %>%
tally() %>%
mutate(percent = n / sum(n, na.rm = TRUE)) %>%
select(-n) %>%
spread(y, percent) %>%
spread(y, percent, fill = 0) %>%
ungroup()
```
And is more featured than the base R equivalents:
```{r, results="hide"}
table(x, y)
prop.table(table(x, y), 1)
```
And is more featured than the base R equivalents `table(dat$x, dat$y)` and `prop.table(table(dat$x, dat$y), 1)`.

\**not exactly: the long pipeline returns a `tibble`, while crosstab() returns a `data.frame` that prints fully in the console.*

## Explore records with duplicated values for specific combinations of variables with `get_dupes()`
This is for hunting down and examining duplicate records during data cleaning - usually when there shouldn't be any.
Expand Down
84 changes: 42 additions & 42 deletions vignettes/introduction.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Intro to janitor functions
================
2016-07-23
2016-07-28

- [Major functions](#major-functions)
- [Clean data.frame names with `clean_names()`](#clean-data.frame-names-with-clean_names)
Expand Down Expand Up @@ -54,7 +54,7 @@ names(clean_df) # they are clean

`tabyl()` takes a vector and returns a frequency table, like `table()`. But its additional features are:

- It returns a data.frame (actually, a `tbl_df`) - for manipulating further, or printing with `knitr::kable()`.
- It returns a data.frame - for manipulating further, or printing with `knitr::kable()`.
- It automatically calculates percentages
- It can (optionally) display `NA` values
- When `NA` values are present, it will calculate an additional column `valid_percent` in the style of SPSS
Expand All @@ -63,13 +63,11 @@ names(clean_df) # they are clean
``` r
x <- c("a", "b", "c", "c", NA)
tabyl(x, sort = TRUE)
#> # A tibble: 4 x 4
#> x n percent valid_percent
#> <chr> <int> <dbl> <dbl>
#> 1 c 2 0.4 0.50
#> 2 a 1 0.2 0.25
#> 3 b 1 0.2 0.25
#> 4 <NA> 1 0.2 NA
#> x n percent valid_percent
#> 1 c 2 0.4 0.50
#> 2 a 1 0.2 0.25
#> 3 b 1 0.2 0.25
#> 4 <NA> 1 0.2 NA
```

Compare to:
Expand All @@ -86,34 +84,43 @@ Crosstabulate two variables with `crosstab()`

`crosstab()` generates a crosstab table. There many R crosstab functions already; this one is distinguished by:

- It returns a data.frame (actually, a `tbl_df`)
- It returns a data.frame
- It is simple.
- It calculates frequencies by default but can calculate row, column, and table-wise percentages.
- It can (optionally) display `NA` values
- It can be called with `%>%` in a pipeline.

It wraps the common pipeline of `group_by %>% summarise %>% mutate %>% spread` from the dplyr and tidyr packages, often used in exploratory analysis.
Usage:

``` r
y <- c(1, 1, 2, 1, 2)
x <- c("a", "a", "b", "b", NA)

crosstab(x, y)
#> # A tibble: 3 x 3
#> x 1 2
#> * <chr> <dbl> <dbl>
#> 1 a 2 0
#> 2 b 1 1
#> 3 <NA> 0 1
#> x 1 2
#> 1 a 2 0
#> 2 b 1 1
#> 3 <NA> 0 1
crosstab(x, y, percent = "row")
#> # A tibble: 3 x 3
#> x 1 2
#> * <chr> <dbl> <dbl>
#> 1 a 1.0 0.0
#> 2 b 0.5 0.5
#> 3 <NA> 0.0 1.0
#> x 1 2
#> 1 a 1.0 0.0
#> 2 b 0.5 0.5
#> 3 <NA> 0.0 1.0
```

This gives the same result as the much longer pipeline:
If the variables are in the same data frame, call `crosstab` with the `%>%`pipe:

``` r
dat <- data.frame(x, y, stringsAsFactors = FALSE)
dat %>%
crosstab(x, y, percent = "row")
#> x 1 2
#> 1 a 1.0 0.0
#> 2 b 0.5 0.5
#> 3 <NA> 0.0 1.0
```

This function wraps the common pipeline of `group_by %>% summarise %>% mutate %>% spread` from the dplyr and tidyr packages, often used in exploratory analysis. The simple `crosstab` call above produces the same result\* as this much longer pipeline:

``` r
library(dplyr) ; library(tidyr)
Expand All @@ -122,16 +129,13 @@ data_frame(x, y) %>%
tally() %>%
mutate(percent = n / sum(n, na.rm = TRUE)) %>%
select(-n) %>%
spread(y, percent) %>%
spread(y, percent, fill = 0) %>%
ungroup()
```

And is more featured than the base R equivalents:
And is more featured than the base R equivalents `table(dat$x, dat$y)` and `prop.table(table(dat$x, dat$y), 1)`.

``` r
table(x, y)
prop.table(table(x, y), 1)
```
\**not exactly: the long pipeline returns a `tibble`, while crosstab() returns a `data.frame` that prints fully in the console.*

Explore records with duplicated values for specific combinations of variables with `get_dupes()`
------------------------------------------------------------------------------------------------
Expand Down Expand Up @@ -218,19 +222,15 @@ Originally designed for use with Likert survey data stored as factors. Returns a
f <- factor(c("strongly agree", "agree", "neutral", "neutral", "disagree", "strongly agree"),
levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree"))
top_levels(f)
#> # A tibble: 3 x 3
#> f n percent
#> <fctr> <int> <dbl>
#> 1 strongly agree, agree 3 0.5000000
#> 2 neutral 2 0.3333333
#> 3 disagree, strongly disagree 1 0.1666667
#> f n percent
#> 1 strongly agree, agree 3 0.5000000
#> 2 neutral 2 0.3333333
#> 3 disagree, strongly disagree 1 0.1666667
top_levels(f, n = 1, sort = TRUE)
#> # A tibble: 3 x 3
#> f n percent
#> <fctr> <int> <dbl>
#> 1 agree, neutral, disagree 4 0.6666667
#> 2 strongly agree 2 0.3333333
#> 3 strongly disagree NA NA
#> f n percent
#> 1 agree, neutral, disagree 4 0.6666667
#> 2 strongly agree 2 0.3333333
#> 3 strongly disagree NA NA
```

`remove_empty_cols()` and `remove_empty_rows()`
Expand Down

0 comments on commit 6d67f46

Please sign in to comment.