Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adorn_crosstab's helper functions always skip the first column #57

Closed
sfirke opened this issue Aug 28, 2016 · 9 comments
Closed

adorn_crosstab's helper functions always skip the first column #57

sfirke opened this issue Aug 28, 2016 · 9 comments
Assignees

Comments

@sfirke
Copy link
Owner

sfirke commented Aug 28, 2016

add_totals_row, add_totals_col, and ns_to_percents all have it hard-coded that the first column is a character vector to be ignored. Fine when they were non-exported helper functions for adorn_crosstab, but if exported, this should be more flexible. There could be multiple non-numeric columns, in any order, or none.

Default to skipping an arbitrarily-wide number of initial non-numeric columns? Or detect non-numeric columns, and exclude (for summing) or leave as is (for in-place conversion to percentages)?

@sfirke
Copy link
Owner Author

sfirke commented Jan 31, 2017

I guess the latter? Here's a good use case for this:
http://stackoverflow.com/questions/41949529/frequency-table-and-group-by-multiple-variables-in-r

sfirke added a commit that referenced this issue Feb 2, 2017
@sfirke
Copy link
Owner Author

sfirke commented Feb 3, 2017

I kind of went with a hybrid. add_totals_col is simple, you just skip the non-numeric columns.

For the other two, add_totals_row and ns_to_percents, the first column always gets left alone. The result of say mtcars %>% crosstab(cyl, carb) will have a numeric first column. And I could make crosstab results have a character-class first column, but many users will be coming to this from dplyr::count + tidyr::spread and would still be in that pickle of their numeric 1st column getting summed when in practice it's not truly a numeric value anymore. And I don't like coercing the 1st column of the input data.frame to a character, since then the result can't be filtered for say am > 0.

So, first column always gets skipped and the word "Total" placed at the bottom.

But other non-numeric columns can occur in any position, mixed in among numeric vectors. This could be nice, say if a data source has a right-most text column but you still want a totals row.

Oh and if there are those additional non-numeric columns, and you call add_totals_row, there's a fill argument to specify the character string that goes in those bottom row cells.

@rgknight what do you think about that approach for the three functions, in particular add_totals_row and ns_to_percents which are more complicated? You can check out the "improve_adorn_helpers" branch, it is working. Try:

dat <- structure(list(a = c("hi", "lo"), b = c(1, 2), c = c(5, 10), 
    d = c("big", "small"), e = c(20, NA)), .Names = c("a", "b", 
"c", "d", "e"), row.names = c(NA, -2L), class = "data.frame")

to play around with.

Then if you are good with it, I'll write tests and merge it in pending Travis/codecov. But I didn't want to write tests yet in case you had other thoughts re: how these should work.

@sfirke sfirke self-assigned this Feb 3, 2017
@rgknight
Copy link
Collaborator

rgknight commented Feb 6, 2017

For version 0.X:

I think ns_to_percents may not be working yet. The approach to add looks good.

For version 1.0:

I think there are two potential approaches that will work well here. One is to make everything an option instead of a helper function, and the other is to create a short prefix for all tabyl modifiers so that you know they are intended to modify a tabyl (assuming that crosstab and tabyl are eventually merged.

These would both mean moving away from making these fully-featured export functions. I think you will get into trouble if you start trying to support use cases for these functions other than modifying a tabyl. I keep going back to Hadley's idea that tidyr does less than reshape2 -- I think you will be more successful if you focus on making tabyl really good pivot table alternative.

If these were options, I would use:

tabyl(df, rows, cols, split, values = c("N", "row_percent", "col_percent"), totals = c("row", "col"))

Then row_percent and col_percent would be like adding ns_to_percents with the respective denom, and totals = "row" would give you the same thing as adding %>% adorn_total_row.

(Random thought: maybe we eliminate the split and just split on grouped vars when df is a grouped_df. Might be a bad idea though).

The second approach would be to make these all start with something like adorn_ or fmt_ (for format) so that users know that all of these things are intended to modify a tabyl. Then you would do

mtcars %>%
  tabyl(am, cycle) %>%
  adorn_values(row_percent) %>%
  adorn_totals(row)

Or something like that, which would let us add multiple values under adorn_values in the future. Looking at both, I'm leaning toward the adorn_ approach.

@sfirke
Copy link
Owner Author

sfirke commented Mar 17, 2017

Yeah I like the adorn_ approach better. I've had people using the add_totals_ functions with non-tabyl outputs, which would be lost if we made it an option to tabyl. Also a tabyl in 1.0 that subsumes crosstab will already be a more complex function, and starts to exceed the tidyverse principle of bite-sized simple functions.

I never loved "adorn" but at least that word is short and isn't used widely - and I couldn't think of a better one.

Well maybe I should take this to the 1.0 version on the "improve_adorn_helpers" branch, make both of the add_totals functions into a single function adorn_totals or adorn_add_totals (?) that takes "col" "row" or "both". I think ns_to_percents() isn't as simple a case so I'll just improve it slightly as stated above, for now.

@sfirke
Copy link
Owner Author

sfirke commented Mar 17, 2017

For consistency, should remove_empty_rows and remove_empty_cols be a single function too? remove_empty("cols") and it defaults to "both"? Dunno, "remove_empty()" isn't fully self-descriptive. remove_empty_lines()? Line is kind of synonymous with "row" though.

@sfirke
Copy link
Owner Author

sfirke commented Mar 17, 2017

@rgknight I'm grateful for your vision for 1.0 and a cleaner API, at the expense of some small helper functions or trying to do everything. The more I think about it, the more I am convinced it's the way to go.

@rgknight
Copy link
Collaborator

:-) Very glad that you have found this helpful. Huge props to you for being open to the feedback. It's not at all easy to open up your vision in this way!

I really like remove_empty() with it defaulting to both rows and columns, similar to how distinct() works now in dplyr.

@sfirke
Copy link
Owner Author

sfirke commented Mar 19, 2017

Whew I got the two add_totals functions merged into a single function and passing tests. The remaining things to do:

  • Decide what to call it. If adorn_totals, it clashes a little semantically with adorn_crosstab because in the latter, it's the crosstab being adorned.
  • Deprecate the old functions, add_totals_row_ and _col
  • Fill in a few unit tests that have long been missing, like whether na.rm flag works. Skeletons already exist for them.

@rgknight thoughts on keeping it as add_totals vs. switching to adorn_totals? You can add totals already through adorn_crosstab so I don't know how helpful it is to use the same word to draw associations between the two.

@sfirke
Copy link
Owner Author

sfirke commented Mar 22, 2017

I'll just go with add_totals for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants