-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
offer other case options as an argument in clean_names #96
Comments
For the record, I only use snake case but I can imagine other people using |
Just some notes on that. It is super easy to get pascal- and camel-case once you, @sfirke have converted to clean snake_case. So it might still be useful to build it up on the already available clean_names() function, which has all the tests for valid data.frame names. Of course you can wrap snakecase, just let me know any requirementes. Also note that snakecase currently imports purrr, stringr and magrittr-pipes, but I could translate to a baseR version if this is an issue. On the long run I think it would be best if the functionality would be implemented in stringr/stringi. However, at the moment it is not, which is as far as I know, because these packages claim more generality and it is not always obvious what the snake case result should look like. |
Just wanted to let you know that snakecase is on cran now and it is planned to support a lot of stuff that might be interesting for you. See the issues for that or this talk: |
Nice work! Since I have a special character in my name I like the related issue 😀 |
@Tazinho congrats on the package being on CRAN! I have a data source at work that gives me camelCase and it looks like I think this would be nice to offer camelCase and PascalCase as options to I like minimizing dependencies to the extent possible, but I don't think it'll be a dealbreaker here. |
I will think about resolving stringr to stringi. In general the stringi/stringr dependency seems to be a really good thing. Tazinho/snakecase#46 |
I just finished a bigger update on the devversion of the snakecase package, including a change to the #120 (transliteration of special characters) You can now do the following: devtools::install_github("Tazinho/snakecase")
library(snakecase)
to_any_case("àngst häschen", case = "all_caps",
replace_special_characters = c("german", "Latin-ASCII"),
postprocess = "-")
[1] "ANGST-HAESCHEN" So you can supply a string of combinations of entries from the I was also thinking about the dependencies stringr and stringi. The problem is, that I only need Pls let me know your thoughts on that. |
Hi, since snakecase is ready for CRAN (when stringi is updated), I thought it is a good time for this issue and just played around a bit with your testcases. I tried to give you the same parsing (default) and case options from the snakecase pkg and a good default transliteration. At the same time I wanted to give you backward compatibility. I think
Basically this looks like this: ### ____________________________________________________________________________
### LIBRARIES
> # devtools::install_github("Tazinho/snakecase")
> # library(snakecase)
> # library(magrittr)
### ____________________________________________________________________________
### OLD VERSION
> clean_names2 <- function(old_names){
+ new_names <- old_names %>%
+ gsub("'", "", .) %>% # remove quotation marks
+ gsub("\"", "", .) %>% # remove quotation marks
+ gsub("%", "percent", .) %>%
+ gsub("^[ ]+", "", .) %>%
+ make.names(.) %>%
+ gsub("[.]+", "_", .) %>% # convert 1+ periods to single _
+ gsub("[_]+", "_", .) %>% # fix rare cases of multiple consecutive underscores
+ tolower(.) %>%
+ gsub("_$", "", .) %>% # remove string-final underscores
+ stringi::stri_trans_general("latin-ascii")
+
+ # Handle duplicated names - they mess up dplyr pipelines
+ # This appends the column number to repeated instances of duplicate variable names
+ dupe_count <- sapply(1:length(new_names), function(i) {
+ sum(new_names[i] == new_names[1:i]) })
+
+ new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1],
+ dupe_count[dupe_count > 1],
+ sep = "_")
+ new_names
+ }
### ____________________________________________________________________________
### NEW VERSION
> clean_names3 <- function(old_names, case = "snake"){
+ new_names <- old_names %>%
+ gsub("'", "", .) %>% # remove quotation marks
+ gsub("\"", "", .) %>% # remove quotation marks
+ gsub("%", "percent", .) %>%
+ gsub("^[ ]+", "", .) %>%
+ make.names(.) %>%
+ # gsub("[.]+", "_", .) %>% # convert 1+ periods to single _
+ # gsub("[_]+", "_", .) %>% # fix rare cases of multiple consecutive underscores
+ # tolower(.) %>%
+ # gsub("_$", "", .) %>% # remove string-final underscores
+ # stringi::stri_trans_general("latin-ascii")
+ to_any_case(case = case, preprocess = "\\.",
+ replace_special_characters = c("german", "Latin-ASCII"))
+
+ # Handle duplicated names - they mess up dplyr pipelines
+ # This appends the column number to repeated instances of duplicate variable names
+ dupe_count <- sapply(1:length(new_names), function(i) {
+ sum(new_names[i] == new_names[1:i]) })
+
+ new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1],
+ dupe_count[dupe_count > 1],
+ sep = "_")
+ new_names
+ }
### ____________________________________________________________________________
### TEST CASES
> string <- c("sp ace", "repeated", "a**#@", "%", "#", "!",
+ "d(!)9", "REPEATED", "can\"'t", "hi_`there`", " leading spaces",
+ "€", "ação", "farœ", "r.stüdio:v.1.0.143")
### ____________________________________________________________________________
### COMPARISON
> clean_names2(string)
[1] "sp_ace" "repeated" "a"
[4] "percent" "x" "x_2"
[7] "d_9" "repeated_2" "cant"
[10] "hi_there" "leading_spaces" "x_3"
[13] "acao" "faroe" "r_studio_v_1_0_143"
> clean_names3(string)
[1] "sp_ace" "repeated" "a"
[4] "percent" "x" "x_2"
[7] "d_9" "repeated_2" "cant"
[10] "hi_there" "leading_spaces" "x_3"
[13] "acao" "faroe" "r_stuedio_v_1_0_143"
> clean_names3(string, case = "all_caps")
[1] "SP_ACE" "REPEATED" "A"
[4] "PERCENT" "X" "X_2"
[7] "D_9" "REPEATED_2" "CANT"
[10] "HI_THERE" "LEADING_SPACES" "X_3"
[13] "ACAO" "FAROE" "R_STUEDIO_V_1_0_143" Further I would add "_" around "percentage" during the replacement and switch from sapply to vapply. I think thats basically what can be done. Let me know what you think. If you like it, I will add a pull request within the next days after the next CRAN update. Best, |
Thanks Malte! I'm tied up at the moment but am excited to dig into this. I'll take a thorough look through and reply within the next week. |
Sounds good, thanks :) |
Err I am late w/ this but will play around with this tonight. And I think you're right about |
@Tazinho I played around with I do wonder about this line: I'm not sure what my preference is: skip this transformation? Allow users to pass through their own lists? Something else? But wanted to note this as one part I'm unsure about. |
This sounds super good. Thanks also for taking a deeper look into the package. I will try to handle all issues latest next week. About I'ld also like to mention, that there are some scandinavian letters, which are tranliterated by "Latin-ASCII" to two characters (so word length can change anyway). So as a german I can say, that ü, ä, ö and ß are really annoying. I think the native transliteration to ue, ae, oe and ss is definitely clear. These characters are to my best knowledge also only in the german alphabet. So transliteration of these cases is always obvious, besides maybe in abbreviations and that ß is always small, which is a general problem for parsing. One problem is that I currently only doing this only for german characters. So if you like the idea in general, there might be more transliteration lists in the future, which you might wanna add. I would suggest that I implement for now only The list thing is also a good idea and I might switch someday from character to list, but your users won't recognice it. Only if you give them the |
Came here to cheerfully report that my long-running woes of Windows tests failing b/c of encoding in the My reward for revisiting this old issue is that I got to see this dinosaur GIF I'd missed the first time ^^^ 😀 |
Suggested by @masalmon on Twitter. I only use snake_case, but it would be more inclusive of other coding styles to offer camelCase or maybe PascalCase. Perhaps wrap @Tazinho's snakecase package?
The text was updated successfully, but these errors were encountered: