-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
directly: UTF8 -> ASCII #269
Comments
BTW: on windows: > stringi::stri_trans_general("œ", "Latin-ASCII")
[1] "\u009c" on ubuntu: > stringi::stri_trans_general("œ", "Latin-ASCII")
[1] "oe" |
Yes, |
Hmmmm
|
What about your |
> charToRaw("œ")
[1] 9c
> Encoding("œ")
[1] "latin1"
> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252" |
OK, and what about |
and |
I guess stringi should handle R's "latin1" on Windows as Windows-1252 (https://en.wikipedia.org/wiki/Windows-1252) |
Works so far: > stri_trans_general(stri_encode("œ", from="WINDOWS-1252", to="UTF-8"), "Latin-ASCII")
[1] "oe" |
Cool. I guess this should be the default conversion on Windows instead of |
Thanks! I will try it, when it's ready :) |
Hi! Are you able to compile the lastest git version of stringi on Windows? Could you pls try if it works now? |
I got -- before:
after:
|
Doesn't seem to have fixed it for me:
|
@sfirke your |
|
Can't reproduce that on my Windows 7 :(
Could you please give |
I'm running R 3.4.0 but also tried on R 3.3.3 and get the same result. In case more info helps:
|
Please let me know, if you still need any testers. |
I failed to identify where the problem is (so far). :( |
I can get the oe character and the euro symbol to convert if I insert
|
I can get more Windows users to test certain commands if helpful in learning more. |
Thank for all these. What about |
Maybe any of you could help me hacking than in C++? The relevant files are stri_container_utf8.cpp and stri_container_utf16.cpp. In Lines 84 and 102, respectively, a WINDOWS-1252 codec is opened. I wonder if Moreover, lines 121 and ~148, respectively, are actually responsible for using the opened converters. |
Here's
As to the C++ debugging, I have never done anything in C++ ... but your guidance seems like enough of a starting point that I will take a look and see what I can find out! |
I can no longer reproduce this error, which is great, though I'm sorry to have wasted your time. As best I can tell, I had reloaded the patched version of stringi, not realizing that the changes to C++ wouldn't take until I restarted R. (This came to me after I added your print statement suggested above - even when I deleted + installed the package to clear the print statements out, I was still getting the "I'M HERE", indicating that the old C++ was getting called). I think this can be closed. Thanks for working through this with me, Marek, and your great work on this package! |
Heh :) @Tazinho, can you confirm? If so, please close the issue. |
Yes! Here the complete output. Thanks a lot :) > # install.packages("devtools")
> # devtools::install_github("gagolews/stringi")
> library(stringi)
> devtools::session_info()
Session info ------------------------------------------------------------------------
setting value
version R version 3.3.2 (2016-10-31)
system x86_64, mingw32
ui RStudio (1.0.143)
language (EN)
collate German_Germany.1252
tz Europe/Berlin
date 2017-06-23
Packages ----------------------------------------------------------------------------
package * version date source
base * 3.3.2 2016-10-31 local
datasets * 3.3.2 2016-10-31 local
devtools 1.13.2 2017-06-02 CRAN (R 3.3.3)
digest 0.6.12 2017-01-27 CRAN (R 3.3.3)
graphics * 3.3.2 2016-10-31 local
grDevices * 3.3.2 2016-10-31 local
memoise 1.1.0 2017-04-21 CRAN (R 3.3.3)
methods * 3.3.2 2016-10-31 local
packrat 0.4.8-12 2017-06-23 Github (rstudio/packrat@b55feb6)
rstudioapi 0.6 2016-06-27 CRAN (R 3.3.3)
stats * 3.3.2 2016-10-31 local
stringi * 1.1.6 2017-06-23 Github (gagolews/stringi@dc94c5a)
tools 3.3.2 2016-10-31 local
utils * 3.3.2 2016-10-31 local
withr 1.0.2 2016-06-20 CRAN (R 3.3.3)
> charToRaw("œ")
[1] 9c
> Encoding("œ")
[1] "latin1"
> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> stringi::stri_trans_general("\u0153", "Latin-ASCII")
[1] "oe"
> stringi::stri_trans_general(stri_encode("œ", from="WINDOWS-1252", to="UTF-8"), "Latin-ASCII")
[1] "oe"
> stri_trans_general("œ", "Latin-ASCII")
[1] "oe" |
@gagolews Are you aware of any way to transliterate ö, ü, ä, ß to oe, ue, ae and ss? As mentioned in the first post? I implemented this as a lookup in the snakecase package, but only for these german cases and it would be interesting if you knew about a more general way, which includes all letters, which are related to only one specific language alphabet and have an obvious transliteration. Of course it would be also nice, if this could be an option in |
At least ß works:
|
I guess ICU does not support this... |
Sorry to necro this thread, but I stumbled across a similar problem that @Tazinho had and only just now realized that it's possible to chain the transformation identifiers together: stringi::stri_trans_general(" ö, ü, ä, ß ", "de-ASCII; Latin-ASCII")
#> [1] " oe, ue, ae, ss " Created on 2019-05-02 by the reprex package (v0.2.1) |
Is there a way to go directly from utf8 to ascii?
So far I only see the option to go with combinations from
stringi::stri_trans_list()
from some national encoding to latin and then to ascii via
stringi::stri_trans_general()
.However, I am looking for someting like uni2ascii mentioned here on so:
http://stackoverflow.com/questions/17517319/r-replacing-foreign-characters-in-a-string
Regarding this issue Tazinho/snakecase#36
It would be also nice, to have some "more meaningful" conversion (in a language depending way) like the german ö -> oe
The text was updated successfully, but these errors were encountered: