Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

directly: UTF8 -> ASCII #269

Closed
Tazinho opened this issue May 22, 2017 · 32 comments
Closed

directly: UTF8 -> ASCII #269

Tazinho opened this issue May 22, 2017 · 32 comments

Comments

@Tazinho
Copy link

Tazinho commented May 22, 2017

Is there a way to go directly from utf8 to ascii?
So far I only see the option to go with combinations from
stringi::stri_trans_list()
from some national encoding to latin and then to ascii via stringi::stri_trans_general().

However, I am looking for someting like uni2ascii mentioned here on so:
http://stackoverflow.com/questions/17517319/r-replacing-foreign-characters-in-a-string

Regarding this issue Tazinho/snakecase#36
It would be also nice, to have some "more meaningful" conversion (in a language depending way) like the german ö -> oe

@Tazinho
Copy link
Author

Tazinho commented May 22, 2017

BTW:
I would have been fine with my first approach to this via "Latin-ASCII",
which however results in different results depending on the system:

on windows:

> stringi::stri_trans_general("œ", "Latin-ASCII")
[1] "\u009c"

on ubuntu:

> stringi::stri_trans_general("œ", "Latin-ASCII")
[1] "oe"

@gagolews
Copy link
Owner

Yes, Latin-ASCII seems to be a nice solution. The problem on windows is most likely due to the fact that you do not input œ as UTF-8 but via R-flavored latin-1. This is not official latin-1, as https://en.wikipedia.org/wiki/ISO/IEC_8859-1 such a digraph is not supported by this encoding.
Try calling enc2utf8 on the first arg

@Tazinho
Copy link
Author

Tazinho commented May 23, 2017

Hmmmm
still on windows:

stringi::stri_trans_general(enc2utf8("œ"), "Latin-ASCII")
[1] "\u009c"
enc2utf8("œ")
[1] "\u009c"

@gagolews
Copy link
Owner

gagolews commented May 23, 2017

What about your charToRaw("œ") and Encoding("œ")and Sys.getlocale()?

@Tazinho
Copy link
Author

Tazinho commented May 23, 2017

> charToRaw("œ")
[1] 9c
> Encoding("œ")
[1] "latin1"
> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"

@gagolews
Copy link
Owner

OK, and what about stringi::stri_trans_general("\u0153", "Latin-ASCII")?

@gagolews
Copy link
Owner

and stringi::stri_trans_general(stri_encode("œ", from="WINDOWS-1252", to="UTF-8"), "Latin-ASCII")

@gagolews
Copy link
Owner

I guess stringi should handle R's "latin1" on Windows as Windows-1252 (https://en.wikipedia.org/wiki/Windows-1252)

@Tazinho
Copy link
Author

Tazinho commented May 23, 2017

Works so far:
Windows

> stri_trans_general(stri_encode("œ", from="WINDOWS-1252", to="UTF-8"), "Latin-ASCII")
[1] "oe"

@gagolews
Copy link
Owner

gagolews commented May 23, 2017

Cool. I guess this should be the default conversion on Windows instead of stri_encode("œ", from="ISO-8859-1", to="UTF-8") (which is done internally). I opened an issue for that; as soon it's fixed and on CRAN, this will work in a platform-indep manner

@Tazinho
Copy link
Author

Tazinho commented May 23, 2017

Thanks! I will try it, when it's ready :)

@gagolews
Copy link
Owner

Hi! Are you able to compile the lastest git version of stringi on Windows? Could you pls try if it works now?

@gagolews
Copy link
Owner

I got -- before:

> Sys.setlocale(locale="German_Germany.1252")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> Encoding("\x80")
[1] "latin1"
> stringi::stri_trans_general("\x80", "latin-ascii")
[1] "\u0080"

after:

> Sys.setlocale(locale="German_Germany.1252")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> Encoding("\x80")
[1] "latin1"
> stringi::stri_trans_general("\x80", "latin-ascii")
[1] "€"

@sfirke
Copy link

sfirke commented May 25, 2017

Doesn't seem to have fixed it for me:

> stringi::stri_trans_general("œ", "Latin-ASCII")
[1] "\u009c"
> stringi::stri_trans_general("\x80", "latin-ascii")
[1] "\u0080"

session_info() says stringi * 1.1.6 2017-05-25 Github (gagolews/stringi@27b47f4) is my current version of stringi. I'm running Windows 10

@gagolews
Copy link
Owner

@sfirke your Sys.getlocale() and Encoding("\x80") and stri_enc_mark("\x80")?

@sfirke
Copy link

sfirke commented May 25, 2017

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> Encoding("\x80")
[1] "latin1"
> stri_enc_mark("\x80")
[1] "latin1"

@gagolews
Copy link
Owner

Can't reproduce that on my Windows 7 :(

> Sys.setlocale("LC_COLLATE", "English_United States.1252")
[1] "English_United States.1252"
> Sys.setlocale("LC_CTYPE", "English_United States.1252")
[1] "English_United States.1252"
> Sys.setlocale("LC_MONETARY", "English_United States.1252")
[1] "English_United States.1252"
> Sys.setlocale("LC_TIME", "English_United States.1252")
[1] "English_United States.1252"
> Sys.setlocale("LC_NUMERIC", "C")
[1] "C"
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> library('stringi')
> stri_enc_get()
[1] "windows-1252"
> stri_enc_mark("\x80")
[1] "latin1"
> Encoding("\x80")
[1] "latin1"
> "\x80"
[1] "€"
> stringi::stri_trans_general("\x80", "latin-ascii")
[1] "€"
> stringi::stri_trans_general("œ", "Latin-ASCII")
[1] "oe"

Could you please give stri_enc_toutf8("\x80") a try too?

@sfirke
Copy link

sfirke commented May 28, 2017

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> stri_enc_toutf8("\x80")
[1] "\u0080"

I'm running R 3.4.0 but also tried on R 3.3.3 and get the same result. In case more info helps:

> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
setting  value                       
version  R version 3.4.0 (2017-04-21)
system   x86_64, mingw32             
ui       RStudio (1.0.143)           
language (EN)                        
collate  English_United States.1252  
tz       America/New_York            
date     2017-05-28                  

Packages ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
package  * version date       source                           
devtools   1.12.0  2016-12-05 CRAN (R 3.4.0)                   
digest     0.6.12  2017-01-27 CRAN (R 3.4.0)                   
memoise    1.1.0   2017-04-21 CRAN (R 3.4.0)                   
stringi  * 1.1.6   2017-05-25 Github (gagolews/stringi@27b47f4)
withr      1.0.2   2016-06-20 CRAN (R 3.4.0) 

@Tazinho
Copy link
Author

Tazinho commented Jun 11, 2017

Please let me know, if you still need any testers.

@gagolews
Copy link
Owner

gagolews commented Jun 12, 2017

I failed to identify where the problem is (so far). :(

@sfirke
Copy link

sfirke commented Jun 21, 2017

I can get the oe character and the euro symbol to convert if I insert stri_encode as you did in an example above. Do you see anything useful in this output?

> stringi::stri_trans_general("œ", "Latin-ASCII")
[1] "\u009c"
> stri_trans_general(stri_encode("œ", from="WINDOWS-1252", to="UTF-8"), "Latin-ASCII")
[1] "oe"
> stringi::stri_trans_general("\u0153", "Latin-ASCII")
[1] "oe"
> stri_encode("œ", from="ISO-8859-1", to="UTF-8")
[1] "\u009c"
> stringi::stri_trans_general("\x80", "latin-ascii")
[1] "\u0080"
> stringi::stri_trans_general(stri_encode("\x80", from="WINDOWS-1252", to="UTF-8"), "latin-ascii")
[1] "€"
> 

@sfirke
Copy link

sfirke commented Jun 21, 2017

I can get more Windows users to test certain commands if helpful in learning more.

@gagolews
Copy link
Owner

Thank for all these. What about stri_enc_info("WINDOWS-1252")?

@gagolews
Copy link
Owner

Maybe any of you could help me hacking than in C++? The relevant files are stri_container_utf8.cpp and stri_container_utf16.cpp. In Lines 84 and 102, respectively, a WINDOWS-1252 codec is opened. I wonder if #if defined(_WIN32) || defined(_WIN64) is TRUE on Windows 10 (?) adding Rprintf("I'M HERE\n"); like instructions in the code could help determining whether the preprocessor-conditional parts are executed.

Moreover, lines 121 and ~148, respectively, are actually responsible for using the opened converters.

@sfirke
Copy link

sfirke commented Jun 21, 2017

Here's stri_enc_info("WINDOWS-1252"):

> stri_enc_info("WINDOWS-1252")
$Name.friendly
[1] "windows-1252"

$Name.ICU
[1] "ibm-5348_P100-1997"

$Name.UTR22
[1] "ibm-5348_P100-1997"

$Name.IBM
[1] "ibm-5348"

$Name.WINDOWS
[1] "windows-1252"

$Name.JAVA
[1] "windows-1252"

$Name.IANA
[1] "windows-1252"

$Name.MIME
[1] NA

$ASCII.subset
[1] TRUE

$Unicode.1to1
[1] TRUE

$CharSize.8bit
[1] TRUE

$CharSize.min
[1] 1

$CharSize.max
[1] 1

As to the C++ debugging, I have never done anything in C++ ... but your guidance seems like enough of a starting point that I will take a look and see what I can find out!

@sfirke
Copy link

sfirke commented Jun 23, 2017

I can no longer reproduce this error, which is great, though I'm sorry to have wasted your time. As best I can tell, I had reloaded the patched version of stringi, not realizing that the changes to C++ wouldn't take until I restarted R. (This came to me after I added your print statement suggested above - even when I deleted + installed the package to clear the print statements out, I was still getting the "I'M HERE", indicating that the old C++ was getting called).

I think this can be closed. Thanks for working through this with me, Marek, and your great work on this package!

@gagolews
Copy link
Owner

Heh :)

@Tazinho, can you confirm? If so, please close the issue.

@Tazinho
Copy link
Author

Tazinho commented Jun 23, 2017

Yes! Here the complete output. Thanks a lot :)

> # install.packages("devtools")
> # devtools::install_github("gagolews/stringi")
> library(stringi)
> devtools::session_info()
Session info ------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, mingw32             
 ui       RStudio (1.0.143)           
 language (EN)                        
 collate  German_Germany.1252         
 tz       Europe/Berlin               
 date     2017-06-23                  

Packages ----------------------------------------------------------------------------
 package    * version  date       source                           
 base       * 3.3.2    2016-10-31 local                            
 datasets   * 3.3.2    2016-10-31 local                            
 devtools     1.13.2   2017-06-02 CRAN (R 3.3.3)                   
 digest       0.6.12   2017-01-27 CRAN (R 3.3.3)                   
 graphics   * 3.3.2    2016-10-31 local                            
 grDevices  * 3.3.2    2016-10-31 local                            
 memoise      1.1.0    2017-04-21 CRAN (R 3.3.3)                   
 methods    * 3.3.2    2016-10-31 local                            
 packrat      0.4.8-12 2017-06-23 Github (rstudio/packrat@b55feb6) 
 rstudioapi   0.6      2016-06-27 CRAN (R 3.3.3)                   
 stats      * 3.3.2    2016-10-31 local                            
 stringi    * 1.1.6    2017-06-23 Github (gagolews/stringi@dc94c5a)
 tools        3.3.2    2016-10-31 local                            
 utils      * 3.3.2    2016-10-31 local                            
 withr        1.0.2    2016-06-20 CRAN (R 3.3.3)                   
> charToRaw("œ")
[1] 9c
> Encoding("œ")
[1] "latin1"
> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> stringi::stri_trans_general("\u0153", "Latin-ASCII")
[1] "oe"
> stringi::stri_trans_general(stri_encode("œ", from="WINDOWS-1252", to="UTF-8"), "Latin-ASCII")
[1] "oe"
> stri_trans_general("œ", "Latin-ASCII")
[1] "oe"

@Tazinho
Copy link
Author

Tazinho commented Aug 3, 2017

@gagolews Are you aware of any way to transliterate ö, ü, ä, ß to oe, ue, ae and ss? As mentioned in the first post?

I implemented this as a lookup in the snakecase package, but only for these german cases and it would be interesting if you knew about a more general way, which includes all letters, which are related to only one specific language alphabet and have an obvious transliteration. Of course it would be also nice, if this could be an option in stri_trans_general, but I am not aware, if this is a stringi or icu issue.

@gagolews
Copy link
Owner

gagolews commented Aug 4, 2017

At least ß works:

> stringi::stri_trans_general(" ö, ü, ä, ß ", "Latin-ASCII")
[1] " o, u, a, ss "

@gagolews
Copy link
Owner

gagolews commented Aug 4, 2017

I guess ICU does not support this...

@zkamvar
Copy link

zkamvar commented May 2, 2019

Sorry to necro this thread, but I stumbled across a similar problem that @Tazinho had and only just now realized that it's possible to chain the transformation identifiers together:

stringi::stri_trans_general(" ö, ü, ä, ß ", "de-ASCII; Latin-ASCII")
#> [1] " oe, ue, ae, ss "

Created on 2019-05-02 by the reprex package (v0.2.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants