Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around bug in enc2utf8 / translateCharUTF8 on Windows #287

Closed
patperry opened this issue Nov 8, 2017 · 3 comments
Closed

Work around bug in enc2utf8 / translateCharUTF8 on Windows #287

patperry opened this issue Nov 8, 2017 · 3 comments

Comments

@patperry
Copy link

patperry commented Nov 8, 2017

R's handling of native text is buggy on Windows. Specifically, R marks all Windows-1252 text as Latin-1. This causes problems when converting from marked "latin1" strings to "UTF-8": bytes in the range 0x80 to 0x9F get translated as U+0080 to U+009F. See, for example, the input string "You don‘t get “your” money’s worth":

screen shot 2017-11-08 at 2 39 48 pm

More context: https://stat.ethz.ch/pipermail/r-devel/2017-September/074908.html

You can work around this bug by interpreting CE_LATIN1 as Windows-1252 on Windows. Feel free to copy code from https://github.com/patperry/r-utf8/blob/master/src/util.c#L59

@gagolews
Copy link
Owner

Hi, I've worked that already in #270

@patperry
Copy link
Author

great to hear! sorry for not checking the devel version before posting

@gagolews
Copy link
Owner

no worries!
I'll be filing a CRAN update of stringi today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants