-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should be able to supply encoding #40
Comments
@romainfrancois can you look into this? We need to be able to accept arbitrary encoding and convert to utf-8 for R. |
I'll have a look at how it is done in R. Encoding is something I don't quite understand yet and so as been ignored in Rcpp, etc ... So this would be an argument to the function or would we have to detect encoding somehow ? |
It would be an argument to a function. Detecting encoding automatically is difficult - the stringi package has some code using ICU (see bottom of http://docs.rexamine.com/stringi/compat_tab_conversion.html) |
I think I have a handle on how to do this now - need to use iconv |
Actually, there is no automatic conversion to UTF-8. I think we could automatically detect the encoding of a file using the chardet command line.
It seems to work pretty well. The fileEncoding option would still be useful to me. |
@blaquans right, there's no automatical conversion because this is an open issue. Character encoding detection is difficult to do well automatically and I think is dangerous to turn on by default. |
Base functions accept encodings such as read.csv(..., fileEncoding="SJIS"). |
The interface will probably get nicer, but this now works :) x <- c("こんにちは")
x
#> [1] "こんにちは"
Encoding(x)
#> [1] "UTF-8"
y <- iconv(x, "UTF-8", "shift-jis")
y
#> [1] "\x82\xb1\x82\xf1\x82\u0242\xbf\x82\xcd"
Encoding(y)
#> [1] "unknown"
ja <- locale("ja", encoding = "shift-jis")
z <- parse_character(y, locale = ja)
z
#> [1] "こんにちは"
Encoding(z)
#> [1] "UTF-8" |
Output should always be utf-8
The text was updated successfully, but these errors were encountered: