Should be able to supply encoding #40

hadley · 2014-06-19T10:55:04Z

Output should always be utf-8

hadley · 2014-06-23T14:59:18Z

@romainfrancois can you look into this? We need to be able to accept arbitrary encoding and convert to utf-8 for R.

romainfrancois · 2014-06-23T15:02:48Z

I'll have a look at how it is done in R. Encoding is something I don't quite understand yet and so as been ignored in Rcpp, etc ...

So this would be an argument to the function or would we have to detect encoding somehow ?

hadley · 2014-06-23T15:06:33Z

It would be an argument to a function. Detecting encoding automatically is difficult - the stringi package has some code using ICU (see bottom of http://docs.rexamine.com/stringi/compat_tab_conversion.html)

hadley · 2015-03-09T20:51:47Z

I think I have a handle on how to do this now - need to use iconv

pachevalier · 2015-04-10T16:18:28Z

Actually, there is no automatic conversion to UTF-8. I think we could automatically detect the encoding of a file using the chardet command line.

> system("chardet sources/DE_PF_et_FA.txt")
sources/DE_PF_et_FA.txt: windows-1252 with confidence 0.73

It seems to work pretty well. The fileEncoding option would still be useful to me.

hadley · 2015-04-10T16:19:43Z

@blaquans right, there's no automatical conversion because this is an open issue. Character encoding detection is difficult to do well automatically and I think is dangerous to turn on by default.

okumuralab · 2015-04-11T05:10:42Z

Base functions accept encodings such as read.csv(..., fileEncoding="SJIS").

hadley · 2015-09-09T22:04:42Z

The interface will probably get nicer, but this now works :)

x <- c("こんにちは")
x
#> [1] "こんにちは"
Encoding(x)
#> [1] "UTF-8"

y <- iconv(x, "UTF-8", "shift-jis")
y
#> [1] "\x82\xb1\x82\xf1\x82\u0242\xbf\x82\xcd"
Encoding(y)
#> [1] "unknown"

ja <- locale("ja", encoding = "shift-jis")
z <- parse_character(y, locale = ja)
z
#> [1] "こんにちは"
Encoding(z)
#> [1] "UTF-8"

hadley mentioned this issue May 5, 2015

UTF8/character encoding #164

Closed

This was referenced Jul 5, 2015

support UTF-8 fules, allow the 'encoding' option in read_csv #191

Closed

Import de données et readr larmarange/analyse-R#44

Closed

hadley mentioned this issue Aug 3, 2015

Character encoding problems with write_csv #227

Closed

hadley closed this as completed in 9e1e2e2 Sep 9, 2015

briatte mentioned this issue Sep 15, 2015

Use read_tsv from readr instead of read.table rOpenGov/eurostat#29

Closed

davidski mentioned this issue Sep 16, 2015

Support for UTF-8-BOM #263

Closed

lock bot locked and limited conversation to collaborators Sep 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should be able to supply encoding #40

Should be able to supply encoding #40

hadley commented Jun 19, 2014

hadley commented Jun 23, 2014

romainfrancois commented Jun 23, 2014

hadley commented Jun 23, 2014

hadley commented Mar 9, 2015

pachevalier commented Apr 10, 2015

hadley commented Apr 10, 2015

okumuralab commented Apr 11, 2015

hadley commented Sep 9, 2015

Should be able to supply encoding #40

Should be able to supply encoding #40

Comments

hadley commented Jun 19, 2014

hadley commented Jun 23, 2014

romainfrancois commented Jun 23, 2014

hadley commented Jun 23, 2014

hadley commented Mar 9, 2015

pachevalier commented Apr 10, 2015

hadley commented Apr 10, 2015

okumuralab commented Apr 11, 2015

hadley commented Sep 9, 2015