-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quoted values are never turned into missing values #295
Comments
Do you have readr 0.2.0? |
I do. I first saw the problem in 0.1.1 and upgraded to check that it still occurred in 0.2.0 — which it did. |
The current behaviour is a result of // Taken from `src/TokenizerDelim.cpp`, lines 229-246,
// commit ef750db855f9434e78bd89e8944e8b1c547bf23a
Token TokenizerDelim::fieldToken(SourceIterator begin, SourceIterator end,
bool hasEscapeB, bool hasNull,
int row, int col) {
Token t(begin, end, row, col, hasNull, (hasEscapeB) ? this : NULL);
if (trimWS_)
t.trim();
t.flagNA(NA_);
return t;
}
Token TokenizerDelim::stringToken(SourceIterator begin, SourceIterator end,
bool hasEscapeB, bool hasEscapeD, bool hasNull,
int row, int col) {
Token t(begin, end, row, col, hasNull, (hasEscapeD || hasEscapeB) ? this : NULL);
if (trimWS_)
t.trim();
return t;
} Adding the line An alternative behaviour is that the argument passed to By the looks of it, the latter behaviour would require more effort to implement. Thoughts? |
I just ran into this with polygraphing's data on screenplays. For example, all character columns are quoted in this csv: character_list5.csv. Little example library(readr)
read_csv('"a"\n"?"', na = "?")
#> Source: local data frame [1 x 1]
#>
#> a
#> (chr)
#> 1 ?
read.csv(text = '"a"\n"?"', na.strings = "?")
#> a
#> 1 NA |
I think this probably needs an additional option. Otherwise how do you distinguish NA from "NA"? But it varies from file to file |
Would you ever need to distinguish
|
When Namibia is in your dataset and you're using ISO 3166-2 country codes 😬. |
I also wondered how this could ever come up in real life. Maybe it's contrived but here's little demo. library(tibble)
x <- frame_data(
~country, ~code,
"Belize", "BZ",
"Namibia", "NA",
"Narnia", NA_character_
)
as.data.frame(x)
#> country code
#> 1 Belize BZ
#> 2 Namibia NA
#> 3 Narnia <NA>
write_csv(x, "test-readr.csv")
x2 <- read_csv("test-readr.csv")
identical(x, x2)
#> [1] TRUE
write.csv(x, "test-base.csv", row.names = FALSE)
x2_base <- read.csv("test-base.csv")
identical(x, x2_base)
#> [1] FALSE
x2_base
#> country code
#> 1 Belize BZ
#> 2 Namibia <NA>
#> 3 Narnia <NA> |
Two can play at that game :) Here's an inconsistency in readr's handling of missing values: library(readr)
quoted.csv <- '1,2,3\n4,"Unknown",6'
x1 <- read_csv(quoted.csv, na = "Unknown", col_names = F)
write_csv(x1, "test.csv", col_names = F)
x2 <- read_csv("test.csv", na = "Unknown", col_names = F)
x1
# X1 X2 X3
# 1 1 2 3
# 2 4 Unknown 6
x2
# X1 X2 X3
# 1 1 2 3
# 2 4 <NA> 6 However my earlier question wasn't about the exact string
While readr says they're not, most — if not all — other CSV readers say they are. |
Here's another example that bit me at the weekend: readr::read_csv('"x","y","z"\n1,"",\n', na = c('""','','NA'))
Source: local data frame [1 x 3]
x y z
<int> <chr> <chr>
1 1 NA Which is a bit of a pain if you have a readr::read_csv('"x","y","z"\n1,"",\n', na = c('""','','NA'), col_types = "iic")
Source: local data frame [1 x 3]
x y z
<int> <int> <chr>
1 1 NA NA which is how I solved my immediate problem, but otherwise you are stumped. This is causing problems reading CSV files from the Canadian Historical Climate Data website. |
This should be an optional, probably defaulting to treating quoted and unquoted missing values identically. |
Got bit by this today. With regards to the option implementation I reckon something like: So the default behaviour will remove both "NA" and NA given the argument If you set the flag to TRUE you get more control, allowing you specify |
I think the argument should be called |
As mentioned in a comment on #111, if a cell value is quoted it can never be a null value:
I think the second call to
read_csv
should result in the"Unknown"
cell being converted into aNA
value — that is, the output of both calls toread_csv
should be identical. It's probably good practice to ignore quotes when typing variables (as mentioned in a comment on #155).This is causing problems parsing the Guardian's The Counted data, where all CSV fields are quoted and null values are specified using
Unknown
.The text was updated successfully, but these errors were encountered: