Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quoted values are never turned into missing values #295

Closed
flother opened this issue Oct 20, 2015 · 13 comments
Closed

Quoted values are never turned into missing values #295

flother opened this issue Oct 20, 2015 · 13 comments
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@flother
Copy link

flother commented Oct 20, 2015

As mentioned in a comment on #111, if a cell value is quoted it can never be a null value:

> unquoted.csv <- "1,2,3\n4,Unknown,6"
> quoted.csv <- '1,2,3\n4,"Unknown",6'
> read_csv(unquoted.csv, na="Unknown", col_names=F)
  X1   X2 X3
1  1    2  3
2  4 <NA>  6
> read_csv(quoted.csv, na="Unknown", col_names=F)
  X1      X2 X3
1  1       2  3
2  4 Unknown  6

I think the second call to read_csv should result in the "Unknown" cell being converted into a NA value — that is, the output of both calls to read_csv should be identical. It's probably good practice to ignore quotes when typing variables (as mentioned in a comment on #155).

This is causing problems parsing the Guardian's The Counted data, where all CSV fields are quoted and null values are specified using Unknown.

@hadley
Copy link
Member

hadley commented Oct 20, 2015

Do you have readr 0.2.0?

@flother
Copy link
Author

flother commented Oct 20, 2015

I do. I first saw the problem in 0.1.1 and upgraded to check that it still occurred in 0.2.0 — which it did.

@asnr
Copy link
Contributor

asnr commented Dec 1, 2015

The current behaviour is a result of TokenizerDelim::stringToken not calling Token::flagNA:

// Taken from `src/TokenizerDelim.cpp`, lines 229-246,
// commit ef750db855f9434e78bd89e8944e8b1c547bf23a

Token TokenizerDelim::fieldToken(SourceIterator begin, SourceIterator end,
                                 bool hasEscapeB, bool hasNull,
                                 int row, int col) {
  Token t(begin, end, row, col, hasNull, (hasEscapeB) ? this : NULL);
  if (trimWS_)
    t.trim();
  t.flagNA(NA_);
  return t;
}

Token TokenizerDelim::stringToken(SourceIterator begin, SourceIterator end,
                                  bool hasEscapeB, bool hasEscapeD, bool hasNull,
                                  int row, int col) {
  Token t(begin, end, row, col, hasNull, (hasEscapeD || hasEscapeB) ? this : NULL);
  if (trimWS_)
    t.trim();
  return t;
}

Adding the line t.flagNA(NA_) should lead to both Unknown and "Unknown" being turned into NA in read_csv(x, na="Unknown"). In #111 it was decided that this behaviour isn't desirable.

An alternative behaviour is that the argument passed to na is matched strictly, so that read_csv(x, na="Unknown") will not read "Unknown" as NA and read_csv(x, na='"Unknown"') will not read Unknown as NA.

By the looks of it, the latter behaviour would require more effort to implement.

Thoughts?

@jennybc
Copy link
Member

jennybc commented Apr 9, 2016

I just ran into this with polygraphing's data on screenplays. For example, all character columns are quoted in this csv: character_list5.csv. na should be c("", "NA", "NULL", "?") but read_csv() doesn't actually turn the question marks into NAs (they appear in gender). I was surprised.

Little example

library(readr)
read_csv('"a"\n"?"', na = "?")
#> Source: local data frame [1 x 1]
#> 
#>       a
#>   (chr)
#> 1     ?
read.csv(text = '"a"\n"?"', na.strings = "?")
#>    a
#> 1 NA

@hadley
Copy link
Member

hadley commented Apr 9, 2016

I think this probably needs an additional option. Otherwise how do you distinguish NA from "NA"? But it varies from file to file

@flother
Copy link
Author

flother commented Apr 9, 2016

Would you ever need to distinguish NA from "NA"? Pandas treats them both as NA values:

>>> import io
>>> import pandas as pd
>>> pd.read_csv(io.StringIO(u'1,2,3\n4,NA,6'), na_values=["NA"])
   1   2  3
0  4 NaN  6
>>> pd.read_csv(io.StringIO(u'1,2,3\n4,"NA",6'), na_values=["NA"])
   1   2  3
0  4 NaN  6

@jennybc
Copy link
Member

jennybc commented Apr 9, 2016

When Namibia is in your dataset and you're using ISO 3166-2 country codes 😬.

@jennybc
Copy link
Member

jennybc commented Apr 11, 2016

@flother

I also wondered how this could ever come up in real life. Maybe it's contrived but here's little demo. readr is at least self-consistent!

library(tibble)
x <- frame_data(
  ~country, ~code,
  "Belize", "BZ",
  "Namibia", "NA",
  "Narnia", NA_character_
)
as.data.frame(x)
#>   country code
#> 1  Belize   BZ
#> 2 Namibia   NA
#> 3  Narnia <NA>

write_csv(x, "test-readr.csv")
x2 <- read_csv("test-readr.csv")
identical(x, x2)
#> [1] TRUE

write.csv(x, "test-base.csv", row.names = FALSE)
x2_base <- read.csv("test-base.csv")
identical(x, x2_base)
#> [1] FALSE
x2_base
#>   country code
#> 1  Belize   BZ
#> 2 Namibia <NA>
#> 3  Narnia <NA>

@flother
Copy link
Author

flother commented Apr 12, 2016

Two can play at that game :) Here's an inconsistency in readr's handling of missing values:

library(readr)

quoted.csv <- '1,2,3\n4,"Unknown",6'
x1 <- read_csv(quoted.csv, na = "Unknown", col_names = F)
write_csv(x1, "test.csv", col_names = F)
x2 <- read_csv("test.csv", na = "Unknown", col_names = F)
x1
#   X1      X2 X3
# 1  1       2  3
# 2  4 Unknown  6
x2
#   X1   X2 X3
# 1  1    2  3
# 2  4 <NA>  6

However my earlier question wasn't about the exact string NA, but about whether there's a need to distinguish between a missing-value string with quotes and without. If Unknown is the NA value, are these two CSV rows identical?

Blah,"Unknown"
Blah,Unknown

While readr says they're not, most — if not all — other CSV readers say they are.

@gavinsimpson
Copy link

Here's another example that bit me at the weekend:

readr::read_csv('"x","y","z"\n1,"",\n', na = c('""','','NA'))

Source: local data frame [1 x 3]

      x     y     z
  <int> <chr> <chr>
1     1          NA

Which is a bit of a pain if you have a " quoted CSV but "" means missing (NA). You can work around this if you know the entire column is a numeric/integer

readr::read_csv('"x","y","z"\n1,"",\n', na = c('""','','NA'), col_types = "iic")

Source: local data frame [1 x 3]

      x     y     z
  <int> <int> <chr>
1     1    NA    NA

which is how I solved my immediate problem, but otherwise you are stumped.

This is causing problems reading CSV files from the Canadian Historical Climate Data website.

@hadley hadley added feature a feature request or enhancement ready labels Jun 2, 2016
@hadley
Copy link
Member

hadley commented Jun 2, 2016

This should be an optional, probably defaulting to treating quoted and unquoted missing values identically.

@MilesMcBain
Copy link

MilesMcBain commented Jun 27, 2016

Got bit by this today. With regards to the option implementation I reckon something like: parse_missing_before_unquote=F

So the default behaviour will remove both "NA" and NA given the argument na = c("NA") (I agree with @flother in that this is the expectation).

If you set the flag to TRUE you get more control, allowing you specify na = c("NA") to match just NA and na = c("\"NA\"") to match "NA". That should keep the Namibians happy.

@hadley
Copy link
Member

hadley commented Jul 6, 2016

I think the argument should be called quoted_na

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

7 participants