Quoted values are never turned into missing values #295

flother · 2015-10-20T16:11:58Z

As mentioned in a comment on #111, if a cell value is quoted it can never be a null value:

> unquoted.csv <- "1,2,3\n4,Unknown,6"
> quoted.csv <- '1,2,3\n4,"Unknown",6'
> read_csv(unquoted.csv, na="Unknown", col_names=F)
  X1   X2 X3
1  1    2  3
2  4 <NA>  6
> read_csv(quoted.csv, na="Unknown", col_names=F)
  X1      X2 X3
1  1       2  3
2  4 Unknown  6

I think the second call to read_csv should result in the "Unknown" cell being converted into a NA value — that is, the output of both calls to read_csv should be identical. It's probably good practice to ignore quotes when typing variables (as mentioned in a comment on #155).

This is causing problems parsing the Guardian's The Counted data, where all CSV fields are quoted and null values are specified using Unknown.

The text was updated successfully, but these errors were encountered:

hadley · 2015-10-20T19:45:29Z

Do you have readr 0.2.0?

flother · 2015-10-20T20:11:46Z

I do. I first saw the problem in 0.1.1 and upgraded to check that it still occurred in 0.2.0 — which it did.

asnr · 2015-12-01T04:42:50Z

The current behaviour is a result of TokenizerDelim::stringToken not calling Token::flagNA:

// Taken from `src/TokenizerDelim.cpp`, lines 229-246,
// commit ef750db855f9434e78bd89e8944e8b1c547bf23a

Token TokenizerDelim::fieldToken(SourceIterator begin, SourceIterator end,
                                 bool hasEscapeB, bool hasNull,
                                 int row, int col) {
  Token t(begin, end, row, col, hasNull, (hasEscapeB) ? this : NULL);
  if (trimWS_)
    t.trim();
  t.flagNA(NA_);
  return t;
}

Token TokenizerDelim::stringToken(SourceIterator begin, SourceIterator end,
                                  bool hasEscapeB, bool hasEscapeD, bool hasNull,
                                  int row, int col) {
  Token t(begin, end, row, col, hasNull, (hasEscapeD || hasEscapeB) ? this : NULL);
  if (trimWS_)
    t.trim();
  return t;
}

Adding the line t.flagNA(NA_) should lead to both Unknown and "Unknown" being turned into NA in read_csv(x, na="Unknown"). In #111 it was decided that this behaviour isn't desirable.

An alternative behaviour is that the argument passed to na is matched strictly, so that read_csv(x, na="Unknown") will not read "Unknown" as NA and read_csv(x, na='"Unknown"') will not read Unknown as NA.

By the looks of it, the latter behaviour would require more effort to implement.

Thoughts?

jennybc · 2016-04-09T07:14:57Z

I just ran into this with polygraphing's data on screenplays. For example, all character columns are quoted in this csv: character_list5.csv. na should be c("", "NA", "NULL", "?") but read_csv() doesn't actually turn the question marks into NAs (they appear in gender). I was surprised.

Little example

library(readr)
read_csv('"a"\n"?"', na = "?")
#> Source: local data frame [1 x 1]
#> 
#>       a
#>   (chr)
#> 1     ?
read.csv(text = '"a"\n"?"', na.strings = "?")
#>    a
#> 1 NA

hadley · 2016-04-09T17:31:40Z

I think this probably needs an additional option. Otherwise how do you distinguish NA from "NA"? But it varies from file to file

flother · 2016-04-09T22:19:10Z

Would you ever need to distinguish NA from "NA"? Pandas treats them both as NA values:

>>> import io
>>> import pandas as pd
>>> pd.read_csv(io.StringIO(u'1,2,3\n4,NA,6'), na_values=["NA"])
   1   2  3
0  4 NaN  6
>>> pd.read_csv(io.StringIO(u'1,2,3\n4,"NA",6'), na_values=["NA"])
   1   2  3
0  4 NaN  6

jennybc · 2016-04-09T22:29:10Z

When Namibia is in your dataset and you're using ISO 3166-2 country codes 😬.

jennybc · 2016-04-11T15:19:41Z

@flother

I also wondered how this could ever come up in real life. Maybe it's contrived but here's little demo. readr is at least self-consistent!

library(tibble)
x <- frame_data(
  ~country, ~code,
  "Belize", "BZ",
  "Namibia", "NA",
  "Narnia", NA_character_
)
as.data.frame(x)
#>   country code
#> 1  Belize   BZ
#> 2 Namibia   NA
#> 3  Narnia <NA>

write_csv(x, "test-readr.csv")
x2 <- read_csv("test-readr.csv")
identical(x, x2)
#> [1] TRUE

write.csv(x, "test-base.csv", row.names = FALSE)
x2_base <- read.csv("test-base.csv")
identical(x, x2_base)
#> [1] FALSE
x2_base
#>   country code
#> 1  Belize   BZ
#> 2 Namibia <NA>
#> 3  Narnia <NA>

flother · 2016-04-12T21:47:03Z

Two can play at that game :) Here's an inconsistency in readr's handling of missing values:

library(readr)

quoted.csv <- '1,2,3\n4,"Unknown",6'
x1 <- read_csv(quoted.csv, na = "Unknown", col_names = F)
write_csv(x1, "test.csv", col_names = F)
x2 <- read_csv("test.csv", na = "Unknown", col_names = F)
x1
#   X1      X2 X3
# 1  1       2  3
# 2  4 Unknown  6
x2
#   X1   X2 X3
# 1  1    2  3
# 2  4 <NA>  6

However my earlier question wasn't about the exact string NA, but about whether there's a need to distinguish between a missing-value string with quotes and without. If Unknown is the NA value, are these two CSV rows identical?

Blah,"Unknown"
Blah,Unknown

While readr says they're not, most — if not all — other CSV readers say they are.

gavinsimpson · 2016-06-01T22:04:59Z

Here's another example that bit me at the weekend:

readr::read_csv('"x","y","z"\n1,"",\n', na = c('""','','NA'))

Source: local data frame [1 x 3]

      x     y     z
  <int> <chr> <chr>
1     1          NA

Which is a bit of a pain if you have a " quoted CSV but "" means missing (NA). You can work around this if you know the entire column is a numeric/integer

readr::read_csv('"x","y","z"\n1,"",\n', na = c('""','','NA'), col_types = "iic")

Source: local data frame [1 x 3]

      x     y     z
  <int> <int> <chr>
1     1    NA    NA

which is how I solved my immediate problem, but otherwise you are stumped.

This is causing problems reading CSV files from the Canadian Historical Climate Data website.

hadley · 2016-06-02T03:06:10Z

This should be an optional, probably defaulting to treating quoted and unquoted missing values identically.

MilesMcBain · 2016-06-27T04:16:35Z

Got bit by this today. With regards to the option implementation I reckon something like: parse_missing_before_unquote=F

So the default behaviour will remove both "NA" and NA given the argument na = c("NA") (I agree with @flother in that this is the expectation).

If you set the flag to TRUE you get more control, allowing you specify na = c("NA") to match just NA and na = c("\"NA\"") to match "NA". That should keep the Namibians happy.

hadley · 2016-07-06T18:16:00Z

I think the argument should be called quoted_na

Fixes tidyverse#295

…gs (#471) Fixes #295

jennybc mentioned this issue Apr 11, 2016

NA printing tidyverse/tibble#69

Closed

hadley added feature a feature request or enhancement ready labels Jun 2, 2016

hadley assigned jimhester Jul 6, 2016

hadley modified the milestone: 0.3.0 Jul 13, 2016

jimhester mentioned this issue Jul 13, 2016

quoted_na argument to control behavior of missing values inside strings #471

Merged

jimhester added in progress and removed ready labels Jul 13, 2016

jimhester added a commit to jimhester/readr that referenced this issue Jul 13, 2016

quoted_na argument to control behavior of missing values inside strings

256f1f8

Fixes tidyverse#295

hadley closed this as completed in #471 Jul 13, 2016

hadley pushed a commit that referenced this issue Jul 13, 2016

quoted_na argument to control behavior of missing values inside strin…

63a7ce9

…gs (#471) Fixes #295

hadley removed the in progress label Jul 13, 2016

jennybc mentioned this issue Jul 16, 2016

Handle quoted NA strings #481

Closed

MilesMcBain mentioned this issue Jan 24, 2017

behaviour of is.na njtierney/naniar#31

Closed

njtierney mentioned this issue Mar 1, 2018

blog posts and other resources to look at njtierney/naniar#26

Open

lock bot locked and limited conversation to collaborators Sep 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quoted values are never turned into missing values #295

Quoted values are never turned into missing values #295

flother commented Oct 20, 2015

hadley commented Oct 20, 2015

flother commented Oct 20, 2015

asnr commented Dec 1, 2015

jennybc commented Apr 9, 2016

hadley commented Apr 9, 2016

flother commented Apr 9, 2016

jennybc commented Apr 9, 2016

jennybc commented Apr 11, 2016

flother commented Apr 12, 2016

gavinsimpson commented Jun 1, 2016

hadley commented Jun 2, 2016

MilesMcBain commented Jun 27, 2016 •

edited

Loading

hadley commented Jul 6, 2016

Quoted values are never turned into missing values #295

Quoted values are never turned into missing values #295

Comments

flother commented Oct 20, 2015

hadley commented Oct 20, 2015

flother commented Oct 20, 2015

asnr commented Dec 1, 2015

jennybc commented Apr 9, 2016

hadley commented Apr 9, 2016

flother commented Apr 9, 2016

jennybc commented Apr 9, 2016

jennybc commented Apr 11, 2016

flother commented Apr 12, 2016

gavinsimpson commented Jun 1, 2016

hadley commented Jun 2, 2016

MilesMcBain commented Jun 27, 2016 • edited Loading

hadley commented Jul 6, 2016

MilesMcBain commented Jun 27, 2016 •

edited

Loading