write_csv (and any other in the write_delim family) can't handle NaN values #1082

jmobrien · 2020-03-12T22:46:45Z

Hi, working with readr version 1.3.1. In this and previous versions, I've found that whenever there are NaN's in the data, write_csv presents no options for handling them. They are always written out as NaN, no matter what is passed to the parameter "na".

Within R, of course, NaN's function as a type of NA's, and can sometimes emerge organically from normal operations where "true" NA's are present, as shown in the reprex.

The complimentary reading functions within R (read_csv and read.csv) can recognize the NaN's and thus maintain them and any expected behavior, but this presents problems for moving data to any outside environment where the user would expect that data to be marked as missing.

I could see preserving NaN's for some users if that avoids information loss in their data, but I'd love to at least have a choice of behavior. And if I'm being opinionated, I think the reasonable default behavior would handle NaN's uniformly with writing all the other classes/types of NA, fitting the expectations of most users who are reasoning from the fact that is.na(NaN) evaluates to TRUE.

is.na(NaN) # NaN's are considered missing data, 
# Data with some NA's and NaN's
dat <- 
  tibble(a = c(NA, 1:10),
         b = c(NA, 11:20),
         c = c(NA, 21:25, NaN, 27:30),
         d = c(NA, 31:35, NA, 37:40)
  ) %>% 
  # NaN's can derive from NA's in some operations:
  mutate(e = rowMeans(., na.rm=T))

# write_csv leaves NaN's:
write_csv(dat, "./file.csv")

# Explicitly specifying the na parameter does nothing:
write_csv(dat, na = "XXX", "./file2.csv")

The text was updated successfully, but these errors were encountered:

jimhester · 2020-03-13T13:49:45Z

I don't think the readr behavior is likely to change. However you can convert the NaNs to normal NAs yourself prior to writing if this is the behavior you prefer.

dat <-
  tibble::tibble(a = c(NA, 1:10),
         b = c(NA, 11:20),
         c = c(NA, 21:25, NaN, 27:30),
         d = c(NA, 31:35, NA, 37:40)
  )

dat[] <- lapply(dat, function(x) { x[is.nan(x)] <- NA; x })
dat
#> # A tibble: 11 x 4
#>        a     b     c     d
#>    <int> <int> <dbl> <int>
#>  1    NA    NA    NA    NA
#>  2     1    11    21    31
#>  3     2    12    22    32
#>  4     3    13    23    33
#>  5     4    14    24    34
#>  6     5    15    25    35
#>  7     6    16    NA    NA
#>  8     7    17    27    37
#>  9     8    18    28    38
#> 10     9    19    29    39
#> 11    10    20    30    40

^{Created on 2020-03-13 by the reprex package (v0.3.0)}

jmobrien · 2020-03-13T14:47:20Z

Of course it's possible to work around this. But that assumes at least one extra step prior to every write operation--either just going ahead and doing an adjustment pass over the whole data like you described, or something like

any(
  vapply(dat, function(x) {any(is.nan(x))}, T)
)

# or

purrr::map_lgl(dat, ~is.nan(.x) %>% any) %>% any

just to see if whether any fix is needed. Given that there would be additional steps if the above returns TRUE, in most cases I may as well just run something your lapply code regardless. But, if I as a user end up needing to consistently write custom code so another function can be trustworthy and viable, isn't that a strong argument for that function having something that needs addressing?

I'm guessing that it has something to do with NaN being an ISO defined type, the fact that this is all being passed to C for speed, and a situation where internally implementing something like what you described prior to the .Call() would undermine the speed case? If so, I get the thinking. The additional speed is nice, as are the other features like sensible defaults and pipe suitability. Still, it's just really strange to see an R function trying to improve on a basic workflow task while introducing such a low-level deviation from R-like behavior (and more practically, deviating from the explicit behavior of its predecessor write.csv()).

One that isn't even documented--I had to do my own manual investigation when my files weren't readable to another program (by Mplus in my case). At the absolute least, surely someone could cut down on future users' confusion with an update to the help page?

jimhester closed this as completed in ccc4dea Mar 13, 2020

sthagen mentioned this issue Mar 17, 2020

Treat NaN values the same as NA values sthagen/tidyverse-readr#2

Merged

jonasfoe mentioned this issue Nov 9, 2020

write_csv (and any other in the write_delim family) can't export NaN anymore #1146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_csv (and any other in the write_delim family) can't handle NaN values #1082

write_csv (and any other in the write_delim family) can't handle NaN values #1082

jmobrien commented Mar 12, 2020

jimhester commented Mar 13, 2020 •

edited

Loading

jmobrien commented Mar 13, 2020

write_csv (and any other in the write_delim family) can't handle NaN values #1082

write_csv (and any other in the write_delim family) can't handle NaN values #1082

Comments

jmobrien commented Mar 12, 2020

jimhester commented Mar 13, 2020 • edited Loading

jmobrien commented Mar 13, 2020

jimhester commented Mar 13, 2020 •

edited

Loading