Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_csv (and any other in the write_delim family) can't handle NaN values #1082

Closed
jmobrien opened this issue Mar 12, 2020 · 2 comments · Fixed by sthagen/tidyverse-readr#2

Comments

@jmobrien
Copy link

Hi, working with readr version 1.3.1. In this and previous versions, I've found that whenever there are NaN's in the data, write_csv presents no options for handling them. They are always written out as NaN, no matter what is passed to the parameter "na".

Within R, of course, NaN's function as a type of NA's, and can sometimes emerge organically from normal operations where "true" NA's are present, as shown in the reprex.

The complimentary reading functions within R (read_csv and read.csv) can recognize the NaN's and thus maintain them and any expected behavior, but this presents problems for moving data to any outside environment where the user would expect that data to be marked as missing.

I could see preserving NaN's for some users if that avoids information loss in their data, but I'd love to at least have a choice of behavior. And if I'm being opinionated, I think the reasonable default behavior would handle NaN's uniformly with writing all the other classes/types of NA, fitting the expectations of most users who are reasoning from the fact that is.na(NaN) evaluates to TRUE.

is.na(NaN) # NaN's are considered missing data, 
# Data with some NA's and NaN's
dat <- 
  tibble(a = c(NA, 1:10),
         b = c(NA, 11:20),
         c = c(NA, 21:25, NaN, 27:30),
         d = c(NA, 31:35, NA, 37:40)
  ) %>% 
  # NaN's can derive from NA's in some operations:
  mutate(e = rowMeans(., na.rm=T))

# write_csv leaves NaN's:
write_csv(dat, "./file.csv")

# Explicitly specifying the na parameter does nothing:
write_csv(dat, na = "XXX", "./file2.csv")
@jimhester
Copy link
Collaborator

jimhester commented Mar 13, 2020

I don't think the readr behavior is likely to change. However you can convert the NaNs to normal NAs yourself prior to writing if this is the behavior you prefer.

dat <-
  tibble::tibble(a = c(NA, 1:10),
         b = c(NA, 11:20),
         c = c(NA, 21:25, NaN, 27:30),
         d = c(NA, 31:35, NA, 37:40)
  )

dat[] <- lapply(dat, function(x) { x[is.nan(x)] <- NA; x })
dat
#> # A tibble: 11 x 4
#>        a     b     c     d
#>    <int> <int> <dbl> <int>
#>  1    NA    NA    NA    NA
#>  2     1    11    21    31
#>  3     2    12    22    32
#>  4     3    13    23    33
#>  5     4    14    24    34
#>  6     5    15    25    35
#>  7     6    16    NA    NA
#>  8     7    17    27    37
#>  9     8    18    28    38
#> 10     9    19    29    39
#> 11    10    20    30    40

Created on 2020-03-13 by the reprex package (v0.3.0)

@jmobrien
Copy link
Author

Of course it's possible to work around this. But that assumes at least one extra step prior to every write operation--either just going ahead and doing an adjustment pass over the whole data like you described, or something like

any(
  vapply(dat, function(x) {any(is.nan(x))}, T)
)

# or

purrr::map_lgl(dat, ~is.nan(.x) %>% any) %>% any

just to see if whether any fix is needed. Given that there would be additional steps if the above returns TRUE, in most cases I may as well just run something your lapply code regardless. But, if I as a user end up needing to consistently write custom code so another function can be trustworthy and viable, isn't that a strong argument for that function having something that needs addressing?

I'm guessing that it has something to do with NaN being an ISO defined type, the fact that this is all being passed to C for speed, and a situation where internally implementing something like what you described prior to the .Call() would undermine the speed case? If so, I get the thinking. The additional speed is nice, as are the other features like sensible defaults and pipe suitability. Still, it's just really strange to see an R function trying to improve on a basic workflow task while introducing such a low-level deviation from R-like behavior (and more practically, deviating from the explicit behavior of its predecessor write.csv()).

One that isn't even documented--I had to do my own manual investigation when my files weren't readable to another program (by Mplus in my case). At the absolute least, surely someone could cut down on future users' confusion with an update to the help page?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants