Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv() creates "spec" attributes which don't get updated nor lost by some tibble transformations #934

Closed
prosoitos opened this issue Dec 5, 2018 · 9 comments

Comments

@prosoitos
Copy link

prosoitos commented Dec 5, 2018

read_csv() (and probably the other read_delim funtions, though I haven't tested it) creates "spec" attributes which do not get updated nor lost when the tibble gets transformed and thus can end up being in total mismatch with the tibble they are associated with.

This does not create any problem, but it makes for very weird outputs to str() and creates unnecessarily lengthy outputs to dput().


Example:

I read a .csv file with read_csv() to create a tibble. I change its names, I select one variable out, I replace the remaining variables with vectors of different types and I end up with this tibble:

tbl <- structure(list(
  A = c("a", "b", "c"),
  B = 1:3),
  row.names = c(NA, -3L),
  spec = structure(list(
    cols = list(
      date = structure(list(format = ""), class = c("collector_date", "collector")),
      species = structure(list(), class = c("collector_character", "collector")),
      abundance = structure(list(), class = c("collector_double", "collector"))),
    default = structure(list(), class = c("collector_guess", "collector"))),
    class = "col_spec"), class = c("tbl_df", "tbl", "data.frame"))

So now, this is my tibble:

tbl
#>   A B
#> 1 a 1
#> 2 b 2
#> 3 c 3

And it gives these str() and dput() outputs:

str(tbl)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    3 obs. of  2 variables:
#>  $ A: chr  "a" "b" "c"
#>  $ B: int  1 2 3
#>  - attr(*, "spec")=List of 2
#>   ..$ cols   :List of 3
#>   .. ..$ date     :List of 1
#>   .. .. ..$ format: chr ""
#>   .. .. ..- attr(*, "class")= chr  "collector_date" "collector"
#>   .. ..$ species  : list()
#>   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
#>   .. ..$ abundance: list()
#>   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
#>   ..$ default: list()
#>   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
#>   ..- attr(*, "class")= chr "col_spec"

(Now the "spec" attribute has a wrong number of variables, wrong variable names, and wrong variable types 🤣 )

dput(tbl)
#> structure(list(A = c("a", "b", "c"), B = 1:3), row.names = c(NA, 
#> -3L), spec = structure(list(cols = list(date = structure(list(
#>     format = ""), class = c("collector_date", "collector")), 
#>     species = structure(list(), class = c("collector_character", 
#>     "collector")), abundance = structure(list(), class = c("collector_double", 
#>     "collector"))), default = structure(list(), class = c("collector_guess", 
#> "collector"))), class = "col_spec"), class = c("tbl_df", "tbl", 
#> "data.frame"))

(That's a long dput() output for such a simple tibble 😳 )

Of course, using the result of dput() to create the object in the first place in this example makes it look very circular and kind of silly, but I have to do that to demonstrate the idea without forcing you to download a .csv file or create one. But this is not just a theoretical point: a real life and very common scenario where this kicks in is this:

You create a tibble with read_csv(), you want to create a reprex, so you transform your tibble to make it very simple, then you use dput() to create the data of your very basic tibble and you end up with a ton of silly attributes in the output. Of course, you can simply get rid of the "spec" attribute from the result of dput() and all is good:

structure(list(
  A = c("a", "b", "c"),
  B = 1:3),
  row.names = c(NA, -3L))

gives the same tibble and it is easy enough to do. And even if you don't, those wrong attributes don't get in the way of anything. So not a big deal. But because it is somewhat hard to get new users to create good reproducible examples, it is a little quirk that doesn't help with explaining dput().

@prosoitos prosoitos changed the title read_csv() creates "spec" attributes which don't get updated nor lost by tibble transformations read_csv() creates "spec" attributes which don't get updated nor lost by some tibble transformations Dec 5, 2018
@jennybc
Copy link
Member

jennybc commented Dec 5, 2018

As for the "making a reprex" side of this ...

I would define a small tibble inline:

library(tidyverse)

tbl <- tibble(
  A = c("a", "b", "c"),
  B = 1:3
)
## or
tbl <- tribble(
  ~A, ~B,
  "a", 1,
  "b", 2,
  "c", 3
)

For larger tibbles, there nicer ways than dput() to get the code needed to define them.
Let’s use the mtcars example that ships with readr:

x <- read_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#>   mpg = col_double(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )

x2 <- x %>% 
  select(mpg, cyl, disp) %>% 
  head(3)

Yes, it is true that attributes defined at read time persist after data manipulation.

str(x2)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    3 obs. of  3 variables:
#>  $ mpg : num  21 21 22.8
#>  $ cyl : num  6 6 4
#>  $ disp: num  160 160 108
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   mpg = col_double(),
#>   ..   cyl = col_double(),
#>   ..   disp = col_double(),
#>   ..   hp = col_double(),
#>   ..   drat = col_double(),
#>   ..   wt = col_double(),
#>   ..   qsec = col_double(),
#>   ..   vs = col_double(),
#>   ..   am = col_double(),
#>   ..   gear = col_double(),
#>   ..   carb = col_double()
#>   .. )

datapasta and deparse both offer ways to get nice code for x2 – much better than dput().

x2_tribble_source <- datapasta::tribble_construct(x2)
cat(x2_tribble_source)
#> tibble::tribble(
#>   ~mpg, ~cyl, ~disp,
#>     21,    6,   160,
#>     21,    6,   160,
#>   22.8,    4,   108
#>   )

x2_df_source <- datapasta::df_construct(x2)
cat(x2_df_source)
#> data.frame(
#>          mpg = c(21, 21, 22.8),
#>          cyl = c(6, 6, 4),
#>         disp = c(160, 160, 108)
#> )

Created on 2018-12-04 by the reprex package (v0.2.1.9000)

@prosoitos
Copy link
Author

prosoitos commented Dec 5, 2018

Thank you very much Jenny for your thoughts.

I would define a small tibble inline:

Of course. Me too. The only reason I used dput() to create this small tibble was in order to use read_csv(). This is very artificial for the purpose of this issue.


For larger tibbles, there nicer ways than dput() to get the code needed to define them.

The problem wasn't the size of the tibble, but using read_csv(). But I am extremely thankful for your comment for 2 reasons:

  • I don't think enough of using the datasets that come with R and/or packages when putting reproducible examples together. So this is a great reminder to use them!
  • Because I haven't built any package and because I am still quite new to R, while I have used mtcars plenty before, I would have never thought of using it in this way 🙂 Shame on me for my naivety, but I hadn't really thought about the form in which the data was stored in the package (an actual .csv file...) and that this .csv file could be used to make a reprex with read_csv()! (I had never paid attention to the readr_example() function.) I feel silly, but this has been enlightening to me. Thank you!
    (Edit: Oh, I see now... I just realized that mtcars is part of the base R package datasets, in some binary form, but that readr ships with a couple of datasets in various text formats for exactly the kind of situation I was in with this issue... all this makes more sense now. Really cool. I had missed that whole part of the readr package).

datapasta and deparse both offer ways to get nice code for x2 – much better than dput()

Ah! Great! I was playing with dput() because I am putting a little workshop together and here again, I am really thankful for your comment!

deparse::deparsec(tbl)
#> tibble(A = c("a", "b", "c"), B = 1:3)

is certainly a lot nicer than my previous dput(tbl) and its crazy output with all the carry-over attributes.

datapasta::dmdclip(tbl) and datapasta::dpasta(tbl) are really sweet too!

Definitely updating my workshop right now!


All that said, and even if, with those great alternatives to dput() it matters even less, those carry-over "spec" attributes are still a little surprising when looking at the str() of a tibble. Maybe not worth the trouble to worry about it though (?)

Anyway, thank you very sincerely. This has been very helpful for me.

@jimhester
Copy link
Collaborator

tibble methods preserving additional attributes is very new, it seems to be added in tidyverse/tibble@2cabe6d#diff-ccca386aac53cf0029fb15ebff8901d5, which is not yet on CRAN.

The original behavior was they were lost as soon as you performed a manipulation.

Anyway spec is meant to store how the data was originally read by readr, not how it currently looks, so even if it is preserved by further manipulations I don't think there is an issue.

@jennybc
Copy link
Member

jennybc commented Dec 5, 2018

Continuing with the "reprex tips" re: workshop, both read.csv() and read_csv() also support the inline provision of what would normally be in the file. It's terribly hard to read though, so is only relevant for a very small example where it is somehow important to use read.csv() or read_csv().

read.csv(text = "A,B\na,1\nb,2\nc,3")
#>   A B
#> 1 a 1
#> 2 b 2
#> 3 c 3

readr::read_csv("A,B\na,1\nb,2\nc,3")
#> # A tibble: 3 x 2
#>   A         B
#>   <chr> <dbl>
#> 1 a         1
#> 2 b         2
#> 3 c         3

Created on 2018-12-05 by the reprex package (v0.2.1.9000)

You might want to favour datapasta over deparse, because data pasta is on CRAN.

@jimhester
Copy link
Collaborator

This is going truly far afield, but you can use glue::glue_trim() for this as well, e.g.

readr::read_csv(glue::trim("
A,B
a,1
b,2
c,3
"))
#> # A tibble: 3 x 2
#>   A         B
#>   <chr> <dbl>
#> 1 a         1
#> 2 b         2
#> 3 c         3

Created on 2018-12-05 by the reprex package (v0.2.1)

jimhester added a commit that referenced this issue Dec 5, 2018
To ensure the spec is dropped once they are subset.

Fixes #934
jimhester added a commit that referenced this issue Dec 5, 2018
To ensure the spec is dropped once they are subset.

Fixes #934
@prosoitos
Copy link
Author

prosoitos commented Dec 5, 2018

tibble methods preserving additional attributes is very new, it seems to be added in tidyverse/tibble@2cabe6d#diff-ccca386aac53cf0029fb15ebff8901d5, which is not yet on CRAN.

Weird... here is the output of sessioninfo::session_info() for the tibble package I am running (sorry, I should have included this in the issue):

 tibble      * 1.4.2      2018-01-22 [2] CRAN (R 3.5.0)                  

I don't think I am running the devel version...

Anyway, it was not important as you said and thanks for fixing it!

@prosoitos
Copy link
Author

Thank you Jenny for the additional tips.

You might want to favour datapasta over deparse, because data pasta is on CRAN.

Both are amazing. But I was thinking of using deparse for my workshop actually... After playing with both for a bit, I thought that it was particularly clean and simple.

Of all the options, this was the one I had settled on:

tbl <- tibble::tibble(
  A = c("a", "b", "c"),
  B = 1:3
)

deparse::deparsec(tbl)
#> tibble(A = c("a", "b", "c"), B = 1:3)

@prosoitos
Copy link
Author

This is going truly far afield, but you can use glue::glue_trim() for this as well

I won't use this for my workshop, but this can be handy to create toy examples quickly. Thanks 🙂

@lock
Copy link

lock bot commented Jun 3, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jun 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants