read_csv() creates "spec" attributes which don't get updated nor lost by some tibble transformations #934

prosoitos · 2018-12-05T03:58:43Z

read_csv() (and probably the other read_delim funtions, though I haven't tested it) creates "spec" attributes which do not get updated nor lost when the tibble gets transformed and thus can end up being in total mismatch with the tibble they are associated with.

This does not create any problem, but it makes for very weird outputs to str() and creates unnecessarily lengthy outputs to dput().

Example:

I read a .csv file with read_csv() to create a tibble. I change its names, I select one variable out, I replace the remaining variables with vectors of different types and I end up with this tibble:

tbl <- structure(list(
  A = c("a", "b", "c"),
  B = 1:3),
  row.names = c(NA, -3L),
  spec = structure(list(
    cols = list(
      date = structure(list(format = ""), class = c("collector_date", "collector")),
      species = structure(list(), class = c("collector_character", "collector")),
      abundance = structure(list(), class = c("collector_double", "collector"))),
    default = structure(list(), class = c("collector_guess", "collector"))),
    class = "col_spec"), class = c("tbl_df", "tbl", "data.frame"))

So now, this is my tibble:

tbl
#>   A B
#> 1 a 1
#> 2 b 2
#> 3 c 3

And it gives these str() and dput() outputs:

str(tbl)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    3 obs. of  2 variables:
#>  $ A: chr  "a" "b" "c"
#>  $ B: int  1 2 3
#>  - attr(*, "spec")=List of 2
#>   ..$ cols   :List of 3
#>   .. ..$ date     :List of 1
#>   .. .. ..$ format: chr ""
#>   .. .. ..- attr(*, "class")= chr  "collector_date" "collector"
#>   .. ..$ species  : list()
#>   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
#>   .. ..$ abundance: list()
#>   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
#>   ..$ default: list()
#>   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
#>   ..- attr(*, "class")= chr "col_spec"

(Now the "spec" attribute has a wrong number of variables, wrong variable names, and wrong variable types 🤣 )

dput(tbl)
#> structure(list(A = c("a", "b", "c"), B = 1:3), row.names = c(NA, 
#> -3L), spec = structure(list(cols = list(date = structure(list(
#>     format = ""), class = c("collector_date", "collector")), 
#>     species = structure(list(), class = c("collector_character", 
#>     "collector")), abundance = structure(list(), class = c("collector_double", 
#>     "collector"))), default = structure(list(), class = c("collector_guess", 
#> "collector"))), class = "col_spec"), class = c("tbl_df", "tbl", 
#> "data.frame"))

(That's a long dput() output for such a simple tibble 😳 )

Of course, using the result of dput() to create the object in the first place in this example makes it look very circular and kind of silly, but I have to do that to demonstrate the idea without forcing you to download a .csv file or create one. But this is not just a theoretical point: a real life and very common scenario where this kicks in is this:

You create a tibble with read_csv(), you want to create a reprex, so you transform your tibble to make it very simple, then you use dput() to create the data of your very basic tibble and you end up with a ton of silly attributes in the output. Of course, you can simply get rid of the "spec" attribute from the result of dput() and all is good:

structure(list(
  A = c("a", "b", "c"),
  B = 1:3),
  row.names = c(NA, -3L))

gives the same tibble and it is easy enough to do. And even if you don't, those wrong attributes don't get in the way of anything. So not a big deal. But because it is somewhat hard to get new users to create good reproducible examples, it is a little quirk that doesn't help with explaining dput().

The text was updated successfully, but these errors were encountered:

jennybc · 2018-12-05T07:05:01Z

As for the "making a reprex" side of this ...

I would define a small tibble inline:

library(tidyverse)

tbl <- tibble(
  A = c("a", "b", "c"),
  B = 1:3
)
## or
tbl <- tribble(
  ~A, ~B,
  "a", 1,
  "b", 2,
  "c", 3
)

For larger tibbles, there nicer ways than dput() to get the code needed to define them.
Let’s use the mtcars example that ships with readr:

x <- read_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#>   mpg = col_double(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )

x2 <- x %>% 
  select(mpg, cyl, disp) %>% 
  head(3)

Yes, it is true that attributes defined at read time persist after data manipulation.

str(x2)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    3 obs. of  3 variables:
#>  $ mpg : num  21 21 22.8
#>  $ cyl : num  6 6 4
#>  $ disp: num  160 160 108
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   mpg = col_double(),
#>   ..   cyl = col_double(),
#>   ..   disp = col_double(),
#>   ..   hp = col_double(),
#>   ..   drat = col_double(),
#>   ..   wt = col_double(),
#>   ..   qsec = col_double(),
#>   ..   vs = col_double(),
#>   ..   am = col_double(),
#>   ..   gear = col_double(),
#>   ..   carb = col_double()
#>   .. )

datapasta and deparse both offer ways to get nice code for x2 – much better than dput().

x2_tribble_source <- datapasta::tribble_construct(x2)
cat(x2_tribble_source)
#> tibble::tribble(
#>   ~mpg, ~cyl, ~disp,
#>     21,    6,   160,
#>     21,    6,   160,
#>   22.8,    4,   108
#>   )

x2_df_source <- datapasta::df_construct(x2)
cat(x2_df_source)
#> data.frame(
#>          mpg = c(21, 21, 22.8),
#>          cyl = c(6, 6, 4),
#>         disp = c(160, 160, 108)
#> )

^{Created on 2018-12-04 by the reprex package (v0.2.1.9000)}

prosoitos · 2018-12-05T08:03:17Z

Thank you very much Jenny for your thoughts.

I would define a small tibble inline:

Of course. Me too. The only reason I used dput() to create this small tibble was in order to use read_csv(). This is very artificial for the purpose of this issue.

For larger tibbles, there nicer ways than dput() to get the code needed to define them.

The problem wasn't the size of the tibble, but using read_csv(). But I am extremely thankful for your comment for 2 reasons:

I don't think enough of using the datasets that come with R and/or packages when putting reproducible examples together. So this is a great reminder to use them!
Because I haven't built any package and because I am still quite new to R, while I have used mtcars plenty before, I would have never thought of using it in this way 🙂 Shame on me for my naivety, but I hadn't really thought about the form in which the data was stored in the package (an actual .csv file...) and that this .csv file could be used to make a reprex with read_csv()! (I had never paid attention to the readr_example() function.) I feel silly, but this has been enlightening to me. Thank you!
(Edit: Oh, I see now... I just realized that mtcars is part of the base R package datasets, in some binary form, but that readr ships with a couple of datasets in various text formats for exactly the kind of situation I was in with this issue... all this makes more sense now. Really cool. I had missed that whole part of the readr package).

datapasta and deparse both offer ways to get nice code for x2 – much better than dput()

Ah! Great! I was playing with dput() because I am putting a little workshop together and here again, I am really thankful for your comment!

deparse::deparsec(tbl)
#> tibble(A = c("a", "b", "c"), B = 1:3)

is certainly a lot nicer than my previous dput(tbl) and its crazy output with all the carry-over attributes.

datapasta::dmdclip(tbl) and datapasta::dpasta(tbl) are really sweet too!

Definitely updating my workshop right now!

All that said, and even if, with those great alternatives to dput() it matters even less, those carry-over "spec" attributes are still a little surprising when looking at the str() of a tibble. Maybe not worth the trouble to worry about it though (?)

Anyway, thank you very sincerely. This has been very helpful for me.

jimhester · 2018-12-05T14:09:21Z

tibble methods preserving additional attributes is very new, it seems to be added in tidyverse/tibble@2cabe6d#diff-ccca386aac53cf0029fb15ebff8901d5, which is not yet on CRAN.

The original behavior was they were lost as soon as you performed a manipulation.

Anyway spec is meant to store how the data was originally read by readr, not how it currently looks, so even if it is preserved by further manipulations I don't think there is an issue.

jennybc · 2018-12-05T17:23:23Z

Continuing with the "reprex tips" re: workshop, both read.csv() and read_csv() also support the inline provision of what would normally be in the file. It's terribly hard to read though, so is only relevant for a very small example where it is somehow important to use read.csv() or read_csv().

read.csv(text = "A,B\na,1\nb,2\nc,3")
#>   A B
#> 1 a 1
#> 2 b 2
#> 3 c 3

readr::read_csv("A,B\na,1\nb,2\nc,3")
#> # A tibble: 3 x 2
#>   A         B
#>   <chr> <dbl>
#> 1 a         1
#> 2 b         2
#> 3 c         3

^{Created on 2018-12-05 by the reprex package (v0.2.1.9000)}

You might want to favour datapasta over deparse, because data pasta is on CRAN.

jimhester · 2018-12-05T18:09:48Z

This is going truly far afield, but you can use glue::glue_trim() for this as well, e.g.

readr::read_csv(glue::trim("
A,B
a,1
b,2
c,3
"))
#> # A tibble: 3 x 2
#>   A         B
#>   <chr> <dbl>
#> 1 a         1
#> 2 b         2
#> 3 c         3

^{Created on 2018-12-05 by the reprex package (v0.2.1)}

To ensure the spec is dropped once they are subset. Fixes #934

prosoitos · 2018-12-05T22:10:18Z

tibble methods preserving additional attributes is very new, it seems to be added in tidyverse/tibble@2cabe6d#diff-ccca386aac53cf0029fb15ebff8901d5, which is not yet on CRAN.

Weird... here is the output of sessioninfo::session_info() for the tibble package I am running (sorry, I should have included this in the issue):

 tibble      * 1.4.2      2018-01-22 [2] CRAN (R 3.5.0)

I don't think I am running the devel version...

Anyway, it was not important as you said and thanks for fixing it!

prosoitos · 2018-12-05T22:21:58Z

Thank you Jenny for the additional tips.

You might want to favour datapasta over deparse, because data pasta is on CRAN.

Both are amazing. But I was thinking of using deparse for my workshop actually... After playing with both for a bit, I thought that it was particularly clean and simple.

Of all the options, this was the one I had settled on:

tbl <- tibble::tibble(
  A = c("a", "b", "c"),
  B = 1:3
)

deparse::deparsec(tbl)
#> tibble(A = c("a", "b", "c"), B = 1:3)

prosoitos · 2018-12-05T22:24:34Z

This is going truly far afield, but you can use glue::glue_trim() for this as well

I won't use this for my workshop, but this can be handy to create toy examples quickly. Thanks 🙂

lock · 2019-06-03T22:50:07Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

prosoitos changed the title ~~read_csv() creates "spec" attributes which don't get updated nor lost by tibble transformations~~ read_csv() creates "spec" attributes which don't get updated nor lost by some tibble transformations Dec 5, 2018

jimhester added a commit that referenced this issue Dec 5, 2018

Add a subset method for spec_tbl_df objects

136a7b1

To ensure the spec is dropped once they are subset. Fixes #934

jimhester mentioned this issue Dec 5, 2018

Add a subset method for spec_tbl_df objects #936

Merged

jimhester closed this as completed in #936 Dec 5, 2018

jimhester added a commit that referenced this issue Dec 5, 2018

Add a subset method for spec_tbl_df objects

e94b8bc

To ensure the spec is dropped once they are subset. Fixes #934

lock bot locked and limited conversation to collaborators Jun 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv() creates "spec" attributes which don't get updated nor lost by some tibble transformations #934

read_csv() creates "spec" attributes which don't get updated nor lost by some tibble transformations #934

prosoitos commented Dec 5, 2018 •

edited

Loading

jennybc commented Dec 5, 2018

prosoitos commented Dec 5, 2018 •

edited

Loading

jimhester commented Dec 5, 2018

jennybc commented Dec 5, 2018

jimhester commented Dec 5, 2018

prosoitos commented Dec 5, 2018 •

edited

Loading

prosoitos commented Dec 5, 2018

prosoitos commented Dec 5, 2018

lock bot commented Jun 3, 2019

read_csv() creates "spec" attributes which don't get updated nor lost by some tibble transformations #934

read_csv() creates "spec" attributes which don't get updated nor lost by some tibble transformations #934

Comments

prosoitos commented Dec 5, 2018 • edited Loading

jennybc commented Dec 5, 2018

prosoitos commented Dec 5, 2018 • edited Loading

jimhester commented Dec 5, 2018

jennybc commented Dec 5, 2018

jimhester commented Dec 5, 2018

prosoitos commented Dec 5, 2018 • edited Loading

prosoitos commented Dec 5, 2018

prosoitos commented Dec 5, 2018

lock bot commented Jun 3, 2019

prosoitos commented Dec 5, 2018 •

edited

Loading

prosoitos commented Dec 5, 2018 •

edited

Loading

prosoitos commented Dec 5, 2018 •

edited

Loading