Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect ordering when having multiple across calls inside arrange #6538

Closed
mgacc0 opened this issue Nov 14, 2022 · 6 comments
Closed

incorrect ordering when having multiple across calls inside arrange #6538

mgacc0 opened this issue Nov 14, 2022 · 6 comments

Comments

@mgacc0
Copy link

mgacc0 commented Nov 14, 2022

There's something fishy here...
When having multiple across inside arrange, sometimes the ordering is correct and other times it is not:

  library(tidyverse)

  df <- tribble(
    ~other_text, ~categ_1, ~categ_2, ~points_1, ~points_2, ~total,
    "x",      "A",      "B",       22L,       20L,    42L,
    "z",      "A",      "B",       20L,       22L,    42L,
    "y",      "A",      "B",       22L,       20L,    42L
  )
  df
#> # A tibble: 3 × 6
#>   other_text categ_1 categ_2 points_1 points_2 total
#>   <chr>      <chr>   <chr>      <int>    <int> <int>
#> 1 x          A       B             22       20    42
#> 2 z          A       B             20       22    42
#> 3 y          A       B             22       20    42
  
  set.seed(3)
  purrr::map_lgl(1:20,
                 ~ identical(
                   df %>%
                     slice_sample(n = nrow(.)) %>%
                     arrange(desc(total),
                             categ_1, categ_2,
                             desc(points_1), desc(points_2)),
                   df %>%
                     slice_sample(n = nrow(.)) %>%
                     arrange(desc(total),
                             across(starts_with("categ_")),
                             across(starts_with("points_"), desc))
                 ))
#>  [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
#> [13]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
@mgacc0
Copy link
Author

mgacc0 commented Nov 14, 2022

@mgacc0
Copy link
Author

mgacc0 commented Nov 14, 2022

  sessioninfo::package_info("attached")
#>  package   * version date (UTC) lib source
#>  dplyr     * 1.0.10  2022-09-01 [1] CRAN (R 4.2.1)
#>  forcats   * 0.5.2   2022-08-19 [1] CRAN (R 4.2.1)
#>  ggplot2   * 3.4.0   2022-11-04 [1] CRAN (R 4.2.2)
#>  purrr     * 0.3.5   2022-10-06 [1] CRAN (R 4.2.1)
#>  readr     * 2.1.3   2022-10-01 [1] CRAN (R 4.2.1)
#>  stringr   * 1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
#>  tibble    * 3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyr     * 1.2.1   2022-09-08 [1] CRAN (R 4.2.1)
#>  tidyverse * 1.3.2   2022-07-18 [1] CRAN (R 4.2.2)
#> 

@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 14, 2022

Nothing fishy here, you can reproduce with identical calls to arrange() that don't use across()

library(tidyverse)

df <- tribble(
  ~other_text, ~categ_1, ~categ_2, ~points_1, ~points_2, ~total,
  "x",      "A",      "B",       22L,       20L,    42L,
  "z",      "A",      "B",       20L,       22L,    42L,
  "y",      "A",      "B",       22L,       20L,    42L
)

set.seed(3)
purrr::map_lgl(1:20,
               ~ identical(
                 df %>%
                   slice_sample(n = nrow(.)) %>%
                   arrange(desc(total),
                           categ_1, categ_2,
                           desc(points_1), desc(points_2)),
                 df %>%
                   slice_sample(n = nrow(.)) %>%
                   arrange(desc(total),
                           categ_1, categ_2,
                           desc(points_1), desc(points_2))
               ))
#>  [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
#> [13]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE

The problem is that you have two rows that are identical in every column except other_text, which you aren't ordering by. So in some cases after you sample the rows you will have other_text = "x" first and in other cases other_text = "y" will be first. Since every other column is the same for those rows, arrange() just leaves them in whatever order it got from the sampling, so sometimes they dont match up

@mgacc0
Copy link
Author

mgacc0 commented Nov 14, 2022

Sorry @DavisVaughan, I was trying to write a reproducible example that were simpler than the real one.
My real case is something like this:

library(tidyverse)
df <- tibble::tribble(
  ~resultado_1, ~resultado_2, ~resultado_3, ~nota_1, ~nota_2,
  "SU",         "SU",           "",    26.3,   18.75,
  "SU",         "SU",           "",    28.3,    22.5
)
df
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>   <chr>       <chr>       <chr>        <dbl>  <dbl>
#> 1 SU          SU          ""            26.3   18.8
#> 2 SU          SU          ""            28.3   22.5
identical(
  df %>%
    mutate(across(
      starts_with("resultado_"),
      ~ case_when(. %in% c("SU", "EX") ~ 1L,
                  . == "NS" ~ 2L,
                  . == "" ~ 3L)
    )) %>%
    arrange(resultado_1, resultado_2, resultado_3,
            desc(nota_1), desc(nota_2)),
  df %>%
    arrange(
      across(starts_with("resultado_"), ~ case_when(
        . %in% c("SU", "EX") ~ 1,
        . == "NS" ~ 2,
        . == "" ~ 3
      )),
      across(starts_with("nota_"), desc)
    )
)
#> [1] FALSE

Could you check it?

@DavisVaughan
Copy link
Member

Those are just different data frames because you mutated the resultado_* columns in one but not the other. I'm fairly confident it is working right

library(tidyverse)
df <- tibble::tribble(
  ~resultado_1, ~resultado_2, ~resultado_3, ~nota_1, ~nota_2,
  "SU",         "SU",           "",    26.3,   18.75,
  "SU",         "SU",           "",    28.3,    22.5
)
df
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>   <chr>       <chr>       <chr>        <dbl>  <dbl>
#> 1 SU          SU          ""            26.3   18.8
#> 2 SU          SU          ""            28.3   22.5

df %>%
  mutate(across(
    starts_with("resultado_"),
    ~ case_when(. %in% c("SU", "EX") ~ 1L,
                . == "NS" ~ 2L,
                . == "" ~ 3L)
  )) %>%
  arrange(resultado_1, resultado_2, resultado_3,
          desc(nota_1), desc(nota_2))
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>         <int>       <int>       <int>  <dbl>  <dbl>
#> 1           1           1           3   28.3   22.5
#> 2           1           1           3   26.3   18.8

df %>%
  arrange(
    across(starts_with("resultado_"), ~ case_when(
      . %in% c("SU", "EX") ~ 1,
      . == "NS" ~ 2,
      . == "" ~ 3
    )),
    across(starts_with("nota_"), desc)
  )
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>   <chr>       <chr>       <chr>        <dbl>  <dbl>
#> 1 SU          SU          ""            28.3   22.5
#> 2 SU          SU          ""            26.3   18.8

Created on 2022-11-14 with reprex v2.0.2.9000

@mgacc0
Copy link
Author

mgacc0 commented Nov 14, 2022

My previous intention was to write

library(tidyverse)
df <- tibble::tribble(
  ~resultado_1, ~resultado_2, ~resultado_3, ~nota_1, ~nota_2,
  "SU",         "SU",           "",    26.3,   18.75,
  "SU",         "SU",           "",    28.3,    22.5
)
df
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>   <chr>       <chr>       <chr>        <dbl>  <dbl>
#> 1 SU          SU          ""            26.3   18.8
#> 2 SU          SU          ""            28.3   22.5
identical(
  df %>%
    mutate(across(
      starts_with("resultado_"),
      ~ case_when(. %in% c("SU", "EX") ~ 1L,
                  . == "NS" ~ 2L,
                  . == "" ~ 3L)
    )) %>%
    arrange(resultado_1, resultado_2, resultado_3,
            desc(nota_1), desc(nota_2)),
  df %>%
    arrange(
      across(starts_with("resultado_"), ~ case_when(
        . %in% c("SU", "EX") ~ 1,
        . == "NS" ~ 2,
        . == "" ~ 3
      )),
      across(starts_with("nota_"), desc)
    ) %>%
    mutate(across(
      starts_with("resultado_"),
      ~ case_when(. %in% c("SU", "EX") ~ 1L,
                  . == "NS" ~ 2L,
                  . == "" ~ 3L)
    ))
)

This returns TRUE when using a recent tidyverse/dplyr version from Github.
But the current dplyr version from CRAN (1.0.10) returns FALSE.

sessioninfo::package_info("attached")
 package   * version     date (UTC) lib source
 dplyr     * 1.0.99.9000 2022-11-14 [1] Github (tidyverse/dplyr@50c58dd)
 forcats   * 0.5.2       2022-08-19 [1] CRAN (R 4.2.1)
 ggplot2   * 3.4.0       2022-11-04 [1] CRAN (R 4.2.2)
 purrr     * 0.3.5       2022-10-06 [1] CRAN (R 4.2.1)
 readr     * 2.1.3       2022-10-01 [1] CRAN (R 4.2.1)
 stringr   * 1.4.1       2022-08-20 [1] CRAN (R 4.2.1)
 tibble    * 3.1.8       2022-07-22 [1] CRAN (R 4.2.1)
 tidyr     * 1.2.1       2022-09-08 [1] CRAN (R 4.2.1)
 tidyverse * 1.3.2       2022-07-18 [1] CRAN (R 4.2.2)

I had (since Oct 4th, #6490) installed locally a recent version from GitHub.
But in the last days unintentionally I had overwritten it with the older from CRAN.

So, this issue is analogous to #6490 and should remain closed.

Thanks for your help and diagnostics @DavisVaughan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants