incorrect ordering when having multiple `across` calls inside `arrange` #6538

mgacc0 · 2022-11-14T15:35:54Z

There's something fishy here...
When having multiple across inside arrange, sometimes the ordering is correct and other times it is not:

  library(tidyverse)

  df <- tribble(
    ~other_text, ~categ_1, ~categ_2, ~points_1, ~points_2, ~total,
    "x",      "A",      "B",       22L,       20L,    42L,
    "z",      "A",      "B",       20L,       22L,    42L,
    "y",      "A",      "B",       22L,       20L,    42L
  )
  df
#> # A tibble: 3 × 6
#>   other_text categ_1 categ_2 points_1 points_2 total
#>   <chr>      <chr>   <chr>      <int>    <int> <int>
#> 1 x          A       B             22       20    42
#> 2 z          A       B             20       22    42
#> 3 y          A       B             22       20    42
  
  set.seed(3)
  purrr::map_lgl(1:20,
                 ~ identical(
                   df %>%
                     slice_sample(n = nrow(.)) %>%
                     arrange(desc(total),
                             categ_1, categ_2,
                             desc(points_1), desc(points_2)),
                   df %>%
                     slice_sample(n = nrow(.)) %>%
                     arrange(desc(total),
                             across(starts_with("categ_")),
                             across(starts_with("points_"), desc))
                 ))
#>  [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
#> [13]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE

The text was updated successfully, but these errors were encountered:

mgacc0 · 2022-11-14T15:37:33Z

Related to #6490 and https://stackoverflow.com/questions/73943949/sophisticated-formula-inside-arrange/73945836

mgacc0 · 2022-11-14T15:57:59Z

  sessioninfo::package_info("attached")
#>  package   * version date (UTC) lib source
#>  dplyr     * 1.0.10  2022-09-01 [1] CRAN (R 4.2.1)
#>  forcats   * 0.5.2   2022-08-19 [1] CRAN (R 4.2.1)
#>  ggplot2   * 3.4.0   2022-11-04 [1] CRAN (R 4.2.2)
#>  purrr     * 0.3.5   2022-10-06 [1] CRAN (R 4.2.1)
#>  readr     * 2.1.3   2022-10-01 [1] CRAN (R 4.2.1)
#>  stringr   * 1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
#>  tibble    * 3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyr     * 1.2.1   2022-09-08 [1] CRAN (R 4.2.1)
#>  tidyverse * 1.3.2   2022-07-18 [1] CRAN (R 4.2.2)
#>

DavisVaughan · 2022-11-14T16:09:24Z

Nothing fishy here, you can reproduce with identical calls to arrange() that don't use across()

library(tidyverse)

df <- tribble(
  ~other_text, ~categ_1, ~categ_2, ~points_1, ~points_2, ~total,
  "x",      "A",      "B",       22L,       20L,    42L,
  "z",      "A",      "B",       20L,       22L,    42L,
  "y",      "A",      "B",       22L,       20L,    42L
)

set.seed(3)
purrr::map_lgl(1:20,
               ~ identical(
                 df %>%
                   slice_sample(n = nrow(.)) %>%
                   arrange(desc(total),
                           categ_1, categ_2,
                           desc(points_1), desc(points_2)),
                 df %>%
                   slice_sample(n = nrow(.)) %>%
                   arrange(desc(total),
                           categ_1, categ_2,
                           desc(points_1), desc(points_2))
               ))
#>  [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
#> [13]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE

The problem is that you have two rows that are identical in every column except other_text, which you aren't ordering by. So in some cases after you sample the rows you will have other_text = "x" first and in other cases other_text = "y" will be first. Since every other column is the same for those rows, arrange() just leaves them in whatever order it got from the sampling, so sometimes they dont match up

mgacc0 · 2022-11-14T17:04:32Z

Sorry @DavisVaughan, I was trying to write a reproducible example that were simpler than the real one.
My real case is something like this:

library(tidyverse)
df <- tibble::tribble(
  ~resultado_1, ~resultado_2, ~resultado_3, ~nota_1, ~nota_2,
  "SU",         "SU",           "",    26.3,   18.75,
  "SU",         "SU",           "",    28.3,    22.5
)
df
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>   <chr>       <chr>       <chr>        <dbl>  <dbl>
#> 1 SU          SU          ""            26.3   18.8
#> 2 SU          SU          ""            28.3   22.5
identical(
  df %>%
    mutate(across(
      starts_with("resultado_"),
      ~ case_when(. %in% c("SU", "EX") ~ 1L,
                  . == "NS" ~ 2L,
                  . == "" ~ 3L)
    )) %>%
    arrange(resultado_1, resultado_2, resultado_3,
            desc(nota_1), desc(nota_2)),
  df %>%
    arrange(
      across(starts_with("resultado_"), ~ case_when(
        . %in% c("SU", "EX") ~ 1,
        . == "NS" ~ 2,
        . == "" ~ 3
      )),
      across(starts_with("nota_"), desc)
    )
)
#> [1] FALSE

Could you check it?

DavisVaughan · 2022-11-14T18:20:09Z

Those are just different data frames because you mutated the resultado_* columns in one but not the other. I'm fairly confident it is working right

library(tidyverse)
df <- tibble::tribble(
  ~resultado_1, ~resultado_2, ~resultado_3, ~nota_1, ~nota_2,
  "SU",         "SU",           "",    26.3,   18.75,
  "SU",         "SU",           "",    28.3,    22.5
)
df
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>   <chr>       <chr>       <chr>        <dbl>  <dbl>
#> 1 SU          SU          ""            26.3   18.8
#> 2 SU          SU          ""            28.3   22.5

df %>%
  mutate(across(
    starts_with("resultado_"),
    ~ case_when(. %in% c("SU", "EX") ~ 1L,
                . == "NS" ~ 2L,
                . == "" ~ 3L)
  )) %>%
  arrange(resultado_1, resultado_2, resultado_3,
          desc(nota_1), desc(nota_2))
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>         <int>       <int>       <int>  <dbl>  <dbl>
#> 1           1           1           3   28.3   22.5
#> 2           1           1           3   26.3   18.8

df %>%
  arrange(
    across(starts_with("resultado_"), ~ case_when(
      . %in% c("SU", "EX") ~ 1,
      . == "NS" ~ 2,
      . == "" ~ 3
    )),
    across(starts_with("nota_"), desc)
  )
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>   <chr>       <chr>       <chr>        <dbl>  <dbl>
#> 1 SU          SU          ""            28.3   22.5
#> 2 SU          SU          ""            26.3   18.8

^{Created on 2022-11-14 with reprex v2.0.2.9000}

mgacc0 · 2022-11-14T18:53:31Z

My previous intention was to write

library(tidyverse)
df <- tibble::tribble(
  ~resultado_1, ~resultado_2, ~resultado_3, ~nota_1, ~nota_2,
  "SU",         "SU",           "",    26.3,   18.75,
  "SU",         "SU",           "",    28.3,    22.5
)
df
#> # A tibble: 2 × 5
#>   resultado_1 resultado_2 resultado_3 nota_1 nota_2
#>   <chr>       <chr>       <chr>        <dbl>  <dbl>
#> 1 SU          SU          ""            26.3   18.8
#> 2 SU          SU          ""            28.3   22.5
identical(
  df %>%
    mutate(across(
      starts_with("resultado_"),
      ~ case_when(. %in% c("SU", "EX") ~ 1L,
                  . == "NS" ~ 2L,
                  . == "" ~ 3L)
    )) %>%
    arrange(resultado_1, resultado_2, resultado_3,
            desc(nota_1), desc(nota_2)),
  df %>%
    arrange(
      across(starts_with("resultado_"), ~ case_when(
        . %in% c("SU", "EX") ~ 1,
        . == "NS" ~ 2,
        . == "" ~ 3
      )),
      across(starts_with("nota_"), desc)
    ) %>%
    mutate(across(
      starts_with("resultado_"),
      ~ case_when(. %in% c("SU", "EX") ~ 1L,
                  . == "NS" ~ 2L,
                  . == "" ~ 3L)
    ))
)

This returns TRUE when using a recent tidyverse/dplyr version from Github.
But the current dplyr version from CRAN (1.0.10) returns FALSE.

sessioninfo::package_info("attached")
 package   * version     date (UTC) lib source
 dplyr     * 1.0.99.9000 2022-11-14 [1] Github (tidyverse/dplyr@50c58dd)
 forcats   * 0.5.2       2022-08-19 [1] CRAN (R 4.2.1)
 ggplot2   * 3.4.0       2022-11-04 [1] CRAN (R 4.2.2)
 purrr     * 0.3.5       2022-10-06 [1] CRAN (R 4.2.1)
 readr     * 2.1.3       2022-10-01 [1] CRAN (R 4.2.1)
 stringr   * 1.4.1       2022-08-20 [1] CRAN (R 4.2.1)
 tibble    * 3.1.8       2022-07-22 [1] CRAN (R 4.2.1)
 tidyr     * 1.2.1       2022-09-08 [1] CRAN (R 4.2.1)
 tidyverse * 1.3.2       2022-07-18 [1] CRAN (R 4.2.2)

I had (since Oct 4th, #6490) installed locally a recent version from GitHub.
But in the last days unintentionally I had overwritten it with the older from CRAN.

So, this issue is analogous to #6490 and should remain closed.

Thanks for your help and diagnostics @DavisVaughan.

DavisVaughan closed this as completed Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect ordering when having multiple `across` calls inside `arrange` #6538

incorrect ordering when having multiple `across` calls inside `arrange` #6538

mgacc0 commented Nov 14, 2022

mgacc0 commented Nov 14, 2022

mgacc0 commented Nov 14, 2022

DavisVaughan commented Nov 14, 2022 •

edited

Loading

mgacc0 commented Nov 14, 2022

DavisVaughan commented Nov 14, 2022

mgacc0 commented Nov 14, 2022

incorrect ordering when having multiple across calls inside arrange #6538

incorrect ordering when having multiple across calls inside arrange #6538

Comments

mgacc0 commented Nov 14, 2022

mgacc0 commented Nov 14, 2022

mgacc0 commented Nov 14, 2022

DavisVaughan commented Nov 14, 2022 • edited Loading

mgacc0 commented Nov 14, 2022

DavisVaughan commented Nov 14, 2022

mgacc0 commented Nov 14, 2022

incorrect ordering when having multiple `across` calls inside `arrange` #6538

incorrect ordering when having multiple `across` calls inside `arrange` #6538

DavisVaughan commented Nov 14, 2022 •

edited

Loading