Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank behaves differently from other systems #7024

Open
Jefffrey opened this issue Jan 26, 2025 · 2 comments
Open

Rank behaves differently from other systems #7024

Jefffrey opened this issue Jan 26, 2025 · 2 comments

Comments

@Jefffrey
Copy link
Contributor

Something interesting about the rank function in arrow-rs is that it seems to behave differently compared to others.

Postgres:

rank () → bigint
Returns the rank of the current row, with gaps; that is, the row_number of the first row in its peer group.

DuckDB:

Description The rank of the current row with gaps; same as row_number of its first peer.

SQL Server:

If two or more rows tie for a rank, each tied row receives the same rank. For example, if the two top salespeople have the same SalesYTD value, they are both ranked one.

It seems these other ones would have ties take the lower values, whereas for arrow-rs the existing rank function (and subsequently what is introduced here) takes the higher value; I guess this might be worth a broader discussion, as this PR itself currently follows the existing behaviour

🤔

Originally posted by @Jefffrey in #6912 (review)

@Jefffrey
Copy link
Contributor Author

arrow-rs

    #[test]
    fn test123() {
        let values = StringArray::from(vec![Some("a"), Some("a"), Some("b"), None]);
        let res = rank(
            &values,
            Some(SortOptions {
                descending: false,
                nulls_first: false,
            }),
        )
        .unwrap();
        dbg!(res);
    }

Output:

arrow-rs$ cargo test -p arrow-ord --lib rank::tests::test123 -- --nocapture
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.05s
     Running unittests src/lib.rs (/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/deps/arrow_ord-34ffc585b0d5be26)

running 1 test
[arrow-ord/src/rank.rs:373:9] res = [
    2,
    2,
    3,
    4,
]
test rank::tests::test123 ... ok

Postgres

postgres=# WITH  vals (k) AS (VALUES ('a'), ('a'), ('b'), (null))
SELECT rank() over (order by k asc nulls last) FROM vals;
 rank
------
    1
    1
    3
    4
(4 rows)

DuckDB

D select rank() over (order by col0 asc nulls last) from (values ('a'), ('a'), ('b'), (null));
┌────────────────────────────────────────────┐
│ rank() OVER (ORDER BY col0 ASC NULLS LAST) │
│                   int64                    │
├────────────────────────────────────────────┤
│                                          1 │
│                                          1 │
│                                          3 │
│                                          4 │
└────────────────────────────────────────────┘

Options

  • Accept this difference and maybe enhance docs to point out this difference in behaviour
  • Change to postgres/duckdb behaviour (subtle breaking change?)
  • Make configurable (API breaking change?)

@tustvold
Copy link
Contributor

Changing to match postgres behaviour in a breaking release would be my vote

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants