Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min and max expressions produce incorrect results #2850

Closed
cbilot opened this issue Mar 8, 2022 · 2 comments
Closed

min and max expressions produce incorrect results #2850

cbilot opened this issue Mar 8, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@cbilot
Copy link

cbilot commented Mar 8, 2022

python 3.10.2
polars 0.13.9
Linux Mint 20.3

Describe your bug.

The min and max expressions yield the wrong results.

What are the steps to reproduce the behavior?

This is a hard bug to reproduce (with simple values), but I found this data reproduces it.

import polars as pl

df = pl.DataFrame(
    {
        "id": [
            130352258,
            130352432,
            130352277,
            130352611,
            130352833,
            130352305,
            130352764,
            130352475,
            130352368,
            130352346,
        ]
    }
)
df
shape: (10, 1)
┌───────────┐
│ id        │
│ ---       │
│ i64       │
╞═══════════╡
│ 130352258 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352432 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352277 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352611 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ ...       │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352764 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352475 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352368 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352346 │
└───────────┘

Note that the minimum of id is 130352258, and the maximum is 130352833.

Now let's run a simple query to calculate the minimum and the maximum...

df.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])
shape: (1, 2)
┌───────────┬───────────┐
│ min       ┆ max       │
│ ---       ┆ ---       │
│ i64       ┆ i64       │
╞═══════════╪═══════════╡
│ 130352258 ┆ 130352368 │
└───────────┴───────────┘

The minimum calculated by the query is correct, but the maximum is not.

Now, let's sort the data and repeat the query.

df = df.sort("id")

df.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])
shape: (1, 2)
┌───────────┬───────────┐
│ min       ┆ max       │
│ ---       ┆ ---       │
│ i64       ┆ i64       │
╞═══════════╪═══════════╡
│ 130352258 ┆ 130352833 │
└───────────┴───────────┘

Now we get the correct results.

Now let's permute the rows slightly, and repeat the process ...

df = pl.DataFrame(
    {
        "id": [
            130352432,
            130352277,
            130352611,
            130352833,
            130352305,
            130352258,
            130352764,
            130352475,
            130352368,
            130352346,
        ]
    }
)

df.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])
shape: (1, 2)
┌───────────┬───────────┐
│ min       ┆ max       │
│ ---       ┆ ---       │
│ i64       ┆ i64       │
╞═══════════╪═══════════╡
│ 130352346 ┆ 130352833 │
└───────────┴───────────┘

This time, the calculated maximum is correct .. but the minimum is not.

However, sorting the data once again yields the correct results.

df = df.sort("id")

df.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])
shape: (1, 2)
┌───────────┬───────────┐
│ min       ┆ max       │
│ ---       ┆ ---       │
│ i64       ┆ i64       │
╞═══════════╪═══════════╡
│ 130352258 ┆ 130352833 │
└───────────┴───────────┘

As you permute the rows, the values you get for min and max can change.

There may be a simpler example. But I started with 1.25 billion records, concatenated using scan_ipc of a wildcard/glob of 91 ipc files, with all calculations done in lazy mode. So it was quite the debugging process to nail down just what was going wrong. And these numbers seem to produce aberrant results. No idea why ...

@cbilot cbilot added the bug Something isn't working label Mar 8, 2022
@ritchie46
Copy link
Member

Sorry for this 😞 . We switched to std::simd latest release which likely caused this regression.

This snippet fails in 0.13.9 and succeeds on 0.13.8

minimum = 130352258
maximum = 130352833.

for _ in range(10):
    permuted = df.sample(frac=1.0)
    computed = permuted.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])
    assert computed[0, "min"] == minimum
    assert computed[0, "max"] == maximum

I will issue and release a fix immediately.

Thanks for taking the effor to produce such a good example! 👍

@ritchie46
Copy link
Member

Fixed and released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants