min and max expressions produce incorrect results #2850

cbilot · 2022-03-08T04:36:36Z

python 3.10.2
polars 0.13.9
Linux Mint 20.3

Describe your bug.

The min and max expressions yield the wrong results.

What are the steps to reproduce the behavior?

This is a hard bug to reproduce (with simple values), but I found this data reproduces it.

import polars as pl

df = pl.DataFrame(
    {
        "id": [
            130352258,
            130352432,
            130352277,
            130352611,
            130352833,
            130352305,
            130352764,
            130352475,
            130352368,
            130352346,
        ]
    }
)
df

shape: (10, 1)
┌───────────┐
│ id        │
│ ---       │
│ i64       │
╞═══════════╡
│ 130352258 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352432 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352277 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352611 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ ...       │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352764 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352475 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352368 │
├╌╌╌╌╌╌╌╌╌╌╌┤
│ 130352346 │
└───────────┘

Note that the minimum of id is 130352258, and the maximum is 130352833.

Now let's run a simple query to calculate the minimum and the maximum...

df.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])

shape: (1, 2)
┌───────────┬───────────┐
│ min       ┆ max       │
│ ---       ┆ ---       │
│ i64       ┆ i64       │
╞═══════════╪═══════════╡
│ 130352258 ┆ 130352368 │
└───────────┴───────────┘

The minimum calculated by the query is correct, but the maximum is not.

Now, let's sort the data and repeat the query.

df = df.sort("id")

df.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])

shape: (1, 2)
┌───────────┬───────────┐
│ min       ┆ max       │
│ ---       ┆ ---       │
│ i64       ┆ i64       │
╞═══════════╪═══════════╡
│ 130352258 ┆ 130352833 │
└───────────┴───────────┘

Now we get the correct results.

Now let's permute the rows slightly, and repeat the process ...

df = pl.DataFrame(
    {
        "id": [
            130352432,
            130352277,
            130352611,
            130352833,
            130352305,
            130352258,
            130352764,
            130352475,
            130352368,
            130352346,
        ]
    }
)

df.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])

shape: (1, 2)
┌───────────┬───────────┐
│ min       ┆ max       │
│ ---       ┆ ---       │
│ i64       ┆ i64       │
╞═══════════╪═══════════╡
│ 130352346 ┆ 130352833 │
└───────────┴───────────┘

This time, the calculated maximum is correct .. but the minimum is not.

However, sorting the data once again yields the correct results.

df = df.sort("id")

df.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])

shape: (1, 2)
┌───────────┬───────────┐
│ min       ┆ max       │
│ ---       ┆ ---       │
│ i64       ┆ i64       │
╞═══════════╪═══════════╡
│ 130352258 ┆ 130352833 │
└───────────┴───────────┘

As you permute the rows, the values you get for min and max can change.

There may be a simpler example. But I started with 1.25 billion records, concatenated using scan_ipc of a wildcard/glob of 91 ipc files, with all calculations done in lazy mode. So it was quite the debugging process to nail down just what was going wrong. And these numbers seem to produce aberrant results. No idea why ...

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-03-08T06:52:28Z

Sorry for this 😞 . We switched to std::simd latest release which likely caused this regression.

This snippet fails in 0.13.9 and succeeds on 0.13.8

minimum = 130352258
maximum = 130352833.

for _ in range(10):
    permuted = df.sample(frac=1.0)
    computed = permuted.select([pl.col("id").min().alias("min"), pl.col("id").max().alias("max")])
    assert computed[0, "min"] == minimum
    assert computed[0, "max"] == maximum

I will issue and release a fix immediately.

Thanks for taking the effor to produce such a good example! 👍

ritchie46 · 2022-03-08T09:23:18Z

Fixed and released.

cbilot added the bug Something isn't working label Mar 8, 2022

ritchie46 mentioned this issue Mar 8, 2022

undo std::simd incorrect aggregation #2852

Merged

ritchie46 closed this as completed Mar 8, 2022

jorgecarleitao mentioned this issue Mar 8, 2022

Fixed error in computing min_max in std::simd jorgecarleitao/arrow2#894

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

min and max expressions produce incorrect results #2850

min and max expressions produce incorrect results #2850

cbilot commented Mar 8, 2022 •

edited

Loading

ritchie46 commented Mar 8, 2022

ritchie46 commented Mar 8, 2022

min and max expressions produce incorrect results #2850

min and max expressions produce incorrect results #2850

Comments

cbilot commented Mar 8, 2022 • edited Loading

Describe your bug.

What are the steps to reproduce the behavior?

ritchie46 commented Mar 8, 2022

ritchie46 commented Mar 8, 2022

cbilot commented Mar 8, 2022 •

edited

Loading