Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars-u64-idx: frame_equal fails when number of rows reaches 2^32 #3511

Closed
cbilot opened this issue May 26, 2022 · 0 comments · Fixed by #3563
Closed

polars-u64-idx: frame_equal fails when number of rows reaches 2^32 #3511

cbilot opened this issue May 26, 2022 · 0 comments · Fixed by #3563
Labels
bug Something isn't working

Comments

@cbilot
Copy link

cbilot commented May 26, 2022

polars-u64-idx 0.13.38
python 3.10.4
Linux Mint 20.3

Describe your bug.

The frame_equals method fails when the number of rows exceeds 2^32 - 1.

What are the steps to reproduce the behavior?

Despite the number of rows, I designed this MWE to work using polars-u64-idx on any computer with reasonable RAM.

import polars as pl

_nbr_rows = (2**32)
df = pl.select(pl.repeat(False, n=_nbr_rows, eager=True, name="col1"))
df

df.frame_equal(df)
>>> df
shape: (4294967296, 1)
┌───────┐
│ col1  │
│ ---   │
│ bool  │
╞═══════╡
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ ...   │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
└───────┘
>>> df.frame_equal(df)
False

However, if we reduce _nbr_rows to 2^32 - 1

shape: (4294967295, 1)
┌───────┐
│ col1  │
│ ---   │
│ bool  │
╞═══════╡
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ ...   │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌┤
│ false │
└───────┘
>>> df.frame_equal(df)
True

Other Notes:

I've checked that I'm running polars-u64-idx

>>> pl.select(pl.repeat(False, n=(2**32) + 100,
...           eager=True, name="col1")
...           ).with_row_count()
shape: (4294967396, 2)
┌────────────┬───────┐
│ row_nr     ┆ col1  │
│ ---        ┆ ---   │
│ u64        ┆ bool  │
╞════════════╪═══════╡
│ 0          ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1          ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2          ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3          ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ...        ┆ ...   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4294967392 ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4294967393 ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4294967394 ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4294967395 ┆ false │
└────────────┴───────┘

The row_nr datatype is u64.

This came about as I tried to figure out why large datasets I wrote to parquet and IPC formats seemed corrupted. I'm guessing it's not about file corruption or file format issues ... but rather a row-indexing issue with frame_equal (which I was using to test the datasets read in from the files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant