Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.read_ipc(..., columns=specific_column_order, use_pyarrow=False) does not preserve the column order . #1761

Closed
ghuls opened this issue Nov 13, 2021 · 4 comments · Fixed by #3591
Labels
bug Something isn't working

Comments

@ghuls
Copy link
Collaborator

ghuls commented Nov 13, 2021

Are you using Python or Rust?

Python.

What version of polars are you using?

git commit: e8f6b0f

What operating system are you using polars on?

CentOS 7

Describe your bug.

pl.read_ipc(..., columns=specific_column_order, use_pyarrow=False) does not preserve the column order specified by the user (as it is sorted by column indices when passing to arrow2 read IPC function).

What are the steps to reproduce the behavior?

In [28: import polars as pl
]
In [29]: df = pl.DataFrame([
    ...:     pl.Series("a", [2, 5]),
    ...:     pl.Series("b", [4, 3]),
    ...:     pl.Series("c", [1, 6]),
    ...:     pl.Series("d", [7, 9]),
    ...: ])

In [30]: df.to_ipc('test.feather')

In [31]: df_read = pl.read_ipc('test.feather', columns=["a", "b", "c", "d"], use_pyarrow=False)

In [32]: df_read_out_of_order = pl.read_ipc('test.feather', columns=["c", "b", "d", "a"], use_pyarrow=False)

In [33]: df
Out[33]: 
shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ abcd   │
│ ------------ │
│ i64i64i64i64 │
╞═════╪═════╪═════╪═════╡
│ 2417   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5369   │
└─────┴─────┴─────┴─────┘

In [34]: df_read
Out[34]: 
shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ abcd   │
│ ------------ │
│ i64i64i64i64 │
╞═════╪═════╪═════╪═════╡
│ 2417   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5369   │
└─────┴─────┴─────┴─────┘

In [35]: df_read_out_of_order
Out[35]: 
shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ abcd   │
│ ------------ │
│ i64i64i64i64 │
╞═════╪═════╪═════╪═════╡
│ 2417   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5369   │
└─────┴─────┴─────┴─────┘

In [36]: df_read_out_of_order_pyarrow = pl.read_ipc('test.feather', columns=["c", "b", "d", "a"], use_pyarrow=True)

In [37]: df_read_out_of_order_pyarrow
Out[37]: 
shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ cbda   │
│ ------------ │
│ i64i64i64i64 │
╞═════╪═════╪═════╪═════╡
│ 1472   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 6395   │
└─────┴─────┴─────┴─────┘
@jorgecarleitao
Copy link
Collaborator

Maybe we should address this in arrow2; atm it does not support projections over IPC with an order different from the one on the file, but I can't find a reason why it shouldn't; it is just a column swap before creating the record batch.

@ritchie46
Copy link
Member

Maybe we should address this in arrow2; atm it does not support projections over IPC with an order different from the one on the file, but I can't find a reason why it shouldn't; it is just a column swap before creating the record batch.

Then we ditch the sort 👍

@ghuls
Copy link
Collaborator Author

ghuls commented Feb 18, 2022

@jorgecarleitao I just retested it with the latest polars version and it seems it is still not supported in the latest arrow2.

@jorgecarleitao
Copy link
Collaborator

Being addressed upstream: jorgecarleitao/arrow2#961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants