Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column order/data scrambled when reading back from ipc file (regression?) #3704

Closed
alexander-beedie opened this issue Jun 15, 2022 · 1 comment · Fixed by #3706 or #3947
Closed

Column order/data scrambled when reading back from ipc file (regression?) #3704

alexander-beedie opened this issue Jun 15, 2022 · 1 comment · Fixed by #3706 or #3947
Labels
bug Something isn't working

Comments

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jun 15, 2022

What language/platform are you using?

Python 3.9, macOS 12.4, Polars 0.13.46

Describe your bug.

The columns directive for pl.read_ipc appears to have a regression in one of the recent updates; if a non-default column order is provided, the column names/data can load in an order that is neither the original order, nor the requested order. (A somewhat similar issue was previously fixed by #3591).

What are the steps to reproduce the behavior?

import polars as pl
df = pl.DataFrame(
    data = [
        ['x',123, 4.5, 'misc'],
        ['y',456,10.0,'other'],
        ['z',789,10.0,'value'],
    ],
    columns = ['a','b','c','d'],
)
print( df )
# ┌─────┬─────┬──────┬───────┐
# │ a   ┆ b   ┆ c    ┆ d     │
# │ --- ┆ --- ┆ ---  ┆ ---   │
# │ str ┆ i64 ┆ f64  ┆ str   │
# ╞═════╪═════╪══════╪═══════╡
# │ x   ┆ 123 ┆ 4.5  ┆ misc  │
# │ y   ┆ 456 ┆ 10.0 ┆ other │
# │ z   ┆ 789 ┆ 10.0 ┆ value │
# └─────┴─────┴──────┴───────┘

# save frame data to feather/ipc file
df.write_ipc( 'test.feather' )

# load back in requested (different) column order: data gets scrambled
dx = pl.read_ipc( 'test.feather', columns=['a','c','d','b'] )
print( dx )
# ┌─────┬───────┬─────┬──────┐
# │ a   ┆ c     ┆ d   ┆ b    │  << column *names* are in the requested order,
# │ --- ┆ ---   ┆ --- ┆ ---  │     but the associated column *data* is incorrect
# │ str ┆ str   ┆ i64 ┆ f64  │     
# ╞═════╪═══════╪═════╪══════╡     col 'b' should have i64 data, not f64
# │ x   ┆ misc  ┆ 123 ┆ 4.5  │     col 'c' should have f64 data, not str
# │ y   ┆ other ┆ 456 ┆ 10.0 │     col 'd' should have str data, not i64
# │ z   ┆ value ┆ 789 ┆ 10.0 │
# └─────┴───────┴─────┴──────┘

What is the actual behavior?

Loaded column data is not correct.

What is the expected behavior?

Load the column data in the requested order.

dx = pl.read_ipc( 'test.feather', columns=['a','c','d','b'] )
print( dx )
# ┌─────┬──────┬───────┬─────┐
# │ a   ┆ c    ┆ d     ┆ b   │
# │ --- ┆ ---  ┆ ---   ┆ --- │
# │ str ┆ f64  ┆ str   ┆ i64 │
# ╞═════╪══════╪═══════╪═════╡
# │ x   ┆ 4.5  ┆ misc  ┆ 123 │
# │ y   ┆ 10.0 ┆ other ┆ 456 │
# │ z   ┆ 10.0 ┆ value ┆ 789 │
# └─────┴──────┴───────┴─────┘
@alexander-beedie alexander-beedie added the bug Something isn't working label Jun 15, 2022
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jun 21, 2022

Actually this still seems to be misbehaving (in a slightly different way?)
Using the same DataFrame as above ::

# save frame data to feather/ipc file in column-default order
df.write_ipc( 'test.feather' )

# load back in requested (different) column order
dx = pl.read_ipc( 'test.feather', columns=['a','c','d','b'] )

print( dx )
# ┌─────┬───────┬─────┬──────┐
# │ a   ┆ d     ┆ b   ┆ c    │  <<< columns not in requested order 
# │ --- ┆ ---   ┆ --- ┆ ---  │      (though associated with the correct datatype)
# │ str ┆ str   ┆ i64 ┆ f64  │ 
# ╞═════╪═══════╪═════╪══════╡
# │ x   ┆ misc  ┆ 123 ┆ 4.5  │
# ├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
# │ y   ┆ other ┆ 456 ┆ 10.0 │
# ├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
# │ z   ┆ value ┆ 789 ┆ 10.0 │
# └─────┴───────┴─────┴──────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants