Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust/python): optimize.compact not working with tables with mixed large/normal arrow #1926

Merged

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Nov 30, 2023

Description

  • Fixes optimize.compact not working when a table has parquet files with large and normal arrow types. Basically it cast the recordbatch to normal arrow types

Issues

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core labels Nov 30, 2023
@ion-elgreco ion-elgreco marked this pull request as ready for review November 30, 2023 13:28
@ion-elgreco ion-elgreco changed the title fix(rust/python): optimize not working with tables with mixed large and normal arrow schemas fix(rust/python): optimize.compact not working with tables with mixed large and normal arrow schemas Nov 30, 2023
@ion-elgreco ion-elgreco changed the title fix(rust/python): optimize.compact not working with tables with mixed large and normal arrow schemas fix(rust/python): optimize.compact not working with tables with mixed large/normal arrow Nov 30, 2023
@ion-elgreco ion-elgreco enabled auto-merge (squash) December 1, 2023 20:56
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific issue being addressed here I don't have an opinion on, but making cast_record_batch a public API is fine to me.

@ion-elgreco ion-elgreco merged commit 18c4834 into delta-io:main Dec 2, 2023
24 checks passed
@ion-elgreco
Copy link
Collaborator Author

@rtyler yeah the cast_record_batch is needed since we have writers who can write large arrow data into a parquet. According to Will, the arrow writers serialize the arrow schema in the metadata of the parquet, so when we re-read these parquets, there is a chance some recordbatches will have the large dtypes while the others aren't.

rtyler added a commit to rtyler/delta-rs that referenced this pull request Dec 11, 2023
dependency

A dependency from optimize on the cast_record_batch function was added
which cannot be met without the `datafusion` feature enabled

See delta-io#1926
rtyler added a commit that referenced this pull request Dec 11, 2023
dependency

A dependency from optimize on the cast_record_batch function was added
which cannot be met without the `datafusion` feature enabled

See #1926
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

optimize.compact() fails with bad schema after updating to pyarrow 8.0
2 participants