Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python, rust): arrow large/view types passthrough, rust default engine #2738

Merged
merged 5 commits into from
Aug 10, 2024

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Aug 6, 2024

Description

Allows large/view types to be passed through during write, and prevents unnecessary potentially costly casting that could fail.

In Pyarrow engine only normal/large modes can be used, in Rust engine we always passthrough since we allow passthrough throughout the codebase now, this notion can be removed once pyarrow engine is fully deprecated.

Can be merged after: #2727

Related issues

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Aug 6, 2024
@ion-elgreco ion-elgreco changed the title feat(rust): arrow large/view types passthrough feat(python, rust): arrow large/view types passthrough Aug 6, 2024
@ion-elgreco ion-elgreco added this to the python v0.19 milestone Aug 7, 2024
@ion-elgreco ion-elgreco marked this pull request as draft August 7, 2024 00:30
@ion-elgreco
Copy link
Collaborator Author

@rtyler still a couple test failures, so will take another round tomorrow on this!

@ion-elgreco ion-elgreco marked this pull request as ready for review August 7, 2024 13:14
@ion-elgreco ion-elgreco changed the title feat(python, rust): arrow large/view types passthrough feat(python, rust): arrow large/view types passthrough, rust default engine Aug 7, 2024
@ion-elgreco
Copy link
Collaborator Author

ion-elgreco commented Aug 7, 2024

@aersam hey! Could you perhaps take a look on the refactored casting logic :) I reintroduce some old code since we now have to allow large/view types passthrough

Cargo.toml Outdated Show resolved Hide resolved
@ion-elgreco ion-elgreco enabled auto-merge August 7, 2024 15:15
@ion-elgreco ion-elgreco force-pushed the feat/type_passthrough_possible branch 2 times, most recently from 98cef9b to e04661d Compare August 7, 2024 17:09
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this I feel our conversion code is getting a bit excessive and we have to eventually step back and see if we can find a cleaner solution to all of this - I do not however have a better idea ready :).

My main question would be what now our minimal python / pyarrow support would look like as we are not testing this anymore? Should there not be some test for this?

.github/workflows/python_build.yml Outdated Show resolved Hide resolved
crates/core/src/kernel/arrow/mod.rs Outdated Show resolved Hide resolved
@roeap roeap disabled auto-merge August 7, 2024 21:05
@ion-elgreco
Copy link
Collaborator Author

ion-elgreco commented Aug 7, 2024

Looking at this I feel our conversion code is getting a bit excessive and we have to eventually step back and see if we can find a cleaner solution to all of this - I do not however have a better idea ready :).

My main question would be what now our minimal python / pyarrow support would look like as we are not testing this anymore? Should there not be some test for this?

Yup.. same goes for the writer. Also now with this passthrough change we allow more flexibility between utf8/binary/list flavours, but we are now less flexible on writing as a side effect. So a batch that is int64 to a table with int32 for example. Before this "might" have worked, however now it won't because we can't merge those schema's

Not sure what to do here, we can add this functionality in the schema merge to simply check if something is a supertype but it doesn't sit right

@ion-elgreco
Copy link
Collaborator Author

Hmm I'll put this back in draft and think of a redesign.

@ion-elgreco ion-elgreco marked this pull request as draft August 7, 2024 23:05
@ion-elgreco ion-elgreco marked this pull request as ready for review August 8, 2024 21:28
@ion-elgreco ion-elgreco requested review from hntd187 and roeap August 8, 2024 21:28
@ion-elgreco ion-elgreco force-pushed the feat/type_passthrough_possible branch 3 times, most recently from 0df9df4 to 9db4e16 Compare August 9, 2024 11:01
rtyler
rtyler previously approved these changes Aug 10, 2024
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only partially reviewed this but recognize the work @ion-elgreco has put in. This also touches a lot of other stuff we're currently looking at, so I would rather iterate on it in main to avoid monster conflicts should more tuning be needed

@rtyler rtyler enabled auto-merge August 10, 2024 13:11
@ion-elgreco ion-elgreco force-pushed the feat/type_passthrough_possible branch from 9db4e16 to 002fbe4 Compare August 10, 2024 16:30
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good - left some very minor questions, thus did not want to hit approve, as that would auto-merge stuff, and wanted to leave the option to comment on these ...

.github/workflows/python_build.yml Outdated Show resolved Hide resolved
Comment on lines +311 to +314
let input_schema = match schema {
Some(schema) => schema,
None => snapshot.input_schema()?,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting, since there seem to be some issues with the wrapped schema also in the pruning logic, which I have yet to fully grok. Why did we need to change here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take the arrow_schema it was always wrapping partition values in dictionary, somewhere down the line where it was going to do divide by partition values, it would fail because a slice to get a value from a DictionaryArray will always return None

Comment on lines +30 to +33
pub(crate) fn merge_delta_type(
left: &DeltaDataType,
right: &DeltaDataType,
) -> Result<DeltaDataType, ArrowError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may come in hayndy when we do type widening as well :).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Comment on lines +20 to +22
NORMAL = "NORMAL"
LARGE = "LARGE"
PASSTHROUGH = "PASSTHROUGH"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add some comments on what the behavior of these modes is?

@rtyler rtyler added this pull request to the merge queue Aug 10, 2024
Merged via the queue into delta-io:main with commit da824cd Aug 10, 2024
21 checks passed
@ion-elgreco ion-elgreco mentioned this pull request Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rust writer incorrectly casts Large Arrow to normal arrow types
4 participants