Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): expose rust writer as additional engine v2 #1891

Merged

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Nov 20, 2023

Description

  • Adds rust writer as additional engine in python
  • Adds overwrite schema functionality to the rust writer. @roeap feel free to point out improvements 😄

A couple gaps will exist between current Rust writer and pyarrow writer. We will have to solve this in a later PR:

  • Replacewhere (partition filter / predicate) overwrite
    (users however can solve this by doing DeltaTabel.delete and then append)

Related Issue(s)

ion-elgreco and others added 22 commits November 18, 2023 15:38
Exposes added `convert to delta` functionality by @junjunjd to Python
API.

- closes delta-io#1767

---------

Co-authored-by: Robert Pack <[email protected]>
# Description
This refactors the merge operation to use DataFusion's DataFrame and
LogicalPlan APIs

The NLJ is eliminated and the query planner can pick the optimal join
operator. This also enables the operation to use multiple threads and
should result in significant speed up.
Merge is still limited to using a single thread in some area. When
collecting benchmarks, I encountered multiple OoM issues with
Datafusion's hash join implementation. There are multiple tickets
upstream open regarding this. For now, I've limited the number of
partitions to just 1 to prevent this.

Predicates passed as SQL are also easier to use now. Manual casting was
required to ensure data types were aligned. Now the logical plan will
perform type coercion when optimizing the plan.

# Related Issues
- enhances delta-io#850
- closes delta-io#1790 
- closes delta-io#1753
# Description
Implements benchmarks that are similar to Spark's Delta benchmarks.

Enable us to have a standard benchmark to measure improvements to merge
and some pieces can be factored out to build a framework for bench
marking delta workflows.
@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core labels Nov 20, 2023
@rtyler
Copy link
Member

rtyler commented Nov 21, 2023

@ion-elgreco should #1872 be closed with this being its replacement?

@ion-elgreco
Copy link
Collaborator Author

@rtyler yes will do that!

r3stl355 and others added 12 commits November 25, 2023 09:38
Signed-off-by: Nikolay Ulmasov <[email protected]>
# Description

Adds docs on how to append, overwrite, delete rows, and Z Order Delta
tables.

Will add much more detailed pages in the future. Just getting the
high-level skeleton of the docs developed.
…#1836)

# Description
get_actions wrongly assumes that partition_columns from schema and
partitionValues from log must be the same. This is not true since
partition_columns are logical column names while partitionValues are
physical column names.

Tests pending

# Related Issue(s)

- closes delta-io#1835

# Documentation

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-requirements-for-column-mapping
"Track partition values and column level statistics with the physical
name of the column in the transaction log."

---------

Co-authored-by: Will Jones <[email protected]>
…iter/merge (delta-io#1820)

This ports some functionality that @stinodego and I had worked on in
Polars. Where we converted a pyarrow schema to a compatible delta
schema. It converts the following:

- uint -> int
- timestamp(any timeunit) -> timestamp(us)

I adjusted the functionality to do schema conversion from large to
normal when necessary, which is still needed in MERGE as workaround
delta-io#1753.

Additional things I've added:

- Schema conversion for every input in write_deltalake/merge
- Add Pandas dataframe conversion
- Add Pandas dataframe as input in merge

- closes delta-io#686
- closes delta-io#1467

---------

Co-authored-by: Will Jones <[email protected]>
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, just some minor comments :)

Comment on lines 500 to 502
metadata.schema = schema.clone().try_into().unwrap();
let metadata_action = Metadata::try_from(metadata).unwrap();
actions.push(Action::Metadata(metadata_action));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's handle these unwraps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roeap Can you check if the change is ok? I think you want that it returns the error, so I used the question mark operator

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly right :)

python/deltalake/writer.py Outdated Show resolved Hide resolved
.or_else(|_| this.snapshot.arrow_schema())
.unwrap_or(schema.clone());

if schema != table_schema {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the schemas here have Eq or PartialEq on them? This might not be as straight forward as this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has PartialEq on them. This piece was also reused from src/lib.rs in the python module

@ion-elgreco ion-elgreco merged commit e6ad2e0 into delta-io:main Nov 29, 2023
24 checks passed
ion-elgreco added a commit to ion-elgreco/delta-rs that referenced this pull request Dec 1, 2023
- Adds rust writer as additional engine in python
- Adds overwrite schema functionality to the rust writer. @roeap feel
free to point out improvements 😄

A couple gaps will exist between current Rust writer and pyarrow writer.
We will have to solve this in a later PR:
- Replacewhere (partition filter / predicate) overwrite
(users however can solve this by doing DeltaTabel.delete and then
append)

- closes delta-io#1861

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: David Blajda <[email protected]>
Co-authored-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Matthew Powers <[email protected]>
Co-authored-by: Thomas Frederik Hoeck <[email protected]>
Co-authored-by: Adrian Ehrsam <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Marijn Valk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Rust-backed engine for write_deltalake
10 participants