-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): expose rust writer as additional engine v2 #1891
feat(python): expose rust writer as additional engine v2 #1891
Conversation
Exposes added `convert to delta` functionality by @junjunjd to Python API. - closes delta-io#1767 --------- Co-authored-by: Robert Pack <[email protected]>
# Description This refactors the merge operation to use DataFusion's DataFrame and LogicalPlan APIs The NLJ is eliminated and the query planner can pick the optimal join operator. This also enables the operation to use multiple threads and should result in significant speed up. Merge is still limited to using a single thread in some area. When collecting benchmarks, I encountered multiple OoM issues with Datafusion's hash join implementation. There are multiple tickets upstream open regarding this. For now, I've limited the number of partitions to just 1 to prevent this. Predicates passed as SQL are also easier to use now. Manual casting was required to ensure data types were aligned. Now the logical plan will perform type coercion when optimizing the plan. # Related Issues - enhances delta-io#850 - closes delta-io#1790 - closes delta-io#1753
# Description Implements benchmarks that are similar to Spark's Delta benchmarks. Enable us to have a standard benchmark to measure improvements to merge and some pieces can be factored out to build a framework for bench marking delta workflows.
This reverts commit 57565b5.
@ion-elgreco should #1872 be closed with this being its replacement? |
@rtyler yes will do that! |
Signed-off-by: Nikolay Ulmasov <[email protected]>
# Description Adds docs on how to append, overwrite, delete rows, and Z Order Delta tables. Will add much more detailed pages in the future. Just getting the high-level skeleton of the docs developed.
…#1836) # Description get_actions wrongly assumes that partition_columns from schema and partitionValues from log must be the same. This is not true since partition_columns are logical column names while partitionValues are physical column names. Tests pending # Related Issue(s) - closes delta-io#1835 # Documentation https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-requirements-for-column-mapping "Track partition values and column level statistics with the physical name of the column in the transaction log." --------- Co-authored-by: Will Jones <[email protected]>
…iter/merge (delta-io#1820) This ports some functionality that @stinodego and I had worked on in Polars. Where we converted a pyarrow schema to a compatible delta schema. It converts the following: - uint -> int - timestamp(any timeunit) -> timestamp(us) I adjusted the functionality to do schema conversion from large to normal when necessary, which is still needed in MERGE as workaround delta-io#1753. Additional things I've added: - Schema conversion for every input in write_deltalake/merge - Add Pandas dataframe conversion - Add Pandas dataframe as input in merge - closes delta-io#686 - closes delta-io#1467 --------- Co-authored-by: Will Jones <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, just some minor comments :)
metadata.schema = schema.clone().try_into().unwrap(); | ||
let metadata_action = Metadata::try_from(metadata).unwrap(); | ||
actions.push(Action::Metadata(metadata_action)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's handle these unwraps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roeap Can you check if the change is ok? I think you want that it returns the error, so I used the question mark operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly right :)
.or_else(|_| this.snapshot.arrow_schema()) | ||
.unwrap_or(schema.clone()); | ||
|
||
if schema != table_schema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the schemas here have Eq or PartialEq on them? This might not be as straight forward as this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has PartialEq on them. This piece was also reused from src/lib.rs in the python module
- Adds rust writer as additional engine in python - Adds overwrite schema functionality to the rust writer. @roeap feel free to point out improvements 😄 A couple gaps will exist between current Rust writer and pyarrow writer. We will have to solve this in a later PR: - Replacewhere (partition filter / predicate) overwrite (users however can solve this by doing DeltaTabel.delete and then append) - closes delta-io#1861 --------- Signed-off-by: Nikolay Ulmasov <[email protected]> Co-authored-by: Robert Pack <[email protected]> Co-authored-by: Robert Pack <[email protected]> Co-authored-by: David Blajda <[email protected]> Co-authored-by: Nikolay Ulmasov <[email protected]> Co-authored-by: Matthew Powers <[email protected]> Co-authored-by: Thomas Frederik Hoeck <[email protected]> Co-authored-by: Adrian Ehrsam <[email protected]> Co-authored-by: Will Jones <[email protected]> Co-authored-by: Marijn Valk <[email protected]>
Description
A couple gaps will exist between current Rust writer and pyarrow writer. We will have to solve this in a later PR:
(users however can solve this by doing DeltaTabel.delete and then append)
Related Issue(s)