-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write support for additional Arrow datatypes #1044
Write support for additional Arrow datatypes #1044
Conversation
Signed-off-by: Chitral Verma <[email protected]>
@chitralverma I would recommend writing a unit test for the Python writer to be sure. IIRC we validate the schema, so we might need a modified equality function that ignores the large string difference. |
+1 to what @wjones127 said :) Thanks @chitralverma for taking a stab at this. |
cool, I'll add some tests on python side |
@chitralverma Any updates about the python tests? I'm happy to support when you are to busy. |
@chitralverma any news? This would enable a delta-writer in polars as well. 👍 |
I will work on the tests that are pending for this and rebase the changes soon Thanks for the reminder, I was completely swamped with work so couldn't continue with delta writer for polars and delta sharing source |
…usion's CREATE EXTERNAL TABLE (delta-io#1043) # Description We've recently added Delta table support to [Seafowl](https://github.com/splitgraph/seafowl) using delta-rs, which utilizes the new `OPTIONS` clause in sqlparser/DataFusion. It allows propagating a set of key/values down to the `DeltaTableBuilder`, which in turn can use those to instantiate a corresponding object store client. This means someone can now define a delta table without relying on env vars as: ```sql CREATE EXTERNAL TABLE my_delta STORED AS DELTATABLE OPTIONS ('AWS_ACCESS_KEY_ID' 'secret', 'AWS_SECRET_ACCESS_KEY' 'also_secret', 'AWS_REGION' 'eu-west-3') LOCATION 's3://my-bucket/my-delta-table/' ``` I've also changed the existing datafusion integration tests to use this approach to exercise it. I'm not sure whether it makes sense to merge this PR upstream, but opening this PR just in case it does. # Related Issue(s) Didn't find any related issues. # Documentation
# Description Integrating with polars requires the `DeltaStorageHandler` to be serializable with pickle. this PR implements the required dunder methods to make it so... Unfortunately we lost the ability to instantiate the `DeltaStorageHandler` with an existing object store, however I do believe that this is not a critical loss. cc @chitralverma @ritchie46 # Related Issue(s) closes delta-io#1015 # Documentation <!--- Share links to useful documentation --->
# Description ~~This PR updates datafusion and related dependencies to their latest versions. Since datafusion now has improved support for loading partition columns with non string types, we update our scan methods to take advantage of that.~~ While working on dependencies, I took the opportunity to do some housekeeping. - do not use chrono with default features - make `aws-profile` from object_store optional. The upstream create explicitly discourages its usage, and it brings quite a few new dependencies, as it pulls in some aws sdk. - rename `datafusion-ext` feature to `datafusion`. The ext suffix is still from a time where there were less options to define features. I kept the ols feature around as an alias. # Related Issue(s) closes delta-io#914 # Documentation <!--- Share links to useful documentation ---> Co-authored-by: R. Tyler Croy <[email protected]>
# Description Update gihub actions to avoid warnings and deprecations. Unfortunately there is no updated version of `actions-rs/toolchain` (yet?), but at least some warnings will go away. # Related Issue(s) closes delta-io#978 # Documentation <!--- Share links to useful documentation --->
# Description Bump version for new release in crates.io # Related Issue(s) blocks delta-io#973
# Description Moving the `vacuum` operation into the operations module and adopting `IntoFuture` for the command builder. This is breaking the APIs for the builder (now with consistent setter names) but we are able to keep the APIs for `DeltaTable` in rust and python. In a follow up I would like to move th optimize command as well, This however may require refactoring the `PartitionValue` since we can only deal with `static` lifetimes when using `IntoFuture`, A while back we talked about pulling in `ScalarValue` from datafusion to optimize that implementation and maybe that's a good opportunitiy to look into that as well. # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> Co-authored-by: Will Jones <[email protected]>
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.23.0 to 1.23.1. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.23.0...tokio-1.23.1) --- updated-dependencies: - dependency-name: tokio dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
…o#1058) # Description Current python wrapper hasn't any functionlity to create checkpoints. This PR exposes rust functionality which is creates checkpoint at current table version. # Documentation Sample of usage: ```Python delta_table = DeltaTable(some_path) # apply actions... delta_table.create_checkpoint() ``` Co-authored-by: Ilya Moshkov <[email protected]> Co-authored-by: Will Jones <[email protected]>
# Description Exposes function to get a dataframe of add actions for selected version of the table. TODO: * [x] add unit tests * [x] write user guide * [x] handle partition columns * [x] handle stats * [x] handle tags * [x] add a `flatten` option # Related Issue(s) - closes delta-io#1031 # Documentation <!--- Share links to useful documentation --->
# Description Recently we moved some of our storage configuration via a property bag upstream to the object_store crate. This allows us to simplify our configuration handling here and make S3 configuration consistent with azure and gcp. I think as a follow up it would be great to migrate dynamodb_lock to using the official SDKs as well, and then see what we still need form the s3 storage options. # Related Issue(s) closes delta-io#999 # Documentation <!--- Share links to useful documentation ---> Co-authored-by: Will Jones <[email protected]>
# Description This PR contains some improvements and refactoring for handling storage locations. - Removes the `StorageLocation` struct (a left-over from previous clean up) - allows for creating tables using local file paths (including relative) - persists options during serialization (this will not work for custom storage backends, but still extends what the previous approach could do) - adopts `PrefixObjectStore` from upstream crate in favour of maintaining that logic here. - run `cargo clippy --fix` on `/rust` # Related Issue(s) Closes delta-io#998 # Documentation <!--- Share links to useful documentation --->
…ypes # Conflicts: # python/tests/test_writer.py
Signed-off-by: Chitral Verma <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
@wjones127 I have fixed the linting issues, and updated with the main branch. I also added support for uint* and float16 as well. Now only the following types remain, any ideas for these because I don't think we can map all of them?
|
Signed-off-by: Chitral Verma <[email protected]>
From the above remaining types, here is a proposed strategy,
Any suggestions on this? |
Personally, I'd suggest "Error as not supported" for all of those. That requires the user to explicitly acknowledge that they are writing unsupported types, and cast to supported types themselves. |
In that case, the PR is ready. |
Signed-off-by: Chitral Verma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this @chitralverma
I have a few changes suggested. Once addressed I think this is good to merge.
Co-authored-by: Will Jones <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
@wjones127 Thanks for the review comments, I have made the requested changes. |
some tests timed out |
Failures are unrelated. GitHub is being rather weird today. |
Thanks @wjones127 |
Description
Added missing mapping from below mentioned arrow types to delta types,
LargeUtf8
(LargeString) ->string
LargeBinary
->binary
FixedSizeBinary(_)
->binary
LargeList(_)
->array
UInt8
->byte
UInt16
->short
UInt32
->int
UInt64
->long
Date64
->date
Related Issue(s)
closes #1024