Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write support for additional Arrow datatypes #1044

Merged
merged 92 commits into from
Mar 30, 2023

Conversation

chitralverma
Copy link
Contributor

@chitralverma chitralverma commented Dec 30, 2022

Description

Added missing mapping from below mentioned arrow types to delta types,

  • LargeUtf8 (LargeString) -> string
  • LargeBinary -> binary
  • FixedSizeBinary(_) -> binary
  • LargeList(_) -> array
  • UInt8 -> byte
  • UInt16 -> short
  • UInt32 -> int
  • UInt64 -> long
  • Date64 -> date

Related Issue(s)

closes #1024

@chitralverma
Copy link
Contributor Author

@houqp this should be enough to close #1024 right?

Signed-off-by: Chitral Verma <[email protected]>
@wjones127
Copy link
Collaborator

@chitralverma I would recommend writing a unit test for the Python writer to be sure. IIRC we validate the schema, so we might need a modified equality function that ignores the large string difference.

@houqp
Copy link
Member

houqp commented Dec 30, 2022

+1 to what @wjones127 said :) Thanks @chitralverma for taking a stab at this.

@chitralverma
Copy link
Contributor Author

cool, I'll add some tests on python side

@BMeyn
Copy link

BMeyn commented Mar 11, 2023

@chitralverma Any updates about the python tests? I'm happy to support when you are to busy.

@ritchie46
Copy link

@chitralverma any news? This would enable a delta-writer in polars as well. 👍

@chitralverma
Copy link
Contributor Author

chitralverma commented Mar 14, 2023

I will work on the tests that are pending for this and rebase the changes soon

Thanks for the reminder, I was completely swamped with work so couldn't continue with delta writer for polars and delta sharing source

gruuya and others added 12 commits March 17, 2023 19:47
…usion's CREATE EXTERNAL TABLE (delta-io#1043)

# Description
We've recently added Delta table support to
[Seafowl](https://github.com/splitgraph/seafowl) using delta-rs, which
utilizes the new `OPTIONS` clause in sqlparser/DataFusion. It allows
propagating a set of key/values down to the `DeltaTableBuilder`, which
in turn can use those to instantiate a corresponding object store
client. This means someone can now define a delta table without relying
on env vars as:
```sql
CREATE EXTERNAL TABLE my_delta
STORED AS DELTATABLE
OPTIONS ('AWS_ACCESS_KEY_ID' 'secret', 'AWS_SECRET_ACCESS_KEY' 'also_secret', 'AWS_REGION' 'eu-west-3') 
LOCATION 's3://my-bucket/my-delta-table/'
```

I've also changed the existing datafusion integration tests to use this
approach to exercise it.

I'm not sure whether it makes sense to merge this PR upstream, but
opening this PR just in case it does.

# Related Issue(s)
Didn't find any related issues.

# Documentation
# Description

Integrating with polars requires the `DeltaStorageHandler` to be
serializable with pickle. this PR implements the required dunder methods
to make it so...

Unfortunately we lost the ability to instantiate the
`DeltaStorageHandler` with an existing object store, however I do
believe that this is not a critical loss.

cc @chitralverma @ritchie46

# Related Issue(s)

closes delta-io#1015

# Documentation

<!---
Share links to useful documentation
--->
# Description

~~This PR updates datafusion and related dependencies to their latest
versions. Since datafusion now has improved support for loading
partition columns with non string types, we update our scan methods to
take advantage of that.~~

While working on dependencies, I took the opportunity to do some
housekeeping.

- do not use chrono with default features
- make `aws-profile` from object_store optional. The upstream create
explicitly discourages its usage, and it brings quite a few new
dependencies, as it pulls in some aws sdk.
- rename `datafusion-ext` feature to `datafusion`. The ext suffix is
still from a time where there were less options to define features. I
kept the ols feature around as an alias.

# Related Issue(s)

closes delta-io#914

# Documentation

<!---
Share links to useful documentation
--->

Co-authored-by: R. Tyler Croy <[email protected]>
# Description

Update gihub actions to avoid warnings and deprecations. Unfortunately
there is no updated version of `actions-rs/toolchain` (yet?), but at
least some warnings will go away.

# Related Issue(s)

closes delta-io#978

# Documentation

<!---
Share links to useful documentation
--->
# Description

Bump version for new release in crates.io

# Related Issue(s)
 
blocks delta-io#973
# Description

Moving the `vacuum` operation into the operations module and adopting
`IntoFuture` for the command builder. This is breaking the APIs for the
builder (now with consistent setter names) but we are able to keep the
APIs for `DeltaTable` in rust and python.

In a follow up I would like to move th optimize command as well, This
however may require refactoring the `PartitionValue` since we can only
deal with `static` lifetimes when using `IntoFuture`, A while back we
talked about pulling in `ScalarValue` from datafusion to optimize that
implementation and maybe that's a good opportunitiy to look into that as
well.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

Co-authored-by: Will Jones <[email protected]>
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.23.0 to 1.23.1.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-1.23.0...tokio-1.23.1)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
…o#1058)

# Description
Current python wrapper hasn't any functionlity to create checkpoints.
This PR exposes rust functionality which is creates checkpoint at
current table version.


# Documentation
Sample of usage:
```Python
    delta_table = DeltaTable(some_path)
    # apply actions...
    delta_table.create_checkpoint()
```

Co-authored-by: Ilya Moshkov <[email protected]>
Co-authored-by: Will Jones <[email protected]>
# Description

Exposes function to get a dataframe of add actions for selected version
of the table.

TODO:

 * [x] add unit tests
 * [x] write user guide
 * [x] handle partition columns
 * [x] handle stats
 * [x] handle tags
 * [x] add a `flatten` option

# Related Issue(s)

- closes delta-io#1031

# Documentation

<!---
Share links to useful documentation
--->
# Description
Recently we moved some of our storage configuration via a property bag
upstream to the object_store crate. This allows us to simplify our
configuration handling here and make S3 configuration consistent with
azure and gcp.

I think as a follow up it would be great to migrate dynamodb_lock to
using the official SDKs as well, and then see what we still need form
the s3 storage options.

# Related Issue(s)

closes delta-io#999

# Documentation

<!---
Share links to useful documentation
--->

Co-authored-by: Will Jones <[email protected]>
# Description

This PR contains some improvements and refactoring for handling storage
locations.

- Removes the `StorageLocation` struct (a left-over from previous clean
up)
- allows for creating tables using local file paths (including relative)
- persists options during serialization (this will not work for custom
storage backends, but still extends what the previous approach could do)
- adopts `PrefixObjectStore` from upstream crate in favour of
maintaining that logic here.
- run `cargo clippy --fix` on `/rust`

# Related Issue(s)

Closes delta-io#998

# Documentation

<!---
Share links to useful documentation
--->
…ypes

# Conflicts:
#	python/tests/test_writer.py
Signed-off-by: Chitral Verma <[email protected]>
@chitralverma chitralverma changed the title Support for additional Arrow datatypes Write support for additional Arrow datatypes Mar 27, 2023
@chitralverma
Copy link
Contributor Author

chitralverma commented Mar 27, 2023

@wjones127 I have fixed the linting issues, and updated with the main branch. I also added support for uint* and float16 as well. Now only the following types remain, any ideas for these because I don't think we can map all of them?

        DataType::Null => {}
        DataType::Time32(_) => {}
        DataType::Time64(_) => {}
        DataType::Duration(_) => {}
        DataType::Interval(_) => {}
        DataType::Float16 => {} # parquet doesn't support this.
        DataType::Union(_, _, _) => {}
        DataType::Dictionary(_, _) => {}
        DataType::RunEndEncoded(_, _) => {}

@chitralverma
Copy link
Contributor Author

chitralverma commented Mar 28, 2023

@wjones127 I have fixed the linting issues, and updated with the main branch. I also added support for uint* and float16 as well. Now only the following types remain, any ideas for these because I don't think we can map all of them?

        DataType::Null => {}
        DataType::Time32(_) => {}
        DataType::Time64(_) => {}
        DataType::Duration(_) => {}
        DataType::Interval(_) => {}
        DataType::Union(_, _, _) => {}
        DataType::Dictionary(_, _) => {}
        DataType::RunEndEncoded(_, _) => {}

From the above remaining types, here is a proposed strategy,

  • Time32 => Int32
  • Time64 => Int64
  • Duration => Int64
  • Interval => Int64
  • Null => String or Error as Not Supported
  • Union | Dictionary | RunEndEncoded => Error as Not Supported

Any suggestions on this?

@stinodego
Copy link

Personally, I'd suggest "Error as not supported" for all of those. That requires the user to explicitly acknowledge that they are writing unsupported types, and cast to supported types themselves.

@chitralverma
Copy link
Contributor Author

Personally, I'd suggest "Error as not supported" for all of those. That requires the user to explicitly acknowledge that they are writing unsupported types, and cast to supported types themselves.

In that case, the PR is ready.
@wjones127 can you please review

Signed-off-by: Chitral Verma <[email protected]>
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this @chitralverma

I have a few changes suggested. Once addressed I think this is good to merge.

python/src/schema.rs Outdated Show resolved Hide resolved
python/src/schema.rs Outdated Show resolved Hide resolved
python/src/schema.rs Show resolved Hide resolved
python/src/schema.rs Outdated Show resolved Hide resolved
python/src/schema.rs Outdated Show resolved Hide resolved
python/src/schema.rs Outdated Show resolved Hide resolved
@chitralverma chitralverma requested review from wjones127 and removed request for rtyler, xianwill, houqp, mosyp, fvaleye and roeap March 29, 2023 06:55
@chitralverma
Copy link
Contributor Author

Thanks for sticking with this @chitralverma

I have a few changes suggested. Once addressed I think this is good to merge.

@wjones127 Thanks for the review comments, I have made the requested changes.

@chitralverma
Copy link
Contributor Author

chitralverma commented Mar 30, 2023

some tests timed out

@wjones127
Copy link
Collaborator

Failures are unrelated. GitHub is being rather weird today.

@wjones127 wjones127 enabled auto-merge (squash) March 30, 2023 17:11
@wjones127 wjones127 merged commit d9920aa into delta-io:main Mar 30, 2023
@chitralverma
Copy link
Contributor Author

Thanks @wjones127

@chitralverma chitralverma deleted the support-arrow-datatypes branch May 11, 2023 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate delta-rs-crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for LargeUtf8 column type