Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Use arrow for checkpoint reading and state handling #1776

Closed
roeap opened this issue Oct 26, 2023 · 3 comments · Fixed by #2037
Closed

[RFC] Use arrow for checkpoint reading and state handling #1776

roeap opened this issue Oct 26, 2023 · 3 comments · Fixed by #2037
Labels
enhancement New feature or request

Comments

@roeap
Copy link
Collaborator

roeap commented Oct 26, 2023

Description

With the switch to updated schema / types in #1756 we also got quite some code to read checkpoints and more generally handle the table state as arrow record batches. In the past we tried several times to switch to arrow-based state.

I believe now is a good time, as we did a lot of the required work in kernel already. In this context I would also question the value of further maintaining a parquet2 variant of the checkpoint parsing or rather an arrow2 equivalent. IIRC, polars is currently vendoring large parts of the arrow2 codebase and as they are currently the main maintainers, there is likely little maintenance going forward.

AS there is a zero-copy interop between arrow and arrow2, I think we can just remove that support from our codebase, which would lead to - i be lieve - significantly reduced complexity.

Use Case

Related Issue(s)

@roeap roeap added the enhancement New feature or request label Oct 26, 2023
@ion-elgreco
Copy link
Collaborator

Polars also cloned pieces of arrow2 into Polars since the main maintainer of arrow2 pulled back. It's called polars-arrow now. So any activity on that arrow will probably also likely only happen into polars-arrow, unless folks back port stuff to arrow2.

@houqp
Copy link
Member

houqp commented Nov 20, 2023

I agree with deleting the parquet2 support code to keep things simple given the fork. That said, I recommend taking the columnar based parsing approach in the current parquet2 implement for the new arrow-rs based state implementation for better performance. If we use arrow recordbatch to store the state and leverage parquet crate to parse directly from parquet to arrow, then we should get this for free.

@roeap
Copy link
Collaborator Author

roeap commented Nov 21, 2023

I recommend taking the columnar based parsing approach

Fully agree. My thinking was to start with loading most of the state in a first pass, and then moving to something I recently learned is called PnM-Query - i.e. trying our best to get the protocol and metadata as efficiently as possible. i.e. scanning all commit files and then selectively getting the data from the checkpoint if not present in any commit json file. and loadig file actions only when needed ...

roeap added a commit that referenced this issue Dec 11, 2023
~~based on #1807~~

# Description

In the effort to advance protocol support and move our internal APIs
closer to the kernel library, it is advantageous to leverage the
expression handling logic from kernel specifically for filtering actions
etc.

This PR just add the expression definitions and evaluation logic.
Integrating it with our current codebase and basing the existing
partition handling logic on this is left for follow up PRs to keep thigs
review-able.

related: #1894, #1776
rtyler added a commit that referenced this issue Jan 23, 2024
# Description

This is still very much a work in progress, opening it up for visibility
and discussion.

Finally I do hope that we can make the switch to arrow based log
handling. Aside from hopefully advantages in the memory footprint, I
also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs 

- `Snapshot` - a half lazy version of the Snapshot, which only tries to
get `Protocol` & `Metadata` actions ASAP. Of course these drive all our
planning activities and without them there is not much we can do.
- `EagerSnapshot` - An intermediary structure, which eagerly loads file
actions and does log replay to serve as a compatibility laver for the
current `DeltaTable` APIs.

One conceptually larger change is related to how we view the
availability of information. Up until now `DeltaTableState` could be
initialized empty, containing no useful information for any code to work
with. State (snapshots) now always needs to be created valid. The thing
that may not yet be initialized is the `DeltaTable`, which now only
carries the table configuration and the `LogStore`. the state / snapshot
is now optional. Consequently all code that works against a snapshot no
longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already
are working against snapshots mostly, but should abolish most traits
implemented for `DeltaTable` as this does not provide the information
(and never has) that is al least required to execute a query.

Some larger notable changes include:

* remove `DeltaTableMetadata` and always use `Metadata` action.
* arrow and parquet are now required, as such the features got removed.
Personalyl I would also argue, that if you cannot read checkpoints, you
cannot read delta tables :). - so hopefully users weren't using
arrow-free versions.

### Major follow-ups:

* (pre-0.17) review integration with `log_store` and `object_store`.
Currently we make use mostly of `ObjectStore` inside the state handling.
What we really use is `head` / `list_from` / `get` - my hope would be
that we end up with a single abstraction...
* test cleanup - we are currently dealing with test flakiness and have
several approaches to scaffolding tests. SInce we have the
`deltalake-test` crate now, this can be reconciled.
* ...
* do more processing on borrowed data ...
* perform file-heavy operations on arrow data
* update checkpoint writing to leverage new state handling and arrow ...
* switch to exposing URL in public APIs

## Questions

* should paths be percent-encoded when written to checkpoint?

# Related Issue(s)

supersedes: #454
supersedes: #1837
closes: #1776
closes: #425 (should also be addressed in the current implementation)
closes: #288 (multi-part checkpoints are deprecated)
related: #435

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <[email protected]>
RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this issue Feb 2, 2024
# Description

This is still very much a work in progress, opening it up for visibility
and discussion.

Finally I do hope that we can make the switch to arrow based log
handling. Aside from hopefully advantages in the memory footprint, I
also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs 

- `Snapshot` - a half lazy version of the Snapshot, which only tries to
get `Protocol` & `Metadata` actions ASAP. Of course these drive all our
planning activities and without them there is not much we can do.
- `EagerSnapshot` - An intermediary structure, which eagerly loads file
actions and does log replay to serve as a compatibility laver for the
current `DeltaTable` APIs.

One conceptually larger change is related to how we view the
availability of information. Up until now `DeltaTableState` could be
initialized empty, containing no useful information for any code to work
with. State (snapshots) now always needs to be created valid. The thing
that may not yet be initialized is the `DeltaTable`, which now only
carries the table configuration and the `LogStore`. the state / snapshot
is now optional. Consequently all code that works against a snapshot no
longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already
are working against snapshots mostly, but should abolish most traits
implemented for `DeltaTable` as this does not provide the information
(and never has) that is al least required to execute a query.

Some larger notable changes include:

* remove `DeltaTableMetadata` and always use `Metadata` action.
* arrow and parquet are now required, as such the features got removed.
Personalyl I would also argue, that if you cannot read checkpoints, you
cannot read delta tables :). - so hopefully users weren't using
arrow-free versions.

### Major follow-ups:

* (pre-0.17) review integration with `log_store` and `object_store`.
Currently we make use mostly of `ObjectStore` inside the state handling.
What we really use is `head` / `list_from` / `get` - my hope would be
that we end up with a single abstraction...
* test cleanup - we are currently dealing with test flakiness and have
several approaches to scaffolding tests. SInce we have the
`deltalake-test` crate now, this can be reconciled.
* ...
* do more processing on borrowed data ...
* perform file-heavy operations on arrow data
* update checkpoint writing to leverage new state handling and arrow ...
* switch to exposing URL in public APIs

## Questions

* should paths be percent-encoded when written to checkpoint?

# Related Issue(s)

supersedes: delta-io#454
supersedes: delta-io#1837
closes: delta-io#1776
closes: delta-io#425 (should also be addressed in the current implementation)
closes: delta-io#288 (multi-part checkpoints are deprecated)
related: delta-io#435

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants