[RFC] Use arrow for checkpoint reading and state handling #1776

roeap · 2023-10-26T07:49:24Z

Description

With the switch to updated schema / types in #1756 we also got quite some code to read checkpoints and more generally handle the table state as arrow record batches. In the past we tried several times to switch to arrow-based state.

I believe now is a good time, as we did a lot of the required work in kernel already. In this context I would also question the value of further maintaining a parquet2 variant of the checkpoint parsing or rather an arrow2 equivalent. IIRC, polars is currently vendoring large parts of the arrow2 codebase and as they are currently the main maintainers, there is likely little maintenance going forward.

AS there is a zero-copy interop between arrow and arrow2, I think we can just remove that support from our codebase, which would lead to - i be lieve - significantly reduced complexity.

Use Case

Related Issue(s)

ion-elgreco · 2023-10-26T07:56:17Z

Polars also cloned pieces of arrow2 into Polars since the main maintainer of arrow2 pulled back. It's called polars-arrow now. So any activity on that arrow will probably also likely only happen into polars-arrow, unless folks back port stuff to arrow2.

houqp · 2023-11-20T05:29:00Z

I agree with deleting the parquet2 support code to keep things simple given the fork. That said, I recommend taking the columnar based parsing approach in the current parquet2 implement for the new arrow-rs based state implementation for better performance. If we use arrow recordbatch to store the state and leverage parquet crate to parse directly from parquet to arrow, then we should get this for free.

roeap · 2023-11-21T11:56:43Z

I recommend taking the columnar based parsing approach

Fully agree. My thinking was to start with loading most of the state in a first pass, and then moving to something I recently learned is called PnM-Query - i.e. trying our best to get the protocol and metadata as efficiently as possible. i.e. scanning all commit files and then selectively getting the data from the checkpoint if not present in any commit json file. and loadig file actions only when needed ...

~~based on #1807~~ # Description In the effort to advance protocol support and move our internal APIs closer to the kernel library, it is advantageous to leverage the expression handling logic from kernel specifically for filtering actions etc. This PR just add the expression definitions and evaluation logic. Integrating it with our current codebase and basing the existing partition handling logic on this is left for follow up PRs to keep thigs review-able. related: #1894, #1776

# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: #454 supersedes: #1837 closes: #1776 closes: #425 (should also be addressed in the current implementation) closes: #288 (multi-part checkpoints are deprecated) related: #435 # Documentation  --------- Co-authored-by: R. Tyler Croy <[email protected]>

# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: delta-io#454 supersedes: delta-io#1837 closes: delta-io#1776 closes: delta-io#425 (should also be addressed in the current implementation) closes: delta-io#288 (multi-part checkpoints are deprecated) related: delta-io#435 # Documentation  --------- Co-authored-by: R. Tyler Croy <[email protected]>

roeap added the enhancement New feature or request label Oct 26, 2023

roeap mentioned this issue Nov 11, 2023

feat: arrow based table state and checkpoint handling #1837

Closed

This was referenced Nov 18, 2023

refactor: express log schema in delta types #1876

Merged

refactor: simplify DeltaTableState #1877

Merged

This was referenced Dec 3, 2023

feat: add kernel ExpressionEvaluator #1829

Merged

refactor: prefer usage of metadata and protocol fields #1935

Merged

roeap mentioned this issue Jan 7, 2024

feat: arrow backed log replay and table state #2037

Merged

rtyler closed this as completed in #2037 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Use arrow for checkpoint reading and state handling #1776

[RFC] Use arrow for checkpoint reading and state handling #1776

roeap commented Oct 26, 2023

ion-elgreco commented Oct 26, 2023

houqp commented Nov 20, 2023 •

edited

Loading

roeap commented Nov 21, 2023

[RFC] Use arrow for checkpoint reading and state handling #1776

[RFC] Use arrow for checkpoint reading and state handling #1776

Comments

roeap commented Oct 26, 2023

Description

ion-elgreco commented Oct 26, 2023

houqp commented Nov 20, 2023 • edited Loading

roeap commented Nov 21, 2023

houqp commented Nov 20, 2023 •

edited

Loading