Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] move delta log handling to new struct DeltaLog and harmonize operations #661

Closed
roeap opened this issue Jun 27, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@roeap
Copy link
Collaborator

roeap commented Jun 27, 2022

Description

As we add more and more high level operations (vacuum, optimize, writes, ....) our DeltaTable struct is continuously growing in complexity, and maybe it's time for us to have another look at our APIs.

The main proposal would be

  1. Move delta log handling into new DeltaLog struct. This would likely also make things like Support file index #528 easier.
  2. Implement all operations as separate structs, much like current optimize or the high level write operations potentially leveraging the upcoming into_future feature. Additionally using a builder pattern to configure the specific operation.
  3. The DeltaTable APIs could then look somewhat like the current pyspark apis

what do you think? @houqp, @rtyler @xianwill @wjones127 @mosyp @fvaleye

@roeap roeap added the enhancement New feature or request label Jun 27, 2022
@Blajda
Copy link
Collaborator

Blajda commented Jun 29, 2022

Hi @roeap, I'm interested in performing the work of factoring out the vacuum code into a struct related to line item 2. It would also be good opportunity to add commit information since it is now supported.

I like a DeltaLog proposal since it would allow us to create different DeltaTable variants. One variant can be created to support append only write operations without having to track the log in memory which be an issue for larger tables. Going further we can also use the trait system to validate if a particular operation requires a DeltaLog at compile time.

@roeap
Copy link
Collaborator Author

roeap commented Jun 29, 2022

@Blajda - obviously any contribuiton is highly welcome :).

Currently I am working on #632, where I am thinking a bit about how to structure our codebase to more efficiently handle out table state / log... One thing that keeps on coming up is that it might be beneficial to move the object store and stare around, rather then the whole table. The proposed delta log would probably be more or less immutable (except for a cache maybe)...

I guess what I am saying is, I'd be very keen on seeing an implementation for vacuum, but am not sure yet which way we end up agree an pursuing :).

@wjones127
Copy link
Collaborator

I think a re-oganization is a good idea. I'd like to see some examples of how usage would change, though. Since many operations depend on having an up-to-date DeltaLog state, it's unclear how you can separate that from the top-level DeltaTable struct.

@roeap
Copy link
Collaborator Author

roeap commented Jul 1, 2022

You are very much correct. Most of that thinking evolved while working on #632. Essentially the DeltaLog would just get some config and an object sore instance and handle reading / querying the log. This should align well with how the reference implementation handles the log. The actual state would remain in the existing DeltaTableState struct, equivalent in spark being the Snapshot. One thing still a bit fuzzy is the transaction. If i remember correctly, spark also exposes file (data) reading APIs on that, so they can track which files were read by a transaction. Not sure if we want to go this way also ...

I will try to finish an initial mergeable version of #632, and then try and explore some of the options we have ...

@roeap
Copy link
Collaborator Author

roeap commented Feb 23, 2023

Closing this, since discussions advanced quite a bit, and this is now obsolete.

@roeap roeap closed this as completed Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants