-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: introduce CDC write-side support for the Update operations #2486
Conversation
I think it's better to disable it until all operations (delete and merge) are supported, otherwise we cannot push any python releases until those are added |
How would you disable this? It doesn't make sense to me to include short-term configuration or feature flags to me. The protocol states that when the enable change data feed table-feature is enabled, that writers can optionally produce CDC files. Our writers just optionally will only create them on updates for now 😆 |
#[cfg(feature = "cdf")] | ||
{ | ||
writer_features.insert(WriterFeatures::ChangeDataFeed); | ||
writer_features.insert(WriterFeatures::GeneratedColumns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? I don't remember generated columns being required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ion-elgreco writer versions 4-6 require this 😦 [see here](If the table is on a Writer Version starting from 4 up to 6, Generated Columns are always supported.)
Writer versions before 7 are annoying , they're annoying after 7 too 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we are not adding the support here, right?
Protocol states this: "The value of delta.generationExpression SHOULD be parsed as a SQL expression.
Writers MUST enforce that any data writing to the table satisfy the condition ( <=> ) IS TRUE. <=> is the NULL-safe equal operator which performs an equality comparison like the = operator but returns TRUE rather than NULL if both operands are NULL"
We might only support v7 with CDF at this stage
Doing further acceptance testing I have identified what I believe to be a bug in DataFusion and will put this into Draft until I can figure out the path forward |
In discussion with @ion-elgreco , due to apache/datafusion#10749 which is really an issue with arrow-rs. We decided that we can move forward without struct/list CDC working with the following conditions:
|
8e4b248
to
58e4d60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix that failing python test, just replace the pandas code with pyarrow equivalent code.
Then we can merge!
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
This test has highlighted an apparent race condition when handling structs or lists in how excerpt() is treated by the CDCObserver.
Basically for older minWriterVersions we don't have to really worry about generated columns unless an expression has been set, in which case we must fail to write since we cannot honor generationExpression
90c1f83
to
87c01cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉🎉
This change introduces a
CDCTracker
which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer codeThere is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations
Fixes #604
Fixes #2095