-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlcapture: Treat tables with omitted primary-key columns as keyless #2396
base: main
Are you sure you want to change the base?
Conversation
Apparently computed columns can be a table's primary key. Huh.
We omit columns when they're things we can't reliably capture, as is the case with SQL Server or Postgres computed columns. At least in SQL Server, but I think probably also in Postgres, it is possible to use such a column as the primary key of a table. When this happens we currently emit a nonsensical schema in which the column isn't defined as a property of the output documents, but _is_ listed as required and is suggested as the collection key. The better outcome here is to just treat the table as keyless if that happens. I've opted to put that logic in the sqlcapture catalog generation code, because it is never correct to try and do primary-key things if the table's primary key contains omitted columns. It might be smarter to eventually integrate a "is any key column generated" check into the secondary index selection logic so that if the primary key of a table is generated but there's a unique secondary index that isn't we use that. But that seemed like a lot of additional complexity for a very niche edge case, so I'm just going with the simple fix for a situation that blatantly cannot ever be correct here.
var suggestedCollectionKey = table.PrimaryKey | ||
var suggestedKeyHasOmittedColumn = false | ||
for _, key := range suggestedCollectionKey { | ||
if table.Columns[key].OmitColumn { | ||
suggestedKeyHasOmittedColumn = true | ||
break | ||
} | ||
} | ||
if suggestedKeyHasOmittedColumn { | ||
suggestedCollectionKey = nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a bandaid on top of an interface problem: should we expect the underlying DiscoverTables
to not return an omitted primary key as table.PrimaryKey
? that seems to make more sense to me
In connectors' implementations of DiscoverTables we should check before setting PrimaryKey:
if !info.Columns[key].Omitted {
info.PrimaryKey = key
}
(it needs to be actually a loop as it is here, there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the connector-specific discovery logic is expected to faithfully report what the database tables look like without much "editorializing". That's why we omit tables and columns by marking a "omit this from the generated schema" flag instead of actually leaving them out of the discovery results, for instance.
The rationale is that we use this information for other purposes inside the connector, and often we're setting ourselves up for failure if we lie about what the database actually told us. Also we log the complete discovery results for every table at debug level, which is actually very useful information to have and in that context there's an important distinction between "the table has a primary key which happens to include generated columns" and "the table actually has no declared primary key at all".
(In fact the connector logic which selects a unique secondary index as a primary key when no actual primary key exists is a violation of this principle, and we should probably fix that at some point. But in practice that's a pretty small and mostly harmless substitution so it's not an urgent problem.)
Description:
We omit columns from discovered schemas when they're things we can't reliably capture, as is the case with SQL Server or Postgres computed columns.
At least in SQL Server, but I think probably also in Postgres, it is possible to use such a column as the primary key of a table. When this happens we currently emit a nonsensical schema in which the column isn't defined as a property of the output documents, but is listed as required and is suggested as the collection key.
This is incorrect both because the schema is incoherent, and because that column isn't going to be reliably present (it won't be present in replicated changes, specifically) so it can't be a collection key anyway. The best outcome here is just to treat the table as keyless, so I've put some sanity-checking logic in the
sqlcapture
catalog generation code to do that.It might be smarter to eventually integrate a "is any key column generated" check into the secondary index selection logic for impacted databases, so that if the primary key of a table is generated but there's a unique secondary index that isn't we use that. But that seemed like a lot of additional complexity for a very niche edge case, so I'm just going with the simple fix for a situation that blatantly cannot ever be correct here.
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)