Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlcapture: Treat tables with omitted primary-key columns as keyless #2396

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Feb 13, 2025

Description:

We omit columns from discovered schemas when they're things we can't reliably capture, as is the case with SQL Server or Postgres computed columns.

At least in SQL Server, but I think probably also in Postgres, it is possible to use such a column as the primary key of a table. When this happens we currently emit a nonsensical schema in which the column isn't defined as a property of the output documents, but is listed as required and is suggested as the collection key.

This is incorrect both because the schema is incoherent, and because that column isn't going to be reliably present (it won't be present in replicated changes, specifically) so it can't be a collection key anyway. The best outcome here is just to treat the table as keyless, so I've put some sanity-checking logic in the sqlcapture catalog generation code to do that.

It might be smarter to eventually integrate a "is any key column generated" check into the secondary index selection logic for impacted databases, so that if the primary key of a table is generated but there's a unique secondary index that isn't we use that. But that seemed like a lot of additional complexity for a very niche edge case, so I'm just going with the simple fix for a situation that blatantly cannot ever be correct here.


This change is Reviewable

Apparently computed columns can be a table's primary key. Huh.
We omit columns when they're things we can't reliably capture,
as is the case with SQL Server or Postgres computed columns.

At least in SQL Server, but I think probably also in Postgres,
it is possible to use such a column as the primary key of a
table. When this happens we currently emit a nonsensical schema
in which the column isn't defined as a property of the output
documents, but _is_ listed as required and is suggested as the
collection key.

The better outcome here is to just treat the table as keyless
if that happens. I've opted to put that logic in the sqlcapture
catalog generation code, because it is never correct to try and
do primary-key things if the table's primary key contains omitted
columns.

It might be smarter to eventually integrate a "is any key column
generated" check into the secondary index selection logic so that
if the primary key of a table is generated but there's a unique
secondary index that isn't we use that. But that seemed like a
lot of additional complexity for a very niche edge case, so I'm
just going with the simple fix for a situation that blatantly
cannot ever be correct here.
@willdonnelly willdonnelly added the change:unplanned Unplanned change, useful for things like doc updates label Feb 13, 2025
@willdonnelly willdonnelly requested a review from a team February 13, 2025 00:45
Comment on lines +72 to +83
var suggestedCollectionKey = table.PrimaryKey
var suggestedKeyHasOmittedColumn = false
for _, key := range suggestedCollectionKey {
if table.Columns[key].OmitColumn {
suggestedKeyHasOmittedColumn = true
break
}
}
if suggestedKeyHasOmittedColumn {
suggestedCollectionKey = nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bandaid on top of an interface problem: should we expect the underlying DiscoverTables to not return an omitted primary key as table.PrimaryKey? that seems to make more sense to me

In connectors' implementations of DiscoverTables we should check before setting PrimaryKey:

if !info.Columns[key].Omitted {
	info.PrimaryKey = key
}

(it needs to be actually a loop as it is here, there)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the connector-specific discovery logic is expected to faithfully report what the database tables look like without much "editorializing". That's why we omit tables and columns by marking a "omit this from the generated schema" flag instead of actually leaving them out of the discovery results, for instance.

The rationale is that we use this information for other purposes inside the connector, and often we're setting ourselves up for failure if we lie about what the database actually told us. Also we log the complete discovery results for every table at debug level, which is actually very useful information to have and in that context there's an important distinction between "the table has a primary key which happens to include generated columns" and "the table actually has no declared primary key at all".

(In fact the connector logic which selects a unique secondary index as a primary key when no actual primary key exists is a violation of this principle, and we should probably fix that at some point. But in practice that's a pretty small and mostly harmless substitution so it's not an urgent problem.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:unplanned Unplanned change, useful for things like doc updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants