sqlcapture: Treat tables with omitted primary-key columns as keyless #2396

willdonnelly · 2025-02-13T00:45:55Z

Description:

We omit columns from discovered schemas when they're things we can't reliably capture, as is the case with SQL Server or Postgres computed columns.

At least in SQL Server, but I think probably also in Postgres, it is possible to use such a column as the primary key of a table. When this happens we currently emit a nonsensical schema in which the column isn't defined as a property of the output documents, but is listed as required and is suggested as the collection key.

This is incorrect both because the schema is incoherent, and because that column isn't going to be reliably present (it won't be present in replicated changes, specifically) so it can't be a collection key anyway. The best outcome here is just to treat the table as keyless, so I've put some sanity-checking logic in the sqlcapture catalog generation code to do that.

It might be smarter to eventually integrate a "is any key column generated" check into the secondary index selection logic for impacted databases, so that if the primary key of a table is generated but there's a unique secondary index that isn't we use that. But that seemed like a lot of additional complexity for a very niche edge case, so I'm just going with the simple fix for a situation that blatantly cannot ever be correct here.

This change is

Apparently computed columns can be a table's primary key. Huh.

We omit columns when they're things we can't reliably capture, as is the case with SQL Server or Postgres computed columns. At least in SQL Server, but I think probably also in Postgres, it is possible to use such a column as the primary key of a table. When this happens we currently emit a nonsensical schema in which the column isn't defined as a property of the output documents, but _is_ listed as required and is suggested as the collection key. The better outcome here is to just treat the table as keyless if that happens. I've opted to put that logic in the sqlcapture catalog generation code, because it is never correct to try and do primary-key things if the table's primary key contains omitted columns. It might be smarter to eventually integrate a "is any key column generated" check into the secondary index selection logic so that if the primary key of a table is generated but there's a unique secondary index that isn't we use that. But that seemed like a lot of additional complexity for a very niche edge case, so I'm just going with the simple fix for a situation that blatantly cannot ever be correct here.

mdibaiee · 2025-02-13T13:45:52Z

sqlcapture/discovery.go

+		var suggestedCollectionKey = table.PrimaryKey
+		var suggestedKeyHasOmittedColumn = false
+		for _, key := range suggestedCollectionKey {
+			if table.Columns[key].OmitColumn {
+				suggestedKeyHasOmittedColumn = true
+				break
+			}
+		}
+		if suggestedKeyHasOmittedColumn {
+			suggestedCollectionKey = nil
+		}
+


This seems like a bandaid on top of an interface problem: should we expect the underlying DiscoverTables to not return an omitted primary key as table.PrimaryKey? that seems to make more sense to me

In connectors' implementations of DiscoverTables we should check before setting PrimaryKey:

if !info.Columns[key].Omitted { info.PrimaryKey = key }

(it needs to be actually a loop as it is here, there)

In general the connector-specific discovery logic is expected to faithfully report what the database tables look like without much "editorializing". That's why we omit tables and columns by marking a "omit this from the generated schema" flag instead of actually leaving them out of the discovery results, for instance.

The rationale is that we use this information for other purposes inside the connector, and often we're setting ourselves up for failure if we lie about what the database actually told us. Also we log the complete discovery results for every table at debug level, which is actually very useful information to have and in that context there's an important distinction between "the table has a primary key which happens to include generated columns" and "the table actually has no declared primary key at all".

(In fact the connector logic which selects a unique secondary index as a primary key when no actual primary key exists is a violation of this principle, and we should probably fix that at some point. But in practice that's a pretty small and mostly harmless substitution so it's not an urgent problem.)

willdonnelly added 2 commits February 12, 2025 18:26

source-sqlserver: Add TestComputedPrimaryKey

b482814

Apparently computed columns can be a table's primary key. Huh.

willdonnelly added the change:unplanned Unplanned change, useful for things like doc updates label Feb 13, 2025

willdonnelly requested a review from a team February 13, 2025 00:45

mdibaiee reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlcapture: Treat tables with omitted primary-key columns as keyless #2396

sqlcapture: Treat tables with omitted primary-key columns as keyless #2396

willdonnelly commented Feb 13, 2025 •

edited by jgraettinger

Loading

mdibaiee Feb 13, 2025

willdonnelly Feb 13, 2025

sqlcapture: Treat tables with omitted primary-key columns as keyless #2396

Are you sure you want to change the base?

sqlcapture: Treat tables with omitted primary-key columns as keyless #2396

Conversation

willdonnelly commented Feb 13, 2025 • edited by jgraettinger Loading

mdibaiee Feb 13, 2025

Choose a reason for hiding this comment

willdonnelly Feb 13, 2025

Choose a reason for hiding this comment

willdonnelly commented Feb 13, 2025 •

edited by jgraettinger

Loading