-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix spurious warnings and bogus index when reflecting Iceberg tables #520
base: master
Are you sure you want to change the base?
Fix spurious warnings and bogus index when reflecting Iceberg tables #520
Conversation
I'll see if I can figure out a way to make it work on 351. |
c396080
to
ac86862
Compare
The 351 check passes now. This is the query that the code uses to check whether the catalog's connector is Hive. If it returns SELECT
COUNT(*)
FROM "system"."metadata"."table_properties"
WHERE "catalog_name" = :catalog_name
AND "property_name" = 'bucketing_version' If there's a better way to do this that passes the 351 test, I'd be happy to change the code! |
for reference the earlier alternative was reading from For the test then you can make it "pass" by cc: @damian3031 @ebyhr @dungdm93 what do you think? |
264ea95
to
35343f1
Compare
@hashhar I realized I could query |
35343f1
to
4d1f272
Compare
@@ -229,6 +229,33 @@ def _get_partitions( | |||
partition_names = [desc[0] for desc in res.cursor.description] | |||
return partition_names | |||
|
|||
def _has_connector_name(self, connection: Connection): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we now need to issue 3 queries instead of 1 in the old version.
I'd recommend to not try to do any "detection" and instead just query system.metadata.catalogs
and ignore failure if it doesn't exist.
When it exists we identify if the catalog is Hive or something else.
When it doesn't exist we can determine Hive or something else by looking at the "format" of the output from the actual query to $partitions
.
Also I wonder if we can simply issue query to $partitions
and use the output shape to determine what connector it is. After all we don't care what CONNECTOR it is. We rather care about the fact that we return whatever the partition columns are.
Description
Previously, during reflection,
TrinoDialect.get_indexes()
called_get_partitions()
, which executedand assumed that, if this query executed successfully, the table in hand was a Hive table. The issue (#518) was that this same query also succeeds for Iceberg tables, resulting in a spurious index (containing no columns) being added to the table, and a series of warnings from SQLAlchemy about index keys not being located in the column names for the table.
This PR adds a check to
TrinoDialect.get_indexes()
to ensure that the catalog in hand is a Hive catalog before calling_get_partitions()
.I looked at creating a test method, but I couldn't see how to do so. Instead, I bench-tested the fix against Hive and Iceberg catalogs with the test app from #518. The output for an Iceberg table is:
For a Hive table it is:
Non-technical explanation
This PR fixes spurious warnings and a bogus index being added to the metadata when reflecting Iceberg tables.
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: