feat(ingestion): add option for optimized skipping of schemas #2209

thomasplarsson · 2021-03-10T21:25:41Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable)

thomasplarsson · 2021-03-10T21:30:03Z

@hsheth2 This is not the prettiest change I know, but I wanted to just put the PR out there to make any discussion about issue #2208 more concrete. For context, I am stumbling on this when trying to ingest from aws athena where we have in our non-production environment a truly polluted metastore with a ridiculous number of garbage schemas, whereas the number of schemas I would be interested in ingesting from is on the order of <10.

hsheth2

Overall looks pretty good! Made some minor suggestions

I ran into this same issue when I was ingesting from BigQuery, so I'm glad we've got a good solution for it

hsheth2 · 2021-03-10T22:19:13Z

metadata-ingestion/src/datahub/ingestion/source/sql_common.py

-                        dataset_snapshot.aspects.append(dataset_properties)
-                    schema_metadata = get_schema_metadata(
-                        self.report, dataset_name, platform, columns
+            if sql_config.schema_pattern.allowed(schema):


Could we use something along these lines:

if not sql_config.schema_pattern.allowed(schema): self.report.report_dropped(schema) continue

It will help localize the reporting alongside the if statement and keep the indentation a bit less deep. I was planning on doing this for table_pattern as well.

Agree and fixed.

metadata-ingestion/src/datahub/ingestion/source/sql_common.py

hsheth2

Looks great! Thanks for adding this.

Fixes: #2208

shirshanka

LGTM!

hsheth2 requested changes Mar 10, 2021

View reviewed changes

hsheth2 approved these changes Mar 11, 2021

View reviewed changes

thomas.larsson added 4 commits March 11, 2021 08:50

feat(ingestion): add option for optimized skipping of schemas

77cbd7a

Fixes: #2208

Invert conditional to avoid too deep nesting.

f5c506c

Documenting the new option in the README.

0e10087

Black code formatting.

922b5a9

shirshanka approved these changes Mar 11, 2021

View reviewed changes

shirshanka merged commit 1f1518c into datahub-project:master Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion): add option for optimized skipping of schemas #2209

feat(ingestion): add option for optimized skipping of schemas #2209

thomasplarsson commented Mar 10, 2021 •

edited

Loading

thomasplarsson commented Mar 10, 2021 •

edited

Loading

hsheth2 left a comment

hsheth2 Mar 10, 2021

thomasplarsson Mar 11, 2021

hsheth2 left a comment

shirshanka left a comment

feat(ingestion): add option for optimized skipping of schemas #2209

feat(ingestion): add option for optimized skipping of schemas #2209

Conversation

thomasplarsson commented Mar 10, 2021 • edited Loading

Checklist

thomasplarsson commented Mar 10, 2021 • edited Loading

hsheth2 left a comment

Choose a reason for hiding this comment

hsheth2 Mar 10, 2021

Choose a reason for hiding this comment

thomasplarsson Mar 11, 2021

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

shirshanka left a comment

Choose a reason for hiding this comment

thomasplarsson commented Mar 10, 2021 •

edited

Loading

thomasplarsson commented Mar 10, 2021 •

edited

Loading