Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion): add option for optimized skipping of schemas #2209

Merged
merged 4 commits into from
Mar 11, 2021
Merged

feat(ingestion): add option for optimized skipping of schemas #2209

merged 4 commits into from
Mar 11, 2021

Conversation

thomasplarsson
Copy link
Contributor

@thomasplarsson thomasplarsson commented Mar 10, 2021

Fixes: #2208

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

@thomasplarsson
Copy link
Contributor Author

thomasplarsson commented Mar 10, 2021

@hsheth2 This is not the prettiest change I know, but I wanted to just put the PR out there to make any discussion about issue #2208 more concrete. For context, I am stumbling on this when trying to ingest from aws athena where we have in our non-production environment a truly polluted metastore with a ridiculous number of garbage schemas, whereas the number of schemas I would be interested in ingesting from is on the order of <10.

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good! Made some minor suggestions

I ran into this same issue when I was ingesting from BigQuery, so I'm glad we've got a good solution for it

dataset_snapshot.aspects.append(dataset_properties)
schema_metadata = get_schema_metadata(
self.report, dataset_name, platform, columns
if sql_config.schema_pattern.allowed(schema):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use something along these lines:

if not sql_config.schema_pattern.allowed(schema):
    self.report.report_dropped(schema)
    continue

It will help localize the reporting alongside the if statement and keep the indentation a bit less deep. I was planning on doing this for table_pattern as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and fixed.

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for adding this.

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shirshanka shirshanka merged commit 1f1518c into datahub-project:master Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to do early skips of complete schemas when ingesting from rdbms-like sources
3 participants