Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): add classification for sql sources #10013

Conversation

mayurinehate
Copy link
Collaborator

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@mayurinehate mayurinehate requested a review from hsheth2 March 8, 2024 13:46
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 8, 2024
@mayurinehate mayurinehate force-pushed the master+ing-319-classification_for_sql_sources branch from 309e4ab to 764ea8a Compare March 8, 2024 13:51
else:
query = sa.select([sa.text("*")]).select_from(table).limit(sample_size)
query_results = self.engine.execute(query)
# Not ideal - creates a parallel structure. Can we use pandas here ?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah refactoring the query generation a bit would be nice

@mayurinehate mayurinehate marked this pull request as ready for review March 12, 2024 05:01
@mayurinehate mayurinehate requested a review from hsheth2 March 12, 2024 05:11
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments, but otherwise looks good


# Not ideal - creates a parallel structure in column_values. Can we use pandas here ?
for row in query_results.fetchall():
if isinstance(row, LegacyRow):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this LegacyRow thing seems fishy - is row always going to be a LegacyRow? won't it be a Row object in some cases?

we're on sqlalchemy 1.4, so LegacyRow should be deprecated https://docs.sqlalchemy.org/en/20/changelog/migration_14.html#rowproxy-is-no-longer-a-proxy-is-now-called-row-and-behaves-like-an-enhanced-named-tuple

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Given that LegacyRow import and functionality in sql_common.py is working fine, the current code flow should also work well for now. I've added fix for this in followup PR.

…sification_for_sql_sources' into master+ing-319-classification_for_sql_sources
@hsheth2 hsheth2 merged commit 2de0e62 into datahub-project:master Mar 12, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants