feat(ingest): add classification for sql sources #10013

mayurinehate · 2024-03-08T13:46:13Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

metadata-ingestion/src/datahub/ingestion/source/sql/data_reader.py

hsheth2 · 2024-03-09T01:46:31Z

metadata-ingestion/src/datahub/ingestion/source/sql/data_reader.py

+        else:
+            query = sa.select([sa.text("*")]).select_from(table).limit(sample_size)
+        query_results = self.engine.execute(query)
+        # Not ideal - creates a parallel structure. Can we use pandas here ?


yeah refactoring the query generation a bit would be nice

metadata-ingestion/src/datahub/ingestion/source/sql/data_reader.py

metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py

metadata-ingestion/src/datahub/ingestion/source/sql/data_reader.py

metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py

Co-authored-by: Harshal Sheth <[email protected]>

…r.py Co-authored-by: Harshal Sheth <[email protected]>

hsheth2

a few minor comments, but otherwise looks good

metadata-ingestion/src/datahub/ingestion/source/sql/data_reader.py

hsheth2 · 2024-03-12T05:18:22Z

metadata-ingestion/src/datahub/ingestion/source/sql/data_reader.py

+
+        # Not ideal - creates a parallel structure in column_values. Can we use pandas here ?
+        for row in query_results.fetchall():
+            if isinstance(row, LegacyRow):


this LegacyRow thing seems fishy - is row always going to be a LegacyRow? won't it be a Row object in some cases?

we're on sqlalchemy 1.4, so LegacyRow should be deprecated https://docs.sqlalchemy.org/en/20/changelog/migration_14.html#rowproxy-is-no-longer-a-proxy-is-now-called-row-and-behaves-like-an-enhanced-named-tuple

Agree. Given that LegacyRow import and functionality in sql_common.py is working fine, the current code flow should also work well for now. I've added fix for this in followup PR.

…sification_for_sql_sources' into master+ing-319-classification_for_sql_sources

mayurinehate requested a review from hsheth2 March 8, 2024 13:46

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 8, 2024

feat(ingest): add classification for sql sources

764ea8a

mayurinehate force-pushed the master+ing-319-classification_for_sql_sources branch from 309e4ab to 764ea8a Compare March 8, 2024 13:51

vercel bot deployed to Preview March 8, 2024 14:23 View deployment

hsheth2 reviewed Mar 9, 2024

View reviewed changes

mayurinehate and others added 3 commits March 11, 2024 11:18

Update metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py

b39e693

Co-authored-by: Harshal Sheth <[email protected]>

Update metadata-ingestion/src/datahub/ingestion/source/sql/data_reade…

ee729cf

…r.py Co-authored-by: Harshal Sheth <[email protected]>

refactor changes

0876747

vercel bot deployed to Preview March 11, 2024 06:25 View deployment

test more sources

dc89e49

mayurinehate marked this pull request as ready for review March 12, 2024 05:01

Merge branch 'master' into master+ing-319-classification_for_sql_sources

1b94481

mayurinehate requested a review from hsheth2 March 12, 2024 05:11

hsheth2 approved these changes Mar 12, 2024

View reviewed changes

vercel bot deployed to Preview March 12, 2024 05:35 View deployment

mayurinehate added 2 commits March 12, 2024 11:18

changes, fix lint

45d01a0

Merge remote-tracking branch 'refs/remotes/origin/master+ing-319-clas…

03b8e2a

…sification_for_sql_sources' into master+ing-319-classification_for_sql_sources

vercel bot deployed to Preview March 12, 2024 06:05 View deployment

mayurinehate added 2 commits March 12, 2024 11:42

revert accidental format

e9cae8e

add dependency

577d79c

vercel bot deployed to Preview March 12, 2024 06:42 View deployment

mayurinehate mentioned this pull request Mar 12, 2024

feat(ingest): add classification to bigquery, redshift #10031

Merged

5 tasks

hsheth2 merged commit 2de0e62 into datahub-project:master Mar 12, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): add classification for sql sources #10013

feat(ingest): add classification for sql sources #10013

mayurinehate commented Mar 8, 2024

hsheth2 Mar 9, 2024

hsheth2 left a comment

hsheth2 Mar 12, 2024

mayurinehate Mar 12, 2024

feat(ingest): add classification for sql sources #10013

feat(ingest): add classification for sql sources #10013

Conversation

mayurinehate commented Mar 8, 2024

Checklist

hsheth2 Mar 9, 2024

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

hsheth2 Mar 12, 2024

Choose a reason for hiding this comment

mayurinehate Mar 12, 2024

Choose a reason for hiding this comment