feat: Enrich superset ingestion #11688

hwmarkcheng · 2024-10-22T12:24:53Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Description:

PR to enrich Superset Ingestion with relevant metadata
Superset has useful metadata in relation to superset physical (Raw table, 1:1 with say redshift table) /virtual (Custom sql) datasets, e.g owners, their own schema fields, custom sql, etc.

The current iteration of superset ingestion simply just links to the schema/table specified in the API (Which can be inaccurate, especially for virtual datasets). We've added a few changes to enrich the ingestion process.

Creates Dataset Entity for each physical/virtual dataset in Superset
Added Schema fields for charts
SQL parsing for virtual datasets
Lineage for tables (including virtual datasets), table level lineage for virtual datasets are accurate
Column level lineage (Working, but low accuracy on retrieving some columns, improvement will likely have to be another contribution)
Tagging for physical/virtual datasets
Retrieves owners and metrics from dataset/charts to sync in (WIP, Will map Preset owners to datahub owners, metrics to glossary terms)

metadata-ingestion/src/datahub/ingestion/source/superset.py

ssidorenko · 2024-10-22T16:51:49Z

Very interested in those changes that you are proposing! I'm just wondering why you are removing the platform check condition for postgresql?

hwmarkcheng · 2024-10-29T14:41:27Z

Happy to discuss these changes with the community, let me know if you have any questions.
Looking to update the tests (Fix lint) before marking this ready for review.

Masterchen09 · 2024-10-30T13:49:19Z

Great to see that the Superset source is getting some more love. :-)

Maybe two additions from my side, as we are internally using a completely rewritten custom version of the Superset source, which is also adding support for virtual datasets and SQL parsing (although we are not parsing the lineage on field-level and only extracting the tables/views from the SQL):

I have seen that you are adding a tag to distinguish between pysical and virtual datasets...what about the subtypes aspect? I think using the subtypes aspect would be better than adding a tag in this case -> https://datahubproject.io/docs/generated/metamodel/entities/dataset/#subtypes

In our case we are also using templating a lot in our virtual datasets (I think except two or three virtual datasets in all our virtual datasets templating is used) and I noticed that with the template "syntax" in the SQL the SQL can usually not be parsed...I have written a "dummy" template processor which nearly covers all functions of Superset (not everything can be implemented as a dummy processor, because some information is missing in context of DataHub, which Superset has when opening a dashboard, which would result in invalid results) and allows us to also parse these SQLs. Of course we do not have access to the actual filters of the dashboard etc., therefore in case "weird" templates are used (e.g. when a different source table is used depending on a filter value) it is possible that the lineage will not be complete (but this is still better than not beeing able to parse the SQL -> It is a best effort approach) - this is something I could contribute back once your enhancements were merged. :-)

edit: Theres is one more thing...for virtual datasets it would also be great to ingest the view logic as a viewProperties aspect of the dataset, that's how users could check the SQL directly in DataHub -> https://datahubproject.io/docs/generated/metamodel/entities/dataset/#viewproperties

metadata-ingestion/src/datahub/ingestion/source/superset.py

hwmarkcheng · 2024-11-04T15:26:09Z

@Masterchen09
These are great suggestions! I think adding physical/virtual datasets with subtypes would make sense.

We've also looked to address the jinja templating issue, our team here uses Preset (Managed instance of Superset). We've worked with them to add rendered SQL (Full SQL output) as an parameter to the datasets api endpoint, which will include the entire expanded SQL output. The PR has been merged in: apache/superset#30721
This should at least let our lineage system accurately parse for tables.

Theres is one more thing...for virtual datasets it would also be great to ingest the view logic as a viewProperties aspect of the dataset, that's how users could check the SQL directly in DataHub ->
Essentially sync in the SQL text into view definition?

hsheth2

We really appreciate the PR, and apologies for taking so long to review it.

I got through reviewing about half of this PR before getting somewhat lost in the code.

I think it'd be significantly easier for us to accept these changes if it were split across a couple PRs. In general, small PRs are easier to review and easier to debug/revert in case anything goes wrong.

It'd also let us have more fine-grained conversations about specific pieces - like I'd like to understand the metrics piece and the virtual/physical tables stuff in more depth before we integrate those changes

hsheth2 · 2024-11-25T23:21:12Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

+    "GEOMETRY": NullType,
+    "HLLSKETCH": NullType,
+    "TIMETZ": TimeType,
+    "VARBYTE": StringType,


how was this list generated?

we've got a bunch of similar string type -> datahub type mappings, and ideally we'd reuse one of those instead of creating another one. The dbt source has a mapping that might be a good candidate for reuse

This is pretty similar to the redshift mapping. Is this a file I can reference? Can't seem to find the dbt source mapping

hsheth2 · 2024-11-25T23:21:44Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

@@ -106,7 +210,7 @@ class SupersetConfig(
    api_key: Optional[str] = Field(default=None, description="Preset.io API key.")
    api_secret: Optional[str] = Field(default=None, description="Preset.io API secret.")
    manager_uri: str = Field(
-        default="https://api.app.preset.io", description="Preset.io API URL"
+        default="https://api.app.preset.io/", description="Preset.io API URL"


I'm a bit surprised to see this config in the SupersetConfig class? shouldn't this be in the PresetConfig instead?

hsheth2 · 2024-11-25T23:22:28Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

+            and ctx.pipeline_config.sink.config
+        ):
+            self.sink_config = ctx.pipeline_config.sink.config
+            self.rest_emitter = DatahubRestEmitter(


we can use ctx.graph

hsheth2 · 2024-11-25T23:24:24Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

-        return None
+        raise ValueError("Could not construct dataset URN")
+
+    def parse_owner_payload(self, payload, owners_dict):


type annotations?

hsheth2 · 2024-11-25T23:24:40Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

+        chart_payload = self.get_all_chart_owners()
+        dashboard_payload = self.get_all_dashboard_owners()
+
+        owners_dict = self.parse_owner_payload(dataset_payload, owners_dict)


this would be cleaner. pure methods are easier to reason about

Suggested change

owners_dict = self.parse_owner_payload(dataset_payload, owners_dict)

owners_dict.update(self.parse_owner_payload(dataset_payload))

hsheth2 · 2024-11-25T23:25:28Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

+                owners_dict[owner_id] = email
+        return owners_dict
+
+    def build_preset_owner_dict(self) -> dict:


can we tighten the type annotations e.g. Dict[str, str] or something?

hsheth2 · 2024-11-25T23:26:02Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

+        while (current_chart_page - 1) * PAGE_SIZE <= total_chart_owners:
+            full_owners_response = self.session.get(
+                f"{self.config.connect_uri}/api/v1/chart/related/owners",
+                params=f"q=(page:{current_chart_page},page_size:{PAGE_SIZE})",


given how standardized the pagination logic is, can we extract it into a helper method?

hsheth2 · 2024-11-25T23:26:53Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

-            custom_properties["CertifiedBy"] = dashboard_data.get("certified_by")
+            custom_properties["CertifiedBy"] = dashboard_data.get(
+                "certified_by", "None"
+            )


why did this need to change?

Had to add this in to pass lint check

hsheth2 · 2024-11-25T23:27:41Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

+        owners_dict = self.parse_owner_payload(dashboard_payload, owners_dict)
+        return owners_dict
+
+    def build_owners_urn_list(self, data):


needs type annotations + ideally a comment explaining what the input dict structure looks like

hsheth2 · 2024-11-25T23:29:18Z

metadata-ingestion/src/datahub/ingestion/source/superset.py

+                    mce = MetadataChangeEvent(proposedSnapshot=chart_snapshot)
+                    yield MetadataWorkUnit(id=chart_snapshot.urn, mce=mce)
+                    yield from self._get_domain_wu(
+                        title=chart_data.get("slice_name", ""),


what is slice_name?

Slice name is essentially chart name, it's what Superset calls it

hwmarkcheng · 2024-11-26T19:25:53Z

Hey Harshal!

Thanks for checking this out! We'll look to break this into smaller PRs so it's easier to review:
Llance:master (Our Fork) <— Mark/base-superset-change ← Enos/preset-Ownership
← Mark/Lineage
← Mark/Glossary metrics

hwmarkcheng · 2024-11-27T15:29:59Z

@hsheth2 -> Split the first PR here: #11972

Please take a look when you have the time!

Daniellundin048 · 2024-12-11T19:43:08Z

Daniellundin048 · 2024-12-11T19:43:24Z

Fyfan

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Oct 22, 2024

vercel bot deployed to Preview October 22, 2024 12:46 View deployment

ssidorenko reviewed Oct 22, 2024

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/superset.py Outdated Show resolved Hide resolved

vercel bot deployed to Preview October 29, 2024 14:58 View deployment

Masterchen09 reviewed Oct 30, 2024

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/superset.py Outdated Show resolved Hide resolved

vercel bot had a problem deploying to Preview November 5, 2024 20:19 Failure

hwmarkcheng marked this pull request as ready for review November 22, 2024 19:43

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 22, 2024

vercel bot deployed to Preview November 22, 2024 20:02 View deployment

vercel bot deployed to Preview November 25, 2024 22:16 View deployment

hsheth2 requested changes Nov 25, 2024

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 25, 2024

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Nov 26, 2024

hwmarkcheng mentioned this pull request Nov 27, 2024

feat(ingest/superset): initial support for superset datasets #11972

Merged

5 tasks

vercel bot deployed to Preview November 27, 2024 16:17 View deployment

hsheth2 removed the needs-review Label for PRs that need review from a maintainer. label Nov 29, 2024

datahub-cyborg bot added the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Nov 29, 2024

vercel bot deployed to Preview December 4, 2024 14:54 View deployment

vercel bot deployed to Preview December 6, 2024 04:50 View deployment

enosodigie force-pushed the master branch from 9dcd8de to f1ef4f8 Compare December 11, 2024 19:23

hsheth2 merged commit f1ef4f8 into datahub-project:master Dec 11, 2024
87 of 89 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Enrich superset ingestion #11688

feat: Enrich superset ingestion #11688

hwmarkcheng commented Oct 22, 2024 •

edited

Loading

ssidorenko commented Oct 22, 2024

hwmarkcheng commented Oct 29, 2024 •

edited

Loading

Masterchen09 commented Oct 30, 2024 •

edited

Loading

hwmarkcheng commented Nov 4, 2024

hsheth2 left a comment

hsheth2 Nov 25, 2024

hwmarkcheng Nov 27, 2024

hsheth2 Nov 25, 2024

hsheth2 Nov 25, 2024

hsheth2 Nov 25, 2024

hsheth2 Nov 25, 2024

hsheth2 Nov 25, 2024

hsheth2 Nov 25, 2024

hsheth2 Nov 25, 2024

hwmarkcheng Nov 27, 2024 •

edited

Loading

hsheth2 Nov 25, 2024

hsheth2 Nov 25, 2024 •

edited

Loading

hwmarkcheng Nov 27, 2024

hwmarkcheng commented Nov 26, 2024

hwmarkcheng commented Nov 27, 2024

Daniellundin048 commented Dec 11, 2024

Daniellundin048 commented Dec 11, 2024

	owners_dict = self.parse_owner_payload(dataset_payload, owners_dict)
	owners_dict.update(self.parse_owner_payload(dataset_payload))

feat: Enrich superset ingestion #11688

feat: Enrich superset ingestion #11688

Conversation

hwmarkcheng commented Oct 22, 2024 • edited Loading

Checklist

Description:

ssidorenko commented Oct 22, 2024

hwmarkcheng commented Oct 29, 2024 • edited Loading

Masterchen09 commented Oct 30, 2024 • edited Loading

hwmarkcheng commented Nov 4, 2024

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hwmarkcheng Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hwmarkcheng commented Nov 26, 2024

hwmarkcheng commented Nov 27, 2024

Daniellundin048 commented Dec 11, 2024

Daniellundin048 commented Dec 11, 2024

hwmarkcheng commented Oct 22, 2024 •

edited

Loading

hwmarkcheng commented Oct 29, 2024 •

edited

Loading

Masterchen09 commented Oct 30, 2024 •

edited

Loading

hwmarkcheng Nov 27, 2024 •

edited

Loading

hsheth2 Nov 25, 2024 •

edited

Loading