Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enrich superset ingestion #11688

Merged
merged 0 commits into from
Dec 11, 2024
Merged

feat: Enrich superset ingestion #11688

merged 0 commits into from
Dec 11, 2024

Conversation

hwmarkcheng
Copy link
Contributor

@hwmarkcheng hwmarkcheng commented Oct 22, 2024

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Description:

PR to enrich Superset Ingestion with relevant metadata
Superset has useful metadata in relation to superset physical (Raw table, 1:1 with say redshift table) /virtual (Custom sql) datasets, e.g owners, their own schema fields, custom sql, etc.

The current iteration of superset ingestion simply just links to the schema/table specified in the API (Which can be inaccurate, especially for virtual datasets). We've added a few changes to enrich the ingestion process.

  • Creates Dataset Entity for each physical/virtual dataset in Superset
  • Added Schema fields for charts
  • SQL parsing for virtual datasets
  • Lineage for tables (including virtual datasets), table level lineage for virtual datasets are accurate
  • Column level lineage (Working, but low accuracy on retrieving some columns, improvement will likely have to be another contribution)
  • Tagging for physical/virtual datasets
  • Retrieves owners and metrics from dataset/charts to sync in (WIP, Will map Preset owners to datahub owners, metrics to glossary terms)

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Oct 22, 2024
@ssidorenko
Copy link
Contributor

Very interested in those changes that you are proposing! I'm just wondering why you are removing the platform check condition for postgresql?

@hwmarkcheng
Copy link
Contributor Author

hwmarkcheng commented Oct 29, 2024

Happy to discuss these changes with the community, let me know if you have any questions.
Looking to update the tests (Fix lint) before marking this ready for review.

@Masterchen09
Copy link
Contributor

Masterchen09 commented Oct 30, 2024

Great to see that the Superset source is getting some more love. :-)

Maybe two additions from my side, as we are internally using a completely rewritten custom version of the Superset source, which is also adding support for virtual datasets and SQL parsing (although we are not parsing the lineage on field-level and only extracting the tables/views from the SQL):

I have seen that you are adding a tag to distinguish between pysical and virtual datasets...what about the subtypes aspect? I think using the subtypes aspect would be better than adding a tag in this case -> https://datahubproject.io/docs/generated/metamodel/entities/dataset/#subtypes

In our case we are also using templating a lot in our virtual datasets (I think except two or three virtual datasets in all our virtual datasets templating is used) and I noticed that with the template "syntax" in the SQL the SQL can usually not be parsed...I have written a "dummy" template processor which nearly covers all functions of Superset (not everything can be implemented as a dummy processor, because some information is missing in context of DataHub, which Superset has when opening a dashboard, which would result in invalid results) and allows us to also parse these SQLs. Of course we do not have access to the actual filters of the dashboard etc., therefore in case "weird" templates are used (e.g. when a different source table is used depending on a filter value) it is possible that the lineage will not be complete (but this is still better than not beeing able to parse the SQL -> It is a best effort approach) - this is something I could contribute back once your enhancements were merged. :-)

edit: Theres is one more thing...for virtual datasets it would also be great to ingest the view logic as a viewProperties aspect of the dataset, that's how users could check the SQL directly in DataHub -> https://datahubproject.io/docs/generated/metamodel/entities/dataset/#viewproperties

@hwmarkcheng
Copy link
Contributor Author

@Masterchen09
These are great suggestions! I think adding physical/virtual datasets with subtypes would make sense.

We've also looked to address the jinja templating issue, our team here uses Preset (Managed instance of Superset). We've worked with them to add rendered SQL (Full SQL output) as an parameter to the datasets api endpoint, which will include the entire expanded SQL output. The PR has been merged in: apache/superset#30721
This should at least let our lineage system accurately parse for tables.

Theres is one more thing...for virtual datasets it would also be great to ingest the view logic as a viewProperties aspect of the dataset, that's how users could check the SQL directly in DataHub ->
Essentially sync in the SQL text into view definition?

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really appreciate the PR, and apologies for taking so long to review it.

I got through reviewing about half of this PR before getting somewhat lost in the code.

I think it'd be significantly easier for us to accept these changes if it were split across a couple PRs. In general, small PRs are easier to review and easier to debug/revert in case anything goes wrong.

It'd also let us have more fine-grained conversations about specific pieces - like I'd like to understand the metrics piece and the virtual/physical tables stuff in more depth before we integrate those changes

"GEOMETRY": NullType,
"HLLSKETCH": NullType,
"TIMETZ": TimeType,
"VARBYTE": StringType,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this list generated?

we've got a bunch of similar string type -> datahub type mappings, and ideally we'd reuse one of those instead of creating another one. The dbt source has a mapping that might be a good candidate for reuse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty similar to the redshift mapping. Is this a file I can reference? Can't seem to find the dbt source mapping

@@ -106,7 +210,7 @@ class SupersetConfig(
api_key: Optional[str] = Field(default=None, description="Preset.io API key.")
api_secret: Optional[str] = Field(default=None, description="Preset.io API secret.")
manager_uri: str = Field(
default="https://api.app.preset.io", description="Preset.io API URL"
default="https://api.app.preset.io/", description="Preset.io API URL"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised to see this config in the SupersetConfig class? shouldn't this be in the PresetConfig instead?

and ctx.pipeline_config.sink.config
):
self.sink_config = ctx.pipeline_config.sink.config
self.rest_emitter = DatahubRestEmitter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use ctx.graph

return None
raise ValueError("Could not construct dataset URN")

def parse_owner_payload(self, payload, owners_dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type annotations?

chart_payload = self.get_all_chart_owners()
dashboard_payload = self.get_all_dashboard_owners()

owners_dict = self.parse_owner_payload(dataset_payload, owners_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be cleaner. pure methods are easier to reason about

Suggested change
owners_dict = self.parse_owner_payload(dataset_payload, owners_dict)
owners_dict.update(self.parse_owner_payload(dataset_payload))

owners_dict[owner_id] = email
return owners_dict

def build_preset_owner_dict(self) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we tighten the type annotations e.g. Dict[str, str] or something?

while (current_chart_page - 1) * PAGE_SIZE <= total_chart_owners:
full_owners_response = self.session.get(
f"{self.config.connect_uri}/api/v1/chart/related/owners",
params=f"q=(page:{current_chart_page},page_size:{PAGE_SIZE})",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given how standardized the pagination logic is, can we extract it into a helper method?

custom_properties["CertifiedBy"] = dashboard_data.get("certified_by")
custom_properties["CertifiedBy"] = dashboard_data.get(
"certified_by", "None"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this need to change?

Copy link
Contributor Author

@hwmarkcheng hwmarkcheng Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add this in to pass lint check

owners_dict = self.parse_owner_payload(dashboard_payload, owners_dict)
return owners_dict

def build_owners_urn_list(self, data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs type annotations + ideally a comment explaining what the input dict structure looks like

mce = MetadataChangeEvent(proposedSnapshot=chart_snapshot)
yield MetadataWorkUnit(id=chart_snapshot.urn, mce=mce)
yield from self._get_domain_wu(
title=chart_data.get("slice_name", ""),
Copy link
Collaborator

@hsheth2 hsheth2 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is slice_name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice name is essentially chart name, it's what Superset calls it

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 25, 2024
@hwmarkcheng
Copy link
Contributor Author

Hey Harshal!

Thanks for checking this out! We'll look to break this into smaller PRs so it's easier to review:
Llance:master (Our Fork) <— Mark/base-superset-change ← Enos/preset-Ownership
← Mark/Lineage
← Mark/Glossary metrics

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Nov 26, 2024
@hwmarkcheng
Copy link
Contributor Author

@hsheth2 -> Split the first PR here: #11972

Please take a look when you have the time!

@hsheth2 hsheth2 removed the needs-review Label for PRs that need review from a maintainer. label Nov 29, 2024
@datahub-cyborg datahub-cyborg bot added the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Nov 29, 2024
@hsheth2 hsheth2 merged commit f1ef4f8 into datahub-project:master Dec 11, 2024
87 of 89 checks passed
@Daniellundin048
Copy link

Screenshot_20241211_152749_Chrome

@Daniellundin048
Copy link

Fyfan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants