-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enrich superset ingestion #11688
Conversation
Very interested in those changes that you are proposing! I'm just wondering why you are removing the platform check condition for |
Happy to discuss these changes with the community, let me know if you have any questions. |
Great to see that the Superset source is getting some more love. :-) Maybe two additions from my side, as we are internally using a completely rewritten custom version of the Superset source, which is also adding support for virtual datasets and SQL parsing (although we are not parsing the lineage on field-level and only extracting the tables/views from the SQL): I have seen that you are adding a tag to distinguish between pysical and virtual datasets...what about the subtypes aspect? I think using the subtypes aspect would be better than adding a tag in this case -> https://datahubproject.io/docs/generated/metamodel/entities/dataset/#subtypes In our case we are also using templating a lot in our virtual datasets (I think except two or three virtual datasets in all our virtual datasets templating is used) and I noticed that with the template "syntax" in the SQL the SQL can usually not be parsed...I have written a "dummy" template processor which nearly covers all functions of Superset (not everything can be implemented as a dummy processor, because some information is missing in context of DataHub, which Superset has when opening a dashboard, which would result in invalid results) and allows us to also parse these SQLs. Of course we do not have access to the actual filters of the dashboard etc., therefore in case "weird" templates are used (e.g. when a different source table is used depending on a filter value) it is possible that the lineage will not be complete (but this is still better than not beeing able to parse the SQL -> It is a best effort approach) - this is something I could contribute back once your enhancements were merged. :-) edit: Theres is one more thing...for virtual datasets it would also be great to ingest the view logic as a viewProperties aspect of the dataset, that's how users could check the SQL directly in DataHub -> https://datahubproject.io/docs/generated/metamodel/entities/dataset/#viewproperties |
@Masterchen09 We've also looked to address the jinja templating issue, our team here uses Preset (Managed instance of Superset). We've worked with them to add rendered SQL (Full SQL output) as an parameter to the datasets api endpoint, which will include the entire expanded SQL output. The PR has been merged in: apache/superset#30721
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really appreciate the PR, and apologies for taking so long to review it.
I got through reviewing about half of this PR before getting somewhat lost in the code.
I think it'd be significantly easier for us to accept these changes if it were split across a couple PRs. In general, small PRs are easier to review and easier to debug/revert in case anything goes wrong.
It'd also let us have more fine-grained conversations about specific pieces - like I'd like to understand the metrics piece and the virtual/physical tables stuff in more depth before we integrate those changes
"GEOMETRY": NullType, | ||
"HLLSKETCH": NullType, | ||
"TIMETZ": TimeType, | ||
"VARBYTE": StringType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was this list generated?
we've got a bunch of similar string type -> datahub type mappings, and ideally we'd reuse one of those instead of creating another one. The dbt source has a mapping that might be a good candidate for reuse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty similar to the redshift mapping. Is this a file I can reference? Can't seem to find the dbt source mapping
@@ -106,7 +210,7 @@ class SupersetConfig( | |||
api_key: Optional[str] = Field(default=None, description="Preset.io API key.") | |||
api_secret: Optional[str] = Field(default=None, description="Preset.io API secret.") | |||
manager_uri: str = Field( | |||
default="https://api.app.preset.io", description="Preset.io API URL" | |||
default="https://api.app.preset.io/", description="Preset.io API URL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised to see this config in the SupersetConfig class? shouldn't this be in the PresetConfig instead?
and ctx.pipeline_config.sink.config | ||
): | ||
self.sink_config = ctx.pipeline_config.sink.config | ||
self.rest_emitter = DatahubRestEmitter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use ctx.graph
return None | ||
raise ValueError("Could not construct dataset URN") | ||
|
||
def parse_owner_payload(self, payload, owners_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type annotations?
chart_payload = self.get_all_chart_owners() | ||
dashboard_payload = self.get_all_dashboard_owners() | ||
|
||
owners_dict = self.parse_owner_payload(dataset_payload, owners_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be cleaner. pure methods are easier to reason about
owners_dict = self.parse_owner_payload(dataset_payload, owners_dict) | |
owners_dict.update(self.parse_owner_payload(dataset_payload)) |
owners_dict[owner_id] = email | ||
return owners_dict | ||
|
||
def build_preset_owner_dict(self) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we tighten the type annotations e.g. Dict[str, str]
or something?
while (current_chart_page - 1) * PAGE_SIZE <= total_chart_owners: | ||
full_owners_response = self.session.get( | ||
f"{self.config.connect_uri}/api/v1/chart/related/owners", | ||
params=f"q=(page:{current_chart_page},page_size:{PAGE_SIZE})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given how standardized the pagination logic is, can we extract it into a helper method?
custom_properties["CertifiedBy"] = dashboard_data.get("certified_by") | ||
custom_properties["CertifiedBy"] = dashboard_data.get( | ||
"certified_by", "None" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this need to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to add this in to pass lint check
owners_dict = self.parse_owner_payload(dashboard_payload, owners_dict) | ||
return owners_dict | ||
|
||
def build_owners_urn_list(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs type annotations + ideally a comment explaining what the input dict structure looks like
mce = MetadataChangeEvent(proposedSnapshot=chart_snapshot) | ||
yield MetadataWorkUnit(id=chart_snapshot.urn, mce=mce) | ||
yield from self._get_domain_wu( | ||
title=chart_data.get("slice_name", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is slice_name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slice name is essentially chart name, it's what Superset calls it
Hey Harshal! Thanks for checking this out! We'll look to break this into smaller PRs so it's easier to review: |
Fyfan |
Checklist
Description:
PR to enrich Superset Ingestion with relevant metadata
Superset has useful metadata in relation to superset physical (Raw table, 1:1 with say redshift table) /virtual (Custom sql) datasets, e.g owners, their own schema fields, custom sql, etc.
The current iteration of superset ingestion simply just links to the schema/table specified in the API (Which can be inaccurate, especially for virtual datasets). We've added a few changes to enrich the ingestion process.