-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/powerbi): powerbi dataset profiling #9355
Conversation
try: | ||
logger.info(f"Column data query for {dataset.name}, {column.name}") | ||
query = f""" | ||
EVALUATE ROW( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add query.py in rest_api_wrapper package. In query.py add a DaxQuery class having static method to return the qury
@@ -47,6 +50,13 @@ def __init__(self, config: PowerBiDashboardSourceConfig) -> None: | |||
tenant_id=self.__config.tenant_id, | |||
) | |||
|
|||
self.__profiling_resolver = PowerBiDatasetProfilingResolver( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only two way to get the metadata from PowerBI either using regular API or Admin API. You can clearly see this segregation in PowerBI Rest API documentation: https://learn.microsoft.com/en-us/rest/api/power-bi/.
As per doc it looks like it comes under regular-api, so please add profile_dataset in base resolver and provide implementation in both regular and admin resolver (here it is just pass). Invoke the powerbi_profiler.py from regular api's profile_dataset method
@@ -108,6 +112,10 @@ class Measure: | |||
BooleanTypeClass, DateTypeClass, NullTypeClass, NumberTypeClass, StringTypeClass | |||
] = dataclasses.field(default_factory=NullTypeClass) | |||
description: Optional[str] = None | |||
min: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to have MeasureProfiling dataclass and add the reference here
@@ -107,6 +107,10 @@ By default, extracting endorsement information to tags is disabled. The feature | |||
|
|||
Please note that the default implementation overwrites tags for the ingested entities, if you need to preserve existing tags, consider using a [transformer](../../../../metadata-ingestion/docs/transformer/dataset_transformer.md#simple-add-dataset-globaltags) with `semantics: PATCH` tags instead of `OVERWRITE`. | |||
|
|||
## Profiling | |||
|
|||
The profiling implementation is done through querying [DAX query endpoint](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/execute-queries). Therefore the principal needs to have permission to query the datasets to be profiled. Profiling is done with column based queries to be able to handle wide datasets without timeouts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional documentation need to be added:
- Please also check this doc: https://datahubproject.io/docs/quick-ingestion-guides/powerbi/setup and add/update any steps if any specific permission is required for profiling. The markdown file is available at
docs/quick-ingestion-guides/powerbi/setup.md
- On Source class add decorator
@capability( SourceCapability.DATA_PROFILING, "Optionally enabled via configuration
profiling.enabled", )
Thank you for the feedback! I'll start working with this asap. |
@sid-acryl can you take another look at this PR? |
f"Table {table.name} in {dataset.name}, not allowed for profiling" | ||
) | ||
return | ||
logger.info(f"Profiling table: {table.name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make it debug
] = [*(table.columns or []), *(table.measures or [])] | ||
for column in columns: | ||
allowed_column = self.__config.profile_pattern.allowed( | ||
f"{workspace.name}.{dataset.name}.{table.name}.{column.name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please give example value of profile_pattern
config to make it clear for user that he needs to specify fully qualified name of table
} | ||
], | ||
"serializerSettings": { | ||
"includeNulls": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add comments for Why we need to include Nulls
query = DaxQuery.row_count_query(table.name) | ||
try: | ||
data = self._execute_profiling_query(dataset, query) | ||
rows = data["results"][0]["tables"][0]["rows"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add checks here to avoid key error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like they're already catching KeyError below, so it should be ok
from typing import Dict, List | ||
|
||
|
||
def get_column_name(table_and_col: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it -> Optional[str]
Thank you for the review and feedback! I'll continue working on this still this week. |
Take into account that the profiling implementation exeutes fairly big amount of DAX queries and for big datasets this is substantial load to the PowerBI system. | ||
|
||
The `profiling_pattern` setting may be used to limit profiling actions to only a certain set of resources in PowerBI. Both allow and deny rules are matched against following pattern for every table in a PowerBI Dataset: `workspace_name.dataset_name.table_name`. User may limit profiling with these settings at table level, dataset level or workspace level. | ||
|
||
## Admin Ingestion vs. Basic Ingestion | ||
PowerBI provides two sets of API i.e. [Basic API and Admin API](https://learn.microsoft.com/en-us/rest/api/power-bi/). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the Caveats of setting admin_apis_only to true:
and add a bullet point for dataset profiling
is not available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I added mention that the dataset profiling is not available through the Admin API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
DAX REST API based profiling implementation to query some details of PowerBI Datasets in the workspace. Does not introduce any breaking changes.
This implementation counts only amount of distinct values, min and max of a column. So it won't completely fulfill the possibilities Datahub UI provides. Extending the implementation to include other metrics isn't very complicated.
Is this something you would like to accept and include in Datahub or is using the DAX API for gathering information considered not a good practice?
Checklist