Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/powerbi): powerbi dataset profiling #9355

Merged
merged 18 commits into from
Jun 28, 2024

Conversation

looppi
Copy link
Contributor

@looppi looppi commented Nov 30, 2023

DAX REST API based profiling implementation to query some details of PowerBI Datasets in the workspace. Does not introduce any breaking changes.

This implementation counts only amount of distinct values, min and max of a column. So it won't completely fulfill the possibilities Datahub UI provides. Extending the implementation to include other metrics isn't very complicated.

Is this something you would like to accept and include in Datahub or is using the DAX API for gathering information considered not a good practice?

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

try:
logger.info(f"Column data query for {dataset.name}, {column.name}")
query = f"""
EVALUATE ROW(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add query.py in rest_api_wrapper package. In query.py add a DaxQuery class having static method to return the qury

@@ -47,6 +50,13 @@ def __init__(self, config: PowerBiDashboardSourceConfig) -> None:
tenant_id=self.__config.tenant_id,
)

self.__profiling_resolver = PowerBiDatasetProfilingResolver(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two way to get the metadata from PowerBI either using regular API or Admin API. You can clearly see this segregation in PowerBI Rest API documentation: https://learn.microsoft.com/en-us/rest/api/power-bi/.

As per doc it looks like it comes under regular-api, so please add profile_dataset in base resolver and provide implementation in both regular and admin resolver (here it is just pass). Invoke the powerbi_profiler.py from regular api's profile_dataset method

@@ -108,6 +112,10 @@ class Measure:
BooleanTypeClass, DateTypeClass, NullTypeClass, NumberTypeClass, StringTypeClass
] = dataclasses.field(default_factory=NullTypeClass)
description: Optional[str] = None
min: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have MeasureProfiling dataclass and add the reference here

@@ -107,6 +107,10 @@ By default, extracting endorsement information to tags is disabled. The feature

Please note that the default implementation overwrites tags for the ingested entities, if you need to preserve existing tags, consider using a [transformer](../../../../metadata-ingestion/docs/transformer/dataset_transformer.md#simple-add-dataset-globaltags) with `semantics: PATCH` tags instead of `OVERWRITE`.

## Profiling

The profiling implementation is done through querying [DAX query endpoint](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/execute-queries). Therefore the principal needs to have permission to query the datasets to be profiled. Profiling is done with column based queries to be able to handle wide datasets without timeouts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional documentation need to be added:

  1. Please also check this doc: https://datahubproject.io/docs/quick-ingestion-guides/powerbi/setup and add/update any steps if any specific permission is required for profiling. The markdown file is available at docs/quick-ingestion-guides/powerbi/setup.md
  2. On Source class add decorator @capability( SourceCapability.DATA_PROFILING, "Optionally enabled via configuration profiling.enabled", )

@looppi
Copy link
Contributor Author

looppi commented Jan 4, 2024

Thank you for the feedback! I'll start working with this asap.

@hsheth2
Copy link
Collaborator

hsheth2 commented Jan 18, 2024

@sid-acryl can you take another look at this PR?

@hsheth2 hsheth2 added the community-contribution PR or Issue raised by member(s) of DataHub Community label Jan 18, 2024
f"Table {table.name} in {dataset.name}, not allowed for profiling"
)
return
logger.info(f"Profiling table: {table.name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make it debug

] = [*(table.columns or []), *(table.measures or [])]
for column in columns:
allowed_column = self.__config.profile_pattern.allowed(
f"{workspace.name}.{dataset.name}.{table.name}.{column.name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please give example value of profile_pattern config to make it clear for user that he needs to specify fully qualified name of table

}
],
"serializerSettings": {
"includeNulls": True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add comments for Why we need to include Nulls

query = DaxQuery.row_count_query(table.name)
try:
data = self._execute_profiling_query(dataset, query)
rows = data["results"][0]["tables"][0]["rows"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add checks here to avoid key error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like they're already catching KeyError below, so it should be ok

from typing import Dict, List


def get_column_name(table_and_col: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it -> Optional[str]

@looppi
Copy link
Contributor Author

looppi commented Feb 28, 2024

Thank you for the review and feedback! I'll continue working on this still this week.

Take into account that the profiling implementation exeutes fairly big amount of DAX queries and for big datasets this is substantial load to the PowerBI system.

The `profiling_pattern` setting may be used to limit profiling actions to only a certain set of resources in PowerBI. Both allow and deny rules are matched against following pattern for every table in a PowerBI Dataset: `workspace_name.dataset_name.table_name`. User may limit profiling with these settings at table level, dataset level or workspace level.

## Admin Ingestion vs. Basic Ingestion
PowerBI provides two sets of API i.e. [Basic API and Admin API](https://learn.microsoft.com/en-us/rest/api/power-bi/).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the Caveats of setting admin_apis_only to true: and add a bullet point for dataset profiling is not available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I added mention that the dataset profiling is not available through the Admin API

Copy link
Contributor

@sid-acryl sid-acryl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shirshanka shirshanka merged commit b223281 into datahub-project:master Jun 28, 2024
58 checks passed
aviv-julienjehannet pushed a commit to aviv-julienjehannet/datahub that referenced this pull request Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata poc-marathon-dec-2023
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants