Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/powerbi): powerbi dataset profiling #9355

Merged
merged 18 commits into from
Jun 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/quick-ingestion-guides/powerbi/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ In order to configure ingestion from PowerBI, you'll first have to ensure you ha
- `Enhance admin APIs responses with detailed metadata`
- `Enhance admin APIs responses with DAX and mashup expressions`

f. **Add Security Group to Workspace:** Navigate to `Workspaces` window and open workspace which you want to ingest as shown in below screenshot and click on `Access` and add `powerbi-connector-app-security-group` as member
f. **Add Security Group to Workspace:** Navigate to `Workspaces` window and open workspace which you want to ingest as shown in below screenshot and click on `Access` and add `powerbi-connector-app-security-group` as member. For most cases `Viewer` role is enough, but for profiling the `Contributor` role is required.

<p align="center">
<img width="75%" alt="workspace-window-underlined" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/guides/powerbi/workspace-window-undrlined.png"/>
Expand Down
9 changes: 9 additions & 0 deletions metadata-ingestion/docs/sources/powerbi/powerbi_pre.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,14 @@ By default, extracting endorsement information to tags is disabled. The feature

Please note that the default implementation overwrites tags for the ingested entities, if you need to preserve existing tags, consider using a [transformer](../../../../metadata-ingestion/docs/transformer/dataset_transformer.md#simple-add-dataset-globaltags) with `semantics: PATCH` tags instead of `OVERWRITE`.

## Profiling

The profiling implementation is done through querying [DAX query endpoint](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/execute-queries). Therefore the principal needs to have permission to query the datasets to be profiled. Usually this means that the service principal should have `Contributor` role for the workspace to be ingested. Profiling is done with column based queries to be able to handle wide datasets without timeouts.

Take into account that the profiling implementation exeutes fairly big amount of DAX queries and for big datasets this is substantial load to the PowerBI system.

The `profiling_pattern` setting may be used to limit profiling actions to only a certain set of resources in PowerBI. Both allow and deny rules are matched against following pattern for every table in a PowerBI Dataset: `workspace_name.dataset_name.table_name`. User may limit profiling with these settings at table level, dataset level or workspace level.

## Admin Ingestion vs. Basic Ingestion
PowerBI provides two sets of API i.e. [Basic API and Admin API](https://learn.microsoft.com/en-us/rest/api/power-bi/).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the Caveats of setting admin_apis_only to true: and add a bullet point for dataset profiling is not available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I added mention that the dataset profiling is not available through the Admin API

Expand Down Expand Up @@ -139,6 +147,7 @@ If you don't want to add a service principal as a member in your workspace, then
Caveats of setting `admin_apis_only` to `true`:
- Report's pages would not get ingested as page API is not available in PowerBI Admin API
- [PowerBI Parameters](https://learn.microsoft.com/en-us/power-query/power-query-query-parameters) would not get resolved to actual values while processing M-Query for table lineage
- Dataset profiling is unavailable, as it requires access to the workspace API


### Basic Ingestion: Service Principal As Member In Workspace
Expand Down
10 changes: 10 additions & 0 deletions metadata-ingestion/docs/sources/powerbi/powerbi_recipe.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,16 @@ source:
# extract powerbi dataset table schema
extract_dataset_schema: true

# Enable PowerBI dataset profiling
profiling:
enabled: false
# Pattern to limit which resources to profile
# Matched resource format is following:
# workspace_name.dataset_name.table_name
profile_pattern:
deny:
- .*


sink:
# sink configs
19 changes: 19 additions & 0 deletions metadata-ingestion/src/datahub/ingestion/source/powerbi/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ class Constant:
Authorization = "Authorization"
WORKSPACE_ID = "workspaceId"
DASHBOARD_ID = "powerbi.linkedin.com/dashboards/{}"
DATASET_EXECUTE_QUERIES = "DATASET_EXECUTE_QUERIES_POST"
DATASET_ID = "datasetId"
REPORT_ID = "reportId"
SCAN_ID = "ScanId"
Expand All @@ -59,9 +60,12 @@ class Constant:
STATUS = "status"
CHART_ID = "powerbi.linkedin.com/charts/{}"
CHART_KEY = "chartKey"
COLUMN_TYPE = "columnType"
DATA_TYPE = "dataType"
DASHBOARD = "dashboard"
DASHBOARDS = "dashboards"
DASHBOARD_KEY = "dashboardKey"
DESCRIPTION = "description"
OWNERSHIP = "ownership"
BROWSERPATH = "browsePaths"
DASHBOARD_INFO = "dashboardInfo"
Expand Down Expand Up @@ -108,6 +112,7 @@ class Constant:
TABLES = "tables"
EXPRESSION = "expression"
SOURCE = "source"
SCHEMA_METADATA = "schemaMetadata"
PLATFORM_NAME = "powerbi"
REPORT_TYPE_NAME = BIAssetSubTypes.REPORT
CHART_COUNT = "chartCount"
Expand Down Expand Up @@ -228,6 +233,13 @@ class OwnershipMapping(ConfigModel):
)


class PowerBiProfilingConfig(ConfigModel):
enabled: bool = pydantic.Field(
default=False,
description="Whether profiling of PowerBI datasets should be done",
)


class PowerBiDashboardSourceConfig(
StatefulIngestionConfigBase, DatasetSourceConfigMixin
):
Expand Down Expand Up @@ -406,6 +418,13 @@ class PowerBiDashboardSourceConfig(
"Works for M-Query where native SQL is used for transformation.",
)

profile_pattern: AllowDenyPattern = pydantic.Field(
default=AllowDenyPattern.allow_all(),
description="Regex patterns to filter tables for profiling during ingestion. Note that only tables "
"allowed by the `table_pattern` will be considered. Matched format is 'workspacename.datasetname.tablename'",
)
profiling: PowerBiProfilingConfig = PowerBiProfilingConfig()

@root_validator(skip_on_failure=True)
def validate_extract_column_level_lineage(cls, values: Dict) -> Dict:
flags = [
Expand Down
61 changes: 61 additions & 0 deletions metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,9 @@
CorpUserKeyClass,
DashboardInfoClass,
DashboardKeyClass,
DatasetFieldProfileClass,
DatasetLineageTypeClass,
DatasetProfileClass,
DatasetPropertiesClass,
GlobalTagsClass,
OtherSchemaClass,
Expand Down Expand Up @@ -483,9 +485,64 @@ def to_datahub_dataset(
Constant.DATASET,
dataset.tags,
)
self.extract_profile(dataset_mcps, workspace, dataset, table, ds_urn)

return dataset_mcps

def extract_profile(
self,
dataset_mcps: List[MetadataChangeProposalWrapper],
workspace: powerbi_data_classes.Workspace,
dataset: powerbi_data_classes.PowerBIDataset,
table: powerbi_data_classes.Table,
ds_urn: str,
) -> None:
if not self.__config.profiling.enabled:
# Profiling not enabled
return

if not self.__config.profile_pattern.allowed(
f"{workspace.name}.{dataset.name}.{table.name}"
):
logger.info(
f"Table {table.name} in {dataset.name}, not allowed for profiling"
)
return
logger.debug(f"Profiling table: {table.name}")

profile = DatasetProfileClass(timestampMillis=builder.get_sys_time())
profile.rowCount = table.row_count
profile.fieldProfiles = []

columns: List[
Union[powerbi_data_classes.Column, powerbi_data_classes.Measure]
] = [*(table.columns or []), *(table.measures or [])]
for column in columns:
allowed_column = self.__config.profile_pattern.allowed(
f"{workspace.name}.{dataset.name}.{table.name}.{column.name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please give example value of profile_pattern config to make it clear for user that he needs to specify fully qualified name of table

)
if column.isHidden or not allowed_column:
logger.info(f"Column {column.name} not allowed for profiling")
continue
measure_profile = column.measure_profile
if measure_profile:
field_profile = DatasetFieldProfileClass(column.name or "")
field_profile.sampleValues = measure_profile.sample_values
field_profile.min = measure_profile.min
field_profile.max = measure_profile.max
field_profile.uniqueCount = measure_profile.unique_count
profile.fieldProfiles.append(field_profile)

profile.columnCount = table.column_count

mcp = MetadataChangeProposalWrapper(
entityType="dataset",
entityUrn=ds_urn,
aspectName="datasetProfile",
aspect=profile,
)
dataset_mcps.append(mcp)

@staticmethod
def transform_tags(tags: List[str]) -> GlobalTagsClass:
return GlobalTagsClass(
Expand Down Expand Up @@ -1180,6 +1237,10 @@ def report_to_datahub_work_units(
SourceCapability.LINEAGE_FINE,
"Disabled by default, configured using `extract_column_level_lineage`. ",
)
@capability(
SourceCapability.DATA_PROFILING,
"Optionally enabled via configuration profiling.enabled",
)
class PowerBiDashboardSource(StatefulIngestionSourceBase, TestableSource):
"""
This plugin extracts the following:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,14 @@ def __hash__(self):
return hash(self.__members())


@dataclass
class MeasureProfile:
min: Optional[str] = None
max: Optional[str] = None
unique_count: Optional[int] = None
sample_values: Optional[List[str]] = None


@dataclass
class Column:
name: str
Expand All @@ -96,6 +104,7 @@ class Column:
columnType: Optional[str] = None
expression: Optional[str] = None
description: Optional[str] = None
measure_profile: Optional[MeasureProfile] = None


@dataclass
Expand All @@ -108,6 +117,7 @@ class Measure:
BooleanTypeClass, DateTypeClass, NullTypeClass, NumberTypeClass, StringTypeClass
] = dataclasses.field(default_factory=NullTypeClass)
description: Optional[str] = None
measure_profile: Optional[MeasureProfile] = None


@dataclass
Expand All @@ -117,6 +127,8 @@ class Table:
expression: Optional[str] = None
columns: Optional[List[Column]] = None
measures: Optional[List[Measure]] = None
row_count: Optional[int] = None
column_count: Optional[int] = None

# Pointer to the parent dataset.
dataset: Optional["PowerBIDataset"] = None
Expand Down
Loading
Loading