-
Notifications
You must be signed in to change notification settings - Fork 3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(ingest): framework - client side changes for monitoring and repo…
…rting (#3807)
- Loading branch information
Showing
32 changed files
with
2,100 additions
and
559 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
# Datahub's Reporting Framework for Ingestion Job Telemetry | ||
The Datahub's reporting framework allows for configuring reporting providers with the ingestion pipelines to send | ||
telemetry about the ingestion job runs to external systems for monitoring purposes. It is powered by the Datahub's | ||
stateful ingestion framework. The `datahub` reporting provider comes with the standard client installation, | ||
and allows for reporting ingestion job telemetry to the datahub backend as the destination. | ||
|
||
**_NOTE_**: This feature requires the server to be `statefulIngestion` capable. | ||
This is a feature of metadata service with version >= `0.8.20`. | ||
|
||
To check if you are running a stateful ingestion capable server: | ||
```console | ||
curl http://<datahub-gms-endpoint>/config | ||
|
||
{ | ||
models: { }, | ||
statefulIngestionCapable: true, # <-- this should be present and true | ||
retention: "true", | ||
noCode: "true" | ||
} | ||
``` | ||
|
||
## Config details | ||
The ingestion reporting providers are a list of reporting provider configurations under the `reporting` config | ||
param of the pipeline, each reporting provider configuration begin a type and config pair object. The telemetry data will | ||
be sent to all the reporting providers in this list. | ||
|
||
Note that a `.` is used to denote nested fields, and `[idx]` is used to denote an element of an array of objects in the YAML recipe. | ||
|
||
| Field | Required | Default | Description | | ||
|-------------------------| -------- |------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| `reporting[idx].type` | ✅ | `datahub` | The type of the ingestion reporting provider registered with datahub. | | ||
| `reporting[idx].config` | | The `datahub_api` config if set at pipeline level. Otherwise, the default `DatahubClientConfig`. See the [defaults](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19) here. | The configuration required for initializing the datahub reporting provider. | | ||
| `pipeline_name` | ✅ | | The name of the ingestion pipeline. This is used as a part of the identifying key for the telemetry data reported by each job in the ingestion pipeline. | | ||
|
||
#### Supported sources | ||
* All sql based sources. | ||
* snowflake_usage. | ||
#### Sample configuration | ||
```yaml | ||
source: | ||
type: "snowflake" | ||
config: | ||
username: <user_name> | ||
password: <password> | ||
role: <role> | ||
host_port: <host_port> | ||
warehouse: <ware_house> | ||
# Rest of the source specific params ... | ||
# This is mandatory. Changing it will cause old telemetry correlation to be lost. | ||
pipeline_name: "my_snowflake_pipeline_1" | ||
|
||
# Pipeline-level datahub_api configuration. | ||
datahub_api: # Optional. But if provided, this config will be used by the "datahub" ingestion state provider. | ||
server: "http://localhost:8080" | ||
|
||
sink: | ||
type: "datahub-rest" | ||
config: | ||
server: 'http://localhost:8080' | ||
|
||
reporting: | ||
- type: "datahub" # Required | ||
config: # Optional. | ||
datahub_api: # default value | ||
server: "http://localhost:8080" | ||
``` | ||
## Reporting Ingestion State Provider (Developer Guide) | ||
An ingestion reporting state provider is responsible for saving and retrieving the ingestion telemetry | ||
associated with the ingestion runs of various jobs inside the source connector of the ingestion pipeline. | ||
The data model used for capturing the telemetry is [DatahubIngestionRunSummary](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/datajob/datahub/DatahubIngestionRunSummary.pdl). | ||
A reporting ingestion state provider needs to implement the [IngestionReportingProviderBase](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/api/ingestion_job_reporting_provider_base.py) | ||
interface and register itself with datahub by adding an entry under `datahub.ingestion.checkpointing_provider.plugins` | ||
key of the entry_points section in [setup.py](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/setup.py) | ||
with its type and implementation class as shown below. | ||
```python | ||
entry_points = { | ||
# <snip other keys>" | ||
"datahub.ingestion.checkpointing_provider.plugins": [ | ||
"datahub = datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:DatahubIngestionCheckpointingProvider", | ||
], | ||
} | ||
``` | ||
|
||
### Datahub Reporting Ingestion State Provider | ||
This is the reporting state provider implementation that is available out of the box in datahub. Its type is `datahub` and it is implemented on top | ||
of the `datahub_api` client and the timeseries aspect capabilities of the datahub-backend. | ||
#### Config details | ||
|
||
Note that a `.` is used to denote nested fields in the YAML recipe. | ||
|
||
| Field | Required | Default | Description | | ||
|----------------------------------------------------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------| | ||
| `type` | ✅ | `datahub` | The type of the ingestion reporting provider registered with datahub. | | ||
| `config` | | The `datahub_api` config if set at pipeline level. Otherwise, the default `DatahubClientConfig`. See the [defaults](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19) here. | The configuration required for initializing the datahub reporting provider. | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
67 changes: 67 additions & 0 deletions
67
metadata-ingestion/src/datahub/ingestion/api/committable.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
from abc import ABC, abstractmethod | ||
from dataclasses import dataclass | ||
from enum import Enum, auto | ||
from typing import Generic, List, Optional, TypeVar | ||
|
||
|
||
class CommitPolicy(Enum): | ||
ALWAYS = auto | ||
ON_NO_ERRORS = auto | ||
ON_NO_ERRORS_AND_NO_WARNINGS = auto | ||
|
||
|
||
@dataclass | ||
class _CommittableConcrete: | ||
name: str | ||
commit_policy: CommitPolicy | ||
committed: bool | ||
|
||
|
||
# The concrete portion Committable is separated from the abstract portion due to | ||
# https://github.com/python/mypy/issues/5374#issuecomment-568335302. | ||
class Committable(_CommittableConcrete, ABC): | ||
def __init__(self, name: str, commit_policy: CommitPolicy): | ||
super(Committable, self).__init__(name, commit_policy, committed=False) | ||
|
||
@abstractmethod | ||
def commit(self) -> None: | ||
pass | ||
|
||
|
||
StateKeyType = TypeVar("StateKeyType") | ||
StateType = TypeVar("StateType") | ||
# TODO: Add a better alternative to a string for the filter. | ||
FilterType = TypeVar("FilterType") | ||
|
||
|
||
@dataclass | ||
class _StatefulCommittableConcrete(Generic[StateType]): | ||
state_to_commit: StateType | ||
|
||
|
||
class StatefulCommittable( | ||
Committable, | ||
_StatefulCommittableConcrete[StateType], | ||
Generic[StateKeyType, StateType, FilterType], | ||
): | ||
def __init__( | ||
self, name: str, commit_policy: CommitPolicy, state_to_commit: StateType | ||
): | ||
# _ConcreteCommittable will be the first from this class. | ||
super(StatefulCommittable, self).__init__( | ||
name=name, commit_policy=commit_policy | ||
) | ||
# _StatefulCommittableConcrete will be after _CommittableConcrete in the __mro__. | ||
super(_CommittableConcrete, self).__init__(state_to_commit=state_to_commit) | ||
|
||
def has_successfully_committed(self) -> bool: | ||
return True if not self.state_to_commit or self.committed else False | ||
|
||
@abstractmethod | ||
def get_previous_states( | ||
self, | ||
state_key: StateKeyType, | ||
last_only: bool = True, | ||
filter_opt: Optional[FilterType] = None, | ||
) -> List[StateType]: | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
64 changes: 64 additions & 0 deletions
64
metadata-ingestion/src/datahub/ingestion/api/ingestion_job_checkpointing_provider_base.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
from dataclasses import dataclass | ||
from typing import Any, Dict, List, Optional | ||
|
||
from datahub.ingestion.api.committable import CommitPolicy | ||
from datahub.ingestion.api.common import PipelineContext | ||
from datahub.ingestion.api.ingestion_job_state_provider import ( | ||
IngestionJobStateProvider, | ||
IngestionJobStateProviderConfig, | ||
JobId, | ||
JobStateFilterType, | ||
JobStateKey, | ||
JobStatesMap, | ||
) | ||
from datahub.metadata.schema_classes import DatahubIngestionCheckpointClass | ||
|
||
# | ||
# Common type exports | ||
# | ||
JobId = JobId | ||
JobStateKey = JobStateKey | ||
JobStateFilterType = JobStateFilterType | ||
|
||
# | ||
# Checkpoint state specific types | ||
# | ||
CheckpointJobStateType = DatahubIngestionCheckpointClass | ||
CheckpointJobStatesMap = JobStatesMap[CheckpointJobStateType] | ||
|
||
|
||
class IngestionCheckpointingProviderConfig(IngestionJobStateProviderConfig): | ||
pass | ||
|
||
|
||
@dataclass() | ||
class IngestionCheckpointingProviderBase( | ||
IngestionJobStateProvider[CheckpointJobStateType] | ||
): | ||
""" | ||
The base class(non-abstract) for all checkpointing state provider implementations. | ||
This class is implemented this way as a concrete class is needed to work with the registry, | ||
but we don't want to implement any of the functionality yet. | ||
""" | ||
|
||
def __init__( | ||
self, name: str, commit_policy: CommitPolicy = CommitPolicy.ON_NO_ERRORS | ||
): | ||
super(IngestionCheckpointingProviderBase, self).__init__(name, commit_policy) | ||
|
||
@classmethod | ||
def create( | ||
cls, config_dict: Dict[str, Any], ctx: PipelineContext, name: str | ||
) -> "IngestionJobStateProvider": | ||
raise NotImplementedError("Sub-classes must override this method.") | ||
|
||
def get_previous_states( | ||
self, | ||
state_key: JobStateKey, | ||
last_only: bool = True, | ||
filter_opt: Optional[JobStateFilterType] = None, | ||
) -> List[CheckpointJobStatesMap]: | ||
raise NotImplementedError("Sub-classes must override this method.") | ||
|
||
def commit(self) -> None: | ||
raise NotImplementedError("Sub-classes must override this method.") |
60 changes: 60 additions & 0 deletions
60
metadata-ingestion/src/datahub/ingestion/api/ingestion_job_reporting_provider_base.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
from dataclasses import dataclass | ||
from typing import Any, Dict, List, Optional | ||
|
||
from datahub.ingestion.api.committable import CommitPolicy | ||
from datahub.ingestion.api.common import PipelineContext | ||
from datahub.ingestion.api.ingestion_job_state_provider import ( | ||
IngestionJobStateProvider, | ||
IngestionJobStateProviderConfig, | ||
JobId, | ||
JobStateFilterType, | ||
JobStateKey, | ||
JobStatesMap, | ||
) | ||
from datahub.metadata.schema_classes import DatahubIngestionRunSummaryClass | ||
|
||
# | ||
# Common type exports | ||
# | ||
JobId = JobId | ||
JobStateKey = JobStateKey | ||
JobStateFilterType = JobStateFilterType | ||
|
||
# | ||
# Reporting state specific types | ||
# | ||
ReportingJobStateType = DatahubIngestionRunSummaryClass | ||
ReportingJobStatesMap = JobStatesMap[ReportingJobStateType] | ||
|
||
|
||
class IngestionReportingProviderConfig(IngestionJobStateProviderConfig): | ||
pass | ||
|
||
|
||
@dataclass() | ||
class IngestionReportingProviderBase(IngestionJobStateProvider[ReportingJobStateType]): | ||
""" | ||
The base class(non-abstract) for all reporting state provider implementations. | ||
This class is implemented this way as a concrete class is needed to work with the registry, | ||
but we don't want to implement any of the functionality yet. | ||
""" | ||
|
||
def __init__(self, name: str, commit_policy: CommitPolicy = CommitPolicy.ALWAYS): | ||
super(IngestionReportingProviderBase, self).__init__(name, commit_policy) | ||
|
||
@classmethod | ||
def create( | ||
cls, config_dict: Dict[str, Any], ctx: PipelineContext, name: str | ||
) -> "IngestionJobStateProvider": | ||
raise NotImplementedError("Sub-classes must override this method.") | ||
|
||
def get_previous_states( | ||
self, | ||
state_key: JobStateKey, | ||
last_only: bool = True, | ||
filter_opt: Optional[JobStateFilterType] = None, | ||
) -> List[ReportingJobStatesMap]: | ||
raise NotImplementedError("Sub-classes must override this method.") | ||
|
||
def commit(self) -> None: | ||
raise NotImplementedError("Sub-classes must override this method.") |
Oops, something went wrong.