Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/dbt): point dbt assertions at dbt nodes #10055

Merged
merged 3 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/how/updating-datahub.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ This file documents any backwards-incompatible changes in DataHub and assists pe

- #9934 - Stateful ingestion is now enabled by default if datahub-rest sink is used or if a `datahub_api` is specified. It will still be disabled by default when any other sink type is used.
- #10002 - The `DataHubGraph` client no longer makes a request to the backend during initialization. If you want to preserve the old behavior, call `graph.test_connection()` after constructing the client.
- #10055 - Assertion entities generated by dbt are now associated with the dbt dataset entity, and not the entity in the data warehouse.

### Potential Downtime

Expand Down
8 changes: 4 additions & 4 deletions metadata-ingestion/docs/sources/dbt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ Ingesting metadata from dbt requires either using the **dbt** module or the **db
| Source Concept | DataHub Concept | Notes |
| --------------- | ------------------------------------------------------------- | ------------------ |
| `"dbt"` | [Data Platform](../../metamodel/entities/dataPlatform.md) | |
| dbt Source | [Dataset](../../metamodel/entities/dataset.md) | Subtype `source` |
| dbt Seed | [Dataset](../../metamodel/entities/dataset.md) | Subtype `seed` |
| dbt Model | [Dataset](../../metamodel/entities/dataset.md) | Subtype `model` |
| dbt Snapshot | [Dataset](../../metamodel/entities/dataset.md) | Subtype `snapshot` |
| dbt Source | [Dataset](../../metamodel/entities/dataset.md) | Subtype `Source` |
| dbt Seed | [Dataset](../../metamodel/entities/dataset.md) | Subtype `Seed` |
| dbt Model | [Dataset](../../metamodel/entities/dataset.md) | Subtype `Model` |
| dbt Snapshot | [Dataset](../../metamodel/entities/dataset.md) | Subtype `Snapshot` |
| dbt Test | [Assertion](../../metamodel/entities/assertion.md) | |
| dbt Test Result | [Assertion Run Result](../../metamodel/entities/assertion.md) | |

Expand Down
59 changes: 54 additions & 5 deletions metadata-ingestion/docs/sources/dbt/dbt.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,18 +166,33 @@ The below example set as global tag the query tag `tag` key's value.

### Integrating with dbt test

To integrate with dbt tests, the `dbt` source needs access to the `run_results.json` file generated after a `dbt test` execution. Typically, this is written to the `target` directory. A common pattern you can follow is:
To integrate with dbt tests, the `dbt` source needs access to the `run_results.json` file generated after a `dbt test` or `dbt build` execution. Typically, this is written to the `target` directory. A common pattern you can follow is:

1. Run `dbt docs generate` and upload `manifest.json` and `catalog.json` to a location accessible to the `dbt` source (e.g. s3 or local file system)
2. Run `dbt test` and upload `run_results.json` to a location accessible to the `dbt` source (e.g. s3 or local file system)
3. Run `datahub ingest -c dbt_recipe.dhub.yaml` with the following config parameters specified
- test_results_path: pointing to the run_results.json file that you just created
1. Run `dbt build`
2. Copy the `target/run_results.json` file to a separate location. This is important, because otherwise subsequent `dbt` commands will overwrite the run results.
3. Run `dbt docs generate` to generate the `manifest.json` and `catalog.json` files
4. The dbt source makes use of the manifest, catalog, and run results file, and hence will need to be moved to a location accessible to the `dbt` source (e.g. s3 or local file system). In the ingestion recipe, the `test_results_path` config must be set to the location of the `run_results.json` file from the `dbt build` or `dbt test` run.

The connector will produce the following things:

- Assertion definitions that are attached to the dataset (or datasets)
- Results from running the tests attached to the timeline of the dataset

:::note Missing test results?

The most common reason for missing test results is that the `run_results.json` with the test result information is getting overwritten by a subsequent `dbt` command. We recommend copying the `run_results.json` file before running other `dbt` commands.

```sh
dbt source snapshot-freshness
dbt build
cp target/run_results.json target/run_results_backup.json
dbt docs generate

# Reference target/run_results_backup.json in the dbt source config.
```

:::

#### View of dbt tests for a dataset

![test view](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/dbt-tests-view.png)
Expand Down Expand Up @@ -220,3 +235,37 @@ source:
entities_enabled:
test_results: No
```

### Multiple dbt projects

In more complex dbt setups, you may have multiple dbt projects, where models from one project are used as sources in another project.
DataHub supports this setup natively.

Each dbt project should have its own dbt ingestion recipe, and the `platform_instance` field in the recipe should be set to the dbt project name.

For example, if you have two dbt projects `analytics` and `data_mart`, you would have two ingestion recipes.
If you have models in the `data_mart` project that are used as sources in the `analytics` project, the lineage will be automatically captured.

```yaml
# Analytics dbt project
source:
type: dbt
config:
platform_instance: analytics
target_platform: postgres
manifest_path: analytics/target/manifest.json
catalog_path: analytics/target/catalog.json
# ... other configs
```

```yaml
# Data Mart dbt project
source:
type: dbt
config:
platform_instance: data_mart
target_platform: postgres
manifest_path: data_mart/target/manifest.json
catalog_path: data_mart/target/catalog.json
# ... other configs
```
42 changes: 36 additions & 6 deletions metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -641,6 +641,34 @@ def get_upstreams(
return upstream_urns


def get_upstreams_for_test(
test_node: DBTNode,
all_nodes_map: Dict[str, DBTNode],
platform_instance: Optional[str],
environment: str,
) -> List[str]:
upstream_urns = []

for upstream in test_node.upstream_nodes:
if upstream not in all_nodes_map:
logger.debug(
f"Upstream node of test {upstream} not found in all manifest entities."
)
continue

upstream_manifest_node = all_nodes_map[upstream]

upstream_urns.append(
upstream_manifest_node.get_urn(
target_platform=DBT_PLATFORM,
data_platform_instance=platform_instance,
env=environment,
)
)

return upstream_urns


def make_mapping_upstream_lineage(
upstream_urn: str, downstream_urn: str, node: DBTNode
) -> UpstreamLineageClass:
Expand Down Expand Up @@ -789,16 +817,18 @@ def create_test_entity_mcps(
),
).as_workunit()

upstream_urns = get_upstreams(
upstreams=node.upstream_nodes,
all_nodes=all_nodes_map,
target_platform=self.config.target_platform,
target_platform_instance=self.config.target_platform_instance,
environment=self.config.env,
upstream_urns = get_upstreams_for_test(
test_node=node,
all_nodes_map=all_nodes_map,
platform_instance=self.config.platform_instance,
environment=self.config.env,
)

# In case a dbt test depends on multiple tables, we create separate assertions for each.
# TODO: This logic doesn't actually work properly, since we're reusing the same assertion_urn
# across multiple upstream tables, so we're actually only creating one assertion and the last
# upstream_urn gets used. Luckily, most dbt tests are associated with a single table, so this
# doesn't cause major issues in practice.
for upstream_urn in sorted(upstream_urns):
if self.config.entities_enabled.can_emit_node_type("test"):
yield make_assertion_from_test(
Expand Down
Loading
Loading