Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(docs): Update Metadata Events Docs #5173

Merged
2 changes: 2 additions & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ module.exports = {
"docs/components",
"docs/architecture/metadata-ingestion",
"docs/architecture/metadata-serving",
"docs/what/mxe",
// "docs/what/gma",
// "docs/what/gms",
],
Expand Down Expand Up @@ -297,6 +298,7 @@ module.exports = {
"docs/how/auth/sso/configure-oidc-react-azure",
],
},
"docs/what/mxe",
"docs/how/restore-indices",
"docs/dev-guides/timeline",
"docs/how/extract-container-logs",
Expand Down
17 changes: 8 additions & 9 deletions docs/architecture/metadata-ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,28 +8,27 @@ DataHub supports an extremely flexible ingestion architecture that can support p
The figure below describes all the options possible for connecting your favorite system to DataHub.
![Ingestion Architecture](../imgs/ingestion-architecture.png)

## MCE: The Center Piece
## Metadata Change Proposal: The Center Piece

The center piece for ingestion is the [Metadata Change Event (MCE)] which represents a metadata change that is being communicated by an upstream system.
MCE-s can be sent over Kafka, for highly scalable async publishing from source systems. They can also be sent directly to the HTTP endpoint exposed by the DataHub service tier to get synchronous success / failure responses.
The center piece for ingestion is are [Metadata Change Proposals] which represent requests to make a metadata change to an organization's Metadata Graph.
Metadata Change Proposals can be sent over Kafka, for highly scalable async publishing from source systems. They can also be sent directly to the HTTP endpoint exposed by the DataHub service tier to get synchronous success / failure responses.

## Pull-based Integration

DataHub ships with a Python based [metadata-ingestion system](../../metadata-ingestion/README.md) that can connect to different sources to pull metadata from them. This metadata is then pushed via Kafka or HTTP to the DataHub storage tier. Metadata ingestion pipelines can be [integrated with Airflow](../../metadata-ingestion/README.md#lineage-with-airflow) to set up scheduled ingestion or capture lineage. If you don't find a source already supported, it is very easy to [write your own](../../metadata-ingestion/README.md#contributing).

## Push-based Integration

As long as you can emit a [Metadata Change Event (MCE)] event to Kafka or make a REST call over HTTP, you can integrate any system with DataHub. For convenience, DataHub also provides simple [Python emitters] for you to integrate into your systems to emit metadata changes (MCE-s) at the point of origin.
As long as you can emit a [Metadata Change Proposal (MCP)] event to Kafka or make a REST call over HTTP, you can integrate any system with DataHub. For convenience, DataHub also provides simple [Python emitters] for you to integrate into your systems to emit metadata changes (MCP-s) at the point of origin.

## Internal Components

### Applying MCE-s to DataHub Service Tier (mce-consumer)
### Applying Metadata Change Proposals to DataHub Metadata Service (mce-consumer-job)

DataHub comes with a Kafka Streams based job, [mce-consumer-job], which consumes the MCE-s and converts them into the [equivalent Pegasus format] and sends it to the DataHub Service Tier (datahub-gms) using the `/ingest` endpoint.
DataHub comes with a Kafka Streams based job, [mce-consumer-job], which consumes the Metadata Change Proposals and writes them into the DataHub Metadata Service (datahub-gms) using the `/ingest` endpoint.

[Metadata Change Event (MCE)]: ../what/mxe.md#metadata-change-event-mce
[Metadata Audit Event (MAE)]: ../what/mxe.md#metadata-audit-event-mae
[MAE]: ../what/mxe.md#metadata-audit-event-mae
[Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp
[Metadata Change Log (MCL)]: ../what/mxe.md#metadata-change-log-mcl
[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer
[mce-consumer-job]: ../../metadata-jobs/mce-consumer-job
[Python emitters]: ../../metadata-ingestion/README.md#using-as-a-library
Expand Down
28 changes: 14 additions & 14 deletions docs/architecture/metadata-serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,49 +8,49 @@ The figure below shows the high-level system diagram for DataHub's Serving Tier.

![datahub-serving](../imgs/datahub-serving.png)

The primary service is called [gms](../../metadata-service) and exposes a REST API and a GraphQL API for performing CRUD operations on metadata. The metadata service also exposes search and graph query API-s to support secondary-index style queries, full-text search queries as well as relationship queries like lineage. In addition, the [datahub-frontend](../../datahub-frontend) service expose a GraphQL API on top of the metadata graph.
The primary component is called [the Metadata Service](../../metadata-service) and exposes a REST API and a GraphQL API for performing CRUD operations on metadata. The service also exposes search and graph query API-s to support secondary-index style queries, full-text search queries as well as relationship queries like lineage. In addition, the [datahub-frontend](../../datahub-frontend) service expose a GraphQL API on top of the metadata graph.

## DataHub Serving Tier Components

### Metadata Storage

The DataHub Metadata Service (gms) persists metadata in a document store (could be an RDBMS like MySQL, Postgres or a key-value store like Couchbase etc.).
The DataHub Metadata Service persists metadata in a document store (an RDBMS like MySQL, Postgres, or Cassandra, etc.).

### Metadata Commit Log Stream (MAE)
### Metadata Change Log Stream (MCL)

The DataHub Service Tier also emits a commit event [Metadata Audit Event (MAE)] when a metadata change has been successfully committed to persistent storage. This event is sent over Kafka.
The DataHub Service Tier also emits a commit event [Metadata Change Log] when a metadata change has been successfully committed to persistent storage. This event is sent over Kafka.

The MAE stream is a public API and can be subscribed to by external systems providing an extremely powerful way to react in real-time to changes happening in metadata. For example, you could build an access control enforcer that reacts to change in metadata (e.g. a previously world-readable dataset now has a pii field) to immediately lock down the dataset in question.
Note that not all MCE-s will result in an MAE, because the DataHub serving tier will ignore any duplicate changes to metadata.
The MCL stream is a public API and can be subscribed to by external systems (for example, the Actions Framework) providing an extremely powerful way to react in real-time to changes happening in metadata. For example, you could build an access control enforcer that reacts to change in metadata (e.g. a previously world-readable dataset now has a pii field) to immediately lock down the dataset in question.
Note that not all MCP-s will result in an MCL, because the DataHub serving tier will ignore any duplicate changes to metadata.

### Metadata Index Applier (mae-consumer-job)

[MAE]-s are consumed by another Kafka Streams job, [mae-consumer-job], which applies the changes to the [graph] and [search index] accordingly.
[Metadata Change Logs]s are consumed by another Kafka Streams job, [mae-consumer-job], which applies the changes to the [graph] and [search index] accordingly.
The job is entity-agnostic and will execute corresponding graph & search index builders, which will be invoked by the job when a specific metadata aspect is changed.
The builder should instruct the job how to update the graph and search index based on the metadata change.
The builder can optionally use [Remote DAO] to fetch additional metadata from other sources to help compute the final update.

To ensure that metadata changes are processed in the correct chronological order, MAEs are keyed by the entity [URN] — meaning all MAEs for a particular entity will be processed sequentially by a single Kafka streams thread.
To ensure that metadata changes are processed in the correct chronological order, MCLs are keyed by the entity [URN] — meaning all MAEs for a particular entity will be processed sequentially by a single Kafka streams thread.

### Metadata Query Serving

Primary-key based reads (e.g. getting schema metadata for a dataset based on the `dataset-urn`) on metadata are routed to the document store. Secondary index based reads on metadata are routed to the search index (or alternately can use the strongly consistent secondary index support described [here]()). Full-text and advanced search queries are routed to the search index. Complex graph queries such as lineage are routed to the graph index.

[RecordTemplate]: https://github.com/linkedin/rest.li/blob/master/data/src/main/java/com/linkedin/data/template/RecordTemplate.java
[GenericRecord]: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericRecord.java
[DAO]: https://en.wikipedia.org/wiki/Data_access_object
[Pegasus]: https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates
[relationship]: ../what/relationship.md
[entity]: ../what/entity.md
[aspect]: ../what/aspect.md
[GMS]: ../what/gms.md
[MAE]: ../what/mxe.md#metadata-audit-event-mae
[Metadata Change Log]: ../what/mxe.md#metadata-change-log-mcl
[rest.li]: https://rest.li


[Metadata Change Event (MCE)]: ../what/mxe.md#metadata-change-event-mce
[Metadata Audit Event (MAE)]: ../what/mxe.md#metadata-audit-event-mae
[MAE]: ../what/mxe.md#metadata-audit-event-mae
[Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp
[Metadata Change Log (MCL)]: ../what/mxe.md#metadata-change-log-mcl
[MCP]: ../what/mxe.md#metadata-change-proposal-mcp
[MCL]: ../what/mxe.md#metadata-change-log-mcl

[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer
[graph]: ../what/graph.md
[search index]: ../what/search-index.md
Expand Down
35 changes: 19 additions & 16 deletions docs/deploy/confluent-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,26 @@

DataHub provides the ability to easily leverage Confluent Cloud as your Kafka provider. To do so, you'll need to configure DataHub to talk to a broker and schema registry hosted by Confluent.

Doing this is a matter of configuring the Kafka Producer and Consumers used by DataHub correctly. There are 2 places where Kafka configuration should be provided: the metadata server (GMS) and the frontend server (datahub-frontend). Follow the steps below to configure these components for your deployment.
Doing this is a matter of configuring the Kafka Producer and Consumers used by DataHub correctly. There are 2 places where Kafka configuration should be provided: the metadata service (GMS) and the frontend server (datahub-frontend). Follow the steps below to configure these components for your deployment.

## **Step 1: Create topics in Confluent Control Center**

First, you'll need to create following new topics in the [Confluent Control Center](https://docs.confluent.io/platform/current/control-center/index.html). By default they have the following names:

1. **MetadataChangeEvent_v4**: Metadata change proposal messages
2. **MetadataAuditEvent_v4**: Metadata change log messages
3. **FailedMetadataChangeEvent_v4**: Failed to process #1 event
4. **DataHubUsageEvent_v1**: User behavior tracking event for UI
5. **MetadataChangeProposal_v1**
6. **FailedMetadataChangeProposal_v1**
7. **MetadataChangeLog_Versioned_v1**
8. **MetadataChangeLog_Timeseries_v1**
1. **MetadataChangeProposal_v1**
2. **FailedMetadataChangeProposal_v1**
3. **MetadataChangeLog_Versioned_v1**
4. **MetadataChangeLog_Timeseries_v1**
5. **DataHubUsageEvent_v1**: User behavior tracking event for UI
6. (Deprecated) **MetadataChangeEvent_v4**: Metadata change proposal messages
7. (Deprecated) **MetadataAuditEvent_v4**: Metadata change log messages
8. (Deprecated) **FailedMetadataChangeEvent_v4**: Failed to process #1 event

The last 4 are exaplined in [MCP/MCL](../advanced/mcp-mcl.md)
The first five are the most important, and are explained in more depth in [MCP/MCL](../advanced/mcp-mcl.md). The final topics are
those which are deprecated but still used under certain circumstances. It is likely that in the future they will be completely
decommissioned.

To do so, navigate to your **Cluster** and click "Create Topic". Feel free to tweak the default topic configurations to
To create the topics, navigate to your **Cluster** and click "Create Topic". Feel free to tweak the default topic configurations to
match your preferences.

![CreateTopic](../imgs/confluent-create-topic.png)
Expand Down Expand Up @@ -59,13 +61,14 @@ KAFKA_PROPERTIES_BASIC_AUTH_CREDENTIALS_SOURCE=USER_INFO
KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO=P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123LhPYs2TGjFKmvMWUFnlJ3uKubFbB1Sfs7aOjjNi1m23
```

Note that this step is only required if DATAHUB_ANALYTICS_ENABLED is not set to false.
Note that this step is only required if `DATAHUB_ANALYTICS_ENABLED` environment variable is not explicitly set to false for the datahub-frontend
container.

If you're deploying with Docker Compose, you do not need to deploy the Zookeeper, Kafka Broker, or Schema Registry containers that ship by default.

### Helm

If you're deploying to K8s using Helm, you can simply change the `datahub-helm` values.yml to point to Confluent Cloud and disable some default containers:
If you're deploying on K8s using Helm, you can simply change the **datahub-helm** `values.yml` to point to Confluent Cloud and disable some default containers:

First, disable the `cp-schema-registry` service:

Expand Down Expand Up @@ -106,7 +109,7 @@ automatically populate with your new secrets:
You'll need to copy the values of `sasl.jaas.config` and `basic.auth.user.info`
for the next step.

The next step is to create K8s secrets containing the config values you've just generated. Specifically, you'll run the following commands:
The next step is to create K8s secrets containing the config values you've just generated. Specifically, you can run the following commands:

```shell
kubectl create secret generic confluent-secrets --from-literal=sasl_jaas_config="<your-sasl.jaas.config>"
Expand All @@ -120,7 +123,7 @@ kubectl create secret generic confluent-secrets --from-literal=sasl_jaas_config=
kubectl create secret generic confluent-secrets --from-literal=basic_auth_user_info="P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123LhPYs2TGjFKmvMWUFnlJ3uKubFbB1Sfs7aOjjNi1m23"
```

Finally, we'll configure our containers to pick up the Confluent Kafka Configs by changing two config blocks in our values.yaml file. You
Finally, we'll configure our containers to pick up the Confluent Kafka Configs by changing two config blocks in our `values.yaml` file. You
should see these blocks commented at the bottom of the template. You'll want to uncomment them and set them to the following values:

```
Expand All @@ -143,6 +146,6 @@ Then simply apply the updated `values.yaml` to your K8s cluster via `kubectl app
## Contribution
Accepting contributions for a setup script compatible with Confluent Cloud!

Currently the kafka-setup-job container we ship with is only compatible with a distribution of Kafka wherein ZooKeeper
The kafka-setup-job container we ship with is only compatible with a distribution of Kafka wherein ZooKeeper
is exposed and available. A version of the job using the [Confluent CLI](https://docs.confluent.io/confluent-cli/current/command-reference/kafka/topic/confluent_kafka_topic_create.html)
would be very useful for the broader community.
Loading