Skip to content

Commit

Permalink
feat(ingest): make default rest sink mode env-configurable (#11335)
Browse files Browse the repository at this point in the history
  • Loading branch information
hsheth2 authored Sep 12, 2024
1 parent 39aa086 commit 2a0d14e
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 18 deletions.
36 changes: 23 additions & 13 deletions docs/how/updating-datahub.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,22 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
## Next

### Breaking Changes

- #9857 (#10773) `lower` method was removed from `get_db_name` of `SQLAlchemySource` class. This change will affect the urns of all related to `SQLAlchemySource` entities.

Old `urn`, where `data_base_name` is `Some_Database`:

```
- urn:li:dataJob:(urn:li:dataFlow:(mssql,demodata.Foo.stored_procedures,PROD),Proc.With.SpecialChar)
```

New `urn`, where `data_base_name` is `Some_Database`:

```
- urn:li:dataJob:(urn:li:dataFlow:(mssql,DemoData.Foo.stored_procedures,PROD),Proc.With.SpecialChar)
```
Re-running with stateful ingestion should automatically clear up the entities with old URNS and add entities with new URNs, therefore not duplicating the containers or jobs.

Re-running with stateful ingestion should automatically clear up the entities with old URNS and add entities with new URNs, therefore not duplicating the containers or jobs.

- #11313 - `datahub get` will no longer return a key aspect for entities that don't exist.

Expand All @@ -43,14 +48,15 @@ This file documents any backwards-incompatible changes in DataHub and assists pe

### Breaking Changes

- Protobuf CLI will no longer create binary encoded protoc custom properties. Flag added `-protocProp` in case this
- Protobuf CLI will no longer create binary encoded protoc custom properties. Flag added `-protocProp` in case this
behavior is required.
- #10814 Data flow info and data job info aspect will produce an additional field that will require a corresponding upgrade of server. Otherwise server can reject the aspects.
- #10868 - OpenAPI V3 - Creation of aspects will need to be wrapped within a `value` key and the API is now symmetric with respect to input and outputs.

Example Global Tags Aspect:

Previous:

```json
{
"tags": [
Expand Down Expand Up @@ -78,34 +84,38 @@ New (optional fields `systemMetadata` and `headers`):
"headers": {}
}
```

- #10858 Profiling configuration for Glue source has been updated.

Previously, the configuration was:
```yaml
profiling: {}
```
Previously, the configuration was:

Now, it needs to be:
```yaml
profiling: {}
```
```yaml
profiling:
enabled: true
```
Now, it needs to be:
```yaml
profiling:
enabled: true
```
### Potential Downtime
### Deprecations
- OpenAPI v1: OpenAPI v1 is collectively defined as all endpoints which are not prefixed with `/v2` or `/v3`. The v1 endpoints
- OpenAPI v1: OpenAPI v1 is collectively defined as all endpoints which are not prefixed with `/v2` or `/v3`. The v1 endpoints
will be deprecated in no less than 6 months. Endpoints will be replaced with equivalents in the `/v2` or `/v3` APIs.
No loss of functionality expected unless explicitly mentioned in Breaking Changes.

### Other Notable Changes

- #10498 - Tableau ingestion can now be configured to ingest multiple sites at once and add the sites as containers. The feature is currently only available for Tableau Server.
- #10466 - Extends configuration in `~/.datahubenv` to match `DatahubClientConfig` object definition. See full configuration in https://datahubproject.io/docs/python-sdk/clients/. The CLI should now respect the updated configurations specified in `~/.datahubenv` across its functions and utilities. This means that for systems where ssl certification is disabled, setting `disable_ssl_verification: true` in `~./datahubenv` will apply to all CLI calls.
- #11002 - We will not auto-generate a `~/.datahubenv` file. You must either run `datahub init` to create that file, or set environment variables so that the config is loaded.
- #11023 - Added a new parameter to datahub's `put` cli command: `--run-id`. This parameter is useful to associate a given write to an ingestion process. A use-case can be mimick transformers when a transformer for aspect being written does not exist.
- #11051 - Ingestion reports will now trim the summary text to a maximum of 800k characters to avoid generating `dataHubExecutionRequestResult` that are too large for GMS to handle.

## 0.13.3

### Breaking Changes
Expand Down
15 changes: 11 additions & 4 deletions metadata-ingestion/src/datahub/ingestion/sink/datahub_rest.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
from enum import auto
from typing import List, Optional, Tuple, Union

import pydantic

from datahub.configuration.common import (
ConfigEnum,
ConfigurationError,
Expand Down Expand Up @@ -39,7 +41,7 @@

logger = logging.getLogger(__name__)

DEFAULT_REST_SINK_MAX_THREADS = int(
_DEFAULT_REST_SINK_MAX_THREADS = int(
os.getenv("DATAHUB_REST_SINK_DEFAULT_MAX_THREADS", 15)
)

Expand All @@ -49,16 +51,21 @@ class RestSinkMode(ConfigEnum):
ASYNC = auto()

# Uses the new ingestProposalBatch endpoint. Significantly more efficient than the other modes,
# but requires a server version that supports it.
# but requires a server version that supports it. Added in
# https://github.com/datahub-project/datahub/pull/10706
ASYNC_BATCH = auto()


_DEFAULT_REST_SINK_MODE = pydantic.parse_obj_as(
RestSinkMode, os.getenv("DATAHUB_REST_SINK_DEFAULT_MODE", RestSinkMode.ASYNC)
)


class DatahubRestSinkConfig(DatahubClientConfig):
mode: RestSinkMode = RestSinkMode.ASYNC
mode: RestSinkMode = _DEFAULT_REST_SINK_MODE

# These only apply in async modes.
max_threads: int = DEFAULT_REST_SINK_MAX_THREADS
max_threads: int = _DEFAULT_REST_SINK_MAX_THREADS
max_pending_requests: int = 2000

# Only applies in async batch mode.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1323,7 +1323,7 @@ def process_dashboard(
dashboard_object.folder.is_personal
or dashboard_object.folder.is_personal_descendant
):
self.reporter.report_warning(
self.reporter.info(
title="Dropped Dashboard",
message="Dropped due to being a personal folder",
context=f"Dashboard ID: {dashboard_id}",
Expand Down

0 comments on commit 2a0d14e

Please sign in to comment.