Skip to content

Commit

Permalink
feat(ingestion): improve logging, docs for bigquery, snowflake, redsh…
Browse files Browse the repository at this point in the history
  • Loading branch information
anshbansal authored and aditya-radhakrishnan committed Mar 14, 2022
1 parent 0f73518 commit 21d3349
Show file tree
Hide file tree
Showing 40 changed files with 953 additions and 457 deletions.
4 changes: 3 additions & 1 deletion metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,8 @@ def get_long_description():
"types-click==0.1.12",
"boto3-stubs[s3,glue,sagemaker]",
"types-tabulate",
# avrogen package requires this
"types-pytz",
}

base_dev_requirements = {
Expand All @@ -223,7 +225,7 @@ def get_long_description():
"flake8>=3.8.3",
"flake8-tidy-imports>=4.3.0",
"isort>=5.7.0",
"mypy>=0.920",
"mypy>=0.920,<0.940",
# pydantic 1.8.2 is incompatible with mypy 0.910.
# See https://github.com/samuelcolvin/pydantic/pull/3175#issuecomment-995382910.
"pydantic>=1.9.0",
Expand Down
35 changes: 19 additions & 16 deletions metadata-ingestion/source_docs/bigquery.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
# BigQuery

To get all metadata from BigQuery you need to use two plugins `bigquery` and `bigquery-usage`. Both of them are described in this page. These will require 2 separate recipes. We understand this is not ideal and we plan to make this easier in the future.

For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).

## Setup
## `bigquery`
### Setup

To install this plugin, run `pip install 'acryl-datahub[bigquery]'`.

## Prerequisites
### Create a datahub profile in GCP:
1. Create a custom role for datahub (https://cloud.google.com/iam/docs/creating-custom-roles#creating_a_custom_role)
### Prerequisites
#### Create a datahub profile in GCP
1. Create a custom role for datahub as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-custom-roles#creating_a_custom_role)
2. Grant the following permissions to this role:
```
bigquery.datasets.get
Expand All @@ -27,9 +30,9 @@ To install this plugin, run `pip install 'acryl-datahub[bigquery]'`.
logging.logEntries.list # Needs for lineage generation
resourcemanager.projects.get
```
### Create a service account:
#### Create a service account

1. Setup a ServiceAccount (https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console)
1. Setup a ServiceAccount as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console)
and assign the previously created role to this service account.
2. Download a service account JSON keyfile.
Example credential file:
Expand Down Expand Up @@ -64,7 +67,7 @@ and assign the previously created role to this service account.
client_id: "123456678890"
```
## Capabilities
### Capabilities
This plugin extracts the following:
Expand All @@ -81,11 +84,11 @@ This plugin extracts the following:

:::tip

You can also get fine-grained usage statistics for BigQuery using the `bigquery-usage` source described below.
You can also get fine-grained usage statistics for BigQuery using the `bigquery-usage` source described [below](#bigquery-usage-plugin).

:::

## Quickstart recipe
### Quickstart recipe

Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.

Expand All @@ -102,7 +105,7 @@ sink:
# sink configs
```

## Config details
### Config details

Note that a `.` is used to denote nested fields in the YAML recipe.

Expand Down Expand Up @@ -155,7 +158,7 @@ Note: the bigquery_audit_metadata_datasets parameter receives a list of datasets

Note: Since bigquery source also supports dataset level lineage, the auth client will require additional permissions to be able to access the google audit logs. Refer the permissions section in bigquery-usage section below which also accesses the audit logs.

## Profiling
### Profiling
Profiling can profile normal/partitioned and sharded tables as well but due to performance reasons, we only profile the latest partition for Partitioned tables and the latest shard for sharded tables.

If limit/offset parameter is set or partitioning partitioned or sharded table Great Expectation (the profiling framework we use) needs to create temporary
Expand All @@ -175,11 +178,11 @@ Due to performance reasons, we only profile the latest partition for Partitioned
You can set partition explicitly with `partition.partition_datetime` property if you want. (partition will be applied to all partitioned tables)
:::

# BigQuery Usage Stats
## `bigquery-usage`

For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).

## Setup
### Setup

To install this plugin, run `pip install 'acryl-datahub[bigquery-usage]'`.

Expand All @@ -194,7 +197,7 @@ The Google Identity must have one of the following OAuth scopes granted to it:

And should be authorized on all projects you'd like to ingest usage stats from.

## Capabilities
### Capabilities

This plugin extracts the following:

Expand All @@ -208,7 +211,7 @@ This plugin extracts the following:

:::

## Quickstart recipe
### Quickstart recipe

Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.

Expand All @@ -230,7 +233,7 @@ sink:
# sink configs
```

## Config details
### Config details

Note that a `.` is used to denote nested fields in the YAML recipe.

Expand Down
37 changes: 22 additions & 15 deletions metadata-ingestion/source_docs/redshift.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
# Redshift

To get all metadata from BigQuery you need to use two plugins `redshift` and `redshift-usage`. Both of them are described in this page. These will require 2 separate recipes. We understand this is not ideal and we plan to make this easier in the future.

For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).

## Setup
## `redshift`

### Setup

To install this plugin, run `pip install 'acryl-datahub[redshift]'`.

Expand All @@ -19,7 +23,7 @@ Giving a user unrestricted access to system tables gives the user visibility to

:::

## Capabilities
### Capabilities

This plugin extracts the following:

Expand All @@ -41,7 +45,7 @@ You can also get fine-grained usage statistics for Redshift using the `redshift-
| Data Containers | ✔️ | |
| Data Domains | ✔️ | [link](../../docs/domains.md) |

## Quickstart recipe
### Quickstart recipe

Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.

Expand Down Expand Up @@ -93,7 +97,7 @@ sink:

</details>

## Config details
### Config details

Like all SQL-based sources, the Redshift integration supports:
- Stale Metadata Deletion: See [here](./stateful_ingestion.md) for more details on configuration.
Expand Down Expand Up @@ -130,11 +134,11 @@ Note that a `.` is used to denote nested fields in the YAML recipe.
| `domain.domain_key.deny` | | | List of regex patterns for tables/schemas to not assign domain_key. There can be multiple domain key specified. |
| `domain.domain_key.ignoreCase` | | `True` | Whether to ignore case sensitivity during pattern matching.There can be multiple domain key specified. |

## Lineage
### Lineage

There are multiple lineage collector implementations as Redshift does not support table lineage out of the box.

### stl_scan_based
#### stl_scan_based
The stl_scan based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) and [stl_scan](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_SCAN.html) system tables to
discover lineage between tables.
Pros:
Expand All @@ -145,7 +149,7 @@ Cons:
- Does not work with Spectrum/external tables because those scans do not show up in stl_scan table.
- If a table is depending on a view then the view won't be listed as dependency. Instead the table will be connected with the view's dependencies.

### sql_based
#### sql_based
The sql_based based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) to discover all the insert queries
and uses sql parsing to discover the dependecies.

Expand All @@ -157,7 +161,7 @@ Cons:
- Slow.
- Less reliable as the query parser can fail on certain queries

### mixed
#### mixed
Using both collector above and first applying the sql based and then the stl_scan based one.

Pros:
Expand All @@ -169,10 +173,13 @@ Cons:
- Slow
- May be incorrect at times as the query parser can fail on certain queries

# Note
- The redshift stl redshift tables which are used for getting data lineage only retain approximately two to five days of log history. This means you cannot extract lineage from queries issued outside that window.
:::note

The redshift stl redshift tables which are used for getting data lineage only retain approximately two to five days of log history. This means you cannot extract lineage from queries issued outside that window.

:::

# Redshift Usage Stats
## `redshift-usage`

This plugin extracts usage statistics for datasets in Amazon Redshift. For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).

Expand All @@ -187,10 +194,10 @@ To grant access this plugin for all system tables, please alter your datahub Red
ALTER USER datahub_user WITH SYSLOG ACCESS UNRESTRICTED;
```

## Setup
### Setup
To install this plugin, run `pip install 'acryl-datahub[redshift-usage]'`.

## Capabilities
### Capabilities

| Capability | Status | Details |
| -----------| ------ | ---- |
Expand All @@ -210,7 +217,7 @@ This source only does usage statistics. To get the tables, views, and schemas in

:::

## Quickstart recipe
### Quickstart recipe

Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.

Expand All @@ -233,7 +240,7 @@ sink:
# sink configs
```

## Config details
### Config details
Note that a `.` is used to denote nested fields in the YAML recipe.

By default, we extract usage stats for the last day, with the recommendation that this source is executed every day.
Expand Down
Loading

0 comments on commit 21d3349

Please sign in to comment.