feat(ingestion): improve logging, docs for bigquery, snowflake, redsh…

…ift (datahub-project#4344)
aditya-radhakrishnan · Mar 14, 2022 · 21d3349 · 21d3349
1 parent 0f73518
commit 21d3349
Show file tree

Hide file tree

Showing 40 changed files with 953 additions and 457 deletions.
diff --git a/metadata-ingestion/setup.py b/metadata-ingestion/setup.py
@@ -211,6 +211,8 @@ def get_long_description():
     "types-click==0.1.12",
     "boto3-stubs[s3,glue,sagemaker]",
     "types-tabulate",
+    # avrogen package requires this
+    "types-pytz",
 }
 
 base_dev_requirements = {
@@ -223,7 +225,7 @@ def get_long_description():
     "flake8>=3.8.3",
     "flake8-tidy-imports>=4.3.0",
     "isort>=5.7.0",
-    "mypy>=0.920",
+    "mypy>=0.920,<0.940",
     # pydantic 1.8.2 is incompatible with mypy 0.910.
     # See https://github.com/samuelcolvin/pydantic/pull/3175#issuecomment-995382910.
     "pydantic>=1.9.0",

diff --git a/metadata-ingestion/source_docs/bigquery.md b/metadata-ingestion/source_docs/bigquery.md
@@ -1,14 +1,17 @@
 # BigQuery
 
+To get all metadata from BigQuery you need to use two plugins `bigquery` and `bigquery-usage`. Both of them are described in this page. These will require 2 separate recipes. We understand this is not ideal and we plan to make this easier in the future.
+
 For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
 
-## Setup
+## `bigquery`
+### Setup
 
 To install this plugin, run `pip install 'acryl-datahub[bigquery]'`.
 
-## Prerequisites
-### Create a datahub profile in GCP:
-1. Create a custom role for datahub (https://cloud.google.com/iam/docs/creating-custom-roles#creating_a_custom_role)
+### Prerequisites
+#### Create a datahub profile in GCP
+1. Create a custom role for datahub as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-custom-roles#creating_a_custom_role)
 2. Grant the following permissions to this role:
 ```   
    bigquery.datasets.get
@@ -27,9 +30,9 @@ To install this plugin, run `pip install 'acryl-datahub[bigquery]'`.
    logging.logEntries.list # Needs for lineage generation
    resourcemanager.projects.get
 ```
-### Create a service account:
+#### Create a service account
 
-1. Setup a ServiceAccount (https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console)
+1. Setup a ServiceAccount as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console)
 and assign the previously created role to this service account.
 2. Download a service account JSON keyfile.
    Example credential file:
@@ -64,7 +67,7 @@ and assign the previously created role to this service account.
        client_id: "123456678890"
 ```
 
-## Capabilities
+### Capabilities
 
 This plugin extracts the following:
 
@@ -81,11 +84,11 @@ This plugin extracts the following:
 
 :::tip
 
-You can also get fine-grained usage statistics for BigQuery using the `bigquery-usage` source described below.
+You can also get fine-grained usage statistics for BigQuery using the `bigquery-usage` source described [below](#bigquery-usage-plugin).
 
 :::
 
-## Quickstart recipe
+### Quickstart recipe
 
 Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
 
@@ -102,7 +105,7 @@ sink:
   # sink configs
 ```
 
-## Config details
+### Config details
 
 Note that a `.` is used to denote nested fields in the YAML recipe.
 
@@ -155,7 +158,7 @@ Note: the bigquery_audit_metadata_datasets parameter receives a list of datasets
 
 Note: Since bigquery source also supports dataset level lineage, the auth client will require additional permissions to be able to access the google audit logs. Refer the permissions section in bigquery-usage section below which also accesses the audit logs.
 
-## Profiling
+### Profiling
 Profiling can profile normal/partitioned and sharded tables as well but due to performance reasons, we only profile the latest partition for Partitioned tables and the latest shard for sharded tables.
 
 If limit/offset parameter is set or partitioning partitioned or sharded table Great Expectation (the profiling framework we use) needs to create temporary
@@ -175,11 +178,11 @@ Due to performance reasons, we only profile the latest partition for Partitioned
 You can set partition explicitly with `partition.partition_datetime` property if you want. (partition will be applied to all partitioned tables)
 :::
 
-# BigQuery Usage Stats
+## `bigquery-usage`
 
 For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
 
-## Setup
+### Setup
 
 To install this plugin, run `pip install 'acryl-datahub[bigquery-usage]'`.
 
@@ -194,7 +197,7 @@ The Google Identity must have one of the following OAuth scopes granted to it:
 
 And should be authorized on all projects you'd like to ingest usage stats from. 
 
-## Capabilities
+### Capabilities
 
 This plugin extracts the following:
 
@@ -208,7 +211,7 @@ This plugin extracts the following:
 
 :::
 
-## Quickstart recipe
+### Quickstart recipe
 
 Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
 
@@ -230,7 +233,7 @@ sink:
   # sink configs
 ```
 
-## Config details
+### Config details
 
 Note that a `.` is used to denote nested fields in the YAML recipe.
 

diff --git a/metadata-ingestion/source_docs/redshift.md b/metadata-ingestion/source_docs/redshift.md
@@ -1,8 +1,12 @@
 # Redshift
 
+To get all metadata from BigQuery you need to use two plugins `redshift` and `redshift-usage`. Both of them are described in this page. These will require 2 separate recipes. We understand this is not ideal and we plan to make this easier in the future.
+
 For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
 
-## Setup
+## `redshift`
+
+### Setup
 
 To install this plugin, run `pip install 'acryl-datahub[redshift]'`.
 
@@ -19,7 +23,7 @@ Giving a user unrestricted access to system tables gives the user visibility to
 
 :::
 
-## Capabilities
+### Capabilities
 
 This plugin extracts the following:
 
@@ -41,7 +45,7 @@ You can also get fine-grained usage statistics for Redshift using the `redshift-
 | Data Containers   | ✔️     |                                          |
 | Data Domains      | ✔️     | [link](../../docs/domains.md)            |
 
-## Quickstart recipe
+### Quickstart recipe
 
 Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
 
@@ -93,7 +97,7 @@ sink:
 
 </details>
 
-## Config details
+### Config details
 
 Like all SQL-based sources, the Redshift integration supports:
 - Stale Metadata Deletion: See [here](./stateful_ingestion.md) for more details on configuration.
@@ -130,11 +134,11 @@ Note that a `.` is used to denote nested fields in the YAML recipe.
 | `domain.domain_key.deny`       |          |                    | List of regex patterns for tables/schemas to not assign domain_key. There can be multiple domain key specified.                                                                         |
 | `domain.domain_key.ignoreCase` |          | `True`             | Whether to ignore case sensitivity during pattern matching.There can be multiple domain key specified.                                                                                  |
 
-## Lineage
+### Lineage
 
 There are multiple lineage collector implementations as Redshift does not support table lineage out of the box.
 
-### stl_scan_based
+#### stl_scan_based
 The stl_scan based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) and [stl_scan](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_SCAN.html) system tables to
 discover lineage between tables.
 Pros:
@@ -145,7 +149,7 @@ Cons:
 - Does not work with Spectrum/external tables because those scans do not show up in stl_scan table.
 - If a table is depending on a view then the view won't be listed as dependency. Instead the table will be connected with the view's dependencies.
 
-### sql_based
+#### sql_based
 The sql_based based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) to discover all the insert queries
 and uses sql parsing to discover the dependecies.
 
@@ -157,7 +161,7 @@ Cons:
 - Slow.
 - Less reliable as the query parser can fail on certain queries
 
-### mixed
+#### mixed
 Using both collector above and first applying the sql based and then the stl_scan based one.
 
 Pros:
@@ -169,10 +173,13 @@ Cons:
 - Slow
 - May be incorrect at times as the query parser can fail on certain queries
 
-# Note
-- The redshift stl redshift tables which are used for getting data lineage only retain approximately two to five days of log history. This means you cannot extract lineage from queries issued outside that window.
+:::note
+
+The redshift stl redshift tables which are used for getting data lineage only retain approximately two to five days of log history. This means you cannot extract lineage from queries issued outside that window.
+
+:::
 
-# Redshift Usage Stats
+## `redshift-usage`
 
 This plugin extracts usage statistics for datasets in Amazon Redshift. For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
 
@@ -187,10 +194,10 @@ To grant access this plugin for all system tables, please alter your datahub Red
 ALTER USER datahub_user WITH SYSLOG ACCESS UNRESTRICTED;
 ```
 
-## Setup
+### Setup
 To install this plugin, run `pip install 'acryl-datahub[redshift-usage]'`.
 
-## Capabilities
+### Capabilities
 
 | Capability | Status | Details | 
 | -----------| ------ | ---- |
@@ -210,7 +217,7 @@ This source only does usage statistics. To get the tables, views, and schemas in
 
 :::
 
-## Quickstart recipe
+### Quickstart recipe
 
 Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
 
@@ -233,7 +240,7 @@ sink:
 # sink configs
 ```
 
-## Config details
+### Config details
 Note that a `.` is used to denote nested fields in the YAML recipe.
 
 By default, we extract usage stats for the last day, with the recommendation that this source is executed every day.