Skip to content

Commit

Permalink
fix(docs): make intro to metadata ingestion easier for beginners (dat…
Browse files Browse the repository at this point in the history
…ahub-project#4039)

* fix(docs): fix sidebar titles for clarity

* re-arrange docs to make Intro to Metadata ingestion easier for beginners

* minor changes for readability

* add heading

* docs: add note for common question
  • Loading branch information
anshbansal authored and ne1r0n committed Feb 13, 2022
1 parent 7764bfc commit e4cdfa5
Show file tree
Hide file tree
Showing 5 changed files with 137 additions and 147 deletions.
13 changes: 13 additions & 0 deletions docs-website/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,19 @@ task yarnLint(type: YarnTask, dependsOn: [yarnInstall]) {
outputs.cacheIf { true }
}

task yarnLintFix(type: YarnTask, dependsOn: [yarnInstall]) {
inputs.files(projectMdFiles)
args = ['run', 'lint-fix']
outputs.dir("dist")
// tell gradle to apply the build cache
outputs.cacheIf { true }
}

task serve(type: YarnTask, dependsOn: [yarnInstall] ) {
args = ['run', 'serve']
}


task yarnBuild(type: YarnTask, dependsOn: [yarnLint, yarnGenerate]) {
inputs.files(projectMdFiles)
inputs.file("package.json").withPathSensitivity(PathSensitivity.RELATIVE)
Expand Down
5 changes: 3 additions & 2 deletions docs-website/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
"clear": "docusaurus clear && rm -rf genDocs/*",
"generate": "rm -rf genDocs/* && ts-node -O '{ \"lib\": [\"es2020\"], \"target\": \"es6\" }' generateDocsDir.ts && mv -v docs/* genDocs/",
"lint": "prettier -w generateDocsDir.ts sidebars.js src/pages/index.js",
"lint-check": "prettier -l generateDocsDir.ts sidebars.js src/pages/index.js"
"lint-check": "prettier -l generateDocsDir.ts sidebars.js src/pages/index.js",
"lint-fix": "prettier --write generateDocsDir.ts sidebars.js src/pages/index.js"
},
"dependencies": {
"@docusaurus/core": "^2.0.0-beta.7",
Expand Down Expand Up @@ -45,4 +46,4 @@
"ts-node": "^9.1.1",
"typescript": "^4.1.5"
}
}
}
9 changes: 2 additions & 7 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -62,18 +62,13 @@ module.exports = {
"docs/saas",
"releases",
],
"Getting Started": [
"docs/quickstart",
"docs/cli",
"metadata-ingestion/README",
"docs/debugging",
],
"Getting Started": ["docs/quickstart", "docs/cli", "docs/debugging"],
"Metadata Ingestion": [
// add a custom label since the default is 'Metadata Ingestion'
// note that we also have to add the path to this file in sidebarsjs_hardcoded_titles in generateDocsDir.ts
{
type: "doc",
label: "Quickstart",
label: "Introduction",
id: "metadata-ingestion/README",
},
{
Expand Down
89 changes: 87 additions & 2 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ You can find the release notes in [github releases](https://github.com/linkedin/
## Installation
### Using pip

We recommend python virtual environments (venv-s) to namespace pip modules. Here's an example setup:
We recommend python virtual environments (venv-s) to namespace pip modules. The folks over at [Acryl Data](https://www.acryl.io/) maintain a PyPI package for DataHub metadata ingestion. Here's an example setup:

```shell
python3 -m venv datahub-env # create the environment
Expand All @@ -20,7 +20,7 @@ source datahub-env/bin/activate # activate the environment

Once inside the virtual environment, install `datahub` using the following commands

```console
```shell
# Requires Python 3.6+
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
Expand All @@ -32,8 +32,93 @@ If you run into an error, try checking the [_common setup issues_](../metadata-i

### Using docker

[![Docker Hub](https://img.shields.io/docker/pulls/linkedin/datahub-ingestion?style=plastic)](https://hub.docker.com/r/linkedin/datahub-ingestion)
[![datahub-ingestion docker](https://github.com/linkedin/datahub/actions/workflows/docker-ingestion.yml/badge.svg)](https://github.com/linkedin/datahub/actions/workflows/docker-ingestion.yml)

If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container.
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/linkedin/datahub-ingestion). All plugins will be installed and enabled automatically.

You can use the `datahub-ingestion` docker image as explained in [Docker Images](../docker/README.md). In case you are using Kubernetes you can start a pod with the `datahub-ingestion` docker image, log onto a shell on the pod and you should have the access to datahub CLI in your kubernetes cluster.

_Limitation: the datahub_docker.sh convenience script assumes that the recipe and any input/output files are accessible in the current working directory or its subdirectories. Files outside the current working directory will not be found, and you'll need to invoke the Docker image directly._

```shell
# Assumes the DataHub repo is cloned locally.
./metadata-ingestion/scripts/datahub_docker.sh ingest -c ./examples/recipes/example_to_datahub_rest.yml
```

### Install from source

If you'd like to install from source, see the [developer guide](../metadata-ingestion/developing.md).

## Installing Plugins

We use a plugin architecture so that you can install only the dependencies you actually need. Click the plugin name to learn more about the specific source recipe and any FAQs!

### Sources

| Plugin Name | Install Command | Provides |
|-----------------------------------------------------------------|------------------------------------------------------------| ----------------------------------- |
| [file](../metadata-ingestion/source_docs/file.md) | _included by default_ | File source and sink |
| [athena](../metadata-ingestion/source_docs/athena.md) | `pip install 'acryl-datahub[athena]'` | AWS Athena source |
| [bigquery](../metadata-ingestion/source_docs/bigquery.md) | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
| [bigquery-usage](../metadata-ingestion/source_docs/bigquery.md) | `pip install 'acryl-datahub[bigquery-usage]'` | BigQuery usage statistics source |
| [datahub-business-glossary](../metadata-ingestion/source_docs/business_glossary.md) | _no additional dependencies_ | Business Glossary File source |
| [dbt](../metadata-ingestion/source_docs/dbt.md) | _no additional dependencies_ | dbt source |
| [druid](../metadata-ingestion/source_docs/druid.md) | `pip install 'acryl-datahub[druid]'` | Druid Source |
| [feast](../metadata-ingestion/source_docs/feast.md) | `pip install 'acryl-datahub[feast]'` | Feast source |
| [glue](../metadata-ingestion/source_docs/glue.md) | `pip install 'acryl-datahub[glue]'` | AWS Glue source |
| [hive](../metadata-ingestion/source_docs/hive.md) | `pip install 'acryl-datahub[hive]'` | Hive source |
| [kafka](../metadata-ingestion/source_docs/kafka.md) | `pip install 'acryl-datahub[kafka]'` | Kafka source |
| [kafka-connect](../metadata-ingestion/source_docs/kafka-connect.md) | `pip install 'acryl-datahub[kafka-connect]'` | Kafka connect source |
| [ldap](../metadata-ingestion/source_docs/ldap.md) | `pip install 'acryl-datahub[ldap]'` ([extra requirements]) | LDAP source |
| [looker](../metadata-ingestion/source_docs/looker.md) | `pip install 'acryl-datahub[looker]'` | Looker source |
| [lookml](../metadata-ingestion/source_docs/lookml.md) | `pip install 'acryl-datahub[lookml]'` | LookML source, requires Python 3.7+ |
| [metabase](../metadata-ingestion/source_docs/metabase.md) | `pip install 'acryl-datahub[metabase]` | Metabase source |
| [mode](../metadata-ingestion/source_docs/mode.md) | `pip install 'acryl-datahub[mode]'` | Mode Analytics source |
| [mongodb](../metadata-ingestion/source_docs/mongodb.md) | `pip install 'acryl-datahub[mongodb]'` | MongoDB source |
| [mssql](../metadata-ingestion/source_docs/mssql.md) | `pip install 'acryl-datahub[mssql]'` | SQL Server source |
| [mysql](../metadata-ingestion/source_docs/mysql.md) | `pip install 'acryl-datahub[mysql]'` | MySQL source |
| [mariadb](../metadata-ingestion/source_docs/mariadb.md) | `pip install 'acryl-datahub[mariadb]'` | MariaDB source |
| [openapi](../metadata-ingestion/source_docs/openapi.md) | `pip install 'acryl-datahub[openapi]'` | OpenApi Source |
| [oracle](../metadata-ingestion/source_docs/oracle.md) | `pip install 'acryl-datahub[oracle]'` | Oracle source |
| [postgres](../metadata-ingestion/source_docs/postgres.md) | `pip install 'acryl-datahub[postgres]'` | Postgres source |
| [redash](../metadata-ingestion/source_docs/redash.md) | `pip install 'acryl-datahub[redash]'` | Redash source |
| [redshift](../metadata-ingestion/source_docs/redshift.md) | `pip install 'acryl-datahub[redshift]'` | Redshift source |
| [sagemaker](../metadata-ingestion/source_docs/sagemaker.md) | `pip install 'acryl-datahub[sagemaker]'` | AWS SageMaker source |
| [snowflake](../metadata-ingestion/source_docs/snowflake.md) | `pip install 'acryl-datahub[snowflake]'` | Snowflake source |
| [snowflake-usage](../metadata-ingestion/source_docs/snowflake.md) | `pip install 'acryl-datahub[snowflake-usage]'` | Snowflake usage statistics source |
| [sql-profiles](../metadata-ingestion/source_docs/sql_profiles.md) | `pip install 'acryl-datahub[sql-profiles]'` | Data profiles for SQL-based systems |
| [sqlalchemy](../metadata-ingestion/source_docs/sqlalchemy.md) | `pip install 'acryl-datahub[sqlalchemy]'` | Generic SQLAlchemy source |
| [superset](../metadata-ingestion/source_docs/superset.md) | `pip install 'acryl-datahub[superset]'` | Superset source |
| [tableau](../metadata-ingestion/source_docs/tableau.md) | `pip install 'acryl-datahub[tableau]'` | Tableau source |
| [trino](../metadata-ingestion/source_docs/trino.md) | `pip install 'acryl-datahub[trino]` | Trino source |
| [starburst-trino-usage](../metadata-ingestion/source_docs/trino.md) | `pip install 'acryl-datahub[starburst-trino-usage]'` | Starburst Trino usage statistics source |
| [nifi](../metadata-ingestion/source_docs/nifi.md) | `pip install 'acryl-datahub[nifi]` | Nifi source |

### Sinks

| Plugin Name | Install Command | Provides |
| --------------------------------------- | -------------------------------------------- | -------------------------- |
| [file](../metadata-ingestion/sink_docs/file.md) | _included by default_ | File source and sink |
| [console](../metadata-ingestion/sink_docs/console.md) | _included by default_ | Console sink |
| [datahub-rest](../metadata-ingestion/sink_docs/datahub.md) | `pip install 'acryl-datahub[datahub-rest]'` | DataHub sink over REST API |
| [datahub-kafka](../metadata-ingestion/sink_docs/datahub.md) | `pip install 'acryl-datahub[datahub-kafka]'` | DataHub sink over Kafka |

These plugins can be mixed and matched as desired. For example:

```shell
pip install 'acryl-datahub[bigquery,datahub-rest]'
```

### Check the active plugins

```shell
datahub check plugins
```

[extra requirements]: https://www.python-ldap.org/en/python-ldap-3.3.0/installing.html#build-prerequisites

## User Guide

The `datahub` cli allows you to do many things, such as quickstarting a DataHub docker instance locally, ingesting metadata from your sources, as well as retrieving and modifying metadata.
Expand Down
Loading

0 comments on commit e4cdfa5

Please sign in to comment.