-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/dagster): Dagster source #10071
Changes from 37 commits
1d5b308
92e3cea
e006275
d24e27f
28fcad2
78220aa
9b8b50a
f7c6b08
fbd73d8
de5ad77
12bc920
a9c1a83
29bb752
c419a61
b996394
50e5724
e5e51fd
d134b74
c7a3d18
00cb35e
a7efd7c
9fad5c4
74dc9cf
5f373dc
281e666
5ca6867
7eb7a8e
059ec35
b7bb9e3
6efc54f
b48b061
5d68929
c1f7557
512824d
f788ec5
508fcbc
8ec8f4e
63df8c1
6bfe22d
479c2cf
80d9a89
1450b83
975b19d
295b91a
39f23cc
186069e
85cff86
d0f5365
39e2a7f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
name: Dagster Plugin | ||
on: | ||
push: | ||
branches: | ||
- master | ||
paths: | ||
- ".github/workflows/dagster-plugin.yml" | ||
- "metadata-ingestion-modules/dagster-plugin/**" | ||
- "metadata-ingestion/**" | ||
- "metadata-models/**" | ||
pull_request: | ||
branches: | ||
- master | ||
paths: | ||
- ".github/**" | ||
- "metadata-ingestion-modules/dagster-plugin/**" | ||
- "metadata-ingestion/**" | ||
- "metadata-models/**" | ||
release: | ||
types: [published] | ||
|
||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
dagster-plugin: | ||
runs-on: ubuntu-latest | ||
env: | ||
SPARK_VERSION: 3.0.3 | ||
DATAHUB_TELEMETRY_ENABLED: false | ||
strategy: | ||
matrix: | ||
python-version: ["3.8", "3.10"] | ||
include: | ||
- python-version: "3.8" | ||
extraPythonRequirement: "dagster>=1.3.3" | ||
- python-version: "3.10" | ||
extraPythonRequirement: "dagster>=1.3.3" | ||
fail-fast: false | ||
steps: | ||
- name: Set up JDK 17 | ||
uses: actions/setup-java@v3 | ||
with: | ||
distribution: "zulu" | ||
java-version: 17 | ||
- uses: actions/checkout@v3 | ||
- uses: actions/setup-python@v4 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
cache: "pip" | ||
- name: Install dependencies | ||
run: ./metadata-ingestion/scripts/install_deps.sh | ||
- name: Install dagster package and test (extras ${{ matrix.extraPythonRequirement }}) | ||
run: ./gradlew -Pextra_pip_requirements='${{ matrix.extraPythonRequirement }}' :metadata-ingestion-modules:dagster-plugin:lint :metadata-ingestion-modules:dagster-plugin:testQuick | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [actionlint] reported by reviewdog 🐶 |
||
- name: pip freeze show list installed | ||
if: always() | ||
run: source metadata-ingestion-modules/dagster-plugin/venv/bin/activate && pip freeze | ||
- uses: actions/upload-artifact@v3 | ||
if: ${{ always() && matrix.python-version == '3.10' && matrix.extraPythonRequirement == 'dagster>=1.3.3' }} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [actionlint] reported by reviewdog 🐶 |
||
with: | ||
name: Test Results (dagster Plugin ${{ matrix.python-version}}) | ||
path: | | ||
**/build/reports/tests/test/** | ||
**/build/test-results/test/** | ||
**/junit.*.xml | ||
- name: Upload coverage to Codecov | ||
if: always() | ||
uses: codecov/codecov-action@v3 | ||
with: | ||
token: ${{ secrets.CODECOV_TOKEN }} | ||
directory: . | ||
fail_ci_if_error: false | ||
flags: dagster-${{ matrix.python-version }}-${{ matrix.extraPythonRequirement }} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [actionlint] reported by reviewdog 🐶 |
||
name: pytest-dagster | ||
verbose: true | ||
|
||
event-file: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Upload | ||
uses: actions/upload-artifact@v3 | ||
with: | ||
name: Event File | ||
path: ${{ github.event_path }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# Dagster Integration | ||
DataHub supports the integration of | ||
|
||
- Dagster Pipeline metadata | ||
- Job and Op run information as well as | ||
- Lineage information when present | ||
|
||
## Using Datahub's Dagster Sensor | ||
|
||
Dagster sensors allow us to perform some actions based on some state change. Datahub's defined dagster sensor will emit metadata after every dagster pipeline run execution. This sensor is able to emit both pipeline success as well as failures. For more details about Dagster sensors please refer [Sensors](https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors). | ||
|
||
### Prerequisites | ||
|
||
1. You need to create a new dagster project. See <https://docs.dagster.io/getting-started/create-new-project>. | ||
2. There are two ways to define Dagster definition before starting dagster UI. One using [Definitions](https://docs.dagster.io/_apidocs/definitions#dagster.Definitions) class (recommended) and second using [Repositories](https://docs.dagster.io/concepts/repositories-workspaces/repositories#repositories). | ||
3. Creation of new dagster project by default uses Definition class to define Dagster definition. | ||
|
||
### Setup | ||
|
||
1. You need to install the required dependency. | ||
|
||
```shell | ||
pip install acryl_datahub_dagster_plugin | ||
``` | ||
|
||
2. You need to import DataHub dagster plugin provided sensor definition and add it in Dagster definition or dagster repository before starting dagster UI as show below: | ||
**Using Definitions class:** | ||
|
||
```python | ||
{{ inline /metadata-ingestion-modules/dagster-plugin/src/datahub_dagster_plugin/example_jobs/basic_setup.py }} | ||
``` | ||
|
||
3. The DataHub dagster plugin provided sensor internally uses below configs. You can set these configs using environment variables. If not set, the sensor will take the default value. | ||
|
||
**Configuration options:** | ||
|
||
| Configuration Option | Default value | Description | | ||
|-------------------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| datahub_client_config | | The DataHub client config | | ||
| dagster_url | | The url to your Dagster Webserver. | | ||
| capture_asset_materialization | True | Whether to capture asset keys as Dataset on AssetMaterialization event | | ||
| capture_input_output | True | Whether to capture and try to parse input and output from HANDLED_OUTPUT,.LOADED_INPUT events. (currently only [PathMetadataValue](https://github.com/dagster-io/dagster/blob/7e08c05dcecef9fd07f887c7846bd1c9a90e7d84/python_modules/dagster/dagster/_core/definitions/metadata/__init__.py#L655) metadata supported (EXPERIMENTAL) | | ||
| platform_instance | | The instance of the platform that all assets produced by this recipe belong to. It is optional | | ||
| asset_lineage_extractor | | You can implement your own logic to capture asset lineage information. See example for details[] | | ||
|
||
4. Once Dagster UI is up, you need to turn on the provided sensor execution. To turn on the sensor, click on Overview tab and then on Sensors tab. You will see a toggle button in front of all defined sensors to turn it on/off. | ||
|
||
5. DataHub dagster plugin provided sensor is ready to emit metadata after every dagster pipeline run execution. | ||
|
||
### How to validate installation | ||
|
||
1. Go and check in Dagster UI at Overview -> Sensors menu if you can see the 'datahub_sensor'. | ||
2. Run a Dagster Job. In the dagster daemon logs, you should see DataHub related log messages like: | ||
|
||
``` | ||
datahub_sensor - Emitting metadata... | ||
``` | ||
|
||
## Dagster Ins and Out | ||
|
||
We can provide inputs and outputs to both assets and ops explicitly using a dictionary of `Ins` and `Out` corresponding to the decorated function arguments. While providing inputs and outputs explicitly we can provide metadata as well. | ||
To create dataset upstream and downstream dependency for the assets and ops you can use an ins and out dictionary with metadata provided. For reference, look at the sample jobs created using assets [`assets_job.py`](../../metadata-ingestion-modules/dagster-plugin/src/datahub_dagster_plugin/example_jobs/assets_job.py), or ops [`ops_job.py`](../../metadata-ingestion-modules/dagster-plugin/src/datahub_dagster_plugin/example_jobs/ops_job.py). | ||
|
||
## Add define your custom logic to capture asset lineage information | ||
You can define your own logic to capture asset lineage information. | ||
|
||
The output Tuple contains two dictionaries, one for input assets and the other for output assets. The key of the dictionary is the op key and the value is the set of asset urns that are upstream or downstream of the op. | ||
|
||
```python | ||
def asset_lineage_extractor( | ||
context: RunStatusSensorContext, | ||
dagster_generator: DagsterGenerator, | ||
graph: DataHubGraph, | ||
) -> Tuple[Dict[str, Set], Dict[str, Set]]: | ||
|
||
input_assets:Dict[str, Set] = {} | ||
output_assets:Dict[str, Set] = {} | ||
|
||
# Extracting input and output assets from the context | ||
return input_assets, output_assets | ||
``` | ||
|
||
[See example job here](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion-modules/dagster-plugin/src/datahub_dagster_plugin/example_jobs/advanced_ops_jobs.py). | ||
|
||
## Debugging | ||
|
||
### Connection error for Datahub Rest URL | ||
|
||
If you get ConnectionError: HTTPConnectionPool(host='localhost', port=8080), then in that case your DataHub GMS service is not up. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
.envrc | ||
src/datahub_dagster_plugin/__init__.py.bak | ||
.vscode/ | ||
output | ||
pvenv36/ | ||
bq_credentials.json | ||
/tmp | ||
*.bak | ||
|
||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
pip-wheel-metadata/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# Generated classes | ||
src/datahub/metadata/ | ||
wheels/ | ||
junit.quick.xml |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Datahub Dagster Plugin | ||
|
||
See the DataHub Dagster docs for details. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[actionlint] reported by reviewdog 🐶
property "extrapythonrequirement" is not defined in object type {python-version: number} [expression]