feat(metadata-ingestion): Add support for druid #2235

pedro93 · 2021-03-15T11:05:19Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable)

Adds support for druid as a metadata source to be crawled using the ingestion framework.

Test

This was tested in 2 ways:

Locally using a bare-bones druid installation with dummy data.
Remotely using a development database that mirrors production druid data source at my company.

pedro93 · 2021-03-15T11:11:52Z

This is a WIP as there are 2 unknowns at this time.
1.) How does the metadata framework plugin system work and what changes do I need to do to make it work?

2.) When running the crawler it fails with an exception despite creating the expected MCEs:

$ datahub ingest -c /home/pedro/dev/DCData/Data.DataHub/helm/datahub/configs/druid-metadata-crawler/crawler-config.yaml
[2021-03-15 10:41:20,408] DEBUG    {datahub.entrypoints:67} - Using config: {'source': {'type': 'druid', 'config': {'host_port': 'localhost:8082'}}, 'sink': {'type': 'file', 'config': {'filename': './druid.json'}}}
[2021-03-15 10:41:20,408] DEBUG    {datahub.ingestion.run.pipeline:63} - Source type:druid,<class 'datahub.ingestion.source.druid.DruidSource'> configured
[2021-03-15 10:41:20,408] INFO     {datahub.ingestion.sink.file:27} - Will write to druid.json
[2021-03-15 10:41:20,409] DEBUG    {datahub.ingestion.run.pipeline:69} - Sink type:file,<class 'datahub.ingestion.sink.file.FileSink'> configured
[2021-03-15 10:41:20,409] DEBUG    {datahub.ingestion.source.sql_common:197} - sql_alchemy_url=druid://localhost:8082/druid/v2/sql/
[2021-03-15 10:41:20,603] DEBUG    {datahub.ingestion.run.pipeline:38} - sink called success callback
Traceback (most recent call last):
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1215, in _fetchone_impl
    return self.cursor.fetchone()
AttributeError: 'NoneType' object has no attribute 'fetchone'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/bin/datahub", line 33, in <module>
    sys.exit(load_entry_point('datahub', 'console_scripts', 'datahub')())
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/pedro/dev/datahub/metadata-ingestion/src/datahub/entrypoints.py", line 73, in ingest
    pipeline.run()
  File "/home/pedro/dev/datahub/metadata-ingestion/src/datahub/ingestion/run/pipeline.py", line 81, in run
    for wu in self.source.get_workunits():
  File "/home/pedro/dev/datahub/metadata-ingestion/src/datahub/ingestion/source/sql_common.py", line 205, in get_workunits
    for table in inspector.get_table_names(schema):
  File "<string>", line 2, in get_table_names
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/util/deprecations.py", line 139, in warned
    return fn(*args, **kwargs)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/reflection.py", line 201, in get_table_names
    tnames = self.dialect.get_table_names(
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/pydruid/db/sqlalchemy.py", line 152, in get_table_names
    return [row.TABLE_NAME for row in result]
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/pydruid/db/sqlalchemy.py", line 152, in <listcomp>
    return [row.TABLE_NAME for row in result]
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1010, in __iter__
    row = self.fetchone()
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1343, in fetchone
    self.connection._handle_dbapi_exception(
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1514, in _handle_dbapi_exception
    util.raise_(exc_info[1], with_traceback=exc_info[2])
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1336, in fetchone
    row = self._fetchone_impl()
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1217, in _fetchone_impl
    return self._non_result(None, err)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1236, in _non_result
    util.raise_(
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
sqlalchemy.exc.ResourceClosedError: This result object does not return rows. It has been closed automatically.

I believe this is due to the logic defined in SQLAlchemySource#get_workunits. My understading is that this particular logic assumes all tables in the metadata store in druid are meant to be crawled when they should not.

I'm not familiar enough with either the database metadata architect or the ingestion framework to understand why the tables are empty and if they should be crawled to begin with.

hsheth2 · 2021-03-15T17:37:27Z

For (1) - to add a plugin:

Add the requirements to the setup file https://github.com/acryldata/datahub-fork/blob/master/metadata-ingestion/setup.py#L61
In the readme file, add an entry to the plugin table

As for how the plugin system works: in source_registry.py, if the source can be imported successfully then we register the plugin as enabled. The setup.py defines Python extra requirements, which will install the right set of dependencies and ensure that the source gets enabled.

The second issue is more opaque to me, since it varies by SQLAlchemy driver. At a high level, we're using SQLAlchemy's fine-grained inspection to extract metadata from each source.

Based on the stack trace, it seems to be failing on the call to get_table_names(), while corresponds to pydruid's get_table_names() implementation: https://github.com/druid-io/pydruid/blob/a1d9b7ec877800c7b322fb671f3fb46738a58438/pydruid/db/sqlalchemy.py#L144. Somehow that query is failing?

pedro93 · 2021-03-15T22:01:30Z

Thank you for the detailed explanation.
About point 2 I have opened an issue in pydruid with the hope that someone there can shed a light on what we are seeing.

pedro93 · 2021-03-16T11:02:25Z

FYI @hsheth2, while I don't figure out what is happening on pydruid's side I've added a schema deny pattern to not crawl what seems to be internal druid metadata structures since those are not metadata I think are meant to be shown.

It should suffice for this PRs scope, my only concern is whether user-specified table & schema patterns may overwrite this behaviour.

Do you see any way to handle such a case?

…he deny schema patterns + code formats

hsheth2 · 2021-03-16T23:19:34Z

metadata-ingestion/setup.cfg

@@ -42,6 +42,8 @@ ignore_missing_imports = yes
 ignore_missing_imports = yes
 [mypy-snowflake.*]
 ignore_missing_imports = yes
+[mypy-druid.*]


I think you'd want [mypy-pydruid.*], since the actual python package is called pydruid

pedro93 · 2021-03-17T11:01:01Z

CI pipeline failed due to code coverage percentage under target (79.16% vs the intended 80%).

hsheth2 · 2021-03-17T16:25:18Z

If you don’t mind, can you drop the coverage requirement in setup.cfg to 75%?

Eventually I’ll get around to writing more comprehensive tests, but I don’t want to block you on that

pedro93 · 2021-03-17T16:32:54Z

Done, appreciate the understanding!

hsheth2

LGTM - awesome work @pedro93!

shirshanka · 2021-03-18T03:04:31Z

metadata-ingestion/README.md

+    host_port: localhost:8082
+    schema_pattern:
+      deny:
+        - "^(lookup|sys).*"


I wonder if we should "hardcode" this inside the Druid config class so this is always added on regardless of what the user configures.

shirshanka

LGTM!

Add druid metadata crawler to ingestion framework

9e960f8

Adds plugin requirements for druid crawler

3859643

Add schema pattern to avoid crawling internal druid databases

6afe495

pedro93 added 2 commits March 16, 2021 17:15

Adds warning note about druid-internal databases and how to specify t…

7db3c6b

…he deny schema patterns + code formats

Remove unused import

955d5ce

pedro93 changed the title ~~[WIP] feat(metadata-ingestion): Add support for druid~~ feat(metadata-ingestion): Add support for druid Mar 16, 2021

hsheth2 reviewed Mar 16, 2021

View reviewed changes

Add pydruid to list of missing packages types

dce658c

Adjusts test code coverage requirement to 75%

7a9f64a

hsheth2 approved these changes Mar 17, 2021

View reviewed changes

shirshanka reviewed Mar 18, 2021

View reviewed changes

shirshanka approved these changes Mar 18, 2021

View reviewed changes

shirshanka merged commit 6a0c402 into datahub-project:master Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metadata-ingestion): Add support for druid #2235

feat(metadata-ingestion): Add support for druid #2235

pedro93 commented Mar 15, 2021 •

edited

Loading

pedro93 commented Mar 15, 2021

hsheth2 commented Mar 15, 2021

pedro93 commented Mar 15, 2021 •

edited

Loading

pedro93 commented Mar 16, 2021 •

edited

Loading

hsheth2 Mar 16, 2021

pedro93 commented Mar 17, 2021

hsheth2 commented Mar 17, 2021

pedro93 commented Mar 17, 2021

hsheth2 left a comment

shirshanka Mar 18, 2021

shirshanka left a comment

feat(metadata-ingestion): Add support for druid #2235

feat(metadata-ingestion): Add support for druid #2235

Conversation

pedro93 commented Mar 15, 2021 • edited Loading

Checklist

Test

pedro93 commented Mar 15, 2021

hsheth2 commented Mar 15, 2021

pedro93 commented Mar 15, 2021 • edited Loading

pedro93 commented Mar 16, 2021 • edited Loading

hsheth2 Mar 16, 2021

Choose a reason for hiding this comment

pedro93 commented Mar 17, 2021

hsheth2 commented Mar 17, 2021

pedro93 commented Mar 17, 2021

hsheth2 left a comment

Choose a reason for hiding this comment

shirshanka Mar 18, 2021

Choose a reason for hiding this comment

shirshanka left a comment

Choose a reason for hiding this comment

pedro93 commented Mar 15, 2021 •

edited

Loading

pedro93 commented Mar 15, 2021 •

edited

Loading

pedro93 commented Mar 16, 2021 •

edited

Loading