-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(metadata-ingestion): Add support for druid #2235
feat(metadata-ingestion): Add support for druid #2235
Conversation
This is a WIP as there are 2 unknowns at this time. 2.) When running the crawler it fails with an exception despite creating the expected MCEs: $ datahub ingest -c /home/pedro/dev/DCData/Data.DataHub/helm/datahub/configs/druid-metadata-crawler/crawler-config.yaml
[2021-03-15 10:41:20,408] DEBUG {datahub.entrypoints:67} - Using config: {'source': {'type': 'druid', 'config': {'host_port': 'localhost:8082'}}, 'sink': {'type': 'file', 'config': {'filename': './druid.json'}}}
[2021-03-15 10:41:20,408] DEBUG {datahub.ingestion.run.pipeline:63} - Source type:druid,<class 'datahub.ingestion.source.druid.DruidSource'> configured
[2021-03-15 10:41:20,408] INFO {datahub.ingestion.sink.file:27} - Will write to druid.json
[2021-03-15 10:41:20,409] DEBUG {datahub.ingestion.run.pipeline:69} - Sink type:file,<class 'datahub.ingestion.sink.file.FileSink'> configured
[2021-03-15 10:41:20,409] DEBUG {datahub.ingestion.source.sql_common:197} - sql_alchemy_url=druid://localhost:8082/druid/v2/sql/
[2021-03-15 10:41:20,603] DEBUG {datahub.ingestion.run.pipeline:38} - sink called success callback
Traceback (most recent call last):
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1215, in _fetchone_impl
return self.cursor.fetchone()
AttributeError: 'NoneType' object has no attribute 'fetchone'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/pedro/dev/datahub/metadata-ingestion/venv/bin/datahub", line 33, in <module>
sys.exit(load_entry_point('datahub', 'console_scripts', 'datahub')())
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/pedro/dev/datahub/metadata-ingestion/src/datahub/entrypoints.py", line 73, in ingest
pipeline.run()
File "/home/pedro/dev/datahub/metadata-ingestion/src/datahub/ingestion/run/pipeline.py", line 81, in run
for wu in self.source.get_workunits():
File "/home/pedro/dev/datahub/metadata-ingestion/src/datahub/ingestion/source/sql_common.py", line 205, in get_workunits
for table in inspector.get_table_names(schema):
File "<string>", line 2, in get_table_names
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/util/deprecations.py", line 139, in warned
return fn(*args, **kwargs)
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/reflection.py", line 201, in get_table_names
tnames = self.dialect.get_table_names(
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/pydruid/db/sqlalchemy.py", line 152, in get_table_names
return [row.TABLE_NAME for row in result]
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/pydruid/db/sqlalchemy.py", line 152, in <listcomp>
return [row.TABLE_NAME for row in result]
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1010, in __iter__
row = self.fetchone()
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1343, in fetchone
self.connection._handle_dbapi_exception(
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1514, in _handle_dbapi_exception
util.raise_(exc_info[1], with_traceback=exc_info[2])
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1336, in fetchone
row = self._fetchone_impl()
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1217, in _fetchone_impl
return self._non_result(None, err)
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1236, in _non_result
util.raise_(
File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
sqlalchemy.exc.ResourceClosedError: This result object does not return rows. It has been closed automatically. I believe this is due to the logic defined in SQLAlchemySource#get_workunits. My understading is that this particular logic assumes all tables in the metadata store in druid are meant to be crawled when they should not. I'm not familiar enough with either the database metadata architect or the ingestion framework to understand why the tables are empty and if they should be crawled to begin with. |
For (1) - to add a plugin:
As for how the plugin system works: in The second issue is more opaque to me, since it varies by SQLAlchemy driver. At a high level, we're using SQLAlchemy's fine-grained inspection to extract metadata from each source. Based on the stack trace, it seems to be failing on the call to |
Thank you for the detailed explanation. |
FYI @hsheth2, while I don't figure out what is happening on pydruid's side I've added a schema deny pattern to not crawl what seems to be internal druid metadata structures since those are not metadata I think are meant to be shown. It should suffice for this PRs scope, my only concern is whether user-specified table & schema patterns may overwrite this behaviour. Do you see any way to handle such a case? |
…he deny schema patterns + code formats
metadata-ingestion/setup.cfg
Outdated
@@ -42,6 +42,8 @@ ignore_missing_imports = yes | |||
ignore_missing_imports = yes | |||
[mypy-snowflake.*] | |||
ignore_missing_imports = yes | |||
[mypy-druid.*] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd want [mypy-pydruid.*]
, since the actual python package is called pydruid
CI pipeline failed due to code coverage percentage under target (79.16% vs the intended 80%). |
If you don’t mind, can you drop the coverage requirement in setup.cfg to 75%? Eventually I’ll get around to writing more comprehensive tests, but I don’t want to block you on that |
Done, appreciate the understanding! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - awesome work @pedro93!
host_port: localhost:8082 | ||
schema_pattern: | ||
deny: | ||
- "^(lookup|sys).*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should "hardcode" this inside the Druid config class so this is always added on regardless of what the user configures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Checklist
Adds support for druid as a metadata source to be crawled using the ingestion framework.
Test
This was tested in 2 ways: