-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adding support for glue #2319
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broader question here - how are we going to handle this vs @adriaanslechten's one (https://gist.github.com/adriaanslechten/829efd1a3bc1842fe283f64aa2a06a1e)?
I've had a talk with Amy and said she could take whatever she would need from that branch. I think this way we can fill in the lacking features quite fast. I'll try to make some concrete suggestions. |
metadata-ingestion/src/datahub/ingestion/source/source_registry.py
Outdated
Show resolved
Hide resolved
@amonkhouse for the mypy issues, seems like you're missing some required arguments. These really should be made optional in the model, but that's a different problem. That said, the diff --git a/metadata-ingestion/src/datahub/ingestion/source/glue.py b/metadata-ingestion/src/datahub/ingestion/source/glue.py
index 0c9855635..1ec59d7cd 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/glue.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/glue.py
@@ -16,6 +16,7 @@ from datahub.metadata.com.linkedin.pegasus2avro.schema import (
BooleanTypeClass,
BytesTypeClass,
DateTypeClass,
+ MySqlDDL,
NullTypeClass,
NumberTypeClass,
SchemaField,
@@ -171,6 +172,7 @@ class GlueSource(Source):
},
},
uri=table.get("Location"),
+ tags=[],
)
def get_schema_metadata(glue_source: GlueSource):
@@ -179,6 +181,7 @@ class GlueSource(Source):
for field in schema:
schema_field = SchemaField(
fieldPath=field["Name"],
+ nullable=True,
nativeDataType=field["Type"],
type=get_column_type(
glue_source, field["Type"], table_name, field["Name"]
@@ -194,6 +197,8 @@ class GlueSource(Source):
platform="urn:li:dataPlatform:glue",
created=AuditStamp(time=sys_time, actor="urn:li:corpuser:etl"),
lastModified=AuditStamp(time=sys_time, actor="urn:li:corpuser:etl"),
+ hash="",
+ platformSchema=MySqlDDL(tableSchema=""),
)
sys_time = int(time.time() * 1000) |
ah thank you, couldn't for the life of me figure out how to fix that last night |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There are a couple of smaller comments that can be addressed in a follow-on PR.
config: | ||
aws_region: aws_region_name # i.e. "eu-west-1" | ||
env: environment used for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". # Optional, defaults to "PROD". | ||
databases: list of databases to process. # Optional, if not specified then all databases will be processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to standardize on the allow-deny-list pattern for these across all the sources.
e.g. schema_pattern and table_pattern in SQL-based sources such as
https://datahubproject.io/docs/metadata-ingestion#mysql-metadata-mysql
Not a show-stopper, and can be done in a follow-on PR.
@@ -82,6 +82,7 @@ def get_long_description(): | |||
# Sink plugins. | |||
"datahub-kafka": kafka_common, | |||
"datahub-rest": {"requests>=2.25.1"}, | |||
"glue": {"boto3"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should be moved up to the #Source plugins.
area.
Adding Glue ingestion support.
Checklist