Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding support for glue #2319

Merged
merged 1 commit into from
Apr 4, 2021
Merged

feat: adding support for glue #2319

merged 1 commit into from
Apr 4, 2021

Conversation

amonkhouse
Copy link
Contributor

@amonkhouse amonkhouse commented Mar 31, 2021

Adding Glue ingestion support.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broader question here - how are we going to handle this vs @adriaanslechten's one (https://gist.github.com/adriaanslechten/829efd1a3bc1842fe283f64aa2a06a1e)?

metadata-ingestion/setup.py Outdated Show resolved Hide resolved
@adriaanslechten
Copy link
Contributor

Broader question here - how are we going to handle this vs @adriaanslechten's one (https://gist.github.com/adriaanslechten/829efd1a3bc1842fe283f64aa2a06a1e)?

I've had a talk with Amy and said she could take whatever she would need from that branch. I think this way we can fill in the lacking features quite fast. I'll try to make some concrete suggestions.

@amonkhouse amonkhouse changed the title Add glue support DRAFT feat: adding support for glue Apr 3, 2021
@amonkhouse amonkhouse marked this pull request as ready for review April 3, 2021 18:04
@hsheth2
Copy link
Collaborator

hsheth2 commented Apr 3, 2021

@amonkhouse for the mypy issues, seems like you're missing some required arguments. These really should be made optional in the model, but that's a different problem. That said, the nullable field would be a good one to fill if possible.

diff --git a/metadata-ingestion/src/datahub/ingestion/source/glue.py b/metadata-ingestion/src/datahub/ingestion/source/glue.py
index 0c9855635..1ec59d7cd 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/glue.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/glue.py
@@ -16,6 +16,7 @@ from datahub.metadata.com.linkedin.pegasus2avro.schema import (
     BooleanTypeClass,
     BytesTypeClass,
     DateTypeClass,
+    MySqlDDL,
     NullTypeClass,
     NumberTypeClass,
     SchemaField,
@@ -171,6 +172,7 @@ class GlueSource(Source):
                     },
                 },
                 uri=table.get("Location"),
+                tags=[],
             )
 
         def get_schema_metadata(glue_source: GlueSource):
@@ -179,6 +181,7 @@ class GlueSource(Source):
             for field in schema:
                 schema_field = SchemaField(
                     fieldPath=field["Name"],
+                    nullable=True,
                     nativeDataType=field["Type"],
                     type=get_column_type(
                         glue_source, field["Type"], table_name, field["Name"]
@@ -194,6 +197,8 @@ class GlueSource(Source):
                 platform="urn:li:dataPlatform:glue",
                 created=AuditStamp(time=sys_time, actor="urn:li:corpuser:etl"),
                 lastModified=AuditStamp(time=sys_time, actor="urn:li:corpuser:etl"),
+                hash="",
+                platformSchema=MySqlDDL(tableSchema=""),
             )
 
         sys_time = int(time.time() * 1000)

@amonkhouse
Copy link
Contributor Author

@amonkhouse for the mypy issues, seems like you're missing some required arguments. These really should be made optional in the model, but that's a different problem. That said, the nullable field would be a good one to fill if possible.

diff --git a/metadata-ingestion/src/datahub/ingestion/source/glue.py b/metadata-ingestion/src/datahub/ingestion/source/glue.py
index 0c9855635..1ec59d7cd 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/glue.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/glue.py
@@ -16,6 +16,7 @@ from datahub.metadata.com.linkedin.pegasus2avro.schema import (
     BooleanTypeClass,
     BytesTypeClass,
     DateTypeClass,
+    MySqlDDL,
     NullTypeClass,
     NumberTypeClass,
     SchemaField,
@@ -171,6 +172,7 @@ class GlueSource(Source):
                     },
                 },
                 uri=table.get("Location"),
+                tags=[],
             )
 
         def get_schema_metadata(glue_source: GlueSource):
@@ -179,6 +181,7 @@ class GlueSource(Source):
             for field in schema:
                 schema_field = SchemaField(
                     fieldPath=field["Name"],
+                    nullable=True,
                     nativeDataType=field["Type"],
                     type=get_column_type(
                         glue_source, field["Type"], table_name, field["Name"]
@@ -194,6 +197,8 @@ class GlueSource(Source):
                 platform="urn:li:dataPlatform:glue",
                 created=AuditStamp(time=sys_time, actor="urn:li:corpuser:etl"),
                 lastModified=AuditStamp(time=sys_time, actor="urn:li:corpuser:etl"),
+                hash="",
+                platformSchema=MySqlDDL(tableSchema=""),
             )
 
         sys_time = int(time.time() * 1000)
platformSchema=MySqlDDL(tableSchema=""),

ah thank you, couldn't for the life of me figure out how to fix that last night

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

There are a couple of smaller comments that can be addressed in a follow-on PR.

config:
aws_region: aws_region_name # i.e. "eu-west-1"
env: environment used for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". # Optional, defaults to "PROD".
databases: list of databases to process. # Optional, if not specified then all databases will be processed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to standardize on the allow-deny-list pattern for these across all the sources.

e.g. schema_pattern and table_pattern in SQL-based sources such as
https://datahubproject.io/docs/metadata-ingestion#mysql-metadata-mysql

Not a show-stopper, and can be done in a follow-on PR.

@@ -82,6 +82,7 @@ def get_long_description():
# Sink plugins.
"datahub-kafka": kafka_common,
"datahub-rest": {"requests>=2.25.1"},
"glue": {"boto3"},
Copy link
Contributor

@shirshanka shirshanka Apr 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should be moved up to the #Source plugins. area.

@shirshanka shirshanka merged commit 7592881 into datahub-project:master Apr 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants