Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/glue): delta schemas #10299

Merged

Conversation

sgomezvillamor
Copy link
Contributor

@sgomezvillamor sgomezvillamor commented Apr 16, 2024

Hive sync up in Spark is wrongly reported as col (array<string>). Experience shows that the issue happens only when using Scala and for tables (not views), which is actually a very common case. Given that the correct schema is serialized in the spark.sql.sources.schema.part.{i} table parameters, this PR parses and processes the schema from those properties so such a valuable metadata is not lost.

This is solving the following feature request: https://feature-requests.datahubproject.io/p/glue-crawler-support-for-correct-delta-schema-stored-in-properties

This bug in Spark has been reported many times and never solved. Some references here:

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community datahub-community-champion PRs authored by DataHub Community Champions labels Apr 16, 2024
"nullable": true,
"type": {
"type": {
"com.linkedin.pegasus2avro.schema.NullType": {}
"com.linkedin.pegasus2avro.schema.NumberType": {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers:
This is for the new mapping I added here

Comment on lines 209 to 210
num_dataset_schema_invalid: int = 0
num_dataset_buggy_delta_schema: int = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers:
I added these two mainly for the testing. I'm ok to rename or remove even

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice. Let's please rename as

num_dataset_schema_invalid -> num_dataset_invalid_delta_schema
num_dataset_buggy_delta_schema -> num_dataset_valid_delta_schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in d634b10

@sgomezvillamor sgomezvillamor marked this pull request as ready for review April 18, 2024 15:22
@@ -1796,397 +1796,6 @@
"lastRunId": "no-run-id-provided"
}
},
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the reviewers:

Removed aspects are not lost; they were duplicated and updating the golden file resulted on removing the duplicated aspects.

@sgomezvillamor
Copy link
Contributor Author

@hsheth2 or @treff7es , as usual contributors to ingestion codebase, could you have a look at this?

@mayurinehate mayurinehate self-requested a review May 8, 2024 06:03
Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add config extract_delta_schema_from_parameters to enable this behavior. can be set to default True. If disabled, none of this delta customization would be used.

Can you confirm if this issue is with particular spark/delta version or all of them ?

Also, if I understand correctly, this workaround can entirely be removed if the issue is fixed in new spark/delta version - for example - if the delta.io PR that you have linked is merged, right ? If so, please add a comment in codebase to mention this.

Comment on lines 209 to 210
num_dataset_schema_invalid: int = 0
num_dataset_buggy_delta_schema: int = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice. Let's please rename as

num_dataset_schema_invalid -> num_dataset_invalid_delta_schema
num_dataset_buggy_delta_schema -> num_dataset_valid_delta_schema

@@ -1148,9 +1152,35 @@ def get_s3_tags() -> Optional[GlobalTagsClass]:
return new_tags

def get_schema_metadata() -> Optional[SchemaMetadata]:
if not table.get("StorageDescriptor"):
def is_delta_schema(columns: Optional[List[Mapping[str, Any]]]) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining functions within other functions is discouraged as per coding style. Can you please move this function outside ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be refractored to accept both tableParameters and tableStorageDescriptor and then return boolean ? this function can subsume this check as well -> (provider == "delta") and (num_parts > 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining functions within other functions is discouraged as per coding style. Can you please move this function outside ?

While I do agree, I just followed the existing pattern in the code. Note get_owner, get_dataset_properties, get_s3_tags are all defined within _extract_record method.

So, are you suggesting to move is_delta_schema at _extract_record level too or at GlueSource level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As just moved the is_delta_schema at _extra_record level, to keep it aligned with all others existing methods in the codebase.

return None

def _get_glue_schema_metadata() -> Optional[SchemaMetadata]:
assert table.get("StorageDescriptor")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this assert as this is already checked earlier.

@sgomezvillamor
Copy link
Contributor Author

Thanks for the review @mayurinehate . I will address comments soon.

Can we add config extract_delta_schema_from_parameters to enable this behavior. can be set to default True. If disabled, none of this delta customization would be used.

Sure, good point!

Can you confirm if this issue is with particular spark/delta version or all of them ?

Not a particular version but an long-time overall issue. As you can see in the two links listed in the PR description, community is recurrently complaining about this issue.

Also, if I understand correctly, this workaround can entirely be removed if the issue is fixed in new spark/delta version - for example - if the delta.io PR that you have linked is merged, right ? If so, please add a comment in codebase to mention this.

True. As soon as the hive integration with Spark is correctly providing the schema as expected in the StorageProperties, this could be removed. If that will happen in the PR I linked or some other, hard to predict 😄
I will provide more context in the comment here

# https://github.com/delta-io/delta/pull/2310

@sgomezvillamor
Copy link
Contributor Author

Added the extract_delta_schema_from_parameters config option with False default value in d634b10

@sgomezvillamor
Copy link
Contributor Author

@mayurinehate Could you please review again? Thanks

Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor edits suggested. Everything else looks good to me.

extract_delta_schema_from_parameters: Optional[bool] = Field(
default=False,
description="If enabled, delta schemas can be alternatively fetched from table parameters "
"(https://github.com/delta-io/delta/pull/2310)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please move this PR's link as code comment on this config - rather than in description of config,

@sgomezvillamor
Copy link
Contributor Author

@mayurinehate Thanks for the feedback and approval.
I have added your latests suggestions and solved conflicts.

Build is all green but a nifi test. Surprisingly the test complains about not matching golden fine in 3.10 while it works fine in 3.8. Do you think is that related to my updates? or just some random error?

@mayurinehate mayurinehate requested a review from treff7es May 17, 2024 06:04
@mayurinehate
Copy link
Collaborator

CI Failures are unrelated to the changes in the PR.

@treff7es treff7es merged commit 0059960 into datahub-project:master May 17, 2024
57 of 58 checks passed
@sgomezvillamor sgomezvillamor deleted the feat-ingestion-glue-delta-schema branch May 22, 2024 02:31
sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community datahub-community-champion PRs authored by DataHub Community Champions ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants