Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): snowflake profile tables only if they have been updates… #5132

Conversation

mayurinehate
Copy link
Collaborator

… since N days

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@mayurinehate mayurinehate marked this pull request as ready for review June 9, 2022 13:22
@github-actions
Copy link

github-actions bot commented Jun 9, 2022

Unit Test Results (build & test)

381 tests  ±0   381 ✔️ ±0   3m 40s ⏱️ +31s
  89 suites ±0       0 💤 ±0 
  89 files   ±0       0 ±0 

Results for commit def6bb9. ± Comparison against base commit 503208b.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jun 9, 2022

Unit Test Results (metadata ingestion)

       5 files  ±0         5 suites  ±0   1h 27m 57s ⏱️ -19s
   555 tests ±0     552 ✔️ ±0    3 💤 ±0  0 ±0 
2 552 runs  ±0  2 477 ✔️ ±0  75 💤 ±0  0 ±0 

Results for commit def6bb9. ± Comparison against base commit 503208b.

♻️ This comment has been updated with latest results.

@@ -84,6 +84,11 @@ class GEProfilingConfig(ConfigModel):
description="A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
)

profile_if_updated_since_days: Optional[pydantic.PositiveFloat] = Field(
default=1,
description="Profile table only if it has been updated since these many number of days . `None` implies profile all tables. Only Snowflake supports this.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space between days and . in days .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

except ProgrammingError as pe:
# Snowflake needs schema names quoted when fetching table comments.
logger.debug(
f"Encountered ProgrammingError. Retrying with quoted schema name for schema {schema} and view {view}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is required then why are we not doing this in the try block itself? Can you please give and example of both cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply followed this code for getting table comments - https://github.com/datahub-project/datahub/blob/v0.8.38/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py#L999-L1007 and modified it for view, because I was getting the same error for view.
I think, its better off in except clause since we wouldn't want to apply this for all other sql-common sources.

To give an example - for schema with name public, quotes are not needed, whereas quotes are needed for schema with name test-schema
show tables like 'tblname' in schema public
show tables like 'tblname' in schema "test-schema"

)
except NotImplementedError:
logger.debug("Source does not support generating profile candidates.")
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this pass here.

@@ -68,6 +69,7 @@ def __init__(self, config: SnowflakeConfig, ctx: PipelineContext):
super().__init__(config, ctx, "snowflake")
self._lineage_map: Optional[Dict[str, List[Tuple[str, str, str]]]] = None
self._external_lineage_map: Optional[Dict[str, Set[str]]] = None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this accidental change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

FROM INFORMATION_SCHEMA.TABLES
WHERE LAST_ALTERED >= '{date}' AND TABLE_TYPE= 'BASE TABLE'
""".format(
date=datetime.strftime(threshold_time, "%Y-%m-%d %H:%M:%S.%f %z")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the date string as a constant and add a name. Reading it we are not sure why this specific format is used. Is this specific to snowflake or this table or is this ISO format?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the code to make use of snowflake function to_timestamp_ltz in line with usage at other timestamp queries in snowflake source

@@ -84,6 +84,11 @@ class GEProfilingConfig(ConfigModel):
description="A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
)

profile_if_updated_since_days: Optional[pydantic.PositiveFloat] = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the source reports so this variable is present in the report. Please see the snowflake source report for example. Also, for the intermediate variables introduced for the profile candidates would be great to have them in the report. This helps in debugging in a production environment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding actual datetime as profile_if_updated_since: Optional[datetime], it would be more useful than adding the config variable as is.

@@ -84,6 +84,11 @@ class GEProfilingConfig(ConfigModel):
description="A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
)

profile_if_updated_since_days: Optional[pydantic.PositiveFloat] = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a note in updating-datahub.md file as this is a change in default behaviour.

text(
"""
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does access to this table require any specific privilege? If yes, please ensure that is added in snowflake docs and provision_role block in snowflake is updated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my understanding, this does not require any additional priviledges. The same table is already being accessed for fetching details using snowflake sqlalchemy dialect.

@anshbansal anshbansal self-assigned this Jun 10, 2022
@mayurinehate mayurinehate force-pushed the oss-selective_profiling_by_update_time branch from 4cd8cca to b232c15 Compare June 13, 2022 04:26
@mayurinehate mayurinehate force-pushed the oss-selective_profiling_by_update_time branch from b232c15 to def6bb9 Compare June 13, 2022 05:08
@mayurinehate mayurinehate requested a review from anshbansal June 13, 2022 05:09
Copy link
Collaborator

@anshbansal anshbansal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@anshbansal anshbansal merged commit 7b143b0 into datahub-project:master Jun 13, 2022
maggiehays pushed a commit to maggiehays/datahub that referenced this pull request Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants