-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest): snowflake profile tables only if they have been updates… #5132
feat(ingest): snowflake profile tables only if they have been updates… #5132
Conversation
@@ -84,6 +84,11 @@ class GEProfilingConfig(ConfigModel): | |||
description="A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", | |||
) | |||
|
|||
profile_if_updated_since_days: Optional[pydantic.PositiveFloat] = Field( | |||
default=1, | |||
description="Profile table only if it has been updated since these many number of days . `None` implies profile all tables. Only Snowflake supports this.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove space between days and .
in days .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
except ProgrammingError as pe: | ||
# Snowflake needs schema names quoted when fetching table comments. | ||
logger.debug( | ||
f"Encountered ProgrammingError. Retrying with quoted schema name for schema {schema} and view {view}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is required then why are we not doing this in the try
block itself? Can you please give and example of both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply followed this code for getting table comments - https://github.com/datahub-project/datahub/blob/v0.8.38/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py#L999-L1007 and modified it for view, because I was getting the same error for view.
I think, its better off in except clause since we wouldn't want to apply this for all other sql-common sources.
To give an example - for schema with name public
, quotes are not needed, whereas quotes are needed for schema with name test-schema
show tables like 'tblname' in schema public
show tables like 'tblname' in schema "test-schema"
) | ||
except NotImplementedError: | ||
logger.debug("Source does not support generating profile candidates.") | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this pass
here.
@@ -68,6 +69,7 @@ def __init__(self, config: SnowflakeConfig, ctx: PipelineContext): | |||
super().__init__(config, ctx, "snowflake") | |||
self._lineage_map: Optional[Dict[str, List[Tuple[str, str, str]]]] = None | |||
self._external_lineage_map: Optional[Dict[str, Set[str]]] = None | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this accidental change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
FROM INFORMATION_SCHEMA.TABLES | ||
WHERE LAST_ALTERED >= '{date}' AND TABLE_TYPE= 'BASE TABLE' | ||
""".format( | ||
date=datetime.strftime(threshold_time, "%Y-%m-%d %H:%M:%S.%f %z") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the date string as a constant and add a name. Reading it we are not sure why this specific format is used. Is this specific to snowflake or this table or is this ISO format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the code to make use of snowflake function to_timestamp_ltz
in line with usage at other timestamp queries in snowflake source
@@ -84,6 +84,11 @@ class GEProfilingConfig(ConfigModel): | |||
description="A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", | |||
) | |||
|
|||
profile_if_updated_since_days: Optional[pydantic.PositiveFloat] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the source reports so this variable is present in the report. Please see the snowflake
source report for example. Also, for the intermediate variables introduced for the profile candidates would be great to have them in the report. This helps in debugging in a production environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am adding actual datetime as profile_if_updated_since: Optional[datetime]
, it would be more useful than adding the config variable as is.
@@ -84,6 +84,11 @@ class GEProfilingConfig(ConfigModel): | |||
description="A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", | |||
) | |||
|
|||
profile_if_updated_since_days: Optional[pydantic.PositiveFloat] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a note in updating-datahub.md
file as this is a change in default behaviour.
text( | ||
""" | ||
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME | ||
FROM INFORMATION_SCHEMA.TABLES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does access to this table require any specific privilege? If yes, please ensure that is added in snowflake docs and provision_role
block in snowflake is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my understanding, this does not require any additional priviledges. The same table is already being accessed for fetching details using snowflake sqlalchemy dialect.
4cd8cca
to
b232c15
Compare
b232c15
to
def6bb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… since N days
Checklist