Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(classifier): Add support for excluding list of exact column names #9472

Conversation

ethan-cartwright
Copy link
Contributor

Description

Extends the classification module's config to allow specification of an ExcludeName list that is used in this datahub-classify PR to exclude a list of exact column names

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Dec 17, 2023
Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes suggested. Otherwise looks good.

@@ -66,6 +83,9 @@ class Config:
description="Factors and their weights to consider when predicting info types",
alias="prediction_factors_and_weights",
)
ExcludeName: Optional[ExcludeNameFactorConfig] = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ExcludeName: Optional[ExcludeNameFactorConfig] = Field(
ExcludeName: Optional[List[str]] = Field(default=None, description="List of exact column names to exclude from classification for this info type")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need separate class as it can be represented as list of str directly..

},
}
).config
if config.info_types_config["Email_Address"].ExcludeName is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this if/else by asserting presence of ExcludeName directly.

Suggested change
if config.info_types_config["Email_Address"].ExcludeName is not None:
assert config.info_types_config["Email_Address"].ExcludeName is not None

},
}
).config
assert config.info_types_config["Email_Address"].ExcludeName is None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding tests.

@@ -73,6 +73,14 @@ class Config:
description="Factors and their weights to consider when predicting info types",
alias="prediction_factors_and_weights",
)
StripExclusionFormatting: bool = Field(
default=True, alias="strip_exclusion_formatting"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StripExclusionFormatting is not an infotype level config. As per equivalent datahub-classify PR, its at global level and is set only once across all infotypes. Did you intend this global behavior ? In that case -> strip_exclusion_formatting should be in DataHubClassifierConfig class in same file . Also, need not use StripExclusionFormatting with alias strip_exclusion_formatting here . You can directly name the field as strip_exclusion_formatting .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

StripExclusionFormatting: bool = Field(
default=True, alias="strip_exclusion_formatting"
)
ExcludeName: Optional[List[str]] = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethan-cartwright could you also please update classification.md with these changes ? Unfortunately the configs are not automatically updated yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except one small comment regarding log levels on datahub-classify PR

@ethan-cartwright
Copy link
Contributor Author

LGTM, except one small comment regarding log levels on datahub-classify PR

done in this PR: acryldata/datahub-classify#21

@darnaut darnaut merged commit dfb2f7e into datahub-project:master Jan 17, 2024
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants