Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ability to filter GCP project ingestion based on project labels #10242

Conversation

anaghshineh
Copy link
Contributor

@anaghshineh anaghshineh commented Apr 8, 2024

What Changed?

We've wanted to have more control over project ingestion at my organization. We've been maintaining a "whitelist" of projects we'd like to ingest, but this is becoming unmanageable as we create more projects. We'd like to leverage consistent project labels in GCP to drive ingestion. This PR takes a first pass at allowing folks to define project_labels in their BigQuery ingestion recipes. It uses the GCP Resource Manager search_projects method to query for projects with the given labels. This does not introduce any breaking changes, as ingestion will continue to favor project_ids & will respect any name patterns.

Example:

source:
    type: bigquery
    config:
        project_labels:
            - "team:data"
            - "domain:*"
        credential:
            project_id: '${BIGQUERY_INGESTION_PROJECT}'
            client_email: '${SERVICE_ACCOUNT_CLIENT_EMAIL}'
            private_key_id: '${BIGQUERY_PRIVATE_KEY_ID}'
            private_key: '${BIGQUERY_PRIVATE_KEY}'
            client_id: '${SERVICE_ACCOUNT_UNIQUE_ID}'
        project_id_pattern:
            deny:
                - '^sys-[0-9]{26}$'
sink:
    type: datahub-rest
    config:
        server: '${DATAHUB_BASE_URL}'
        token: '${DATAHUB_ACCESS_TOKEN}'
        max_threads: 20

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata devops PR or Issue related to DataHub backend & deployment community-contribution PR or Issue raised by member(s) of DataHub Community labels Apr 8, 2024
@@ -69,6 +69,9 @@ def __init__(self, **data: Any):
def get_bigquery_client(self) -> bigquery.Client:
client_options = self.extra_client_options
return bigquery.Client(self.project_on_behalf, **client_options)

def get_projects_client(self) -> resourcemanager_v3.ProjectsClient:
return resourcemanager_v3.ProjectsClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this client needs extra permission, or does it not need any new permission?
If it is needed, then can we make sure not to create this client if, in the config, it is not set to use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I initially implemented this PR used the list_projects method but landed on the search_projects one due to its flexibility in searching for projects based on parents AND labels. Looking at GCP docs, seems like you'll need to ensure the SA has resourcemanager.projects.get on the projects you want to grab.

Is that your read, too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anaghshineh There is a easy way to test this out by creating a service account without this permission and try to run this code. If this does indeed require a new permission then please

  • update the docs with the new roles/permissions required
  • make this an optional thing disabled by default. We don't want every orgs ingestion to break due to this new permission. Some orgs have heavier checks on getting new permissions. We don't want them to get stuck on permissions when this gets in.
  • Please add in https://github.com/datahub-project/datahub/blob/master/docs/how/updating-datahub.md that there is new option that requires new permissions

As this is open source we have to be careful w.r.t. not breaking other orgs ingestion as much as possible.

Alice Naghshineh added 2 commits April 17, 2024 09:58
@@ -1,16 +1,17 @@
# Generated requirements file. Run ./regenerate-base-requirements.sh to regenerate.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't make changes in this file. This file is updated separately. This is mainly a caching mechanism for our builds.

Comment on lines +344 to +345
proj_client: resourcemanager_v3.Client = config.get_projects_client()
assert proj_client
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this under a config disabled by default. We don't want folks ingestions to break due to new permissions required as mentioned in other comment.

@shirshanka shirshanka added the on-deck PR or Issue that will be reviewed and/or addressed by the DataHub Maintainers in future cycles label Jun 28, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Sep 27, 2024

Closing as #11169 was merged

@hsheth2 hsheth2 closed this Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community devops PR or Issue related to DataHub backend & deployment ingestion PR or Issue related to the ingestion of metadata on-deck PR or Issue that will be reviewed and/or addressed by the DataHub Maintainers in future cycles
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants