Skip to content

Commit

Permalink
fix(ingest) Azure AD: support nested groups (datahub-project#4367) (d…
Browse files Browse the repository at this point in the history
…atahub-project#4368)

LGTM - Thanks!
  • Loading branch information
cccs-eric authored and aditya-radhakrishnan committed Mar 14, 2022
1 parent 21d3349 commit 6268d04
Show file tree
Hide file tree
Showing 11 changed files with 464 additions and 70 deletions.
4 changes: 2 additions & 2 deletions docker/datahub-frontend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ You can sign in with `datahub` as username and password.

## Build instructions

If you want to build the `datahub-frontend` Docker image yourself, you can run this command from the root directory of the DataHub repository you have locally:
If you want to build the `datahub-frontend` Docker image yourself, you can run this command from the root directory of the DataHub repository you have locally (using Buildkit):

`docker build -t your_datahub_frontend -f ./docker/datahub-frontend/Dockerfile .`
`DOCKER_BUILDKIT=1 docker build -t your_datahub_frontend -f ./docker/datahub-frontend/Dockerfile .`

Please note the final `.` and that the tag `your_datahub_frontend` is determined by you.
4 changes: 2 additions & 2 deletions docs/how/auth/sso/configure-oidc-react-azure.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ AUTH_OIDC_CLIENT_ID=your-client-id
AUTH_OIDC_CLIENT_SECRET=your-client-secret
AUTH_OIDC_DISCOVERY_URI=https://login.microsoftonline.com/{tenant ID}/v2.0/.well-known/openid-configuration
AUTH_OIDC_BASE_URL=your-datahub-url
AUTH_OIDC_SCOPE="openid profile email groups"
AUTH_OIDC_SCOPE="openid profile email"
```

Replacing the placeholders above with the client id (step 5), client secret (step 3) and tenant ID (step 6) received from Microsoft Azure.
Expand All @@ -101,4 +101,4 @@ docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml
Navigate to your DataHub domain to see SSO in action.

## Resources
- [OAuth 2.0 and OpenID Connect Overview](https://developer.okta.com/docs/concepts/oauth-openid/)
- [Microsoft identity platform and OpenID Connect protocol](https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-protocols-oidc/)
2 changes: 0 additions & 2 deletions docs/how/auth/sso/configure-oidc-react.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,8 +149,6 @@ AUTH_OIDC_GROUPS_CLAIM=<your-groups-claim-name>

- `AUTH_OIDC_JIT_PROVISIONING_ENABLED`: Whether DataHub users & groups should be provisioned on login if they do not exist. Defaults to true.
- `AUTH_OIDC_PRE_PROVISIONING_REQUIRED`: Whether the user should already exist in DataHub when they login, failing login if they are not. This is appropriate for situations in which users and groups are batch ingested and tightly controlled inside your environment. Defaults to false.
the userNameClaim field will contain an email address, and we want to omit the domain name suffix of the email, we can specify a custom
regex to do so. (e.g. `([^@]+)`)
- `AUTH_OIDC_EXTRACT_GROUPS_ENABLED`: Only applies if `AUTH_OIDC_JIT_PROVISIONING_ENABLED` is set to true. This determines whether we should attempt to extract a list of group names from a particular claim in the OIDC attributes. Note that if this is enabled, each login will re-sync group membership with the groups in your Identity Provider, clearing the group membership that has been assigned through the DataHub UI. Enable with care! Defaults to false.
- `AUTH_OIDC_GROUPS_CLAIM`: Only applies if `AUTH_OIDC_EXTRACT_GROUPS_ENABLED` is set to true. This determines which OIDC claim will contain a list of string group names. Defaults to 'groups'

Expand Down
2 changes: 1 addition & 1 deletion metadata-ingestion/source_docs/azure-ad.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ Note that a `.` is used to denote nested fields in the YAML configuration block.
| `client_secret` | string || | Client secret. Found in your app registration on Azure AD Portal |
| `redirect` | string || | Redirect URI. Found in your app registration on Azure AD Portal |
| `authority` | string || | The [authority](https://docs.microsoft.com/en-us/azure/active-directory/develop/msal-client-application-configuration) is a URL that indicates a directory that MSAL can request tokens from. |
| `token_url` | string || | The token URL that acquires a token from Azure AD for authorizing requests |
| `token_url` | string || | The token URL that acquires a token from Azure AD for authorizing requests. This source will only work with v1.0 endpoint. |
| `graph_url` | string | ✅ | | [Microsoft Graph API endpoint](https://docs.microsoft.com/en-us/graph/use-the-api)
| `ingest_users` | bool | | `True` | Whether users should be ingested into DataHub. |
| `ingest_groups` | bool | | `True` | Whether groups should be ingested into DataHub. |
Expand Down
125 changes: 75 additions & 50 deletions metadata-ingestion/src/datahub/ingestion/source/identity/azure_ad.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import logging
import re
import urllib
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Any, Dict, Generator, Iterable, List

Expand All @@ -10,6 +11,7 @@

from datahub.configuration import ConfigModel
from datahub.configuration.common import AllowDenyPattern
from datahub.emitter.mce_builder import make_group_urn, make_user_urn
from datahub.ingestion.api.common import PipelineContext
from datahub.ingestion.api.source import Source, SourceReport
from datahub.ingestion.api.workunit import MetadataWorkUnit
Expand Down Expand Up @@ -58,13 +60,20 @@ class AzureADConfig(ConfigModel):
users_pattern: AllowDenyPattern = AllowDenyPattern.allow_all()
groups_pattern: AllowDenyPattern = AllowDenyPattern.allow_all()

# If enabled, report will contain names of filtered users and groups.
filtered_tracking: bool = True


@dataclass
class AzureADSourceReport(SourceReport):
filtered: List[str] = field(default_factory=list)
filtered_tracking: bool = field(default=True, repr=False)
filtered_count: int = field(default=0)

def report_filtered(self, name: str) -> None:
self.filtered.append(name)
self.filtered_count += 1
if self.filtered_tracking:
self.filtered.append(name)


# Source that extracts Azure AD users, groups and group memberships using Microsoft Graph REST API
Expand All @@ -81,7 +90,9 @@ def create(cls, config_dict, ctx):
def __init__(self, config: AzureADConfig, ctx: PipelineContext):
super().__init__(ctx)
self.config = config
self.report = AzureADSourceReport()
self.report = AzureADSourceReport(
filtered_tracking=self.config.filtered_tracking
)
self.token_data = {
"grant_type": "client_credentials",
"client_id": self.config.client_id,
Expand All @@ -91,6 +102,8 @@ def __init__(self, config: AzureADConfig, ctx: PipelineContext):
"scope": "https://graph.microsoft.com/.default",
}
self.token = self.get_token()
self.selected_azure_ad_groups: list = []
self.azure_ad_groups_users: list = []

def get_token(self):
token_response = requests.post(self.config.token_url, data=self.token_data)
Expand All @@ -107,9 +120,6 @@ def get_token(self):
click.echo("Error: Token response invalid")
exit()

selected_azure_ad_groups: list = []
azure_ad_groups_users: list = []

def get_workunits(self) -> Iterable[MetadataWorkUnit]:
# for future developers: The actual logic of this ingestion wants to be executed, in order:
# 1) the groups
Expand All @@ -133,52 +143,26 @@ def get_workunits(self) -> Iterable[MetadataWorkUnit]:
yield wu

# Populate GroupMembership Aspects for CorpUsers
datahub_corp_user_urn_to_group_membership: Dict[str, GroupMembershipClass] = {}
datahub_corp_user_urn_to_group_membership: Dict[
str, GroupMembershipClass
] = defaultdict(lambda: GroupMembershipClass(groups=[]))
if (
self.config.ingest_group_membership
and len(self.selected_azure_ad_groups) > 0
):
# 2) the groups' membership
for azure_ad_group in self.selected_azure_ad_groups:
# Azure supports nested groups, but not DataHub. We need to explode the nested groups into a flat list.
datahub_corp_group_urn = self._map_azure_ad_group_to_urn(azure_ad_group)
if not datahub_corp_group_urn:
error_str = "Failed to extract DataHub Group Name from Azure AD Group named {}. Skipping...".format(
azure_ad_group.get("displayName")
)
error_str = f"Failed to extract DataHub Group Name from Azure AD Group named {azure_ad_group.get('displayName')}. Skipping..."
self.report.report_failure("azure_ad_group_mapping", error_str)
continue
# Extract and map users for each group
for azure_ad_group_users in self._get_azure_ad_group_users(
azure_ad_group
):
# if group doesn't have any members, continue
if not azure_ad_group_users:
continue
for azure_ad_user in azure_ad_group_users:
datahub_corp_user_urn = self._map_azure_ad_user_to_urn(
azure_ad_user
)
if not datahub_corp_user_urn:
error_str = "Failed to extract DataHub Username from Azure ADUser {}. Skipping...".format(
azure_ad_user.get("displayName")
)
self.report.report_failure(
"azure_ad_user_mapping", error_str
)
continue
self.azure_ad_groups_users.append(azure_ad_user)
# update/create the GroupMembership aspect for this group member.
if (
datahub_corp_user_urn
in datahub_corp_user_urn_to_group_membership
):
datahub_corp_user_urn_to_group_membership[
datahub_corp_user_urn
].groups.append(datahub_corp_group_urn)
else:
datahub_corp_user_urn_to_group_membership[
datahub_corp_user_urn
] = GroupMembershipClass(groups=[datahub_corp_group_urn])
self._add_group_members_to_group_membership(
datahub_corp_group_urn,
azure_ad_group,
datahub_corp_user_urn_to_group_membership,
)

if (
self.config.ingest_groups_users
Expand All @@ -205,6 +189,53 @@ def get_workunits(self) -> Iterable[MetadataWorkUnit]:
datahub_corp_user_urn_to_group_membership,
)

def _add_group_members_to_group_membership(
self,
parent_corp_group_urn: str,
azure_ad_group: dict,
user_urn_to_group_membership: Dict[str, GroupMembershipClass],
) -> None:
# Extract and map members for each group
for azure_ad_group_members in self._get_azure_ad_group_members(azure_ad_group):
# if group doesn't have any members, continue
if not azure_ad_group_members:
continue
for azure_ad_member in azure_ad_group_members:
odata_type = azure_ad_member.get("@odata.type")
if odata_type == "#microsoft.graph.user":
self._add_user_to_group_membership(
parent_corp_group_urn,
azure_ad_member,
user_urn_to_group_membership,
)
elif odata_type == "#microsoft.graph.group":
# Since DataHub does not support nested group, we add the members to the parent group and not the nested one.
self._add_group_members_to_group_membership(
parent_corp_group_urn,
azure_ad_member,
user_urn_to_group_membership,
)
else:
raise ValueError(
f"Unsupported @odata.type '{odata_type}' found in Azure group member"
)

def _add_user_to_group_membership(
self,
group_urn: str,
azure_ad_user: dict,
user_urn_to_group_membership: Dict[str, GroupMembershipClass],
) -> None:
user_urn = self._map_azure_ad_user_to_urn(azure_ad_user)
if not user_urn:
error_str = f"Failed to extract DataHub Username from Azure ADUser {azure_ad_user.get('displayName')}. Skipping..."
self.report.report_failure("azure_ad_user_mapping", error_str)
else:
self.azure_ad_groups_users.append(azure_ad_user)
# update/create the GroupMembership aspect for this group member.
if group_urn not in user_urn_to_group_membership[user_urn].groups:
user_urn_to_group_membership[user_urn].groups.append(group_urn)

def ingest_ad_users(
self,
datahub_corp_user_snapshots: Generator[CorpUserSnapshot, Any, None],
Expand Down Expand Up @@ -240,7 +271,7 @@ def _get_azure_ad_groups(self) -> Iterable[List]:
def _get_azure_ad_users(self) -> Iterable[List]:
yield from self._get_azure_ad_data(kind="/users")

def _get_azure_ad_group_users(self, azure_ad_group: dict) -> Iterable[List]:
def _get_azure_ad_group_members(self, azure_ad_group: dict) -> Iterable[List]:
group_id = azure_ad_group.get("id")
kind = f"/groups/{group_id}/members"
yield from self._get_azure_ad_data(kind=kind)
Expand Down Expand Up @@ -332,7 +363,7 @@ def _map_azure_ad_group_to_urn(self, azure_ad_group):
return None
# decode the group name to deal with URL encoding, and replace spaces with '_'
url_encoded_group_name = urllib.parse.quote(group_name)
return self._make_corp_group_urn(url_encoded_group_name)
return make_group_urn(url_encoded_group_name)

def _map_azure_ad_group_to_group_name(self, azure_ad_group):
return self._extract_regex_match_from_dict_value(
Expand Down Expand Up @@ -371,7 +402,7 @@ def _map_azure_ad_user_to_urn(self, azure_ad_user):
user_name = self._map_azure_ad_user_to_user_name(azure_ad_user)
if not user_name:
return None
return self._make_corp_user_urn(user_name)
return make_user_urn(user_name)

def _map_azure_ad_user_to_corp_user(self, azure_ad_user):
full_name = (
Expand All @@ -390,12 +421,6 @@ def _map_azure_ad_user_to_corp_user(self, azure_ad_user):
countryCode=azure_ad_user.get("mobilePhone", None),
)

def _make_corp_group_urn(self, groupname: str) -> str:
return f"urn:li:corpGroup:{groupname}"

def _make_corp_user_urn(self, username: str) -> str:
return f"urn:li:corpuser:{username}"

def _extract_regex_match_from_dict_value(
self, str_dict: Dict[str, str], key: str, pattern: str
) -> str:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
[
{
"@odata.id": "https://graph.microsoft.com/v2/00000000-0000-0000-0000-000000000000/directoryObjects/005203a5-d73b-4b21-8077-d96ee309b454/Microsoft.DirectoryServices.Group",
"id": "00000000-0000-0000-0000-000000000000",
"deletedDateTime": null,
"classification": null,
Expand Down Expand Up @@ -35,7 +34,6 @@
"onPremisesProvisioningErrors": []
},
{
"@odata.id": "https://graph.microsoft.com/v2/00000000-0000-0000-0000-000000000001/directoryObjects/005203a5-d73b-4b21-8077-d96ee309b454/Microsoft.DirectoryServices.Group",
"id": "00000000-0000-0000-0000-0000000000001",
"deletedDateTime": null,
"classification": null,
Expand Down
Loading

0 comments on commit 6268d04

Please sign in to comment.