Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion) ldap: make ldap attrs keys configurable #4682

Merged
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
cab247d
feat(ingestion) ldap: make ldap atttrs keys configurable (#4599)
atulsaurav Apr 17, 2022
5e50969
test(ingestion) ldap: Add test case for configurable ldap attrs
atulsaurav Apr 17, 2022
6f5fbab
Add missing coma
atulsaurav Apr 17, 2022
d0099e5
fix: ldap attrs_mapping defaults & test case
atulsaurav Apr 17, 2022
69dfbb3
test(ingestion): fix attrs_mapping in test case
atulsaurav Apr 17, 2022
3b5c051
fix membership test
atulsaurav Apr 17, 2022
0e0946a
Merge branch 'master' into configurable-ldap-ingestion
atulsaurav Jun 6, 2022
7680041
merge upstream changes in prior changes
atulsaurav Jun 7, 2022
bfcc13b
Merge branch 'master' into configurable-ldap-ingestion
atulsaurav Jun 7, 2022
ff8b3c3
Merge branch 'master' into configurable-ldap-ingestion
atulsaurav Jun 7, 2022
015a517
Merge branch 'master' into configurable-ldap-ingestion
atulsaurav Jun 8, 2022
b736547
Doc changes for attrs_mapping between LDAP and DH concepts
atulsaurav Jun 9, 2022
5132e75
Merge branch 'master' into configurable-ldap-ingestion
atulsaurav Jun 9, 2022
7560c03
Update ldap.md
jjoyce0510 Jun 10, 2022
f485349
Changes based on review comments
atulsaurav Jun 20, 2022
3955951
Merge branch 'master' of github.com:datahub-project/datahub into data…
atulsaurav Jun 20, 2022
e077971
Merge branch 'datahub-project-master' into configurable-ldap-ingestion
atulsaurav Jun 20, 2022
474ac7c
Merge branch 'configurable-ldap-ingestion' of github.com:atulsaurav/d…
atulsaurav Jun 20, 2022
4df5568
fix f-string linting error
atulsaurav Jun 20, 2022
74175e0
Fix guess_person_ldap changes from upstream
atulsaurav Jun 20, 2022
68add91
Fix group membership test
atulsaurav Jun 20, 2022
44eebfe
Update docs
atulsaurav Jun 20, 2022
e9aa404
fix default for group description field
atulsaurav Jun 20, 2022
72fad69
Remove breaking change related to Department info
atulsaurav Jun 20, 2022
0fbb294
fix handling of departmentId
atulsaurav Jun 21, 2022
49f14ee
Merge branch 'datahub-project:master' into configurable-ldap-ingestion
atulsaurav Jun 21, 2022
26aa509
Merge branch 'master' into configurable-ldap-ingestion
atulsaurav Jun 21, 2022
d9de05c
Split `attrs_mapping` into `user_attrs_map` & `group_attrs_map`
atulsaurav Jun 21, 2022
9ffb3af
add missing group email attribute
atulsaurav Jun 21, 2022
ec943a4
Merge branch 'configurable-ldap-ingestion' of github.com:atulsaurav/d…
atulsaurav Jun 21, 2022
d3c69b5
Update ldap.md
jjoyce0510 Jun 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 55 additions & 9 deletions metadata-ingestion/archived/source_docs/ldap.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,27 @@ source:
# Options
base_dn: "dc=example,dc=org"

# Optional attribute mapping to allow ldap config differences across orgs
attrs_mapping:
urn: sAMAccountName

# user related attrs
fullName: cn
lastName: sn
firstName: givenName
displayName: displayName
manager: manager
mail: mail
departmentNumber: departmentNumber
title: title

# group related attrs
group_urn: cn
owner: owner
managedBy: managedBy
uniqueMember: uniqueMember
member: member

sink:
# sink configs
```
Expand All @@ -42,20 +63,45 @@ sink:

Note that a `.` is used to denote nested fields in the YAML recipe.

| Field | Required | Default | Description |
| ------------------------------ | -------- | ------------------- | ----------------------------------------------------------------------- |
| `ldap_server` | ✅ | | LDAP server URL. |
| `ldap_user` | ✅ | | LDAP user. |
| `ldap_password` | ✅ | | LDAP password. |
| `base_dn` | ✅ | | LDAP DN. |
| `filter` | | `"(objectClass=*)"` | LDAP extractor filter. |
| `drop_missing_first_last_name` | | `True` | If set to true, any users without first and last names will be dropped. |
| `page_size` | | `20` | Size of each page to fetch when extracting metadata. |
| Field | Required | Default | Description |
| -------------------------------- | -------- | ------------------- | ------------------------------------------------------------------------------------------ |
| `ldap_server` | ✅ | | LDAP server URL. |
| `ldap_user` | ✅ | | LDAP user. |
| `ldap_password` | ✅ | | LDAP password. |
| `base_dn` | ✅ | | LDAP DN. |
| `filter` | | `"(objectClass=*)"` | LDAP extractor filter. |
| `drop_missing_first_last_name` | | `True` | If set to true, any users without first and last names will be dropped. |
| `page_size` | | `20` | Size of each page to fetch when extracting metadata. |
| `attrs_mapping.urn` | | `sAMAccountName` | An attribute to use in constructing the DataHub User urn. This should be something that uniquely identifies the user and is stable over time. |
| `attrs_mapping.manager` | | `manager` | Alternate attrs key representing same information as manager in the organization. |
| `attrs_mapping.firstName` | | `givenName` | Alternate attrs key representing same information as givenName in the organization. |
| `attrs_mapping.lastName` | | `sn` | Alternate attrs key representing same information as sn in the organization. |
| `attrs_mapping.fullName` | | `cn` | Alternate attrs key representing same information as cn in the organization. |
| `attrs_mapping.mail` | | `mail` | Alternate attrs key representing same information as mail in the organization. |
| `attrs_mapping.displayName` | | `displayName` | Alternate attrs key representing same information as displayName in the organization. |
| `attrs_mapping.departmentNumber` | | `departmentNumber` | Alternate attrs key representing same information as departmentNumber in the organization. |
| `attrs_mapping.title` | | `title` | Alternate attrs key representing same information as title in the organization. |
| `attrs_mapping.group_urn` | | `cn` | Alternate attrs key representing same information as owner in the cn for the LDAP group. |
| `attrs_mapping.owner` | | `owner` | Alternate attrs key representing same information as owner in the organization. |
| `attrs_mapping.managedBy` | | `managedBy` | Alternate attrs key representing same information as managedBy in the organization. |
| `attrs_mapping.uniqueMember` | | `uniqueMember` | Alternate attrs key representing same information as uniqueMember in the organization. |
| `attrs_mapping.member` | | `member` | Alternate attrs key representing same information as member in the organization. |

The `drop_missing_first_last_name` should be set to true if you've got many "headless" user LDAP accounts
for devices or services should be excluded when they do not contain a first and last name. This will only
impact the ingestion of LDAP users, while LDAP groups will be unaffected by this config option.

### Configurable LDAP

Every organization may implement LDAP slightly differently based on their needs. The makes a standard LDAP recipe ineffective due to missing data during LDAP ingestion. For instance, LDAP recipe assumes department information for a CorpUser would be present in the `departmentNumber` attribute. If an organization chose not to implement that attribute or rather capture similar imformation in the `department` attribute, that information can be missed during LDAP ingestion (even though the information may be present in LDAP in a slightly different form). LDAP source provides flexibility to provide optional mapping for such variations to be reperesented under attrs_mapping. So if an organization represented `departmentNumber` as `department` and `mail` as `email`, the recipe can be adapted to customize that mapping based on need. An example is show below. If `attrs_mapping` section is not provided, the default mapping will apply.

```yaml
# in config section
attrs_mapping:
departmentNumber: department
mail: email
```

## Compatibility

Coming soon!
Expand Down
102 changes: 75 additions & 27 deletions metadata-ingestion/src/datahub/ingestion/source/ldap.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,28 @@
CorpUserSnapshotClass,
)

# default mapping for attrs
# general attrs
attrs_mapping: Dict[str, Any] = {}
attrs_mapping["urn"] = "sAMAccountName"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the attribute used to construct the urn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks for clarifying that in the docs!


# user related attrs
attrs_mapping["fullName"] = "cn"
attrs_mapping["lastName"] = "sn"
attrs_mapping["firstName"] = "givenName"
attrs_mapping["displayName"] = "displayName"
attrs_mapping["manager"] = "manager"
attrs_mapping["mail"] = "mail"
attrs_mapping["departmentNumber"] = "departmentNumber"
attrs_mapping["title"] = "title"

# group related attrs
attrs_mapping["group_urn"] = "cn"
attrs_mapping["owner"] = "owner"
attrs_mapping["managedBy"] = "managedBy"
attrs_mapping["uniqueMember"] = "uniqueMember"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these attributes? I'm not familiar with these or how they can be mapped to DataHub concepts.

DataHub groups have

  • urn
  • name
  • members
  • [optional] owners

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just call uniqueMember as "member"?

It's not clear what the difference between "member" and "uniqueMember" are - do you know?

Copy link
Contributor

@bda618 bda618 Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my company we use "member" only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these attributes? I'm not familiar with these or how they can be mapped to DataHub concepts.

DataHub groups have

* urn

* name

* members

* [optional] owners

Sorry, you are correct! I had somehow missed these. I have corrected this now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just call uniqueMember as "member"?

It's not clear what the difference between "member" and "uniqueMember" are - do you know?

I am calling this as members to keep this aligned with DH concepts. The existing behavior was to look for a uniqueMember attribute, so I am retaining that when the mapping is not provided for this one.

@bda618 thanks for checking! we also just use member attribute.

attrs_mapping["member"] = "member"


def create_controls(pagesize: int) -> SimplePagedResultsControl:
"""
Expand Down Expand Up @@ -56,15 +78,6 @@ def set_cookie(
return bool(cookie)


def guess_person_ldap(attrs: Dict[str, Any]) -> Optional[str]:
"""Determine the user's LDAP based on the DN and attributes."""
if "sAMAccountName" in attrs:
return attrs["sAMAccountName"][0].decode()
if "uid" in attrs:
return attrs["uid"][0].decode()
return None


class LDAPSourceConfig(ConfigModel):
"""Config used by the LDAP Source."""

Expand All @@ -87,6 +100,21 @@ class LDAPSourceConfig(ConfigModel):
default=20, description="Size of each page to fetch when extracting metadata."
)

# default mapping for attrs
attrs_mapping: Dict[str, Any] = {}


def guess_person_ldap(attrs: Dict[str, Any], config: LDAPSourceConfig) -> Optional[str]:
"""Determine the user's LDAP based on the DN and attributes."""
if config.attrs_mapping["urn"] in attrs:
jjoyce0510 marked this conversation as resolved.
Show resolved Hide resolved
return attrs[config.attrs_mapping["urn"]][0].decode()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Any chance this will be None? What then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So since we default config.attrs_mapping["urn"] to sAMAccountName, the urn key will always be present in the attrs_mapping dict (whether user specifies urn mapping in the recipe or not.) Also, this function returning None is current behavior but now we give users more flexibility to pick a more reliable attribute in their org to minimize the chances this function will return none.

else: # for backward compatiblity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a log line / warning here that we could not find the configured attribute mapping, falling back to X.

I'm a bit torn on this - we might want to simply skip the user instead of using attributes the user doesn't want to use.

Copy link
Contributor Author

@atulsaurav atulsaurav Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the else part in place to retain existing behavior and not introduce any breaking changes.. I have taken care of adding warning when this happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - thanks for explanation

if "sAMAccountName" in attrs:
return attrs["sAMAccountName"][0].decode()
if "uid" in attrs:
return attrs["uid"][0].decode()
return None


@dataclasses.dataclass
class LDAPSourceReport(SourceReport):
Expand Down Expand Up @@ -116,6 +144,11 @@ def __init__(self, ctx: PipelineContext, config: LDAPSourceConfig):
"""Constructor."""
super().__init__(ctx)
self.config = config
# ensure prior defaults are in place
for k in attrs_mapping:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

if k not in self.config.attrs_mapping:
self.config.attrs_mapping[k] = attrs_mapping[k]

self.report = LDAPSourceReport()

ldap.set_option(ldap.OPT_X_TLS_REQUIRE_CERT, ldap.OPT_X_TLS_ALLOW)
Expand Down Expand Up @@ -198,17 +231,17 @@ def handle_user(self, dn: str, attrs: Dict[str, Any]) -> Iterable[MetadataWorkUn
work unit based on the information.
"""
manager_ldap = None
if "manager" in attrs:
if self.config.attrs_mapping["manager"] in attrs:
try:
m_cn = attrs["manager"][0].decode()
m_cn = attrs[self.config.attrs_mapping["manager"]][0].decode()
manager_msgid = self.ldap_client.search_ext(
m_cn,
ldap.SCOPE_BASE,
self.config.filter,
serverctrls=[self.lc],
)
_m_dn, m_attrs = self.ldap_client.result3(manager_msgid)[1][0]
manager_ldap = guess_person_ldap(m_attrs)
manager_ldap = guess_person_ldap(m_attrs, self.config)
except ldap.LDAPError as e:
self.report.report_warning(
dn, "manager LDAP search failed: {}".format(e)
Expand Down Expand Up @@ -241,26 +274,37 @@ def build_corp_user_mce(
"""
Create the MetadataChangeEvent via DN and attributes.
"""
ldap_user = guess_person_ldap(attrs)
ldap_user = guess_person_ldap(attrs, self.config)

if self.config.drop_missing_first_last_name and (
"givenName" not in attrs or "sn" not in attrs
self.config.attrs_mapping["firstName"] not in attrs
jjoyce0510 marked this conversation as resolved.
Show resolved Hide resolved
or self.config.attrs_mapping["lastName"] not in attrs
):
return None
full_name = attrs["cn"][0].decode()
first_name = attrs["givenName"][0].decode()
last_name = attrs["sn"][0].decode()

email = (attrs["mail"][0]).decode() if "mail" in attrs else ldap_user
full_name = attrs[self.config.attrs_mapping["fullName"]][0].decode()
first_name = attrs[self.config.attrs_mapping["firstName"]][0].decode()
last_name = attrs[self.config.attrs_mapping["lastName"]][0].decode()

email = (
(attrs[self.config.attrs_mapping["mail"]][0]).decode()
if self.config.attrs_mapping["mail"] in attrs
else ldap_user
)
display_name = (
(attrs["displayName"][0]).decode() if "displayName" in attrs else full_name
(attrs[self.config.attrs_mapping["displayName"]][0]).decode()
if self.config.attrs_mapping["displayName"] in attrs
else full_name
)
department = (
(attrs["departmentNumber"][0]).decode()
if "departmentNumber" in attrs
(attrs[self.config.attrs_mapping["departmentNumber"]][0]).decode()
if self.config.attrs_mapping["departmentNumber"] in attrs
else None
)
title = (
attrs[self.config.attrs_mapping["title"]][0].decode()
if self.config.attrs_mapping["title"] in attrs
else None
)
title = attrs["title"][0].decode() if "title" in attrs else None
manager_urn = f"urn:li:corpuser:{manager_ldap}" if manager_ldap else None

return MetadataChangeEvent(
Expand All @@ -284,12 +328,16 @@ def build_corp_user_mce(

def build_corp_group_mce(self, attrs: dict) -> Optional[MetadataChangeEvent]:
"""Creates a MetadataChangeEvent for LDAP groups."""
cn = attrs.get("cn")
cn = attrs.get(self.config.attrs_mapping["group_urn"])
if cn:
full_name = cn[0].decode()
owners = parse_from_attrs(attrs, "owner")
members = parse_from_attrs(attrs, "uniqueMember")
email = attrs["mail"][0].decode() if "mail" in attrs else full_name
owners = parse_from_attrs(attrs, self.config.attrs_mapping["owner"])
members = parse_from_attrs(attrs, self.config.attrs_mapping["uniqueMember"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember -> this members field is deprecated. instead, we should be populating the "GroupMembership" aspect of the user object.. Were you intending to do this as a followup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, as a part of #3335

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful! Thanks for the update

email = (
attrs[self.config.attrs_mapping["mail"]][0].decode()
if self.config.attrs_mapping["mail"] in attrs
else full_name
)

return MetadataChangeEvent(
proposedSnapshot=CorpGroupSnapshotClass(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,10 @@
"displayName": null,
"email": "simpons-group",
"admins": [],
"members": [],
"members": [
"urn:li:corpuser:hsimpson",
"urn:li:corpuser:lsimpson"
],
"groups": [],
"description": null
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ cn: simpons-group
gidnumber: 500
objectclass: posixGroup
objectclass: top
memberUid: hsimpson
memberUid: lsimpson

# Entry 4: ou=people,dc=example,dc=org
dn: ou=people,dc=example,dc=org
Expand Down
3 changes: 3 additions & 0 deletions metadata-ingestion/tests/integration/ldap/test_ldap.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ def test_ldap_ingest(docker_compose_runner, pytestconfig, tmp_path, mock_time):
"ldap_user": "cn=admin,dc=example,dc=org",
"ldap_password": "admin",
"base_dn": "dc=example,dc=org",
"attrs_mapping": {
"uniqueMember": "memberUid",
},
},
},
"sink": {
Expand Down