Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(policy): Use search to fetch all policies #4713

Merged

Conversation

dexter-mh-lee
Copy link
Contributor

@dexter-mh-lee dexter-mh-lee commented Apr 21, 2022

Currently, we use listUrns function to fetch all policies. We realized this function does not scale well when a lot of entities have been ingested.

Instead, we will start using search to fetch all policies. Caveat, we have no Searchable annotations on any fields for policies, which means that the search index is currently empty. Modified the ingestPoliciesStep to send MCLs for the existing policies if the search index is empty, so that we fill up the search index before doing any fetching.

As a side, add a new searchable field called lastUpdatedTimestamp, and set it on any updates. List policies sorts based on this field so newly upserted policies get ranked above.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions
Copy link

github-actions bot commented Apr 21, 2022

Unit Test Results (build & test)

  97 files  ±0    97 suites  ±0   17m 56s ⏱️ - 7m 3s
701 tests ±0  642 ✔️ ±0  59 💤 ±0  0 ±0 

Results for commit 98f17c0. ± Comparison against base commit bbfc902.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Apr 21, 2022

Unit Test Results (metadata ingestion)

       5 files  ±0         5 suites  ±0   59m 57s ⏱️ -17s
   433 tests ±0     433 ✔️ ±  0    0 💤 ±  0  0 ±0 
2 090 runs  ±0  2 024 ✔️  - 20  66 💤 +20  0 ±0 

Results for commit 98f17c0. ± Comparison against base commit bbfc902.

♻️ This comment has been updated with latest results.

@@ -244,6 +237,9 @@ export const PoliciesPage = () => {
content: `Are you sure you want to remove policy?`,
onOk() {
deletePolicy({ variables: { urn: policy?.urn as string } }); // There must be a focus policy urn.
setTimeout(function () {
policiesRefetch();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@@ -282,6 +278,9 @@ export const PoliciesPage = () => {
createPolicy({ variables: { input: toPolicyInput(savePolicy) } });
}
message.success('Successfully saved policy.');
setTimeout(function () {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@Searchable = {
"fieldType": "DATETIME"
}
lastUpdatedTimestamp: optional long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on lastModifiedMs -- only cause timestamp can mean a few formats, also to align with startTimeMs in the ExecutionRequestResult model that we use for sorting ingestion runs. Not a huge deal but let me know what you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just followed the one added to Operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Operation.pdl#L18 This is where I got it from. I think we've been using "last updated" in the UI so going to keep it like this

}
}

private void addPolicyToCache(final Map<String, List<DataHubPolicyInfo>> cache, final EntityResponse entityResponse) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I love the idea of simplifying this.

_entityClient.batchGetV2(POLICY_ENTITY_NAME, new HashSet<>(policyUrns), null, authentication);
return new PolicyFetchResult(policyUrns.stream()
.map(policyEntities::get)
.filter(Objects::nonNull)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq - do we need double non null filter?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we just filtered for wheree apect map contains key

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so just one filter that checks whether urns exist in the map and then entity response has the aspect? feel like efficiency wise this should be exactly the same

* Send MCLs for each policy to refill the policy search index
*/
private void sendMCL() throws URISyntaxException {
log.info("Pushing MCLs for all policies");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what policies? bootstrap policies? do they already exist inside the document store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving for record. The current policy index is empty bc there are no searchable fields. So this step makes sure we reingest the existing policies so they get in the index one time

* @param start start offset for search results
* @param count max number of search results requested
* @return Snapshot key
* @throws RemoteInvocationException
*/
@Nonnull
public SearchResult search(@Nonnull String entity, @Nonnull String input, @Nullable Filter filter, int start,
int count, @Nonnull Authentication authentication) throws RemoteInvocationException;
public SearchResult search(@Nonnull String entity, @Nonnull String input, @Nullable Filter filter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! We also have a filter method which i've used for the same purpose.

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM - let's test on demo extensively before we cut a release :p

@dexter-mh-lee dexter-mh-lee merged commit 8185ba4 into datahub-project:master Apr 22, 2022
@dexter-mh-lee dexter-mh-lee deleted the dl--fix-list-policies branch April 22, 2022 05:15
dexter-mh-lee pushed a commit that referenced this pull request Apr 22, 2022
dexter-mh-lee pushed a commit that referenced this pull request Apr 22, 2022
dexter-mh-lee pushed a commit to dexter-mh-lee/datahub that referenced this pull request Apr 22, 2022
maggiehays pushed a commit to maggiehays/datahub that referenced this pull request Aug 1, 2022
* fix(policy): Use search to fetch all policies

* Add updated timestamp

* Change refetching logic and add timeout

* Increase wait on smoke test
maggiehays pushed a commit to maggiehays/datahub that referenced this pull request Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants