Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(search): Add SearchScore annotation to use fields for search ranking #4596

Merged
merged 6 commits into from
Apr 7, 2022

Conversation

dexter-mh-lee
Copy link
Contributor

Add a search score annotation which supports the following fields

  1. Weight - weight to apply to the value before averaging the scores
  2. Default value - value to use if the feature is missing
  3. Modifier - function to apply to the value before averaging (i.e. log, ln, sqrt, square, reciprocal)

By adding these annotations, it adds the numeric value into the index, and on query time constructs a function_score query combining the score from the query matching (how close is the field match with the query) with the score from these annotated values.

Currently, we won't have any direct support for SearchScores, but we will add them soon.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.

@github-actions
Copy link

github-actions bot commented Apr 6, 2022

Unit Test Results (build & test)

  96 files    96 suites   16m 12s ⏱️
689 tests 630 ✔️ 59 💤 0

Results for commit 1b72d3f.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Apr 6, 2022

Unit Test Results (metadata ingestion)

       5 files  ±0         5 suites  ±0   59m 44s ⏱️ - 6m 54s
   394 tests ±0     394 ✔️ ±0    0 💤 ±0  0 ±0 
1 814 runs  ±0  1 783 ✔️ ±0  31 💤 ±0  0 ±0 

Results for commit 1b72d3f. ± Comparison against base commit df1d8ad.

♻️ This comment has been updated with latest results.

// Modifier to apply to the value. None if empty.
Optional<Modifier> modifier;

public enum Modifier {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - why not expect that the modifier had been applied prior to ingesting the search score?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good thing about doing modifiers on the read path is that you can experiment with using various modifiers without having to reingest the aspects with the specific modifiers. Also counts in general is more objective by nature. Hard to reason about whether or not the feature has been modified on the write path or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh yes!! makes sense

return entitySpec.getSearchableFieldSpecs()
private static FunctionScoreQueryBuilder.FilterFunctionBuilder[] buildScoreFunctions(@Nonnull EntitySpec entitySpec) {
List<FunctionScoreQueryBuilder.FilterFunctionBuilder> finalScoreFunctions = new ArrayList<>();
// Add a default weight of 1.0 to make sure the score function is larger than 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why tho?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in case all features are missing, in which case the functino score will be 0. Since we multiply functino score with the query score, if the function score is 0, it will make all the scores the same regardless of how well the query matches the field values

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay nice!

private static FunctionScoreQueryBuilder.FilterFunctionBuilder buildScoreFunctionFromSearchScoreAnnotation(
@Nonnull SearchScoreAnnotation annotation) {
FieldValueFactorFunctionBuilder scoreFunction =
ScoreFunctionBuilders.fieldValueFactorFunction(annotation.getFieldName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a factor function...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -32,7 +33,12 @@
* Rank the input list of entities
*/
public List<SearchEntity> rank(List<SearchEntity> originalList) {
return Streams.zip(originalList.stream(), fetchFeatures(originalList).stream(), this::updateFeatures)
List<SearchEntity> entitiesToRank = originalList;
if (!getFeatureExtractors().isEmpty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we still cn invoke feature extractors? I believe these are useful in search snippets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this is for oss where I am removing all feature extractors since they are never used. There is no need to update the entities when there are no feature extractors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

private SearchEntity updateFeatures(SearchEntity originalEntity, Features features) {
return new SearchEntity().setEntity(originalEntity.getEntity())
.setMatchedFields(originalEntity.getMatchedFields())
return originalEntity.clone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fancy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got mad having this build fail every time I have to add a new field :(

@@ -28,7 +28,7 @@ public SimpleRanker() {
@Override
public Pair<Double, Double> score(SearchEntity searchEntity) {
Features features = Features.from(searchEntity.getFeatures());
return Pair.of(-features.getNumericFeature(Features.Name.RANK_WITHIN_TYPE, 0.0),
features.getNumericFeature(Features.Name.NUM_ENTITIES_PER_TYPE, 0.0));
return Pair.of(Optional.ofNullable(searchEntity.getScore()).orElse(0.0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind explaining what impact this has? We are now using the search score (if it exists) to rank instead of the number of entities per type?

Remind me why we have a Pair<Double, Double> here? I don't quite get the data structure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pair is simple. it means that it will rank in order. The first element is used to order first. If the first element is the exact same, it will use the second element.

I mean here, we may as well just use the score bc the probability of that being exactly the same is very low.

So overall impact is that we will stop mixing entity types in an artificial way. Before we would do dataset - chart - dashboard - dataset - chart -dashboard. Now, it will be based on the match score, which from trying out various queries seemed reasonable. I don't think there is a compelling reason to artificially mix in different entity types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me - thanks

final Boolean forDelete
) {
final Map<SearchableFieldSpec, List<Object>> extractedFields =
public Optional<String> transformSnapshot(final RecordTemplate snapshot, final EntitySpec entitySpec,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - does this function really accept snapshots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah legacy. It's why we have two functions here. one for snapshot based entities and one for not. Not sure where it is still used tho.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it - ty

@@ -119,6 +119,40 @@ public void setValue(final SearchableFieldSpec fieldSpec, final List<Object> fie
}
}

public void setSearchScoreValue(final SearchScoreFieldSpec fieldSpec, final List<Object> fieldValues,
final ObjectNode searchDocument, final Boolean forDelete) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "forDelete" mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something gabe added before. It is to support deleting aspects, so it sends over empty jsons for the fields that are being deleted so it gets deleted on elasticsearch side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah... okay

@@ -27,4 +27,6 @@ record SearchEntity {
}] = []

features: optional map[string, double]

score: optional double
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we have this new score, it doesn't mean we won't have features right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This score is the score from elasticsearch. we still have features in this pr for future usage when doing a second pass ranking

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay nice!!

@dexter-mh-lee dexter-mh-lee merged commit 55f0412 into datahub-project:master Apr 7, 2022
@dexter-mh-lee dexter-mh-lee deleted the dl--add-scores branch April 7, 2022 18:07
maggiehays pushed a commit to maggiehays/datahub that referenced this pull request Aug 1, 2022
…king (datahub-project#4596)

* Add SearchScore annotation

* Add back test-model

* Remove search features

* Fix to John's comments

* simplify ranker

* Fix checkstyle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants