feat(search): Add SearchScore annotation to use fields for search ranking #4596

dexter-mh-lee · 2022-04-06T15:51:21Z

Add a search score annotation which supports the following fields

Weight - weight to apply to the value before averaging the scores
Default value - value to use if the feature is missing
Modifier - function to apply to the value before averaging (i.e. log, ln, sqrt, square, reciprocal)

By adding these annotations, it adds the numeric value into the index, and on query time constructs a function_score query combining the score from the query matching (how close is the field match with the query) with the score from these annotated values.

Currently, we won't have any direct support for SearchScores, but we will add them soon.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.

github-actions · 2022-04-06T16:28:26Z

Unit Test Results (build & test)

  96 files   96 suites 16m 12s ⏱️
689 tests 630 ✔️ 59 💤 0 ❌

Results for commit 1b72d3f.

♻️ This comment has been updated with latest results.

github-actions · 2022-04-06T16:34:57Z

Unit Test Results (metadata ingestion)

      5 files ±0       5 suites ±0 59m 44s ⏱️ - 6m 54s
  394 tests ±0   394 ✔️ ±0   0 💤 ±0 0 ❌ ±0
1 814 runs ±0 1 783 ✔️ ±0 31 💤 ±0 0 ❌ ±0

Results for commit 1b72d3f. ± Comparison against base commit df1d8ad.

♻️ This comment has been updated with latest results.

entity-registry/src/main/java/com/linkedin/metadata/models/EntitySpecBuilder.java

jjoyce0510 · 2022-04-07T02:16:31Z

...ty-registry/src/main/java/com/linkedin/metadata/models/annotation/SearchScoreAnnotation.java

+  // Modifier to apply to the value. None if empty.
+  Optional<Modifier> modifier;
+
+  public enum Modifier {


Question - why not expect that the modifier had been applied prior to ingesting the search score?

The good thing about doing modifiers on the read path is that you can experiment with using various modifiers without having to reingest the aspects with the specific modifiers. Also counts in general is more objective by nature. Hard to reason about whether or not the feature has been modified on the write path or not.

ohhh yes!! makes sense

jjoyce0510 · 2022-04-07T02:18:58Z

...c/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchQueryBuilder.java

-    return entitySpec.getSearchableFieldSpecs()
+  private static FunctionScoreQueryBuilder.FilterFunctionBuilder[] buildScoreFunctions(@Nonnull EntitySpec entitySpec) {
+    List<FunctionScoreQueryBuilder.FilterFunctionBuilder> finalScoreFunctions = new ArrayList<>();
+    // Add a default weight of 1.0 to make sure the score function is larger than 1


This is in case all features are missing, in which case the functino score will be 0. Since we multiply functino score with the query score, if the function score is 0, it will make all the scores the same regardless of how well the query matches the field values

jjoyce0510 · 2022-04-07T02:19:57Z

...c/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchQueryBuilder.java

+  private static FunctionScoreQueryBuilder.FilterFunctionBuilder buildScoreFunctionFromSearchScoreAnnotation(
+      @Nonnull SearchScoreAnnotation annotation) {
+    FieldValueFactorFunctionBuilder scoreFunction =
+        ScoreFunctionBuilders.fieldValueFactorFunction(annotation.getFieldName());


what is a factor function...?

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-field-value-factor

...c/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchQueryBuilder.java

jjoyce0510 · 2022-04-07T02:21:46Z

metadata-io/src/main/java/com/linkedin/metadata/search/ranker/SearchRanker.java

@@ -32,7 +33,12 @@
   * Rank the input list of entities
   */
  public List<SearchEntity> rank(List<SearchEntity> originalList) {
-    return Streams.zip(originalList.stream(), fetchFeatures(originalList).stream(), this::updateFeatures)
+    List<SearchEntity> entitiesToRank = originalList;
+    if (!getFeatureExtractors().isEmpty()) {


So we still cn invoke feature extractors? I believe these are useful in search snippets

yes. this is for oss where I am removing all feature extractors since they are never used. There is no need to update the entities when there are no feature extractors

jjoyce0510 · 2022-04-07T02:22:04Z

metadata-io/src/main/java/com/linkedin/metadata/search/ranker/SearchRanker.java

  private SearchEntity updateFeatures(SearchEntity originalEntity, Features features) {
-    return new SearchEntity().setEntity(originalEntity.getEntity())
-        .setMatchedFields(originalEntity.getMatchedFields())
+    return originalEntity.clone()


I got mad having this build fail every time I have to add a new field :(

jjoyce0510 · 2022-04-07T02:23:30Z

metadata-io/src/main/java/com/linkedin/metadata/search/ranker/SimpleRanker.java

@@ -28,7 +28,7 @@ public SimpleRanker() {
  @Override
  public Pair<Double, Double> score(SearchEntity searchEntity) {
    Features features = Features.from(searchEntity.getFeatures());
-    return Pair.of(-features.getNumericFeature(Features.Name.RANK_WITHIN_TYPE, 0.0),
-        features.getNumericFeature(Features.Name.NUM_ENTITIES_PER_TYPE, 0.0));
+    return Pair.of(Optional.ofNullable(searchEntity.getScore()).orElse(0.0),


Do you mind explaining what impact this has? We are now using the search score (if it exists) to rank instead of the number of entities per type?

Remind me why we have a Pair<Double, Double> here? I don't quite get the data structure

Pair is simple. it means that it will rank in order. The first element is used to order first. If the first element is the exact same, it will use the second element.

I mean here, we may as well just use the score bc the probability of that being exactly the same is very low.

So overall impact is that we will stop mixing entity types in an artificial way. Before we would do dataset - chart - dashboard - dataset - chart -dashboard. Now, it will be based on the match score, which from trying out various queries seemed reasonable. I don't think there is a compelling reason to artificially mix in different entity types.

makes sense to me - thanks

jjoyce0510 · 2022-04-07T02:24:06Z

...ata-io/src/main/java/com/linkedin/metadata/search/transformer/SearchDocumentTransformer.java

-      final Boolean forDelete
-  ) {
-    final Map<SearchableFieldSpec, List<Object>> extractedFields =
+  public Optional<String> transformSnapshot(final RecordTemplate snapshot, final EntitySpec entitySpec,


Question - does this function really accept snapshots?

Yeah legacy. It's why we have two functions here. one for snapshot based entities and one for not. Not sure where it is still used tho.

got it - ty

jjoyce0510 · 2022-04-07T02:24:36Z

...ata-io/src/main/java/com/linkedin/metadata/search/transformer/SearchDocumentTransformer.java

@@ -119,6 +119,40 @@ public void setValue(final SearchableFieldSpec fieldSpec, final List<Object> fie
    }
  }

+  public void setSearchScoreValue(final SearchScoreFieldSpec fieldSpec, final List<Object> fieldValues,
+      final ObjectNode searchDocument, final Boolean forDelete) {


What does "forDelete" mean?

This is something gabe added before. It is to support deleting aspects, so it sends over empty jsons for the fields that are being deleted so it gets deleted on elasticsearch side.

...ata-io/src/main/java/com/linkedin/metadata/search/transformer/SearchDocumentTransformer.java

jjoyce0510 · 2022-04-07T02:28:41Z

metadata-models/src/main/pegasus/com/linkedin/metadata/search/SearchEntity.pdl

@@ -27,4 +27,6 @@ record SearchEntity {
  }] = []

  features: optional map[string, double]
+
+  score: optional double


So if we have this new score, it doesn't mean we won't have features right?

This score is the score from elasticsearch. we still have features in this pr for future usage when doing a second pass ranking

okay nice!!

…king (datahub-project#4596) * Add SearchScore annotation * Add back test-model * Remove search features * Fix to John's comments * simplify ranker * Fix checkstyle

Dexter Lee added 2 commits April 6, 2022 03:14

Add SearchScore annotation

f5f1fd2

Add back test-model

9c0223e

Remove search features

0f3ac91

jjoyce0510 reviewed Apr 6, 2022

View reviewed changes

entity-registry/src/main/java/com/linkedin/metadata/models/EntitySpecBuilder.java Outdated Show resolved Hide resolved

jjoyce0510 reviewed Apr 7, 2022

View reviewed changes

...c/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchQueryBuilder.java Outdated Show resolved Hide resolved

jjoyce0510 reviewed Apr 7, 2022

View reviewed changes

...ata-io/src/main/java/com/linkedin/metadata/search/transformer/SearchDocumentTransformer.java Show resolved Hide resolved

jjoyce0510 reviewed Apr 7, 2022

View reviewed changes

Dexter Lee added 3 commits April 6, 2022 22:20

Fix to John's comments

4abafd3

simplify ranker

5f6cf27

Fix checkstyle

1b72d3f

jjoyce0510 approved these changes Apr 7, 2022

View reviewed changes

dexter-mh-lee merged commit 55f0412 into datahub-project:master Apr 7, 2022

dexter-mh-lee deleted the dl--add-scores branch April 7, 2022 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): Add SearchScore annotation to use fields for search ranking #4596

feat(search): Add SearchScore annotation to use fields for search ranking #4596

dexter-mh-lee commented Apr 6, 2022

github-actions bot commented Apr 6, 2022 •

edited

Loading

github-actions bot commented Apr 6, 2022 •

edited

Loading

jjoyce0510 Apr 7, 2022

dexter-mh-lee Apr 7, 2022

jjoyce0510 Apr 7, 2022

jjoyce0510 Apr 7, 2022

dexter-mh-lee Apr 7, 2022

jjoyce0510 Apr 7, 2022

jjoyce0510 Apr 7, 2022

dexter-mh-lee Apr 7, 2022

jjoyce0510 Apr 7, 2022

dexter-mh-lee Apr 7, 2022

jjoyce0510 Apr 7, 2022

jjoyce0510 Apr 7, 2022

dexter-mh-lee Apr 7, 2022

jjoyce0510 Apr 7, 2022

dexter-mh-lee Apr 7, 2022

jjoyce0510 Apr 7, 2022

jjoyce0510 Apr 7, 2022

dexter-mh-lee Apr 7, 2022

jjoyce0510 Apr 7, 2022

jjoyce0510 Apr 7, 2022

dexter-mh-lee Apr 7, 2022

jjoyce0510 Apr 7, 2022

jjoyce0510 Apr 7, 2022

dexter-mh-lee Apr 7, 2022

jjoyce0510 Apr 7, 2022

feat(search): Add SearchScore annotation to use fields for search ranking #4596

feat(search): Add SearchScore annotation to use fields for search ranking #4596

Conversation

dexter-mh-lee commented Apr 6, 2022

Checklist

github-actions bot commented Apr 6, 2022 • edited Loading

Unit Test Results (build & test)

github-actions bot commented Apr 6, 2022 • edited Loading

Unit Test Results (metadata ingestion)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 6, 2022 •

edited

Loading

github-actions bot commented Apr 6, 2022 •

edited

Loading