Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Optional Arguments for Highlight #112

Merged
merged 1 commit into from
Sep 13, 2022

Conversation

forestmvey
Copy link

Signed-off-by: forestmvey [email protected]

Description

Add support for optional arguments pre_tags and post_tags for highlight.

Issues Resolved

issue: 636

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link

codecov bot commented Sep 2, 2022

Codecov Report

Merging #112 (df15f37) into integ-highlight-in-ppl (43b841d) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@                     Coverage Diff                      @@
##             integ-highlight-in-ppl     #112      +/-   ##
============================================================
+ Coverage                     94.88%   94.89%   +0.01%     
- Complexity                     2925     2931       +6     
============================================================
  Files                           288      288              
  Lines                          7875     7900      +25     
  Branches                        575      580       +5     
============================================================
+ Hits                           7472     7497      +25     
  Misses                          349      349              
  Partials                         54       54              
Flag Coverage Δ
query-workbench 62.76% <ø> (ø)
sql-engine 97.81% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ain/java/org/opensearch/sql/analysis/Analyzer.java 100.00% <100.00%> (ø)
...rg/opensearch/sql/analysis/ExpressionAnalyzer.java 100.00% <100.00%> (ø)
...org/opensearch/sql/analysis/HighlightAnalyzer.java 100.00% <100.00%> (ø)
...ensearch/sql/planner/logical/LogicalHighlight.java 100.00% <100.00%> (ø)
...opensearch/sql/planner/logical/LogicalPlanDSL.java 100.00% <100.00%> (ø)
...search/sql/planner/physical/HighlightOperator.java 100.00% <100.00%> (ø)
...ecutor/protector/OpenSearchExecutionProtector.java 100.00% <100.00%> (ø)
...l/opensearch/request/OpenSearchRequestBuilder.java 100.00% <100.00%> (ø)
...search/sql/opensearch/storage/OpenSearchIndex.java 100.00% <100.00%> (ø)
...java/org/opensearch/sql/ppl/parser/AstBuilder.java 100.00% <100.00%> (ø)
... and 2 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@forestmvey forestmvey force-pushed the dev-highlight-arguments branch 2 times, most recently from a457574 to cad53f1 Compare September 2, 2022 20:55
@MitchellGale MitchellGale self-requested a review September 2, 2022 20:56
@forestmvey forestmvey marked this pull request as ready for review September 2, 2022 20:58
@forestmvey forestmvey force-pushed the dev-highlight-arguments branch 2 times, most recently from 4c6c56a to 15f6e94 Compare September 6, 2022 18:01
@forestmvey forestmvey force-pushed the dev-highlight-in-ppl branch 4 times, most recently from ca08438 to 43b841d Compare September 7, 2022 16:25
@forestmvey forestmvey force-pushed the dev-highlight-arguments branch from 15f6e94 to 1abbbee Compare September 7, 2022 21:48
Copy link

@MaxKsyunz MaxKsyunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think are the results of a query like this would be SELECT highlight(Tags, pre_tags='<b>', ...), highlight(Title), highlight(Tags, pre_tags='<i>', ..) FROM ...?

At first, I though Title will be tagged with <i> but now I wonder if Tags will be tagged twice. I came up with it while looking at OpenSearchRequestBuilder.

@MaxKsyunz
Copy link

MaxKsyunz commented Sep 8, 2022

@forestmvey also, highlight has a tonne of options (all parse fields can be passed in JSON request AFAIK).
image

I'm happy to leave them out, but let's note that somewhere.

@forestmvey forestmvey changed the base branch from dev-highlight-in-ppl to integ-highlight-in-ppl September 9, 2022 15:44
@forestmvey forestmvey force-pushed the dev-highlight-arguments branch from 1abbbee to ce55ac3 Compare September 9, 2022 15:44
@forestmvey
Copy link
Author

What do you think are the results of a query like this would be SELECT highlight(Tags, pre_tags='<b>', ...), highlight(Title), highlight(Tags, pre_tags='<i>', ..) FROM ...?

At first, I though Title will be tagged with <i> but now I wonder if Tags will be tagged twice. I came up with it while looking at OpenSearchRequestBuilder.

Not a valid use-case for OS as it doesn't support repeating highlight fields. Though when I tested this the SQL plugin outputs incorrect data. I have added a SemanticCheckException to handle the case of repeating fields.
OpenSearchRequestBuilder

@forestmvey
Copy link
Author

@forestmvey also, highlight has a tonne of options (all parse fields can be passed in JSON request AFAIK). image

I'm happy to leave them out, but let's note that somewhere.

You would like a follow up documentation PR?

@forestmvey forestmvey force-pushed the dev-highlight-arguments branch 2 times, most recently from 929e539 to d1abeb2 Compare September 9, 2022 16:48
@forestmvey forestmvey requested a review from MaxKsyunz September 9, 2022 17:10

assertEquals(1, response.getInt("total"));

assertEquals("What are the differences between an <em>IPA</em> and its variants?",
Copy link

@MaxKsyunz MaxKsyunz Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use verifyDataRows in these tests?

I think there's enough repeated code here to create a matcher that simplifies these tests.

For SQL, the following already works:

verifyDataRows(response, 
   rows(new JSONArray(List.of("alcohol-level <mark>yeast</mark> home-brew champagne")));

@Yury-Fridlyand
Copy link

SELECT Id, highlight('Body'), highlight('*', pre_tags='<mark>', post_tags='</mark>') FROM beer.stackexchange WHERE simple_query_string(['*'], 'yeast') LIMIT 5;
returns 2 highlights with default tags both.

@MaxKsyunz
Copy link

@forestmvey also, highlight has a tonne of options (all parse fields can be passed in JSON request AFAIK). image
I'm happy to leave them out, but let's note that somewhere.

You would like a follow up documentation PR?

The upstream PR needs to be clear that it adds pre_tags and post_tags.

@forestmvey
Copy link
Author

SELECT Id, highlight('Body'), highlight('*', pre_tags='<mark>', post_tags='</mark>') FROM beer.stackexchange WHERE simple_query_string(['*'], 'yeast') LIMIT 5; returns 2 highlights with default tags both.

OS highlighting does not support duplicate fields with different tags. Using the wildcard in this way does seem to be a way to work around not returning an error message from OS. The query

--data-raw '{
      "query": {
        "simple_query_string": {
            "query": "yeast",
            "fields": [
                "*^1.0"
            ],
            "flags": -1,
            "default_operator": "or",
            "analyze_wildcard": false,
            "auto_generate_synonyms_phrase_query": true,
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "fuzzy_transpositions": true,
            "boost": 1.0
        }
    },
    "sort": [
        {
            "_doc": {
                "order": "asc"
            }
        }
    ],
    "highlight": {
        "fields": {
            "Body": {
                "pre_tags": [
                    "<mark>"
                ],
                "post_tags": [
                    "</mark>"
                ]
            },
            "Body": {}
        }
    }
}'

returns

{"error":{"root_cause":[{"type":"x_content_parse_exception","reason":"[34:13] [highlight] failed to parse field [fields]"}],"type":"x_content_parse_exception","reason":"[34:13] [highlight] failed to parse field [fields]","caused_by":{"type":"json_parse_exception","reason":"Duplicate field 'Body'\n at [Source: (org.opensearch.common.io.stream.InputStreamStreamInput); line: 35, column: 19]"}},"status":400}

While the query

      "query": {
        "simple_query_string": {
            "query": "yeast",
            "fields": [
                "*^1.0"
            ],
            "flags": -1,
            "default_operator": "or",
            "analyze_wildcard": false,
            "auto_generate_synonyms_phrase_query": true,
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "fuzzy_transpositions": true,
            "boost": 1.0
        }
    },
    "sort": [
        {
            "_doc": {
                "order": "asc"
            }
        }
    ],
    "highlight": {
        "fields": {
            "*": {
                "pre_tags": [
                    "<mark>"
                ],
                "post_tags": [
                    "</mark>"
                ]
            },
            "Body": {}
        }
    }
}'

returns data for only one field

        "hits": [
            {
                "_index": "beer.stackexchange",
                "_id": "IvN9IYIBLnrXlu0CoKH4",
                "_score": null,
                "_source": {
                    "Id": "17",
                    "PostTypeId": "2",
                    "ParentId": "8",
                    "CreationDate": "2014-01-21T20:45:57.743",
                    "Score": "6",
                    "Body": "<p>Ales and lagers are brewed with different types of yeast.  Ale yeast ferments at the top of the brewing vat at a comfortable room temperature while lager yeast ferments at the bottom of the vat at a lower temperature.  The \"low and slow\" lager fermentation brings out more complex flavors.</p>\n",
                    "OwnerUserId": "5",
                    "LastActivityDate": "2014-01-21T20:45:57.743",
                    "CommentCount": "0",
                    "ContentLicense": "CC BY-SA 3.0"
                },
                "highlight": {
                    "Body": [
                        "<p>Ales and lagers are brewed with different types of <em>yeast</em>.",
                        "Ale <em>yeast</em> ferments at the top of the brewing vat at a comfortable room temperature while lager <em>yeast</em>"
                    ]
                },
                "sort": [
                    15
                ]
            }...

Not sure if this would be a bug because having two fields in OS with different pre_tags/post_tags is not a valid use-case. From the SQL plugin perspective if no error is returned from OS, I don't see how we could have a good solution when wildcard is involved.

@forestmvey forestmvey force-pushed the dev-highlight-arguments branch from d1abeb2 to df15f37 Compare September 13, 2022 16:13
@forestmvey forestmvey merged commit fbf286c into integ-highlight-in-ppl Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants