Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate legacy query processing code #787

Open
dai-chen opened this issue Aug 24, 2022 · 8 comments
Open

Deprecate legacy query processing code #787

dai-chen opened this issue Aug 24, 2022 · 8 comments
Labels
enhancement New feature or request legacy Issues related to legacy query engine to be deprecated SQL

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Aug 24, 2022

History

Nowadays most of the user queries are processed by our enhanced SQL engine v2. Initially, as part of our migration plan, we added a plugin setting in case users want to opt out. As more and more features migrated to v2, the setting was removed to avoid confusion in OpenSearch 1.0.

Problem

Recently we found there is still minor discrepancy in certain case that user will see inconsistent query behavior. This may caused by special index name, field name or statement/functions yet to migrate. For reference, see recent issues labeled legacy. To improve customer experience, I think we need to deprecate legacy code as soon as possible.

Approach

Fortunately SQL has good test coverage of both unit test and integration test. All integration test runs against both SQL engine v2 and legacy. It is easy to identify the gaps by failing the test directly without fallback.

Known Gaps and Priority

Any gap in fundamental elements should be prioritized:

  • Identifier name: index and field
  • Data type and literal format: ex. date format
  • Basic SQL functions
  • OpenSearch data types: ip, geo point etc
  • OpenSearch aggregate functions

Query and other statement unsupported in v2 can migrate later:

  • Nested field query by PartiQL
  • JOIN query: nested loop and hash join
  • Set operation: ex. intersect/union/union all
  • Delete statement
  • Pagination

Issues related:

@dai-chen dai-chen added enhancement New feature or request SQL legacy Issues related to legacy query engine to be deprecated labels Aug 24, 2022
@MaxKsyunz
Copy link
Collaborator

MaxKsyunz commented Sep 7, 2022

Here are the test results if v1 engine is disabled -- test-results-no-v1-sql.txt.
It's an xml file but github doesn't accept xml attachments.

There are 442 failing tests but 383 of those request JSON format which forces the plugin to use v1 engine. They will need to be updated to see if they fail in the v2 engine.

@dai-chen
Copy link
Collaborator Author

dai-chen commented Sep 8, 2022

Here are the test results if v1 engine is disabled -- test-results-no-v1-sql.txt. It's an xml file but github doesn't accept xml attachments.

There are 442 failing tests but 383 of those request JSON format which forces the plugin to use v1 engine. They will need to be updated to see if they fail in the v2 engine.

Thanks for the info! JSON formatter was the default long time ago and gave user original SearchHits response. We may need to discuss or collect feedback if we should support it in v2 engine for backward compatibility. For now, I'm thinking we can tweak fallback code to allow v2 engine process those queries. We just need to check if any syntax/semantic/execution error thrown and ignore assert failure in those IT. Thanks!

Code:

private boolean isSupportedFormat() {

@acarbonetto
Copy link
Collaborator

@dai-chen we should update V2 engine to output in JSON so that we don't fall back. That should be one of the items in "Query and other statement unsupported in v2 can migrate later:" section I'm thinking.

@dai-chen
Copy link
Collaborator Author

dai-chen commented Sep 9, 2022

@dai-chen we should update V2 engine to output in JSON so that we don't fall back. That should be one of the items in "Query and other statement unsupported in v2 can migrate later:" section I'm thinking.

Sure. The "JSON" formatter means OpenSearch DSL response format actually. As S3, Prometheus and other data sources added, probably we need to re-define it if we want to migrate it too.

@MaxKsyunz
Copy link
Collaborator

MaxKsyunz commented Sep 13, 2022

@dai-chen this is a list of missing features based on the test results I posted earlier.

Bugs

  • Function names can't be used as aliases -- SELECT (<exr>) AS sin FROM ...
  • *20 treated as identifier
  • Allow . in index names
  • Auto cast int to string -- org.opensearch.sql.legacy.SQLFunctionsIT.concat_ws_field_and_string

Missing Scalar Functions

  • E()
  • EXPM1()
  • nested()
  • reverse_nested

Missing Aggregate Functions

  • histogram
  • date_histogram
  • percentiles
  • topHits
  • geo_polygon
  • geo_bounding_box
  • geo_distance
  • geo_intersects
  • stats
  • extended_stats
  • ... GROUP BY terms(...
  • ... GROUP BY range(...

Relevance Functions

  • regexp_query
  • wildcard_query / wildcardquery
  • v1 syntax for mmatch_query, match_phrase, mutli_match -- ... WHERE fieldA = MATCH_QUERY(... and ... WHERE MATCH_QUERY(query='...', ...
  • in_terms
  • term
  • query
  • ... WHERE SCORE(...

Time/date

  • Curly TIMESTAMP -- ... WHERE time < {ts '2015-03-15 00:00:00.000'}
  • Missing functions -- year, day_of_week, now, day_of_year, maketime, month_of_year, hour_of_day, dayofmonth, week_of_year, month, monthname , day_of_month, minute_of_day, datetime, minute_of_hour, curdate, second_of_minute
  • date_format with three parameters -- timestamp, date format, and time zone
  • Date-only for TIMESTAMP -- SELECT insert_time FROM index WHERE insert_time < '2014-08-18' (See org.opensearch.sql.legacy.QueryIT.dateSearch)

Missing SQL features

  • Engine hints -- /*!
  • include() and exclude()
  • ... WHERE ... IN ( SELECT .... )
  • INNER JOIN
  • LEFT JOIN
  • Implicit JOIN -- SELECT ... FROM indexA, indexB ...
  • SELECT ... FROM ... MINUS SELECT ... FROM
  • UNION
  • ... WHERE [NOT] EXISTS (SELECT ...)
  • ... WHERE field BETWEEN ... AND ...
  • ... WHERE... IS [NOT] MISSING
  • ... WHERE field IS TRUE ... WHERE field IS FALSE
  • Optional ORDER BY -- SELECT ... FROM ... GROUP BY field DESC
  • Pagination / non-zero fetch_size
  • DELETE
  • DELETE cluster setting
  • Resolve nested fields -- SELECT field.property FROM index -- org.opensearch.sql.legacy.SQLFunctionsIT.caseChangeTestWithLocale
  • Combined SQL and REST queries:
{
   "query": "SELECT * FROM index LIMIT 1000",
   "filter": { "range": { "age": {"gt":24} } } 
}

@MaxKsyunz
Copy link
Collaborator

MaxKsyunz commented Sep 26, 2022

For reference, the information above is summarized in this project.

@dai-chen
Copy link
Collaborator Author

Sharing my thought on the priority for any feedback.

  • High priority
    • Bug, Scalar Function, Time/Date
    • Missing SQL Feature regarding WHERE
  • High priority (but requires major changes)
    • JOIN
    • Nested field
  • Medium priority
    • Relevance function
    • Aggregate function (depends on register function dynamically)
  • Low priority
    • Remaining Missing SQL Feature: UNION/MINUS, Pagination, DELETE, Subquery

@GumpacG
Copy link
Collaborator

GumpacG commented Nov 24, 2022

sanity_integration_tests.txt results:

There are more test cases that fail that are not listed below. These cases are giving the same results as the results of the database being compared to, the only difference would be the function signature or data types. Therefore, those tests are not failures due to lack of support and won’t be listed below.

Failures for both engines:

  • SELECT MONTH_OF_YEAR(timestamp) FROM opensearch_dashboards_sample_data_flights
  • SELECT WEEK_OF_YEAR(timestamp) FROM opensearch_dashboards_sample_data_flights
  • SELECT DAY_OF_YEAR(timestamp) FROM opensearch_dashboards_sample_data_flights
  • SELECT DAY_OF_MONTH(timestamp) FROM opensearch_dashboards_sample_data_flights
  • SELECT DAY_OF_WEEK(timestamp) FROM opensearch_dashboards_sample_data_flights
  • SELECT HOUR_OF_DAY(timestamp) FROM opensearch_dashboards_sample_data_flights
  • SELECT MINUTE_OF_DAY(timestamp) FROM opensearch_dashboards_sample_data_flights
  • SELECT MINUTE_OF_HOUR(timestamp) FROM opensearch_dashboards_sample_data_flights
  • SELECT SECOND_OF_MINUTE(timestamp) FROM opensearch_dashboards_sample_data_flights
  • SELECT AvgTicketPrice, Carrier FROM opensearch_dashboards_sample_data_flights WHERE timestamp > '2019-12-23 10:00:00'

Failures for v2 and not legacy:

  • SELECT AvgTicketPrice AS AvgTicketPrice, DestCountry AS DestCountry FROM opensearch_dashboards_sample_data_ecommerce e JOIN opensearch_dashboards_sample_data_flights f ON (e.day_of_week_i = f.dayOfWeek) LIMIT 1000
  • "SELECT AvgTicketPrice AS AvgTicketPrice, DestCountry AS DestCountry FROM opensearch_dashboards_sample_data_ecommerce e LEFT JOIN opensearch_dashboards_sample_data_flights f ON (e.day_of_week_i = f.dayOfWeek) LIMIT 1000

tableau_integration_tests.txt results:

v2 failures: 141
legacy failures: 139

Failures for v2 and not legacy:

  • SELECT AvgTicketPrice AS AvgTicketPrice, Cancelled AS Cancelled, Carrier AS Carrier FROM opensearch_dashboards_sample_data_ecommerce INNER JOIN opensearch_dashboards_sample_data_flights ON (opensearch_dashboards_sample_data_ecommerce.day_of_week_i = dayOfWeek) LIMIT 1000
  • SELECT AvgTicketPrice AS AvgTicketPrice, Cancelled AS Cancelled, Carrier AS Carrier FROM opensearch_dashboards_sample_data_ecommerce LEFT JOIN opensearch_dashboards_sample_data_flights ON (opensearch_dashboards_sample_data_ecommerce.day_of_week_i = dayOfWeek) LIMIT 1000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request legacy Issues related to legacy query engine to be deprecated SQL
Projects
None yet
Development

No branches or pull requests

4 participants