[sqllab] Fix sqllab limit regex issue with sqlparse #5295

timifasubaa · 2018-06-27T00:10:17Z

This PR fixes the sqllab regex issues with Offset and tricky limits that regex doesn't handle correctly. It replaces the regex approach with sqlparse.

The queries below failed before but now they are successful. Fixes #5272

select * from table where a=True limit 1, 2
select * from table where a=True limit 1 offset 10
select 'limit 10'

@conglei @john-bodley @mistercrunch

john-bodley · 2018-06-27T14:58:32Z

superset/db_engine_specs.py


    @classmethod
    def get_query_without_limit(cls, sql):
-        return re.sub(r"""
+        before_limit = re.sub(r"""


You probably should rename the function given the change in return values. Also can we make these staticmethods.

john-bodley · 2018-06-27T15:00:12Z

superset/db_engine_specs.py

-            sql_without_limit = cls.get_query_without_limit(sql)
-            return '{sql_without_limit} LIMIT {limit}'.format(**locals())
+            sql_before_limit, sql_after_limit = cls.get_query_without_limit(sql)
+            return '{sql_before_limit} LIMIT {limit}{sql_after_limit}'.format(**locals())


What happens if one explicitly specifies a limit in the SQL which is less than the override?

It does not override

john-bodley · 2018-06-27T15:00:38Z

superset/db_engine_specs.py

-                LIMIT\s+\d+  # LIMIT $ROWS
-                ;?           # optional semi-colon
-                (\s|;)*$     # remove trailing spaces tabs or semicolons
+                LIMIT\s+(\d+)  # LIMIT $ROWS


Can we align the comments?

john-bodley · 2018-06-27T15:01:03Z

superset/db_engine_specs.py

                """, '', sql)

+        after_limit_pattern = re.compile(r"""
+                (?ix)          # case insensitive, verbose


Similarly can we align the comments?

john-bodley · 2018-06-27T15:01:28Z

superset/db_engine_specs.py


    @classmethod
    def get_query_without_limit(cls, sql):
-        return re.sub(r"""


Can we make this a staticmethod?

john-bodley · 2018-06-27T15:19:51Z

superset/db_engine_specs.py

@@ -114,23 +114,31 @@ def get_limit_from_sql(cls, sql):
                (?ix)          # case insensitive, verbose
                \s+            # whitespace
                LIMIT\s+(\d+)  # LIMIT $ROWS
-                ;?             # optional semi-colon
-                (\s|;)*$       # remove trailing spaces tabs or semicolons
+                .*$            # everything else


I think this regex will fail for queries like (albeit being somewhat unusual):

SELECT ‘LIMIT 10’

conglei

Thanks @timifasubaa for continuing working on this.

In general, there are two issues to be fix.

LIMIT inside Quote. This can be done using something similar to https://robrich.org/archive/2007/11/29/regex-match-content-unless-it-is-inside-quotes.aspx
The new logic will break the tests

john-bodley · 2018-06-27T18:38:33Z

@conglei @timifasubaa sqlparse could probably be leveraged to deal with the word LIMIT outside of the LIMIT clause.

timifasubaa · 2018-06-29T21:35:40Z

@john-bodley @conglei I tried out sqlparse and it worked fine for our concerns. Feel free to suggest any extra cases you want me to cover with tests.

john-bodley · 2018-06-30T04:31:58Z

superset/db_engine_specs.py

+                cls.get_substrings_before_and_after_limit(sql)
+            )
+            sql = '{sql_before_limit} LIMIT {limit} {sql_after_limit}'.format(**locals())
+            re.sub('\s+', ' ', sql).strip()


I don't think you want to be stripping additional whitespace in free-from SQL as the author may have included this on purpose, i.e.,

SELECT * FROM my_table WHERE paragraph LIKE '%. The%'

john-bodley · 2018-06-30T04:38:02Z

tests/db_engine_specs_test.py

+                'LIMIT 777' AS a
+                , b
+            FROM
+            table LIMIT 1000, 999999"""),


Note I believe the form here is LIMIT [<offset>,] <row-count> per this and it seems this is flipped.

john-bodley · 2018-06-30T04:50:27Z

superset/sql_parse.py

+        limit_pos = None
+
+        # Add all items to before_str until there is a limit
+        for pos, item in enumerate(self._parsed[0].tokens):


Given you're now using sqlparse rather than trying to extract the before and after portions of the query, why don't you simply replace the relevant token in-place, i.e.,

>>> sqlparse.parse('SELECT * FROM foo LIMIT 1000')[0].tokens [..., <Keyword 'LIMIT' at 0x10DE741F0>, <Whitespace ' ' at 0x10DE74258>, <Integer '1000' at 0x10DE742C0>]

and

>>> sqlparse.parse('SELECT * FROM foo LIMIT 10, 1000')[0].tokens [..., <Keyword 'LIMIT' at 0x10DE74668>, <Whitespace ' ' at 0x10DE746D0>, <IdentifierList '10, 10...' at 0x10DE751D0>]

It seems once you find the LIMIT keyword just jump two tokens which will contain either an IdentifierList or Integer and update the token accordingly, i.e., for the first case (example code):

>>> s = sqlparse.parse('SELECT * FROM foo LIMIT 1000')[0] >>> s.tokens[-1].value = '999' >>> str(s) 'SELECT * FROM foo LIMIT 999'

codecov-io · 2018-07-03T18:04:25Z

Codecov Report

Merging #5295 into master will decrease coverage by 15.72%.
The diff coverage is 95.12%.

@@             Coverage Diff             @@
##           master    #5295       +/-   ##
===========================================
- Coverage   77.15%   61.43%   -15.73%     
===========================================
  Files          44      373      +329     
  Lines        8892    23554    +14662     
  Branches        0     2725     +2725     
===========================================
+ Hits         6861    14471     +7610     
- Misses       2031     9070     +7039     
- Partials        0       13       +13

Impacted Files	Coverage Δ
superset/sql_parse.py	`99.13% <100%> (+0.34%)`	⬆️
superset/db_engine_specs.py	`53.6% <75%> (-0.23%)`	⬇️
...erset/assets/src/explore/components/ControlRow.jsx	`100% <0%> (ø)`
superset/assets/src/visualizations/sunburst.js	`8.37% <0%> (ø)`
.../src/dashboard/components/menu/WithPopoverMenu.jsx	`82.14% <0%> (ø)`
...ssets/src/dashboard/util/backgroundStyleOptions.js	`100% <0%> (ø)`
.../dashboard/util/logging/findNonTabChildChartIds.js	`0% <0%> (ø)`
superset/assets/src/welcome/DashboardTable.jsx	`83.33% <0%> (ø)`
...t/assets/src/SqlLab/components/QueryStateLabel.jsx	`100% <0%> (ø)`
superset/assets/src/visualizations/country_map.js	`4.9% <0%> (ø)`
... and 321 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 22b7c2d...dca0bd0. Read the comment docs.

john-bodley · 2018-07-05T17:36:42Z

superset/db_engine_specs.py

        return sql

    @classmethod
    def get_limit_from_sql(cls, sql):
-        limit_pattern = re.compile(r"""


Is the get_limit_from_sql method now obsolete?

john-bodley · 2018-07-05T17:37:31Z

superset/sql_parse.py

+                '{}, {}'.format(next(limit.get_identifiers()), new_limit)
+            )
+        flattened = self._parsed[0].tokens
+        str_res = ''


I think you can simply do str(flattened) or flattened.value.

john-bodley · 2018-07-05T17:45:14Z

superset/sql_parse.py

+                break
+        if not limit_token:
+            return limit_token
+        return self._get_limit_from_token(limit_token)


Why not have this logic on like #149 and thus there's no need for the break? This also means that limit_token only needs to be scoped if there exists a limit keyword. Functions return None by default.

def _extract_limit_from_outermost_layer(self, statement): for pos, item in enumerate(statement.tokens): if item.ttype in Keyword and item.value.lower() == 'limit': return statement.tokens[pos + 2]

Also the term outermost may be misleading. You can argue that iterating backwards would find the outermost LIMIT condition.

john-bodley · 2018-07-05T17:47:28Z

superset/sql_parse.py

+            return self.sql + ' LIMIT ' + str(new_limit)
+        limit_pos = None
+        # Add all items to before_str until there is a limit
+        for pos, item in enumerate(self._parsed[0].tokens):


Why are we repeating the logic here for searching for the limit keyword. Can't we do this all in one place, i.e., I think the logic can further be simplified, by finding and replacing in the same function.

john-bodley · 2018-07-05T17:49:46Z

superset/sql_parse.py

+                break
+        limit = self._parsed[0].tokens[limit_pos + 2]
+        if limit.ttype == sqlparse.tokens.Literal.Number.Integer:
+            self._parsed[0].tokens[limit_pos + 2].value = new_limit


The self._parsed[0].tokens array contains a list of references to the Token objects and thus you can mutate the token in-place (the reference remains unchanged) so this can simply be

limit.value = new_limit

john-bodley · 2018-07-05T17:52:23Z

tests/db_engine_specs_test.py

+
+    def test_limit_with_explicit_offset(self):
+        self.sql_limit_regex(
+            textwrap.dedent("""\


Why the need for textwrap and the \ after the tripple-quotes?

john-bodley · 2018-07-05T17:55:52Z

@timifasubaa this is generally the right approach but I sense there could be some additional simplification, i.e., I don't think we need to first i) find (and store) the limit and then go through and then ii) re-find the limit and replace it. It seems that this could all happen in the same function which would reduce the need to repeat code.

timifasubaa · 2018-07-05T22:23:19Z

I will reexamine the usage of limit in the query table in another PR and potentially further simplify the code.

* include items after limit to the modified query * use sqlparse

timifasubaa force-pushed the fix_sqllab_limit_regex branch 2 times, most recently from 2bb3290 to e596ff2 Compare June 27, 2018 00:20

john-bodley reviewed Jun 27, 2018

View reviewed changes

john-bodley mentioned this pull request Jun 27, 2018

[SQL lab] Fixed limit issue #5281

Closed

john-bodley reviewed Jun 27, 2018

View reviewed changes

conglei reviewed Jun 27, 2018

View reviewed changes

include items after limit to the modified query

de31886

timifasubaa force-pushed the fix_sqllab_limit_regex branch 6 times, most recently from 77d17d0 to 9675b17 Compare June 29, 2018 21:30

timifasubaa changed the title ~~[WIP] Fix sqllab limit regex~~ [sqllab] Fix sqllab limit regex issue with sqlparse Jun 29, 2018

timifasubaa force-pushed the fix_sqllab_limit_regex branch 5 times, most recently from 0ec90f5 to f753df7 Compare June 30, 2018 00:32

john-bodley requested changes Jun 30, 2018

View reviewed changes

timifasubaa force-pushed the fix_sqllab_limit_regex branch 8 times, most recently from b5ebb70 to 5e90d6a Compare July 2, 2018 22:01

timifasubaa force-pushed the fix_sqllab_limit_regex branch 7 times, most recently from 78029be to c5512c6 Compare July 3, 2018 17:38

john-bodley reviewed Jul 5, 2018

View reviewed changes

timifasubaa force-pushed the fix_sqllab_limit_regex branch 2 times, most recently from 8206588 to f599919 Compare July 5, 2018 22:07

timifasubaa force-pushed the fix_sqllab_limit_regex branch 7 times, most recently from 0c10221 to 8cad9ed Compare July 13, 2018 20:14

use sqlparse

dca0bd0

timifasubaa force-pushed the fix_sqllab_limit_regex branch from 8cad9ed to dca0bd0 Compare July 13, 2018 23:14

john-bodley approved these changes Jul 16, 2018

View reviewed changes

timifasubaa merged commit f8a6e09 into apache:master Jul 16, 2018

timifasubaa mentioned this pull request Jul 17, 2018

[sqllab] remove limiting at the display level #5413

Merged

timifasubaa added a commit to airbnb/superset-fork that referenced this pull request Jul 25, 2018

[sqllab] Fix sqllab limit regex issue with sqlparse (apache#5295)

376a9c8

* include items after limit to the modified query * use sqlparse

wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018

[sqllab] Fix sqllab limit regex issue with sqlparse (apache#5295)

9dbe665

* include items after limit to the modified query * use sqlparse

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sqllab] Fix sqllab limit regex issue with sqlparse #5295

[sqllab] Fix sqllab limit regex issue with sqlparse #5295

timifasubaa commented Jun 27, 2018 •

edited

Loading

john-bodley Jun 27, 2018

john-bodley Jun 27, 2018

timifasubaa Jun 28, 2018

john-bodley Jun 27, 2018

john-bodley Jun 27, 2018

john-bodley Jun 27, 2018

john-bodley Jun 27, 2018

conglei left a comment

john-bodley commented Jun 27, 2018

timifasubaa commented Jun 29, 2018

john-bodley Jun 30, 2018

john-bodley Jun 30, 2018

john-bodley Jun 30, 2018

codecov-io commented Jul 3, 2018 •

edited

Loading

john-bodley Jul 5, 2018

john-bodley Jul 5, 2018

john-bodley Jul 5, 2018 •

edited

Loading

john-bodley Jul 5, 2018

john-bodley Jul 5, 2018

john-bodley Jul 5, 2018

john-bodley Jul 5, 2018

john-bodley commented Jul 5, 2018

timifasubaa commented Jul 5, 2018

[sqllab] Fix sqllab limit regex issue with sqlparse #5295

[sqllab] Fix sqllab limit regex issue with sqlparse #5295

Conversation

timifasubaa commented Jun 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conglei left a comment

Choose a reason for hiding this comment

john-bodley commented Jun 27, 2018

timifasubaa commented Jun 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jul 3, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

john-bodley Jul 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

john-bodley commented Jul 5, 2018

timifasubaa commented Jul 5, 2018

timifasubaa commented Jun 27, 2018 •

edited

Loading

codecov-io commented Jul 3, 2018 •

edited

Loading

john-bodley Jul 5, 2018 •

edited

Loading