-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sqllab] Fix sqllab limit regex issue with sqlparse #5295
[sqllab] Fix sqllab limit regex issue with sqlparse #5295
Conversation
2bb3290
to
e596ff2
Compare
superset/db_engine_specs.py
Outdated
|
||
@classmethod | ||
def get_query_without_limit(cls, sql): | ||
return re.sub(r""" | ||
before_limit = re.sub(r""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably should rename the function given the change in return values. Also can we make these staticmethods.
superset/db_engine_specs.py
Outdated
sql_without_limit = cls.get_query_without_limit(sql) | ||
return '{sql_without_limit} LIMIT {limit}'.format(**locals()) | ||
sql_before_limit, sql_after_limit = cls.get_query_without_limit(sql) | ||
return '{sql_before_limit} LIMIT {limit}{sql_after_limit}'.format(**locals()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if one explicitly specifies a limit in the SQL which is less than the override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not override
superset/db_engine_specs.py
Outdated
LIMIT\s+\d+ # LIMIT $ROWS | ||
;? # optional semi-colon | ||
(\s|;)*$ # remove trailing spaces tabs or semicolons | ||
LIMIT\s+(\d+) # LIMIT $ROWS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we align the comments?
superset/db_engine_specs.py
Outdated
""", '', sql) | ||
|
||
after_limit_pattern = re.compile(r""" | ||
(?ix) # case insensitive, verbose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly can we align the comments?
superset/db_engine_specs.py
Outdated
|
||
@classmethod | ||
def get_query_without_limit(cls, sql): | ||
return re.sub(r""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a staticmethod?
superset/db_engine_specs.py
Outdated
@@ -114,23 +114,31 @@ def get_limit_from_sql(cls, sql): | |||
(?ix) # case insensitive, verbose | |||
\s+ # whitespace | |||
LIMIT\s+(\d+) # LIMIT $ROWS | |||
;? # optional semi-colon | |||
(\s|;)*$ # remove trailing spaces tabs or semicolons | |||
.*$ # everything else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this regex will fail for queries like (albeit being somewhat unusual):
SELECT ‘LIMIT 10’
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @timifasubaa for continuing working on this.
In general, there are two issues to be fix.
-
LIMIT inside Quote. This can be done using something similar to https://robrich.org/archive/2007/11/29/regex-match-content-unless-it-is-inside-quotes.aspx
-
The new logic will break the tests
@conglei @timifasubaa |
77d17d0
to
9675b17
Compare
@john-bodley @conglei I tried out sqlparse and it worked fine for our concerns. Feel free to suggest any extra cases you want me to cover with tests. |
0ec90f5
to
f753df7
Compare
superset/db_engine_specs.py
Outdated
cls.get_substrings_before_and_after_limit(sql) | ||
) | ||
sql = '{sql_before_limit} LIMIT {limit} {sql_after_limit}'.format(**locals()) | ||
re.sub('\s+', ' ', sql).strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want to be stripping additional whitespace in free-from SQL as the author may have included this on purpose, i.e.,
SELECT * FROM my_table WHERE paragraph LIKE '%. The%'
tests/db_engine_specs_test.py
Outdated
'LIMIT 777' AS a | ||
, b | ||
FROM | ||
table LIMIT 1000, 999999"""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I believe the form here is LIMIT [<offset>,] <row-count>
per this and it seems this is flipped.
superset/sql_parse.py
Outdated
limit_pos = None | ||
|
||
# Add all items to before_str until there is a limit | ||
for pos, item in enumerate(self._parsed[0].tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given you're now using sqlparse
rather than trying to extract the before and after portions of the query, why don't you simply replace the relevant token in-place, i.e.,
>>> sqlparse.parse('SELECT * FROM foo LIMIT 1000')[0].tokens
[..., <Keyword 'LIMIT' at 0x10DE741F0>, <Whitespace ' ' at 0x10DE74258>, <Integer '1000' at 0x10DE742C0>]
and
>>> sqlparse.parse('SELECT * FROM foo LIMIT 10, 1000')[0].tokens
[..., <Keyword 'LIMIT' at 0x10DE74668>, <Whitespace ' ' at 0x10DE746D0>, <IdentifierList '10, 10...' at 0x10DE751D0>]
It seems once you find the LIMIT
keyword just jump two tokens which will contain either an IdentifierList
or Integer
and update the token accordingly, i.e., for the first case (example code):
>>> s = sqlparse.parse('SELECT * FROM foo LIMIT 1000')[0]
>>> s.tokens[-1].value = '999'
>>> str(s)
'SELECT * FROM foo LIMIT 999'
b5ebb70
to
5e90d6a
Compare
78029be
to
c5512c6
Compare
Codecov Report
@@ Coverage Diff @@
## master #5295 +/- ##
===========================================
- Coverage 77.15% 61.43% -15.73%
===========================================
Files 44 373 +329
Lines 8892 23554 +14662
Branches 0 2725 +2725
===========================================
+ Hits 6861 14471 +7610
- Misses 2031 9070 +7039
- Partials 0 13 +13
Continue to review full report at Codecov.
|
return sql | ||
|
||
@classmethod | ||
def get_limit_from_sql(cls, sql): | ||
limit_pattern = re.compile(r""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the get_limit_from_sql
method now obsolete?
superset/sql_parse.py
Outdated
'{}, {}'.format(next(limit.get_identifiers()), new_limit) | ||
) | ||
flattened = self._parsed[0].tokens | ||
str_res = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simply do str(flattened)
or flattened.value
.
superset/sql_parse.py
Outdated
break | ||
if not limit_token: | ||
return limit_token | ||
return self._get_limit_from_token(limit_token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have this logic on like #149 and thus there's no need for the break
? This also means that limit_token
only needs to be scoped if there exists a limit
keyword. Functions return None
by default.
def _extract_limit_from_outermost_layer(self, statement):
for pos, item in enumerate(statement.tokens):
if item.ttype in Keyword and item.value.lower() == 'limit':
return statement.tokens[pos + 2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the term outermost
may be misleading. You can argue that iterating backwards would find the outermost LIMIT
condition.
superset/sql_parse.py
Outdated
return self.sql + ' LIMIT ' + str(new_limit) | ||
limit_pos = None | ||
# Add all items to before_str until there is a limit | ||
for pos, item in enumerate(self._parsed[0].tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we repeating the logic here for searching for the limit
keyword. Can't we do this all in one place, i.e., I think the logic can further be simplified, by finding and replacing in the same function.
superset/sql_parse.py
Outdated
break | ||
limit = self._parsed[0].tokens[limit_pos + 2] | ||
if limit.ttype == sqlparse.tokens.Literal.Number.Integer: | ||
self._parsed[0].tokens[limit_pos + 2].value = new_limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self._parsed[0].tokens
array contains a list of references to the Token
objects and thus you can mutate the token in-place (the reference remains unchanged) so this can simply be
limit.value = new_limit
tests/db_engine_specs_test.py
Outdated
|
||
def test_limit_with_explicit_offset(self): | ||
self.sql_limit_regex( | ||
textwrap.dedent("""\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the need for textwrap
and the \
after the tripple-quotes?
@timifasubaa this is generally the right approach but I sense there could be some additional simplification, i.e., I don't think we need to first i) find (and store) the limit and then go through and then ii) re-find the limit and replace it. It seems that this could all happen in the same function which would reduce the need to repeat code. |
8206588
to
f599919
Compare
I will reexamine the usage of limit in the query table in another PR and potentially further simplify the code. |
0c10221
to
8cad9ed
Compare
8cad9ed
to
dca0bd0
Compare
* include items after limit to the modified query * use sqlparse
* include items after limit to the modified query * use sqlparse
This PR fixes the sqllab regex issues with Offset and tricky limits that regex doesn't handle correctly. It replaces the regex approach with sqlparse.
The queries below failed before but now they are successful. Fixes #5272
select * from table where a=True limit 1, 2
select * from table where a=True limit 1 offset 10
select 'limit 10'
@conglei @john-bodley @mistercrunch