ParserError when use OVER for Hive and Spark dialect #2043

eiphy · 2023-08-13T01:16:25Z

Before you file an issue

Make sure you specify the "read" dialect eg. parse_one(sql, read="spark") | Yes I use read="hive"
Check if the issue still exists on main | Yes, I tested on the main branch.

Fully reproducible code snippet

from sqlglot import parse_one
parse_one("select ROW_NUMBER() OVER(DISTRIBUTE BY id) FROM t", read="hive")
parse_one("select ROW_NUMBER() OVER(DISTRIBUTE BY id) FROM t", read="spark")

Official Documentation
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

tobymao · 2023-08-13T01:43:28Z

what’s the difference between partition by and distribute by or are they the same?

tobymao · 2023-08-13T01:44:16Z

the second link you posted has no reference to distribute by

eiphy · 2023-08-13T02:17:03Z

what’s the difference between partition by and distribute by or are they the same?

They are different. Distribute by adjust the allocation to reducer, e.g., send the rows with the same col value to the same reducer.

The link for distribute by:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

tobymao · 2023-08-13T02:40:11Z

i don't think that's right, that link is for select * from x SORT BY y, that's not the same as a window function

eiphy · 2023-08-13T02:48:59Z

i don't think that's right, that link is for select * from x SORT BY y, that's not the same as a window function

That's true. But I have tested on Hive 3.1.0 and the expression is executed correctly. So I guess this means this is still a valid expression?

BTW, does over(distribute by) for spark make sense to you?

tobymao · 2023-08-13T02:50:32Z

yes, databricks documentation confirms it's simply an alias

eiphy · 2023-08-19T01:38:59Z

Hi @tobymao , This issues is not resolved? I just tried on master branch but seems:

tobymao · 2023-08-19T01:55:12Z

you need to specify the dialect


    Make sure you specify the "read" dialect eg. parse_one(sql, read="spark") | Yes I use read="hive"```

eiphy · 2023-08-19T02:08:40Z

you need to specify the dialect


    Make sure you specify the "read" dialect eg. parse_one(sql, read="spark") | Yes I use read="hive"```

@tobymao
Sorry for that.

Another issue is the current parser does not support:
SELECT ROW() OVER (DISTRIBUTE BY x SORT BY y) FROM tb

The SORT BY clause is claimed to be the alias for order by in:
https://docs.databricks.com/en/sql/language-manual/sql-ref-window-functions.html
So I believe this should also be supported.

tobymao closed this as completed in 689956b Aug 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParserError when use OVER for Hive and Spark dialect #2043

ParserError when use OVER for Hive and Spark dialect #2043

eiphy commented Aug 13, 2023

tobymao commented Aug 13, 2023

tobymao commented Aug 13, 2023

eiphy commented Aug 13, 2023 •

edited

Loading

tobymao commented Aug 13, 2023

eiphy commented Aug 13, 2023

tobymao commented Aug 13, 2023

eiphy commented Aug 19, 2023

tobymao commented Aug 19, 2023

eiphy commented Aug 19, 2023

ParserError when use OVER for Hive and Spark dialect #2043

ParserError when use OVER for Hive and Spark dialect #2043

Comments

eiphy commented Aug 13, 2023

tobymao commented Aug 13, 2023

tobymao commented Aug 13, 2023

eiphy commented Aug 13, 2023 • edited Loading

tobymao commented Aug 13, 2023

eiphy commented Aug 13, 2023

tobymao commented Aug 13, 2023

eiphy commented Aug 19, 2023

tobymao commented Aug 19, 2023

eiphy commented Aug 19, 2023

eiphy commented Aug 13, 2023 •

edited

Loading