Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParserError when use OVER for Hive and Spark dialect #2043

Closed
eiphy opened this issue Aug 13, 2023 · 9 comments
Closed

ParserError when use OVER for Hive and Spark dialect #2043

eiphy opened this issue Aug 13, 2023 · 9 comments

Comments

@eiphy
Copy link

eiphy commented Aug 13, 2023

Before you file an issue

  • Make sure you specify the "read" dialect eg. parse_one(sql, read="spark") | Yes I use read="hive"
  • Check if the issue still exists on main | Yes, I tested on the main branch.

Fully reproducible code snippet

from sqlglot import parse_one
parse_one("select ROW_NUMBER() OVER(DISTRIBUTE BY id) FROM t", read="hive")
parse_one("select ROW_NUMBER() OVER(DISTRIBUTE BY id) FROM t", read="spark")

Official Documentation
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

@tobymao
Copy link
Owner

tobymao commented Aug 13, 2023

what’s the difference between partition by and distribute by or are they the same?

@tobymao
Copy link
Owner

tobymao commented Aug 13, 2023

the second link you posted has no reference to distribute by

@eiphy
Copy link
Author

eiphy commented Aug 13, 2023

what’s the difference between partition by and distribute by or are they the same?

They are different. Distribute by adjust the allocation to reducer, e.g., send the rows with the same col value to the same reducer.

The link for distribute by:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

@tobymao
Copy link
Owner

tobymao commented Aug 13, 2023

i don't think that's right, that link is for select * from x SORT BY y, that's not the same as a window function

@eiphy
Copy link
Author

eiphy commented Aug 13, 2023

i don't think that's right, that link is for select * from x SORT BY y, that's not the same as a window function

That's true. But I have tested on Hive 3.1.0 and the expression is executed correctly. So I guess this means this is still a valid expression?

BTW, does over(distribute by) for spark make sense to you?

@tobymao
Copy link
Owner

tobymao commented Aug 13, 2023

yes, databricks documentation confirms it's simply an alias

@eiphy
Copy link
Author

eiphy commented Aug 19, 2023

Hi @tobymao , This issues is not resolved? I just tried on master branch but seems:
image

@tobymao
Copy link
Owner

tobymao commented Aug 19, 2023

you need to specify the dialect


    Make sure you specify the "read" dialect eg. parse_one(sql, read="spark") | Yes I use read="hive"```

@eiphy
Copy link
Author

eiphy commented Aug 19, 2023

you need to specify the dialect


    Make sure you specify the "read" dialect eg. parse_one(sql, read="spark") | Yes I use read="hive"```

@tobymao
Sorry for that.

Another issue is the current parser does not support:
SELECT ROW() OVER (DISTRIBUTE BY x SORT BY y) FROM tb

The SORT BY clause is claimed to be the alias for order by in:
https://docs.databricks.com/en/sql/language-manual/sql-ref-window-functions.html
So I believe this should also be supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants