Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(spark): new Spark2 dialect, improve DATEDIFF sql generation BREAKING #1529

Merged
merged 3 commits into from
May 3, 2023

Conversation

georgesittas
Copy link
Collaborator

This PR originally aimed to improve the transpilation of DATEDIFF for Spark and Databricks, which both inherit from Hive. It turns out that this function can also receive a "unit" argument in (at least) v3.3.1 Spark, similar to other dialects. This isn't documented in Spark's docs (it is in Databricks'), but one can easily test this using for example spark-sql:

spark-sql> SELECT DATEDIFF(MONTH, '2020-01-01', '2020-03-05'); -- v3.3.1
timestampdiff(MONTH, 2020-01-01, 2020-03-05)
2
Time taken: 0.143 seconds, Fetched 1 row(s)

Note that this is NOT supported in v2 Spark. So, if we just parsed the unit correctly, without changing the generation logic of DATEDIFF, we'd get the following query after transpiling to the current Hive / Spark / Databricks dialects:

SELECT MONTHS_BETWEEN(TO_DATE('2020-03-05'), TO_DATE('2020-01-01'));

However, this transformation results in different semantics, compared to the previous query:

spark-sql> SELECT MONTHS_BETWEEN(TO_DATE('2020-03-05'), TO_DATE('2020-01-01')); -- v3.3.1
months_between(to_date(2020-03-05), to_date(2020-01-01), true)
2.12903226
Time taken: 0.169 seconds, Fetched 1 row(s)

I discussed this with Toby and we decided to introduce a Spark2 dialect, equivalent to our current Spark dialect, which in turn was changed to 1) inherit from Spark2 and 2) override the sql generation of DATEDIFF.

xref: duneanalytics/harmonizer#39

sqlglot/dialects/spark.py Outdated Show resolved Hide resolved
@vegarsti
Copy link
Contributor

vegarsti commented May 3, 2023

Nice!

@tobymao tobymao merged commit 00b4779 into main May 3, 2023
@tobymao tobymao deleted the jo/spark_date_diff_fix branch May 3, 2023 19:47
adrianisk pushed a commit to adrianisk/sqlglot that referenced this pull request Jun 21, 2023
…KING (tobymao#1529)

* Feat(spark): new Spark2 dialect, improve DATEDIFF sql generation BREAKING

* Docstring fixups

* Add annotations import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants