Feat(spark): new Spark2 dialect, improve DATEDIFF sql generation BREAKING #1529

georgesittas · 2023-05-03T17:07:13Z

This PR originally aimed to improve the transpilation of DATEDIFF for Spark and Databricks, which both inherit from Hive. It turns out that this function can also receive a "unit" argument in (at least) v3.3.1 Spark, similar to other dialects. This isn't documented in Spark's docs (it is in Databricks'), but one can easily test this using for example spark-sql:

spark-sql> SELECT DATEDIFF(MONTH, '2020-01-01', '2020-03-05'); -- v3.3.1
timestampdiff(MONTH, 2020-01-01, 2020-03-05)
2
Time taken: 0.143 seconds, Fetched 1 row(s)

Note that this is NOT supported in v2 Spark. So, if we just parsed the unit correctly, without changing the generation logic of DATEDIFF, we'd get the following query after transpiling to the current Hive / Spark / Databricks dialects:

SELECT MONTHS_BETWEEN(TO_DATE('2020-03-05'), TO_DATE('2020-01-01'));

However, this transformation results in different semantics, compared to the previous query:

spark-sql> SELECT MONTHS_BETWEEN(TO_DATE('2020-03-05'), TO_DATE('2020-01-01')); -- v3.3.1
months_between(to_date(2020-03-05), to_date(2020-01-01), true)
2.12903226
Time taken: 0.169 seconds, Fetched 1 row(s)

I discussed this with Toby and we decided to introduce a Spark2 dialect, equivalent to our current Spark dialect, which in turn was changed to 1) inherit from Spark2 and 2) override the sql generation of DATEDIFF.

xref: duneanalytics/harmonizer#39

…KING

sqlglot/dialects/spark.py

vegarsti · 2023-05-03T17:26:16Z

Nice!

…KING (tobymao#1529) * Feat(spark): new Spark2 dialect, improve DATEDIFF sql generation BREAKING * Docstring fixups * Add annotations import

Feat(spark): new Spark2 dialect, improve DATEDIFF sql generation BREA…

92f5d53

…KING

georgesittas requested review from eakmanrq, tobymao and barakalon May 3, 2023 17:07

Docstring fixups

685c0ad

tobymao reviewed May 3, 2023

View reviewed changes

sqlglot/dialects/spark.py Outdated Show resolved Hide resolved

Add annotations import

1a477cd

tobymao approved these changes May 3, 2023

View reviewed changes

tobymao merged commit 00b4779 into main May 3, 2023

tobymao deleted the jo/spark_date_diff_fix branch May 3, 2023 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat(spark): new Spark2 dialect, improve DATEDIFF sql generation BREAKING #1529

Feat(spark): new Spark2 dialect, improve DATEDIFF sql generation BREAKING #1529

georgesittas commented May 3, 2023

vegarsti commented May 3, 2023

Feat(spark): new Spark2 dialect, improve DATEDIFF sql generation BREAKING #1529

Feat(spark): new Spark2 dialect, improve DATEDIFF sql generation BREAKING #1529

Conversation

georgesittas commented May 3, 2023

vegarsti commented May 3, 2023