-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat(duckdb): support TO_JSON #1803
Conversation
sqlglot/dialects/duckdb.py
Outdated
if not this.type: | ||
from sqlglot.optimizer.annotate_types import annotate_types | ||
|
||
annotate_types(this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What case triggered the need to annotate types here? Can't we always just cast to TEXT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should only be done in the annotate types modules, let’s delete this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this whole section was copy-pasted from Hive?
sqlglot/sqlglot/dialects/hive.py
Lines 87 to 96 in 4da37aa
def _json_format_sql(self: generator.Generator, expression: exp.JSONFormat) -> str: | |
this = expression.this | |
if not this.type: | |
from sqlglot.optimizer.annotate_types import annotate_types | |
annotate_types(this) | |
if this.type.is_type("json"): | |
return self.sql(this) | |
return self.func("TO_JSON", this, expression.args.get("options")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the Hive approach used annotate_types
because there was an issue with transpiling Trino's JSON_FORMAT
, where we'd get back the invalid TO_JSON(TO_JSON(..))
SQL in Spark (see aef9cfa9c for reference).
FWIW, it looks like we're still not transpiling Trino correctly to Spark, cc: @tobymao.
Open this for a detailed explanation.
For example:
>>> import sqlglot
>>> sqlglot.transpile("SELECT JSON_FORMAT(JSON '[1, 2, 3]')", "trino", "spark")
["SELECT TO_JSON('[1, 2, 3]')"]
The resulting SQL here is not valid Spark:
spark-sql (default)> SELECT TO_JSON('[1, 2, 3]');
[DATATYPE_MISMATCH.INVALID_JSON_SCHEMA] Cannot resolve "to_json([1, 2, 3])" due to data type mismatch: Input schema "STRING" must be a struct, an array or a map.; line 1 pos 7;
I'm not sure if we can always transpile this function correctly without having a JSON parser, but for cases where we have nested (*) data structures represented as JSON strings, maybe we can alter the logic of _json_format_sql
in Hive to output something like the following:
>>> import sqlglot
>>> sqlglot.transpile("SELECT JSON_FORMAT(JSON '[1, 2, 3]')", "trino", "spark")
["SELECT TO_JSON(FROM_JSON('[1, 2, 3]', SCHEMA_OF_JSON('[1, 2, 3]')))"]
This is now valid Spark:
spark-sql (default)> SELECT TO_JSON(FROM_JSON('[1, 2, 3]', SCHEMA_OF_JSON('[1, 2, 3]')));
to_json(from_json([1, 2, 3]))
[1,2,3]
Time taken: 0.261 seconds, Fetched 1 row(s)
Similarly, but for a JSON string that encodes a map structure:
spark-sql (default)> SELECT TO_JSON(FROM_JSON('{"a": 1, "b": "c"}', SCHEMA_OF_JSON('{"a": 1, "b": "c"}')));
to_json(from_json({"a": 1, "b": "c"}))
{"a":1,"b":"c"}
Time taken: 4.434 seconds, Fetched 1 row(s)
Note that the outer TO_JSON
call was added to cast it into a string again, which is what Trino's JSON_FORMAT
function does.
>>> from pyspark.sql import SQLContext
>>> from pyspark.context import SparkContext
>>> SQLContext(SparkContext()).sql("""SELECT TO_JSON(FROM_JSON('{"a": 1, "b": "c"}', SCHEMA_OF_JSON('{"a": 1, "b": "c"}')))""")
DataFrame[to_json(from_json({"a": 1, "b": "c"})): string]
I can try to implement this in another PR.
(*) I'm saying "nested" because e.g. for '"a"'
this conversion fails:
spark-sql (default)> SELECT TO_JSON(FROM_JSON('"a"', SCHEMA_OF_JSON('"a"')));
[DATATYPE_MISMATCH.INVALID_JSON_SCHEMA] Cannot resolve "from_json("a")" due to data type mismatch: Input schema "STRING" must be a struct, an array or a map.; line 1 pos 15;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
Btw thanks for the PRs @r1b! |
Only added
write=duckdb
. I think we would need something like anexp.JSONParse
to supportread=duckdb
, sinceTO_JSON
returns a JSON type instead of TEXT.Ref: DuckDB - JSON Creation Functions