Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(duckdb): support TO_JSON #1803

Merged
merged 2 commits into from
Jun 19, 2023
Merged

Feat(duckdb): support TO_JSON #1803

merged 2 commits into from
Jun 19, 2023

Conversation

r1b
Copy link
Contributor

@r1b r1b commented Jun 19, 2023

Only added write=duckdb. I think we would need something like an exp.JSONParse to support read=duckdb, since TO_JSON returns a JSON type instead of TEXT.

Ref: DuckDB - JSON Creation Functions

Comment on lines 90 to 93
if not this.type:
from sqlglot.optimizer.annotate_types import annotate_types

annotate_types(this)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What case triggered the need to annotate types here? Can't we always just cast to TEXT?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should only be done in the annotate types modules, let’s delete this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this whole section was copy-pasted from Hive?

def _json_format_sql(self: generator.Generator, expression: exp.JSONFormat) -> str:
this = expression.this
if not this.type:
from sqlglot.optimizer.annotate_types import annotate_types
annotate_types(this)
if this.type.is_type("json"):
return self.sql(this)
return self.func("TO_JSON", this, expression.args.get("options"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 347d340. Note the updated test case - the thought here was that TO_JSON('"a"'::JSON) is redundant in duckdb. Copied this from _to_json_format in hive.py.

Copy link
Collaborator

@georgesittas georgesittas Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the Hive approach used annotate_types because there was an issue with transpiling Trino's JSON_FORMAT, where we'd get back the invalid TO_JSON(TO_JSON(..)) SQL in Spark (see aef9cfa9c for reference).

FWIW, it looks like we're still not transpiling Trino correctly to Spark, cc: @tobymao.

Open this for a detailed explanation.

For example:

>>> import sqlglot
>>> sqlglot.transpile("SELECT JSON_FORMAT(JSON '[1, 2, 3]')", "trino", "spark")
["SELECT TO_JSON('[1, 2, 3]')"]

The resulting SQL here is not valid Spark:

spark-sql (default)> SELECT TO_JSON('[1, 2, 3]');
[DATATYPE_MISMATCH.INVALID_JSON_SCHEMA] Cannot resolve "to_json([1, 2, 3])" due to data type mismatch: Input schema "STRING" must be a struct, an array or a map.; line 1 pos 7;

I'm not sure if we can always transpile this function correctly without having a JSON parser, but for cases where we have nested (*) data structures represented as JSON strings, maybe we can alter the logic of _json_format_sql in Hive to output something like the following:

>>> import sqlglot
>>> sqlglot.transpile("SELECT JSON_FORMAT(JSON '[1, 2, 3]')", "trino", "spark")
["SELECT TO_JSON(FROM_JSON('[1, 2, 3]', SCHEMA_OF_JSON('[1, 2, 3]')))"]

This is now valid Spark:

spark-sql (default)> SELECT TO_JSON(FROM_JSON('[1, 2, 3]', SCHEMA_OF_JSON('[1, 2, 3]')));
to_json(from_json([1, 2, 3]))
[1,2,3]
Time taken: 0.261 seconds, Fetched 1 row(s)

Similarly, but for a JSON string that encodes a map structure:

spark-sql (default)> SELECT TO_JSON(FROM_JSON('{"a": 1, "b": "c"}', SCHEMA_OF_JSON('{"a": 1, "b": "c"}')));
to_json(from_json({"a": 1, "b": "c"}))
{"a":1,"b":"c"}
Time taken: 4.434 seconds, Fetched 1 row(s)

Note that the outer TO_JSON call was added to cast it into a string again, which is what Trino's JSON_FORMAT function does.

>>> from pyspark.sql import SQLContext
>>> from pyspark.context import SparkContext
>>> SQLContext(SparkContext()).sql("""SELECT TO_JSON(FROM_JSON('{"a": 1, "b": "c"}', SCHEMA_OF_JSON('{"a": 1, "b": "c"}')))""")
DataFrame[to_json(from_json({"a": 1, "b": "c"})): string]

I can try to implement this in another PR.

(*) I'm saying "nested" because e.g. for '"a"' this conversion fails:

spark-sql (default)> SELECT TO_JSON(FROM_JSON('"a"', SCHEMA_OF_JSON('"a"')));
[DATATYPE_MISMATCH.INVALID_JSON_SCHEMA] Cannot resolve "from_json("a")" due to data type mismatch: Input schema "STRING" must be a struct, an array or a map.; line 1 pos 15;

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@georgesittas
Copy link
Collaborator

Btw thanks for the PRs @r1b!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants