Feat(duckdb): support TO_JSON #1803

r1b · 2023-06-19T05:30:22Z

Only added write=duckdb. I think we would need something like an exp.JSONParse to support read=duckdb, since TO_JSON returns a JSON type instead of TEXT.

Ref: DuckDB - JSON Creation Functions

georgesittas · 2023-06-19T09:45:14Z

sqlglot/dialects/duckdb.py

+    if not this.type:
+        from sqlglot.optimizer.annotate_types import annotate_types
+
+        annotate_types(this)


What case triggered the need to annotate types here? Can't we always just cast to TEXT?

this should only be done in the annotate types modules, let’s delete this

I assume this whole section was copy-pasted from Hive?

sqlglot/sqlglot/dialects/hive.py

Lines 87 to 96 in 4da37aa

def _json_format_sql(self: generator.Generator, expression: exp.JSONFormat) -> str:

this = expression.this

if not this.type:

from sqlglot.optimizer.annotate_types import annotate_types

annotate_types(this)

if this.type.is_type("json"):

return self.sql(this)

return self.func("TO_JSON", this, expression.args.get("options"))

Removed in 347d340. Note the updated test case - the thought here was that TO_JSON('"a"'::JSON) is redundant in duckdb. Copied this from _to_json_format in hive.py.

IIRC the Hive approach used annotate_types because there was an issue with transpiling Trino's JSON_FORMAT, where we'd get back the invalid TO_JSON(TO_JSON(..)) SQL in Spark (see aef9cfa9c for reference).

FWIW, it looks like we're still not transpiling Trino correctly to Spark, cc: @tobymao.

Open this for a detailed explanation.

For example:

>>> import sqlglot >>> sqlglot.transpile("SELECT JSON_FORMAT(JSON '[1, 2, 3]')", "trino", "spark") ["SELECT TO_JSON('[1, 2, 3]')"]

The resulting SQL here is not valid Spark:

spark-sql (default)> SELECT TO_JSON('[1, 2, 3]'); [DATATYPE_MISMATCH.INVALID_JSON_SCHEMA] Cannot resolve "to_json([1, 2, 3])" due to data type mismatch: Input schema "STRING" must be a struct, an array or a map.; line 1 pos 7;

I'm not sure if we can always transpile this function correctly without having a JSON parser, but for cases where we have nested (*) data structures represented as JSON strings, maybe we can alter the logic of _json_format_sql in Hive to output something like the following:

>>> import sqlglot >>> sqlglot.transpile("SELECT JSON_FORMAT(JSON '[1, 2, 3]')", "trino", "spark") ["SELECT TO_JSON(FROM_JSON('[1, 2, 3]', SCHEMA_OF_JSON('[1, 2, 3]')))"]

This is now valid Spark:

spark-sql (default)> SELECT TO_JSON(FROM_JSON('[1, 2, 3]', SCHEMA_OF_JSON('[1, 2, 3]'))); to_json(from_json([1, 2, 3])) [1,2,3] Time taken: 0.261 seconds, Fetched 1 row(s)

Similarly, but for a JSON string that encodes a map structure:

spark-sql (default)> SELECT TO_JSON(FROM_JSON('{"a": 1, "b": "c"}', SCHEMA_OF_JSON('{"a": 1, "b": "c"}'))); to_json(from_json({"a": 1, "b": "c"})) {"a":1,"b":"c"} Time taken: 4.434 seconds, Fetched 1 row(s)

Note that the outer TO_JSON call was added to cast it into a string again, which is what Trino's JSON_FORMAT function does.

>>> from pyspark.sql import SQLContext >>> from pyspark.context import SparkContext >>> SQLContext(SparkContext()).sql("""SELECT TO_JSON(FROM_JSON('{"a": 1, "b": "c"}', SCHEMA_OF_JSON('{"a": 1, "b": "c"}')))""") DataFrame[to_json(from_json({"a": 1, "b": "c"})): string]

I can try to implement this in another PR.

(*) I'm saying "nested" because e.g. for '"a"' this conversion fails:

spark-sql (default)> SELECT TO_JSON(FROM_JSON('"a"', SCHEMA_OF_JSON('"a"'))); [DATATYPE_MISMATCH.INVALID_JSON_SCHEMA] Cannot resolve "from_json("a")" due to data type mismatch: Input schema "STRING" must be a struct, an array or a map.; line 1 pos 15;

sounds good

georgesittas · 2023-06-19T09:55:20Z

Btw thanks for the PRs @r1b!

Feat(duckdb): support TO_JSON

fb4dacd

georgesittas reviewed Jun 19, 2023

View reviewed changes

remove type logic

347d340

tobymao approved these changes Jun 19, 2023

View reviewed changes

tobymao merged commit a2bf084 into tobymao:main Jun 19, 2023

georgesittas mentioned this pull request Jun 19, 2023

Refactor!(hive): improve transpilation of TO_JSON #1809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat(duckdb): support TO_JSON #1803

Feat(duckdb): support TO_JSON #1803

r1b commented Jun 19, 2023

georgesittas Jun 19, 2023

tobymao Jun 19, 2023

georgesittas Jun 19, 2023

r1b Jun 19, 2023

georgesittas Jun 19, 2023 •

edited

Loading

tobymao Jun 19, 2023

georgesittas commented Jun 19, 2023

	def _json_format_sql(self: generator.Generator, expression: exp.JSONFormat) -> str:
	this = expression.this
	if not this.type:
	from sqlglot.optimizer.annotate_types import annotate_types

	annotate_types(this)

	if this.type.is_type("json"):
	return self.sql(this)
	return self.func("TO_JSON", this, expression.args.get("options"))

Feat(duckdb): support TO_JSON #1803

Feat(duckdb): support TO_JSON #1803

Conversation

r1b commented Jun 19, 2023

georgesittas Jun 19, 2023

Choose a reason for hiding this comment

tobymao Jun 19, 2023

Choose a reason for hiding this comment

georgesittas Jun 19, 2023

Choose a reason for hiding this comment

r1b Jun 19, 2023

Choose a reason for hiding this comment

georgesittas Jun 19, 2023 • edited Loading

Choose a reason for hiding this comment

tobymao Jun 19, 2023

Choose a reason for hiding this comment

georgesittas commented Jun 19, 2023

georgesittas Jun 19, 2023 •

edited

Loading