[SPARK-49992][SQL] Default collation resolution for DDL and DML queries #48844

stefankandic · 2024-11-14T11:33:07Z

What changes were proposed in this pull request?

This PR proposes not using session-level collation in DDL commands (create/alter view/table, add/replace columns).

Also, resolution of default collation should happen in the analyzer and not in the parser. However, due to how we are checking for default string type (using reference equals with StringType object) we cannot just replace this object with StringType("UTF8_BINARY") because they compare as equal so the tree node framework will just return the old plan. Because of this we have to perform this resolution twice, once by changing the StringType object into a TemporaryStringType and then back to StringType("UTF8_BINARY") which is not considered a default string type anymore.

Why are the changes needed?

The default collation for DDL commands should be tied to the object being created or altered (e.g., table, view, schema) rather than the session-level setting. Since object-level collations are not yet supported, we will assume the UTF8_BINARY collation by default for now.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

cloud-fan · 2024-11-14T11:57:58Z

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala

@@ -30,7 +30,7 @@ import org.apache.spark.sql.catalyst.util.CollationFactory
 *   The id of collation for this StringType.
 */
 @Stable
-class StringType private (val collationId: Int) extends AtomicType with Serializable {
+class StringType private[sql] (val collationId: Int) extends AtomicType with Serializable {


no need to do this, as we have def apply(collationId: Int) in object StringType

cloud-fan · 2024-11-14T12:00:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala

@@ -43,7 +43,7 @@ import org.apache.spark.sql.types._
 case class CreateTable(
    tableDesc: CatalogTable,
    mode: SaveMode,
-    query: Option[LogicalPlan]) extends LogicalPlan {
+    query: Option[LogicalPlan]) extends LogicalPlan with V1DDLCommand {


I don't think this works. We need to resolve string collation in tableDesc.schema, and the rule must match CreateTable directly to do it.

cloud-fan · 2024-11-14T12:01:25Z

sql/core/src/test/scala/org/apache/spark/sql/collation/DefaultCollationTestSuite.scala

+import org.apache.spark.sql.internal.SqlApiConf
+import org.apache.spark.sql.types.StringType
+
+class DefaultCollationTestSuite extends DatasourceV2SQLBase {


does this test cover v1 commands as well?

stefankandic added 29 commits October 12, 2024 23:28

initial

25c057b

remove check in CheckAnalysis.scala

0c77855

some working version

0921258

add fix for eager eval of inline tables

b0a2139

initial working version with tests

a18d5b9

change collation id for default type

14130fa

fix typo

67e8dbb

fix failing and add new tests

172d58f

merge with master

d3b8f27

formatting

7ac5654

fix toString method

82dcbf4

fix duplicate test name

014c855

trigger ci

fd86590

trigger ci

99c9dd2

add more tests

32d1d1b

Merge branch 'master' into fixSessionCollation

a50d52c

add support for create/alter view

29f9a18

Merge branch 'master' into fixSessionCollation

82ec0fa

remove explicit collation in map access

8df3263

do not break parser for StringType

fd541a0

fix compilation err

e332022

fmt

7fafc99

Merge branch 'master' into fixSessionCollation

8821abc

remove StronglyTypedStringType

175f703

fix small bug in transform

d717fed

fix scalastyle

143d2c0

add docstring

0138f3a

add check for not using session collation

27059a8

inital

74414a1

github-actions bot added the SQL label Nov 14, 2024

cloud-fan reviewed Nov 14, 2024

View reviewed changes

stefankandic added 3 commits November 14, 2024 17:04

add v1 and v2 tests

ddfc137

merge with latest master

e3ed8a0

add more comments

d0ad673

stefankandic changed the title ~~Fix session collation dmlddl~~ [SPARK-49992][SQL] Default collation resolution for DDL and DML queries Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49992][SQL] Default collation resolution for DDL and DML queries #48844

[SPARK-49992][SQL] Default collation resolution for DDL and DML queries #48844

stefankandic commented Nov 14, 2024 •

edited

Loading

cloud-fan Nov 14, 2024

cloud-fan Nov 14, 2024

cloud-fan Nov 14, 2024

[SPARK-49992][SQL] Default collation resolution for DDL and DML queries #48844

Are you sure you want to change the base?

[SPARK-49992][SQL] Default collation resolution for DDL and DML queries #48844

Conversation

stefankandic commented Nov 14, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan Nov 14, 2024

Choose a reason for hiding this comment

cloud-fan Nov 14, 2024

Choose a reason for hiding this comment

cloud-fan Nov 14, 2024

Choose a reason for hiding this comment

stefankandic commented Nov 14, 2024 •

edited

Loading