Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49992][SQL] Default collation resolution for DDL and DML queries #48844

Draft
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

stefankandic
Copy link
Contributor

@stefankandic stefankandic commented Nov 14, 2024

What changes were proposed in this pull request?

This PR proposes not using session-level collation in DDL commands (create/alter view/table, add/replace columns).

Also, resolution of default collation should happen in the analyzer and not in the parser. However, due to how we are checking for default string type (using reference equals with StringType object) we cannot just replace this object with StringType("UTF8_BINARY") because they compare as equal so the tree node framework will just return the old plan. Because of this we have to perform this resolution twice, once by changing the StringType object into a TemporaryStringType and then back to StringType("UTF8_BINARY") which is not considered a default string type anymore.

Why are the changes needed?

The default collation for DDL commands should be tied to the object being created or altered (e.g., table, view, schema) rather than the session-level setting. Since object-level collations are not yet supported, we will assume the UTF8_BINARY collation by default for now.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 14, 2024
@@ -30,7 +30,7 @@ import org.apache.spark.sql.catalyst.util.CollationFactory
* The id of collation for this StringType.
*/
@Stable
class StringType private (val collationId: Int) extends AtomicType with Serializable {
class StringType private[sql] (val collationId: Int) extends AtomicType with Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to do this, as we have def apply(collationId: Int) in object StringType

@@ -43,7 +43,7 @@ import org.apache.spark.sql.types._
case class CreateTable(
tableDesc: CatalogTable,
mode: SaveMode,
query: Option[LogicalPlan]) extends LogicalPlan {
query: Option[LogicalPlan]) extends LogicalPlan with V1DDLCommand {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works. We need to resolve string collation in tableDesc.schema, and the rule must match CreateTable directly to do it.

import org.apache.spark.sql.internal.SqlApiConf
import org.apache.spark.sql.types.StringType

class DefaultCollationTestSuite extends DatasourceV2SQLBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this test cover v1 commands as well?

@stefankandic stefankandic changed the title Fix session collation dmlddl [SPARK-49992][SQL] Default collation resolution for DDL and DML queries Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants