[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder #48822

pan3793 · 2024-11-12T06:22:33Z

What changes were proposed in this pull request?

SPARK-49098 introduced a SQL syntax to allow users to set table options on DSv2 write cases, but unfortunately, the options set by SQL are not propagated correctly to the underlying DSv2 WriteBuilder

INSERT INTO $t1 WITH (`write.split-size` = 10) SELECT ...

df.writeTo(t1).option("write.split-size", "10").append()

From the user's perspective, the above two are equivalent, but internal implementations differ slightly. Both of them are going to construct an

AppendData(r: DataSourceV2Relation, ..., writeOptions, ...)

but the SQL options are carried by r.options, and the DataFrame API options are carried by writeOptions. Currently, only the latter is propagated to the WriteBuilder, and the former is silently dropped. This PR fixes the above issue by merging those two options.

An additional question: if the user only uses SQL or DataFrame API to construct the query, only one "options" will be filled, but if the user assembles LogicalPlan directly, there is a chance that r.options and writeOptions contain duplicated pairs, which one should take effect?

Why are the changes needed?

Correctly propagate SQL options to WriteBuilder, to complete the feature added in SPARK-49098, so that DSv2 implementations like Iceberg can benefit.

Does this PR introduce any user-facing change?

No, it's an unreleased feature.

How was this patch tested?

UTs added by SPARK-36680 and SPARK-49098 are updated also to check SQL options are correctly propagated to the physical plan

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2024-11-13T07:31:14Z

cc @szehon-ho @cloud-fan @dongjoon-hyun

cloud-fan · 2024-11-13T13:25:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

@@ -44,7 +46,7 @@ object V2Writes extends Rule[LogicalPlan] with PredicateHelper {

  override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
    case a @ AppendData(r: DataSourceV2Relation, query, options, _, None, _) =>
-      val writeBuilder = newWriteBuilder(r.table, options, query.schema)
+      val writeBuilder = newWriteBuilder(r.table, r.options.asScala.toMap ++ options, query.schema)


can we add an assert that only one of them can be non empty?

cloud-fan · 2024-11-13T13:26:35Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala

+import org.apache.spark.sql.execution.CommandResultExec
+import org.apache.spark.sql.execution.datasources.v2._
+
+class DataSourceV2OptionSuite extends DatasourceV2SQLBase {


Suggested change

class DataSourceV2OptionSuite extends DatasourceV2SQLBase {

class DataSourceV2OptionSQLSuite extends DatasourceV2SQLBase {

since this is testing SQL API only.

cloud-fan

This is a good catch!

dongjoon-hyun

+1, LGTM. Thank you for the fix, @pan3793 .

szehon-ho

Thanks, did not realize this. Looks from @cloud-fan comment that only one can be set?

szehon-ho · 2024-11-13T17:17:18Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala

+    }
+  }
+
+  test("SPARK-36680, SPARK-50286: Supports Dynamic Table Options for SQL Insert Overwrite") {


This is my fault, but we can optionally change the first JIRA's in these tests to SPARK-49098 as its the one that added the support to the inserts?

pan3793 · 2024-11-14T12:19:44Z

... that only one can be set

@szehon-ho yes, as mentioned in the description, DataFrame's API options go writeOptions while SQL options go r.options, I think they won't be set together in normal cases, but it would be great if someone could double check that.

pan3793 · 2024-11-14T13:49:16Z

Wait, I forget the SessionConfigSupport

In fact, I submitted a PR to Iceberg to support this feature, but unfortunately, this patch doesn't seem to be getting attention, @szehon-ho do you think we can re-open this PR and get it in? If so, the assumption would not hold.

... that only one can be set

and we should define the priority, I think it should be

options from SQL
options from DataFrame API
options from session configuration

Currently, if there are duplicated options, 2 overrides 3, see

spark/sql/core/src/main/scala/org/apache/spark/sql/internal/DataFrameWriterImpl.scala

Lines 141 to 142 in c1968a1

    
           val finalOptions = sessionOptions.filter { case (k, _) => !optionsWithPath.contains(k) } ++ 
        
             optionsWithPath.originalMap

@cloud-fan, do you think the proposed priority makes sense? or any new ideas?

cloud-fan · 2024-11-14T13:52:53Z

yea 3 should have lower priority.

[SPARK-50286][SQL] Correctly propogate SQL options to WriteBuilder

6c2a540

github-actions bot added the SQL label Nov 12, 2024

pan3793 added 2 commits November 13, 2024 15:20

DataSourceV2OptionSuite

d1e3606

nit

c83c9fd

pan3793 marked this pull request as ready for review November 13, 2024 07:30

cloud-fan reviewed Nov 13, 2024

View reviewed changes

cloud-fan approved these changes Nov 13, 2024

View reviewed changes

dongjoon-hyun approved these changes Nov 13, 2024

View reviewed changes

szehon-ho reviewed Nov 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder #48822

[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder #48822

pan3793 commented Nov 12, 2024 •

edited

Loading

pan3793 commented Nov 13, 2024

cloud-fan Nov 13, 2024

cloud-fan Nov 13, 2024

cloud-fan Nov 13, 2024

cloud-fan left a comment

dongjoon-hyun left a comment

szehon-ho left a comment

szehon-ho Nov 13, 2024

pan3793 commented Nov 14, 2024

pan3793 commented Nov 14, 2024 •

edited

Loading

cloud-fan commented Nov 14, 2024

	class DataSourceV2OptionSuite extends DatasourceV2SQLBase {
	class DataSourceV2OptionSQLSuite extends DatasourceV2SQLBase {

[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder #48822

Are you sure you want to change the base?

[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder #48822

Conversation

pan3793 commented Nov 12, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

pan3793 commented Nov 13, 2024

cloud-fan Nov 13, 2024

Choose a reason for hiding this comment

cloud-fan Nov 13, 2024

Choose a reason for hiding this comment

cloud-fan Nov 13, 2024

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

szehon-ho Nov 13, 2024

Choose a reason for hiding this comment

pan3793 commented Nov 14, 2024

pan3793 commented Nov 14, 2024 • edited Loading

cloud-fan commented Nov 14, 2024

pan3793 commented Nov 12, 2024 •

edited

Loading

pan3793 commented Nov 14, 2024 •

edited

Loading