[SPARK-50130][SQL][FOLLOWUP] Simplify the resolution of LazyOuterReference #48820

cloud-fan · 2024-11-12T05:34:07Z

What changes were proposed in this pull request?

This is a followup of #48664 to simplify the code. The new workflow is:

The Column API creates LazyOuterReference
QueryExecution does lazy analysis if its main query contains LazyOuterReference. Eager analysis is still performed if only subquery expressions contain LazyOuterReference.
The column resolution framework is updated to resolve LazyOuterReference

After this simplification, we no longer need the special logic to strip LazyOuterReference in the DataFrame side. We no longer need the extra flag in the subquery expressions.

Why are the changes needed?

cleanup

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

Was this patch authored or co-authored using generative AI tooling?

no

cloud-fan · 2024-11-12T05:34:25Z

cc @ueshin @HyukjinKwon @xinrong-meng @allisonwang-db

ueshin

@cloud-fan Thanks for taking a look!

I forgot to add one test case in #48664 we also want it to fail (#48828):

from pyspark.sql import functions as sf

l = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 1.0), (2, 2.0), (3, 3.0), (None, None), (None, 5.0), (6, None)],
    ["a", "b"],
)

r = spark.createDataFrame(
    [(2, 3.0), (2, 3.0), (3, 2.0), (4, 1.0), (None, None), (None, 5.0), (6, None)],
    ["c", "d"],
)

l.select(
    "a",
    (
        r
        .where(sf.col("b") == sf.col("a").outer())
        .select(sf.sum("d"))
        .scalar()
    ),
).show()

This query should fail because sf.col("b") needs .outer(), but it passes with this change. cc @allisonwang-db

ueshin · 2024-11-12T18:16:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

-  lazy val isLazyAnalysis: Boolean = logical.containsAnyPattern(LAZY_ANALYSIS_EXPRESSION)
+  lazy val isLazyAnalysis: Boolean = {
+    // Only check the main query as we can resolve LazyOuterReference inside subquery expressions.
+    logical.exists(_.expressions.exists(_.exists(_.isInstanceOf[LazyOuterReference])))


oh, logical.containsAnyPattern(LAZY_ANALYSIS_EXPRESSION) will check subqueries?
I also intended to check the main query. Thanks for the correction!

nit: it should be LazyAnalysisExpression, although currently only LazyOuterReference?

ueshin · 2024-11-12T18:20:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

-  def name: String =
-    nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".")
-
+  def name: String = nameParts.map(quoteIfNeeded).mkString(".")


ueshin · 2024-11-12T18:20:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

-  def name: String =
-    nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".")
-
+  def name: String = nameParts.map(quoteIfNeeded).mkString(".")


cloud-fan · 2024-11-13T10:40:58Z

@ueshin After thinking more about it, I think the new test case should pass instead of fail. My opinion is:

In SQL, users can just write un-qualified column names to reference the outer plan. We should allow the same in DataFrame API.
Eventually, Spark Connect is the main API and DataFrame is always lazy. Then .outer() is only needed to avoid ambiguity when there are column name conflicts, to explicitly reference the outer plan only.

cloud-fan · 2024-11-13T10:48:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

      case u: UnresolvedAttribute =>
-        resolveOuterReference(u.nameParts, outerPlan.get).getOrElse(u)
+        resolve(u.nameParts).getOrElse(u)
+      case u: LazyOuterReference =>


This is the new change. Other changes are just reverting back the previous changes.

ueshin · 2024-11-13T20:29:52Z

@cloud-fan

In SQL, users can just write un-qualified column names to reference the outer plan. We should allow the same in DataFrame API.

In that case, sf.col("a") should also be allowed instead of sf.col("a").outer() in the above example?

l.select(
    "a",
    (
        r
        .where(sf.col("b") == sf.col("a"))
        .select(sf.sum("d"))
        .scalar()
    ),
).show()

otherwise, users may not know why it's necessary for a, but not for b.

So far we do need at least one .outer() to make the analysis lazy, it should have a consistent meaning; otherwise we need another way to make the analysis lazy.

cc @allisonwang-db who suggested the current outer() behavior.

cloud-fan · 2024-11-14T04:16:19Z

@ueshin I think .where(sf.col("b") == sf.col("a")) should be allowed in Spark Connect. For now we need at lease one .outer() to trigger lazy analysis for Classic Spark, but it's not really a problem for Spark Connect.

ueshin · 2024-11-14T04:24:04Z

@cloud-fan Yes, Spark Connect is easy to support that. It doesn't even need LazyOuterReference and UnresolvedOuterReference. I prototyped with Spark Connect.
I'm just wondering if we can introduce this difference between classic and connect. I think no. They should have the same behavior.

cloud-fan · 2024-11-14T05:58:13Z

I think it's fine for Classic Spark to have more limitations. We are moving users from Classic Spark to Spark Connect, not the other direction. This also makes DataFrame more consistent with SQL.

github-actions bot added the SQL label Nov 12, 2024

ueshin reviewed Nov 12, 2024

View reviewed changes

Simplify the resolution of LazyOuterReference

44f830e

fix

802ec46

cloud-fan force-pushed the subquery branch from 34c36f3 to 802ec46 Compare November 13, 2024 10:47

github-actions bot added the PYTHON label Nov 13, 2024

cloud-fan commented Nov 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50130][SQL][FOLLOWUP] Simplify the resolution of LazyOuterReference #48820

[SPARK-50130][SQL][FOLLOWUP] Simplify the resolution of LazyOuterReference #48820

cloud-fan commented Nov 12, 2024 •

edited by xinrong-meng

Loading

cloud-fan commented Nov 12, 2024

ueshin left a comment •

edited

Loading

ueshin Nov 12, 2024

ueshin Nov 12, 2024

ueshin Nov 12, 2024

ueshin Nov 12, 2024

cloud-fan commented Nov 13, 2024

cloud-fan Nov 13, 2024

ueshin commented Nov 13, 2024

cloud-fan commented Nov 14, 2024

ueshin commented Nov 14, 2024 •

edited

Loading

cloud-fan commented Nov 14, 2024

[SPARK-50130][SQL][FOLLOWUP] Simplify the resolution of LazyOuterReference #48820

Are you sure you want to change the base?

[SPARK-50130][SQL][FOLLOWUP] Simplify the resolution of LazyOuterReference #48820

Conversation

cloud-fan commented Nov 12, 2024 • edited by xinrong-meng Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan commented Nov 12, 2024

ueshin left a comment • edited Loading

Choose a reason for hiding this comment

ueshin Nov 12, 2024

Choose a reason for hiding this comment

ueshin Nov 12, 2024

Choose a reason for hiding this comment

ueshin Nov 12, 2024

Choose a reason for hiding this comment

ueshin Nov 12, 2024

Choose a reason for hiding this comment

cloud-fan commented Nov 13, 2024

cloud-fan Nov 13, 2024

Choose a reason for hiding this comment

ueshin commented Nov 13, 2024

cloud-fan commented Nov 14, 2024

ueshin commented Nov 14, 2024 • edited Loading

cloud-fan commented Nov 14, 2024

cloud-fan commented Nov 12, 2024 •

edited by xinrong-meng

Loading

ueshin left a comment •

edited

Loading

ueshin commented Nov 14, 2024 •

edited

Loading