Schema error: No field named mytable."A". when table is named `"MyTable"` #6790

alamb · 2023-06-28T21:08:30Z

Describe the bug

A query is failing that it can't find a column which exists

This was reported by a customer of IOx

To Reproduce

DataFusion CLI v27.0.0

❯
create table "MyTable" (
"A"    VARCHAR,
"B"    VARCHAR,
"time" timestamp
);
0 rows in set. Query took 0.002 seconds.

❯ 
SELECT "A", COUNT(DISTINCT("B"))
FROM "MyTable"
WHERE time >= now() - interval '1 month'
GROUP BY "A"
LIMIT 10;

Optimizer rule 'simplify_expressions' failed
caused by
Schema error: No field named mytable."A". Valid fields are "MyTable.A", "COUNT(DISTINCT MyTable.B)".

Expected behavior

The query should plan successfully (but return no rows)

+-----------+---------------------------+
| MyTable.A | COUNT(DISTINCT MyTable.B) |
+-----------+---------------------------+
+-----------+---------------------------+

Additional context

The query works fine in version 26.0.0 👍

DataFusion CLI v26.0.0
❯ create table "MyTable" (
"A"    VARCHAR,
"B"    VARCHAR,
"time" timestamp
);

0 rows in set. Query took 0.003 seconds.
❯ SELECT "A", COUNT(DISTINCT("B"))
FROM "MyTable"
WHERE time >= now() - interval '1 month'
GROUP BY "A"
LIMIT 10;

+-----------+---------------------------+
| MyTable.A | COUNT(DISTINCT MyTable.B) |
+-----------+---------------------------+
+-----------+---------------------------+

Also if we remove the distinct the query also works:

❯ SELECT "A", COUNT("B")
FROM "MyTable"
WHERE time >= now() - interval '1 month'
GROUP BY "A"
LIMIT 10;
+---+------------------+
| A | COUNT(MyTable.B) |
+---+------------------+
+---+------------------+

The text was updated successfully, but these errors were encountered:

jackwener · 2023-06-29T04:24:22Z

this is related with #6681.

During the process of solving this problem, I encountered some difficulties, but I am working hard to resolve them.

alamb · 2023-07-03T10:13:13Z

I also hit this same error with some of the clickbench queries.

I think this was one:

SELECT "RegionID", COUNT(DISTINCT "UserID") AS u FROM hits GROUP BY "RegionID" ORDER BY u DESC LIMIT 10;

I hope to find time to work on this issue later in the week if you haven't had a chance @jackwener

alamb · 2023-07-03T19:18:47Z

Thank you @jackwener

jackwener · 2023-07-06T12:14:09Z

I have been quite busy lately, and today a newcomer @YjyJeff in the community encountered the same issue and submitted a PR #6862 related to it. So I took some time to systematically analyze this problem and help others in the community better understand it.

This problem is caused by #6595 directly. Although the idea of #6595 is right, but we cannot change so for the time being. Because it will expose the current design problem ( schema will be change in optimization).

Let me have explain this problem:

There is a optimizated rule: simplify expr:
project: t1.a + 1 + 2
its schema is Schema: Field [qualifier: None, name:`t1.a + 1 + 2`]

after simplify expr:
project: t1.a + 3

if we use original schema: schema still is
Schema: Field [qualifier: None, name:t1.a + 1 + 2]
if we recompute schema, its schema is Schema: Field [qualifier: None, name:t1.a + 3]

so, if we recompute new schema, it will cause mismatch schema error
Of course, we can add alias to resolve this problem, it will make field keep same name after recomputation.

BUT, alias can't avoid this corner case: #6595 (comment)
such as: alias('t1.a'), field is [qualifier: none, name: t1.a], we hope field is [qualifier: t1, name: a].

so, I try to resolve it by add a field in Alias to keep schema same.
I try to use #6681 #6826 to resolve it, but I think it will bring much complexity, I think it isn't a good idea and give up it.

so I think a better temporary method is to revert #6595

Let me talk about more.

In my opinion, it is unnecessary to keep Schema unchanged at the Optimization stage, because in the process of optimization, it indeed may change.

qualifier/name change: such as simplify_expr
nullable change: such as infer not null from predicate
type change: such as fix: from_plan generate incorrect schema when it is Aggregate #6862

So, keep Schema isn't a good design, it will cause some trouble.

Why we need to keep Schema unchanged?

Based on my guess, the reason may be that we need to ensure the references are correct.
Plan is based on Schema. if we change schema of child plan, plan may can't get right column from child plan.

This involves a design problem. we shouldn't use name as a references to get a column from child plan, we should use id to get a column from child plan.
Because we should use a immutable/unquie item as a references.
By the way, if we use this design, we also can avoid problem like #6543.

Many systems in the industry have adopted this design, like spark, impala, doris...

such as spark

case class AttributeReference(
    name: String,
    datatype: DataType,
    nullable: Boolean = true,
    override val metadata: Metadata = Metadata.empty)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Seq[String] = Seq.empty[String])
  extends Attribute with Unevaluable

  /**
   * Returns true iff the expression id is the same for both attributes.
   */
  def sameRef(other: AttributeReference): Boolean = this.exprId == other.exprId

alamb · 2023-07-06T19:16:48Z

This involves a design problem. we shouldn't use name as a references to get a column from child plan, we should use id to get a column from child plan.

I believe this is also what Postgres does (refers to columns by id, rather than a name)

Also, interestingly, this is what DataFusion PhyicalExpr do as well.

In my opinion, it is unnecessary to keep Schema unchanged at the Optimization stage, because in the process of optimization, it indeed may change.

qualifier/name change: such as simplify_expr

I somehow feel this should be fixable with Aliases but as you point out this apparently is tricky

nullable change: such as infer not null from predicate

This is a good point

type change: such as #6862

In my opinion, the type (other than nullable) should never change after we do the Analysis phase. Otherwise we could end up getting to the physical expressions and not having the types line up

alamb · 2023-07-06T19:19:55Z

so, I try to resolve it by add a field in Alias to keep schema same.
I try to use #6681 #6826 to resolve it, but I think it will bring much complexity, I think it isn't a good idea and give up it.

🤔 yeah it is the approach that seems to make the most sense from an incremental sense, but I trust you if you say we can't make it work in practice...

@jackwener do you think there is a good ticket that covers this issue adequately ? Your description in #6790 (comment) is really nicely done, and I think it will likely get lost in this PR if we don't pull it into a ticket

* Field not found, releated apache/datafusion#6790

alamb added the bug Something isn't working label Jun 28, 2023

alamb mentioned this issue Jun 28, 2023

Allow multiple statements in datafusion-cli #6793

Closed

alamb mentioned this issue Jun 30, 2023

FieldNotFound but field is shown in valid_fields #6699

Open

jackwener self-assigned this Jul 2, 2023

This was referenced Jul 3, 2023

revert #6595 #6820 #6827

Merged

add field in Alias #6826

Closed

alamb mentioned this issue Jul 3, 2023

Add clickbench queries to sqllogictest coverage #6836

Merged

alamb closed this as completed in #6836 Jul 5, 2023

jackwener mentioned this issue Jul 6, 2023

fix: from_plan generate incorrect schema when it is Aggregate #6862

Closed

yukkit added a commit to yukkit/cnosdb that referenced this issue Jul 10, 2023

update datafusion dep commit id

e8f87c5

* Field not found, releated apache/datafusion#6790

yukkit mentioned this issue Jul 10, 2023

update datafusion dep commit id cnosdb/cnosdb#1343

Merged

4 tasks

yukkit added a commit to yukkit/cnosdb that referenced this issue Jul 10, 2023

update datafusion dep commit id

f16b27f

* Field not found, releated apache/datafusion#6790

alamb mentioned this issue Jul 10, 2023

Fix for issue #6595 has broken existing working queries ("Schema error: No field named ...") #6897

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema error: No field named mytable."A". when table is named `"MyTable"` #6790

Schema error: No field named mytable."A". when table is named `"MyTable"` #6790

alamb commented Jun 28, 2023

jackwener commented Jun 29, 2023

alamb commented Jul 3, 2023

alamb commented Jul 3, 2023

jackwener commented Jul 6, 2023 •

edited

Loading

alamb commented Jul 6, 2023

alamb commented Jul 6, 2023 •

edited by jackwener

Loading

Schema error: No field named mytable."A". when table is named "MyTable" #6790

Schema error: No field named mytable."A". when table is named "MyTable" #6790

Comments

alamb commented Jun 28, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

jackwener commented Jun 29, 2023

alamb commented Jul 3, 2023

alamb commented Jul 3, 2023

jackwener commented Jul 6, 2023 • edited Loading

alamb commented Jul 6, 2023

alamb commented Jul 6, 2023 • edited by jackwener Loading

Schema error: No field named mytable."A". when table is named `"MyTable"` #6790

Schema error: No field named mytable."A". when table is named `"MyTable"` #6790

jackwener commented Jul 6, 2023 •

edited

Loading

alamb commented Jul 6, 2023 •

edited by jackwener

Loading