Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema error: No field named mytable."A". when table is named "MyTable" #6790

Closed
alamb opened this issue Jun 28, 2023 · 6 comments · Fixed by #6836
Closed

Schema error: No field named mytable."A". when table is named "MyTable" #6790

alamb opened this issue Jun 28, 2023 · 6 comments · Fixed by #6836
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Jun 28, 2023

Describe the bug

A query is failing that it can't find a column which exists

This was reported by a customer of IOx

To Reproduce

DataFusion CLI v27.0.0

create table "MyTable" (
"A"    VARCHAR,
"B"    VARCHAR,
"time" timestamp
);
0 rows in set. Query took 0.002 seconds.

❯ 
SELECT "A", COUNT(DISTINCT("B"))
FROM "MyTable"
WHERE time >= now() - interval '1 month'
GROUP BY "A"
LIMIT 10;

Optimizer rule 'simplify_expressions' failed
caused by
Schema error: No field named mytable."A". Valid fields are "MyTable.A", "COUNT(DISTINCT MyTable.B)".

Expected behavior

The query should plan successfully (but return no rows)

+-----------+---------------------------+
| MyTable.A | COUNT(DISTINCT MyTable.B) |
+-----------+---------------------------+
+-----------+---------------------------+

Additional context

The query works fine in version 26.0.0 👍

DataFusion CLI v26.0.0
❯ create table "MyTable" (
"A"    VARCHAR,
"B"    VARCHAR,
"time" timestamp
);

0 rows in set. Query took 0.003 seconds.
❯ SELECT "A", COUNT(DISTINCT("B"))
FROM "MyTable"
WHERE time >= now() - interval '1 month'
GROUP BY "A"
LIMIT 10;

+-----------+---------------------------+
| MyTable.A | COUNT(DISTINCT MyTable.B) |
+-----------+---------------------------+
+-----------+---------------------------+

Also if we remove the distinct the query also works:

SELECT "A", COUNT("B")
FROM "MyTable"
WHERE time >= now() - interval '1 month'
GROUP BY "A"
LIMIT 10;
+---+------------------+
| A | COUNT(MyTable.B) |
+---+------------------+
+---+------------------+
@alamb alamb added the bug Something isn't working label Jun 28, 2023
@jackwener
Copy link
Member

this is related with #6681.

During the process of solving this problem, I encountered some difficulties, but I am working hard to resolve them.

@alamb
Copy link
Contributor Author

alamb commented Jul 3, 2023

I also hit this same error with some of the clickbench queries.

I think this was one:

SELECT "RegionID", COUNT(DISTINCT "UserID") AS u FROM hits GROUP BY "RegionID" ORDER BY u DESC LIMIT 10;

I hope to find time to work on this issue later in the week if you haven't had a chance @jackwener

This was referenced Jul 3, 2023
@alamb
Copy link
Contributor Author

alamb commented Jul 3, 2023

Thank you @jackwener

@jackwener
Copy link
Member

jackwener commented Jul 6, 2023

I have been quite busy lately, and today a newcomer @YjyJeff in the community encountered the same issue and submitted a PR #6862 related to it. So I took some time to systematically analyze this problem and help others in the community better understand it.

This problem is caused by #6595 directly. Although the idea of #6595 is right, but we cannot change so for the time being. Because it will expose the current design problem ( schema will be change in optimization).

Let me have explain this problem:

There is a optimizated rule: simplify expr:
project: t1.a + 1 + 2
its schema is Schema: Field [qualifier: None, name:`t1.a + 1 + 2`]

after simplify expr:
project: t1.a + 3

if we use original schema: schema still is
Schema: Field [qualifier: None, name:t1.a + 1 + 2]
if we recompute schema, its schema is Schema: Field [qualifier: None, name:t1.a + 3]

so, if we recompute new schema, it will cause mismatch schema error
Of course, we can add alias to resolve this problem, it will make field keep same name after recomputation.

BUT, alias can't avoid this corner case: #6595 (comment)
such as: alias('t1.a'), field is [qualifier: none, name: t1.a], we hope field is [qualifier: t1, name: a].

so, I try to resolve it by add a field in Alias to keep schema same.
I try to use #6681 #6826 to resolve it, but I think it will bring much complexity, I think it isn't a good idea and give up it.

so I think a better temporary method is to revert #6595


Let me talk about more.

In my opinion, it is unnecessary to keep Schema unchanged at the Optimization stage, because in the process of optimization, it indeed may change.

So, keep Schema isn't a good design, it will cause some trouble.


Why we need to keep Schema unchanged?

Based on my guess, the reason may be that we need to ensure the references are correct.
Plan is based on Schema. if we change schema of child plan, plan may can't get right column from child plan.

This involves a design problem. we shouldn't use name as a references to get a column from child plan, we should use id to get a column from child plan.
Because we should use a immutable/unquie item as a references.
By the way, if we use this design, we also can avoid problem like #6543.

Many systems in the industry have adopted this design, like spark, impala, doris...

such as spark

case class AttributeReference(
    name: String,
    datatype: DataType,
    nullable: Boolean = true,
    override val metadata: Metadata = Metadata.empty)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Seq[String] = Seq.empty[String])
  extends Attribute with Unevaluable

  /**
   * Returns true iff the expression id is the same for both attributes.
   */
  def sameRef(other: AttributeReference): Boolean = this.exprId == other.exprId

@alamb
Copy link
Contributor Author

alamb commented Jul 6, 2023

This involves a design problem. we shouldn't use name as a references to get a column from child plan, we should use id to get a column from child plan.

I believe this is also what Postgres does (refers to columns by id, rather than a name)

Also, interestingly, this is what DataFusion PhyicalExpr do as well.

In my opinion, it is unnecessary to keep Schema unchanged at the Optimization stage, because in the process of optimization, it indeed may change.

qualifier/name change: such as simplify_expr

I somehow feel this should be fixable with Aliases but as you point out this apparently is tricky

nullable change: such as infer not null from predicate

This is a good point

type change: such as #6862

In my opinion, the type (other than nullable) should never change after we do the Analysis phase. Otherwise we could end up getting to the physical expressions and not having the types line up

@alamb
Copy link
Contributor Author

alamb commented Jul 6, 2023

so, I try to resolve it by add a field in Alias to keep schema same.
I try to use #6681 #6826 to resolve it, but I think it will bring much complexity, I think it isn't a good idea and give up it.

🤔 yeah it is the approach that seems to make the most sense from an incremental sense, but I trust you if you say we can't make it work in practice...

@jackwener do you think there is a good ticket that covers this issue adequately ? Your description in #6790 (comment) is really nicely done, and I think it will likely get lost in this PR if we don't pull it into a ticket

yukkit added a commit to yukkit/cnosdb that referenced this issue Jul 10, 2023
yukkit added a commit to yukkit/cnosdb that referenced this issue Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants