-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix null comparison for Parquet pruning predicate #1595
Conversation
Thank you @viirya -- I will try to review this carefully, but likely won't be able to do so until tomorrow |
Thank you @alamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya for the quick fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BLUF: I am fairly sure this change is ok, but I am not sure why it is needed; I have outlined my confusions below.
Note I ran this change against the IOx test suite in https://github.com/influxdata/influxdb_iox/pull/3479 and it was good.
Confusion 1: Doesn't follow definition of a pruning predicate:
As a reminder, the pruning predicate definition is
/// A pruning predicate is one that has been rewritten in terms of
/// the min and max values of column references and that evaluates
/// to FALSE if the filter predicate would evaluate FALSE *for
/// every row* whose values fell within the min / max ranges (aka
/// could be pruned).
///
/// The pruning predicate evaluates to TRUE or NULL
/// if the filter predicate *might* evaluate to TRUE for at least
/// one row whose vaules fell within the min/max ranges (in other
/// words they might pass the predicate)
Thus, a "TRUE" or "NULL" result for a predicate means the row group must be kept. This is the safe behavior -- only if it is 100% certain that the predicate will evaluate to FALSE should the row group be removed
In this case, x = null
doesn't seem to satisfy the stated conditions in PruningPredicate
. x = null
evaluates to null
for all values (both null and non null) as can be seen in postgres:
alamb=# select x, x = null from foo;
x | ?column?
---+----------
1 |
|
2 |
(3 rows)
alamb=#
Thus there is something wrong. Either:
- the pruning predicate definition should be updated to say that a pruning predicate will return false if all rows will evaluate to FALSE OR NULL (which seems reasonable as only rows that evaluate to
TRUE
pass a predicate, not row that returnnull
) - this is not a correct transformation
Confusion 2: Why are we treating =
specially?
If we go with this PR, I don't see any reason to handle =
specially, as same argument applies to other operators such as !=
, >
, etc (though it does not apply to IS DISTINCT
/ IS NOT DISTINCT
).
@@ -421,6 +444,13 @@ impl<'a> PruningExpressionBuilder<'a> { | |||
&self.scalar_expr | |||
} | |||
|
|||
fn scalar_expr_value(&self) -> Result<&ScalarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn scalar_expr_value(&self) -> Result<&ScalarValue> { | |
fn scalar_expr_value(&self) -> Option<&ScalarValue> { |
Would save a string creation on error (not that it really matters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
||
fn null_count_column_expr(&mut self) -> Result<Expr> { | ||
let null_count_field = &Field::new(self.field.name(), DataType::Int64, false); | ||
self.required_columns.null_count_column_expr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
{ | ||
// column = null => null_count > 0 | ||
let null_count_column_expr = expr_builder.null_count_column_expr()?; | ||
null_count_column_expr.gt(lit::<i64>(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious why we use a i64
here rather than u64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right. This should be u64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to u64
.
// because the null values propagate to the end result, making the predicate result undefined | ||
assert_eq!(row_group_filter, vec![true, true]); | ||
// First row group was filtered out because it contains no null value on "c2". | ||
assert_eq!(row_group_filter, vec![false, true]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think this could be vec![false, false]
as the predicate can never be true (int > 1 AND bool = NULL
is always NULL
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the expression semantics in datafusion. In Spark, the predicate should be IsNull
that checks the null value. Here I follow the original expression bool = NULL
.
I see there is also IsNull
predicate expression, but I don't see IsNull
is handled in predicate pushdown. I don't know if this is intentional (i.e. using =
to do null predicate pushdown) or a bug.
I can fix it if you agree that IsNull
is correct way to handle null predicate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is related to the "Confusion 1 and 2". I guess this is also why you feel confused about treating =
specially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In sql IsNull
is the correct way to test a column for null as well 👍
It would make a lot of sense to me to rewrite x IS NULL
--> 0 > x_null_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I'm surprised when I looked at the bool = NULL
and confused too. I guess this is how datafusion works but seems not :). Let me fix it together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to fix it here or in a following PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated to use IsNull for predicate pruning.
|
||
/// return the number of null values for the named column. | ||
/// Note: the returned array must contain `num_containers()` rows. | ||
fn null_counts(&self, column: &Column) -> Option<ArrayRef>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// return the number of null values for the named column. | |
/// Note: the returned array must contain `num_containers()` rows. | |
fn null_counts(&self, column: &Column) -> Option<ArrayRef>; | |
/// return the number of null values for the named column as an | |
/// `Option<Int64Array>`. | |
/// | |
/// Note: the returned array must contain `num_containers()` rows. | |
fn null_counts(&self, column: &Column) -> Option<ArrayRef>; |
I had to look this up to figure out what type this was required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @houqp ! |
Which issue does this PR close?
Closes #1591.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?