-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Correct null_count in describe() #10260
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank yoU @Weijun-H -- this is a nice improvement
Would it be possible to add / update a test that shows an output with null count greater than zero as well?
I also left some comments about potential improvements
datafusion/expr/src/expr_fn.rs
Outdated
@@ -215,6 +215,18 @@ pub fn count(expr: Expr) -> Expr { | |||
)) | |||
} | |||
|
|||
/// Create an expression to represent the count_null() aggregate function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function is creating a function like COUNT(IS NULL x)
rather than a count_null
function
datafusion/core/src/dataframe/mod.rs
Outdated
@@ -534,7 +534,7 @@ impl DataFrame { | |||
vec![], | |||
original_schema_fields | |||
.clone() | |||
.map(|f| count(is_null(col(f.name()))).alias(f.name())) | |||
.map(|f| count_null(col(f.name())).alias(f.name())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a count_null()
new function, what do you think about using the existing count
function?
Something like count(expr.is_null())
.map(|f| count_null(col(f.name())).alias(f.name())) | |
.map(|f| count(col(f.name()).is_null()).alias(f.name()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using count
, it sums all non-null
Expr while is_null
converts Expr
to a boolean expression, which is why it doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Make sense to me. Another classic trick is to use a CASE expression, something like
SELECT SUM(CASE WHEN x IS NULL THEN 1 ELSE 0 END) as null_count ...
@@ -39,7 +39,7 @@ async fn describe() -> Result<()> { | |||
"| describe | id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col | string_col | timestamp_col | year | month |", | |||
"+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+", | |||
"| count | 7300.0 | 7300 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300 | 7300 | 7300 | 7300.0 | 7300.0 |", | |||
"| null_count | 7300.0 | 7300 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300 | 7300 | 7300 | 7300.0 | 7300.0 |", | |||
"| null_count | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0.0 | 0.0 |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that certainly looks better
a3f58c4
to
3a22207
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Weijun-H - I think this PR needs a test for data that has actual nulls in it
I checked the file the test is using and I don't think it has any null values in it
> select * from './parquet-testing/data/alltypes_tiny_pages.parquet';
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
| id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col | string_col | timestamp_col | year | month |
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
| 122 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/13/09 | 2 | 2009-01-13T01:02:05.410 | 2009 | 1 |
| 123 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/13/09 | 3 | 2009-01-13T01:03:05.430 | 2009 | 1 |
| 124 | true | 4 | 4 | 4 | 40 | 4.4 | 40.4 | 01/13/09 | 4 | 2009-01-13T01:04:05.460 | 2009 | 1 |
| 125 | false | 5 | 5 | 5 | 50 | 5.5 | 50.5 | 01/13/09 | 5 | 2009-01-13T01:05:05.500 | 2009 | 1 |
| 126 | true | 6 | 6 | 6 | 60 | 6.6 | 60.599999999999994 | 01/13/09 | 6 | 2009-01-13T01:06:05.550 | 2009 | 1 |
| 127 | false | 7 | 7 | 7 | 70 | 7.7 | 70.7 | 01/13/09 | 7 | 2009-01-13T01:07:05.610 | 2009 | 1 |
| 128 | true | 8 | 8 | 8 | 80 | 8.8 | 80.8 | 01/13/09 | 8 | 2009-01-13T01:08:05.680 | 2009 | 1 |
| 129 | false | 9 | 9 | 9 | 90 | 9.9 | 90.89999999999999 | 01/13/09 | 9 | 2009-01-13T01:09:05.760 | 2009 | 1 |
| 130 | true | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 01/14/09 | 0 | 2009-01-14T01:10:05.850 | 2009 | 1 |
| 131 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 01/14/09 | 1 | 2009-01-14T01:11:05.850 | 2009 | 1 |
| 132 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/14/09 | 2 | 2009-01-14T01:12:05.860 | 2009 | 1 |
| 133 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/14/09 | 3 | 2009-01-14T01:13:05.880 | 2009 | 1 |
| 134 | true | 4 | 4 | 4 | 40 | 4.4 | 40.4 | 01/14/09 | 4 | 2009-01-14T01:14:05.910 | 2009 | 1 |
| 135 | false | 5 | 5 | 5 | 50 | 5.5 | 50.5 | 01/14/09 | 5 | 2009-01-14T01:15:05.950 | 2009 | 1 |
| 136 | true | 6 | 6 | 6 | 60 | 6.6 | 60.599999999999994 | 01/14/09 | 6 | 2009-01-14T01:16:06 | 2009 | 1 |
| 137 | false | 7 | 7 | 7 | 70 | 7.7 | 70.7 | 01/14/09 | 7 | 2009-01-14T01:17:06.600 | 2009 | 1 |
| 138 | true | 8 | 8 | 8 | 80 | 8.8 | 80.8 | 01/14/09 | 8 | 2009-01-14T01:18:06.130 | 2009 | 1 |
| 139 | false | 9 | 9 | 9 | 90 | 9.9 | 90.89999999999999 | 01/14/09 | 9 | 2009-01-14T01:19:06.210 | 2009 | 1 |
| 140 | true | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 01/15/09 | 0 | 2009-01-15T01:20:06.300 | 2009 | 1 |
| 141 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 01/15/09 | 1 | 2009-01-15T01:21:06.300 | 2009 | 1 |
| 142 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/15/09 | 2 | 2009-01-15T01:22:06.310 | 2009 | 1 |
| 143 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/15/09 | 3 | 2009-01-15T01:23:06.330 | 2009 | 1 |
| 144 | true | 4 | 4 | 4 | 40 | 4.4 | 40.4 | 01/15/09 | 4 | 2009-01-15T01:24:06.360 | 2009 | 1 |
| 145 | false | 5 | 5 | 5 | 50 | 5.5 | 50.5 | 01/15/09 | 5 | 2009-01-15T01:25:06.400 | 2009 | 1 |
| 98 | true | 8 | 8 | 8 | 80 | 8.8 | 80.8 | 01/10/09 | 8 | 2009-01-10T00:38:04.330 | 2009 | 1 |
| 99 | false | 9 | 9 | 9 | 90 | 9.9 | 90.89999999999999 | 01/10/09 | 9 | 2009-01-10T00:39:04.410 | 2009 | 1 |
| 100 | true | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 01/11/09 | 0 | 2009-01-11T00:40:04.500 | 2009 | 1 |
| 101 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 01/11/09 | 1 | 2009-01-11T00:41:04.500 | 2009 | 1 |
| 102 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/11/09 | 2 | 2009-01-11T00:42:04.510 | 2009 | 1 |
| 103 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/11/09 | 3 | 2009-01-11T00:43:04.530 | 2009 | 1 |
| 104 | true | 4 | 4 | 4 | 40 | 4.4 | 40.4 | 01/11/09 | 4 | 2009-01-11T00:44:04.560 | 2009 | 1 |
| 105 | false | 5 | 5 | 5 | 50 | 5.5 | 50.5 | 01/11/09 | 5 | 2009-01-11T00:45:04.600 | 2009 | 1 |
| 106 | true | 6 | 6 | 6 | 60 | 6.6 | 60.599999999999994 | 01/11/09 | 6 | 2009-01-11T00:46:04.650 | 2009 | 1 |
| 107 | false | 7 | 7 | 7 | 70 | 7.7 | 70.7 | 01/11/09 | 7 | 2009-01-11T00:47:04.710 | 2009 | 1 |
| 108 | true | 8 | 8 | 8 | 80 | 8.8 | 80.8 | 01/11/09 | 8 | 2009-01-11T00:48:04.780 | 2009 | 1 |
| 109 | false | 9 | 9 | 9 | 90 | 9.9 | 90.89999999999999 | 01/11/09 | 9 | 2009-01-11T00:49:04.860 | 2009 | 1 |
| 110 | true | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 01/12/09 | 0 | 2009-01-12T00:50:04.950 | 2009 | 1 |
| 111 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 01/12/09 | 1 | 2009-01-12T00:51:04.950 | 2009 | 1 |
| 112 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/12/09 | 2 | 2009-01-12T00:52:04.960 | 2009 | 1 |
| 113 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/12/09 | 3 | 2009-01-12T00:53:04.980 | 2009 | 1 |
| . |
| . |
| . |
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
7300 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.015 seconds.
datafusion/core/src/dataframe/mod.rs
Outdated
@@ -534,7 +534,7 @@ impl DataFrame { | |||
vec![], | |||
original_schema_fields | |||
.clone() | |||
.map(|f| count(is_null(col(f.name()))).alias(f.name())) | |||
.map(|f| count_null(col(f.name())).alias(f.name())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Make sense to me. Another classic trick is to use a CASE expression, something like
SELECT SUM(CASE WHEN x IS NULL THEN 1 ELSE 0 END) as null_count ...
datafusion/expr/src/expr_fn.rs
Outdated
@@ -215,6 +215,18 @@ pub fn count(expr: Expr) -> Expr { | |||
)) | |||
} | |||
|
|||
/// Create an expression to represent the COUNT(IS NULL x) aggregate function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this confusing as first because it isn't COUNT(x IS NULL)
-- I think a more accurate description is COUNT(x) FILTER (WHERE x IS NULL)
or something like that
/// Create an expression to represent the COUNT(IS NULL x) aggregate function | |
/// Create an aggregate that returns the number `NULL` values in a column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really nice to me now -- thank you @Weijun-H
"| describe | a | b |", | ||
"+------------+------+------+", | ||
"| count | 1 | 0 |", | ||
"| null_count | 0 | 1 |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ -- thank you @Weijun-H
`null_count` was fixed upstream. Ref apache/datafusion#10260
* chore: upgrade datafusion Deps Ref #690 * update concat and concat_ws to use datafusion_functions Moved in apache/datafusion#10089 * feat: upgrade functions.rs Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372 * fix ScalarUDF import * feat: remove deprecated suppors_filter_pushdown and impl supports_filters_pushdown Deprecated function removed in apache/datafusion#9923 * use `unnest_columns_with_options` instead of deprecated `unnest_column_with_option` * remove ScalarFunction wrappers These relied on upstream BuiltinScalarFunction, which are now removed. Ref apache/datafusion#10098 * update dataframe `test_describe` `null_count` was fixed upstream. Ref apache/datafusion#10260 * remove PyDFField and related methods DFField was removed upstream. Ref: apache/datafusion#9595 * bump `datafusion-python` package version to 38.0.0 * re-implement `PyExpr::column_name` The previous implementation relied on `DFField` which was removed upstream. Ref: apache/datafusion#9595
Which issue does this PR close?
Closes #10231
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?