Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Correct null_count in describe() #10260

Merged
merged 6 commits into from
Apr 30, 2024

Conversation

Weijun-H
Copy link
Member

Which issue does this PR close?

Closes #10231

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Apr 27, 2024
@Weijun-H Weijun-H marked this pull request as ready for review April 27, 2024 02:23
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank yoU @Weijun-H -- this is a nice improvement

Would it be possible to add / update a test that shows an output with null count greater than zero as well?

I also left some comments about potential improvements

@@ -215,6 +215,18 @@ pub fn count(expr: Expr) -> Expr {
))
}

/// Create an expression to represent the count_null() aggregate function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is creating a function like COUNT(IS NULL x) rather than a count_null function

@@ -534,7 +534,7 @@ impl DataFrame {
vec![],
original_schema_fields
.clone()
.map(|f| count(is_null(col(f.name()))).alias(f.name()))
.map(|f| count_null(col(f.name())).alias(f.name()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a count_null() new function, what do you think about using the existing count function?

Something like count(expr.is_null())

Suggested change
.map(|f| count_null(col(f.name())).alias(f.name()))
.map(|f| count(col(f.name()).is_null()).alias(f.name())))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using count, it sums all non-null Expr while is_null converts Expr to a boolean expression, which is why it doesn't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Make sense to me. Another classic trick is to use a CASE expression, something like

SELECT  SUM(CASE WHEN x IS NULL THEN 1 ELSE 0 END) as null_count ...

@@ -39,7 +39,7 @@ async fn describe() -> Result<()> {
"| describe | id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col | string_col | timestamp_col | year | month |",
"+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",
"| count | 7300.0 | 7300 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300 | 7300 | 7300 | 7300.0 | 7300.0 |",
"| null_count | 7300.0 | 7300 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300 | 7300 | 7300 | 7300.0 | 7300.0 |",
"| null_count | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0.0 | 0.0 |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that certainly looks better

@Weijun-H Weijun-H force-pushed the fix-null-count-in-describe branch from a3f58c4 to 3a22207 Compare April 27, 2024 13:08
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H - I think this PR needs a test for data that has actual nulls in it

I checked the file the test is using and I don't think it has any null values in it

> select * from './parquet-testing/data/alltypes_tiny_pages.parquet';
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
| id  | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col         | date_string_col | string_col | timestamp_col           | year | month |
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
| 122 | true     | 2           | 2            | 2       | 20         | 2.2       | 20.2               | 01/13/09        | 2          | 2009-01-13T01:02:05.410 | 2009 | 1     |
| 123 | false    | 3           | 3            | 3       | 30         | 3.3       | 30.299999999999997 | 01/13/09        | 3          | 2009-01-13T01:03:05.430 | 2009 | 1     |
| 124 | true     | 4           | 4            | 4       | 40         | 4.4       | 40.4               | 01/13/09        | 4          | 2009-01-13T01:04:05.460 | 2009 | 1     |
| 125 | false    | 5           | 5            | 5       | 50         | 5.5       | 50.5               | 01/13/09        | 5          | 2009-01-13T01:05:05.500 | 2009 | 1     |
| 126 | true     | 6           | 6            | 6       | 60         | 6.6       | 60.599999999999994 | 01/13/09        | 6          | 2009-01-13T01:06:05.550 | 2009 | 1     |
| 127 | false    | 7           | 7            | 7       | 70         | 7.7       | 70.7               | 01/13/09        | 7          | 2009-01-13T01:07:05.610 | 2009 | 1     |
| 128 | true     | 8           | 8            | 8       | 80         | 8.8       | 80.8               | 01/13/09        | 8          | 2009-01-13T01:08:05.680 | 2009 | 1     |
| 129 | false    | 9           | 9            | 9       | 90         | 9.9       | 90.89999999999999  | 01/13/09        | 9          | 2009-01-13T01:09:05.760 | 2009 | 1     |
| 130 | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0                | 01/14/09        | 0          | 2009-01-14T01:10:05.850 | 2009 | 1     |
| 131 | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1               | 01/14/09        | 1          | 2009-01-14T01:11:05.850 | 2009 | 1     |
| 132 | true     | 2           | 2            | 2       | 20         | 2.2       | 20.2               | 01/14/09        | 2          | 2009-01-14T01:12:05.860 | 2009 | 1     |
| 133 | false    | 3           | 3            | 3       | 30         | 3.3       | 30.299999999999997 | 01/14/09        | 3          | 2009-01-14T01:13:05.880 | 2009 | 1     |
| 134 | true     | 4           | 4            | 4       | 40         | 4.4       | 40.4               | 01/14/09        | 4          | 2009-01-14T01:14:05.910 | 2009 | 1     |
| 135 | false    | 5           | 5            | 5       | 50         | 5.5       | 50.5               | 01/14/09        | 5          | 2009-01-14T01:15:05.950 | 2009 | 1     |
| 136 | true     | 6           | 6            | 6       | 60         | 6.6       | 60.599999999999994 | 01/14/09        | 6          | 2009-01-14T01:16:06     | 2009 | 1     |
| 137 | false    | 7           | 7            | 7       | 70         | 7.7       | 70.7               | 01/14/09        | 7          | 2009-01-14T01:17:06.600 | 2009 | 1     |
| 138 | true     | 8           | 8            | 8       | 80         | 8.8       | 80.8               | 01/14/09        | 8          | 2009-01-14T01:18:06.130 | 2009 | 1     |
| 139 | false    | 9           | 9            | 9       | 90         | 9.9       | 90.89999999999999  | 01/14/09        | 9          | 2009-01-14T01:19:06.210 | 2009 | 1     |
| 140 | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0                | 01/15/09        | 0          | 2009-01-15T01:20:06.300 | 2009 | 1     |
| 141 | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1               | 01/15/09        | 1          | 2009-01-15T01:21:06.300 | 2009 | 1     |
| 142 | true     | 2           | 2            | 2       | 20         | 2.2       | 20.2               | 01/15/09        | 2          | 2009-01-15T01:22:06.310 | 2009 | 1     |
| 143 | false    | 3           | 3            | 3       | 30         | 3.3       | 30.299999999999997 | 01/15/09        | 3          | 2009-01-15T01:23:06.330 | 2009 | 1     |
| 144 | true     | 4           | 4            | 4       | 40         | 4.4       | 40.4               | 01/15/09        | 4          | 2009-01-15T01:24:06.360 | 2009 | 1     |
| 145 | false    | 5           | 5            | 5       | 50         | 5.5       | 50.5               | 01/15/09        | 5          | 2009-01-15T01:25:06.400 | 2009 | 1     |
| 98  | true     | 8           | 8            | 8       | 80         | 8.8       | 80.8               | 01/10/09        | 8          | 2009-01-10T00:38:04.330 | 2009 | 1     |
| 99  | false    | 9           | 9            | 9       | 90         | 9.9       | 90.89999999999999  | 01/10/09        | 9          | 2009-01-10T00:39:04.410 | 2009 | 1     |
| 100 | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0                | 01/11/09        | 0          | 2009-01-11T00:40:04.500 | 2009 | 1     |
| 101 | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1               | 01/11/09        | 1          | 2009-01-11T00:41:04.500 | 2009 | 1     |
| 102 | true     | 2           | 2            | 2       | 20         | 2.2       | 20.2               | 01/11/09        | 2          | 2009-01-11T00:42:04.510 | 2009 | 1     |
| 103 | false    | 3           | 3            | 3       | 30         | 3.3       | 30.299999999999997 | 01/11/09        | 3          | 2009-01-11T00:43:04.530 | 2009 | 1     |
| 104 | true     | 4           | 4            | 4       | 40         | 4.4       | 40.4               | 01/11/09        | 4          | 2009-01-11T00:44:04.560 | 2009 | 1     |
| 105 | false    | 5           | 5            | 5       | 50         | 5.5       | 50.5               | 01/11/09        | 5          | 2009-01-11T00:45:04.600 | 2009 | 1     |
| 106 | true     | 6           | 6            | 6       | 60         | 6.6       | 60.599999999999994 | 01/11/09        | 6          | 2009-01-11T00:46:04.650 | 2009 | 1     |
| 107 | false    | 7           | 7            | 7       | 70         | 7.7       | 70.7               | 01/11/09        | 7          | 2009-01-11T00:47:04.710 | 2009 | 1     |
| 108 | true     | 8           | 8            | 8       | 80         | 8.8       | 80.8               | 01/11/09        | 8          | 2009-01-11T00:48:04.780 | 2009 | 1     |
| 109 | false    | 9           | 9            | 9       | 90         | 9.9       | 90.89999999999999  | 01/11/09        | 9          | 2009-01-11T00:49:04.860 | 2009 | 1     |
| 110 | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0                | 01/12/09        | 0          | 2009-01-12T00:50:04.950 | 2009 | 1     |
| 111 | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1               | 01/12/09        | 1          | 2009-01-12T00:51:04.950 | 2009 | 1     |
| 112 | true     | 2           | 2            | 2       | 20         | 2.2       | 20.2               | 01/12/09        | 2          | 2009-01-12T00:52:04.960 | 2009 | 1     |
| 113 | false    | 3           | 3            | 3       | 30         | 3.3       | 30.299999999999997 | 01/12/09        | 3          | 2009-01-12T00:53:04.980 | 2009 | 1     |
| .                                                                                                                                                                           |
| .                                                                                                                                                                           |
| .                                                                                                                                                                           |
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
7300 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.015 seconds.

@@ -534,7 +534,7 @@ impl DataFrame {
vec![],
original_schema_fields
.clone()
.map(|f| count(is_null(col(f.name()))).alias(f.name()))
.map(|f| count_null(col(f.name())).alias(f.name()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Make sense to me. Another classic trick is to use a CASE expression, something like

SELECT  SUM(CASE WHEN x IS NULL THEN 1 ELSE 0 END) as null_count ...

@@ -215,6 +215,18 @@ pub fn count(expr: Expr) -> Expr {
))
}

/// Create an expression to represent the COUNT(IS NULL x) aggregate function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this confusing as first because it isn't COUNT(x IS NULL) -- I think a more accurate description is COUNT(x) FILTER (WHERE x IS NULL) or something like that

Suggested change
/// Create an expression to represent the COUNT(IS NULL x) aggregate function
/// Create an aggregate that returns the number `NULL` values in a column

@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Apr 30, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice to me now -- thank you @Weijun-H

"| describe | a | b |",
"+------------+------+------+",
"| count | 1 | 0 |",
"| null_count | 0 | 1 |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ -- thank you @Weijun-H

@alamb alamb merged commit 2fd14a4 into apache:main Apr 30, 2024
23 checks passed
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 13, 2024
`null_count` was fixed upstream.

Ref apache/datafusion#10260
andygrove pushed a commit to apache/datafusion-python that referenced this pull request May 14, 2024
* chore: upgrade datafusion Deps

Ref #690

* update concat and concat_ws to use datafusion_functions

Moved in apache/datafusion#10089

* feat: upgrade functions.rs

Upstream is continuing it's migration to UDFs.

Ref apache/datafusion#10098
Ref apache/datafusion#10372

* fix ScalarUDF import

* feat: remove deprecated suppors_filter_pushdown and impl supports_filters_pushdown

Deprecated function removed in apache/datafusion#9923

* use `unnest_columns_with_options` instead of deprecated `unnest_column_with_option`

* remove ScalarFunction wrappers

These relied on upstream BuiltinScalarFunction, which are now removed.

Ref apache/datafusion#10098

* update dataframe `test_describe`

`null_count` was fixed upstream.

Ref apache/datafusion#10260

* remove PyDFField and related methods

DFField was removed upstream.

Ref: apache/datafusion#9595

* bump `datafusion-python` package version to 38.0.0

* re-implement `PyExpr::column_name`

The previous implementation relied on `DFField` which was removed upstream.

Ref: apache/datafusion#9595
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

null_count in describe() does not show zero when there is no null value in the array
2 participants