-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False when compared to other nulls #4718
Comments
I wrote a little script to compare all main dtypes to see which combinations result in True versus False. The general process is this:
It turns out nulls only compares as False with import polars as pl
pl.Config.set_tbl_cols(10)
pl.Config.set_tbl_rows(10)
dtypes = [pl.UInt8, pl.Int8, pl.UInt16, pl.Int16, pl.UInt32, pl.Int32, pl.UInt64, pl.Int64, pl.Utf8]
dtypes_str = ["dtype", "UInt8", "Int8", "UInt16", "Int16", "UInt32", "Int32", "UInt64", "Int64", "Utf8"]
num_dtypes = len(dtypes)
series = [dtypes_str[1:]]
series.extend(([['']*num_dtypes]*num_dtypes))
cast_matrix = pl.DataFrame(dict(zip(dtypes_str,series)))
values = [1, 2, None, 4, 5]
for col_idx, type1 in enumerate(dtypes, start=1):
s1 = pl.Series(values, dtype=type1)
for row_idx, type2 in enumerate(dtypes):
s2 = pl.Series(values, dtype=type2).cast(type1)
result = {True: "", False: "X"}[(s1 == s2).all()]
cast_matrix[row_idx, col_idx] = result # is there a more idiomatic way of doing this assignment?
print(cast_matrix) OutputEach column refers to the "target data type". The row indicates the dtype that the second dataframe was cast from. An The following table shows that if either
|
Sorry for all the edits, I switched my rows/columns in the script and had an offset error that has since been fixed, which gave an indication for the wrong dtype, hence the title change. The script now shows that |
Thanks for the report. Can you do my brain a favor and don't call |
Ahh geez I usually don't make that mistake. I'll edit it when I'm at my puter. Edit: fixed. |
This seems to come from arrow2. The None value, when cast to utf8, has a value of "0" instead of the empty string "" and the value is compared for a utf8 data type. use arrow2::array::*;
use arrow2::datatypes::*;
use arrow2::compute::{cast::*, comparison::*};
fn main() {
let array_int = Int32Array::from_iter(vec![Some(1), None, Some(10)]);
let array_casted = cast(&array_int, &DataType::Utf8, Default::default()).unwrap();
let array_casted = array_casted.as_any().downcast_ref::<Utf8Array<i32>>().unwrap();
let array_utf8 = Utf8Array::<i32>::from_iter(vec![Some("1"), None, Some("10")]);
println!("casted None value is: <{:?}>, is null: {}",
array_casted.value(1),
array_casted.is_null(1));
println!("utf8 None value is: <{:?}>, is null: {}",
array_utf8.value(1),
array_utf8.is_null(1));
println!("equal {:?}", eq_and_validity(&array_utf8, array_casted));
} This generates the same wrong output as shown in your example
|
Fixed by #4685 |
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of polars.
Issue Description
Typically, in a polars series, None == None and null == null. However, if we create two series of different dtype with a None value, and cast so they have the same dtype, the resulting None values are not treated as equal.
Reproducible Example
Output
Expected Behavior
All elements are equal.
Installed Versions
Polars: 0.14.8
Index type: UInt32
Platform: Windows-10-10.0.19044-SP0
Python: 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 7.0.0
pandas: 1.4.0
numpy: 1.22.2
fsspec:
connectorx: 0.3.0
xlsx2csv: 0.8
pytz: 2021.3
The text was updated successfully, but these errors were encountered: