null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False when compared to other nulls #4718

mcrumiller · 2022-09-03T18:40:03Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of polars.

Issue Description

Typically, in a polars series, None == None and null == null. However, if we create two series of different dtype with a None value, and cast so they have the same dtype, the resulting None values are not treated as equal.

Reproducible Example

import polars as pl

s1 = pl.Series(['1', '2', None, '4', '5'])
s2 = pl.Series(['1', '2', None, '4', '5'])

print(s1 == s2) # all elements are equal

s2 = pl.Series([1, 2, None, 4, 5]).cast(pl.Utf8)

print(s1 == s2) # 3rd element is not equal

Output

shape: (5,)
Series: '' [bool]
[
        true
        true
        true
        true
        true
]
shape: (5,)
Series: '' [bool]
[
        true
        true
        false
        true
        true
]

Expected Behavior

All elements are equal.

Installed Versions

Polars: 0.14.8
Index type: UInt32
Platform: Windows-10-10.0.19044-SP0
Python: 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 7.0.0
pandas: 1.4.0
numpy: 1.22.2
fsspec:
connectorx: 0.3.0
xlsx2csv: 0.8
pytz: 2021.3

The text was updated successfully, but these errors were encountered:

mcrumiller · 2022-09-03T19:16:16Z

I wrote a little script to compare all main dtypes to see which combinations result in True versus False. The general process is this:

create s1 with dtype 1
create s2 with dtype 2
cast s2 to dtype 1
compare results.

It turns out nulls only compares as False with pl.Utf8 and nothing else. Here's the script:

import polars as pl

pl.Config.set_tbl_cols(10)
pl.Config.set_tbl_rows(10)

dtypes = [pl.UInt8, pl.Int8, pl.UInt16, pl.Int16, pl.UInt32, pl.Int32, pl.UInt64, pl.Int64, pl.Utf8]
dtypes_str = ["dtype", "UInt8", "Int8", "UInt16", "Int16", "UInt32", "Int32", "UInt64", "Int64", "Utf8"]
num_dtypes = len(dtypes)
series = [dtypes_str[1:]]
series.extend(([['']*num_dtypes]*num_dtypes))
cast_matrix = pl.DataFrame(dict(zip(dtypes_str,series)))

values = [1, 2, None, 4, 5]
for col_idx, type1 in enumerate(dtypes, start=1):
    s1 = pl.Series(values, dtype=type1)
    for row_idx, type2 in enumerate(dtypes):
        s2 = pl.Series(values, dtype=type2).cast(type1)
        result = {True: "", False: "X"}[(s1 == s2).all()]

        cast_matrix[row_idx, col_idx] = result # is there a more idiomatic way of doing this assignment?

print(cast_matrix)

Output

Each column refers to the "target data type". The row indicates the dtype that the second dataframe was cast from. An X indicates that the comparison evaluated to False.

The following table shows that if either pl.Utf8 is cast to any other value or any value is cast to pl.Utf8, null comparison will evaluate to False. In any other instance, they evaluate to True.

shape: (9, 10)
┌────────┬───────┬──────┬────────┬───────┬────────┬───────┬────────┬───────┬──────┐
│ dtype  ┆ UInt8 ┆ Int8 ┆ UInt16 ┆ Int16 ┆ UInt32 ┆ Int32 ┆ UInt64 ┆ Int64 ┆ Utf8 │
│ ---    ┆ ---   ┆ ---  ┆ ---    ┆ ---   ┆ ---    ┆ ---   ┆ ---    ┆ ---   ┆ ---  │
│ str    ┆ str   ┆ str  ┆ str    ┆ str   ┆ str    ┆ str   ┆ str    ┆ str   ┆ str  │
╞════════╪═══════╪══════╪════════╪═══════╪════════╪═══════╪════════╪═══════╪══════╡
│ UInt8  ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Int8   ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ UInt16 ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Int16  ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ UInt32 ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Int32  ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ UInt64 ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Int64  ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Utf8   ┆ X     ┆ X    ┆ X      ┆ X     ┆ X      ┆ X     ┆ X      ┆ X     ┆      │
└────────┴───────┴──────┴────────┴───────┴────────┴───────┴────────┴───────┴──────┘

mcrumiller · 2022-09-03T19:26:58Z

Sorry for all the edits, I switched my rows/columns in the script and had an offset error that has since been fixed, which gave an indication for the wrong dtype, hence the title change. The script now shows that Utf8 is the culprit.

ritchie46 · 2022-09-04T09:00:42Z

Thanks for the report. Can you do my brain a favor and don't call Series variables df. 😅

mcrumiller · 2022-09-04T13:41:14Z

Ahh geez I usually don't make that mistake. I'll edit it when I'm at my puter. Edit: fixed.

cannero · 2022-09-06T20:03:02Z

This seems to come from arrow2. The None value, when cast to utf8, has a value of "0" instead of the empty string "" and the value is compared for a utf8 data type.
I made a short script to reproduce it:

use arrow2::array::*;
use arrow2::datatypes::*;
use arrow2::compute::{cast::*, comparison::*};

fn main() {
    let array_int = Int32Array::from_iter(vec![Some(1), None, Some(10)]);
    let array_casted = cast(&array_int, &DataType::Utf8, Default::default()).unwrap();
    let array_casted = array_casted.as_any().downcast_ref::<Utf8Array<i32>>().unwrap();
    
    let array_utf8 = Utf8Array::<i32>::from_iter(vec![Some("1"), None, Some("10")]);
    
    println!("casted None value is: <{:?}>, is null: {}",
             array_casted.value(1),
             array_casted.is_null(1));
    println!("utf8 None value is: <{:?}>, is null: {}",
             array_utf8.value(1),
             array_utf8.is_null(1));
    println!("equal {:?}", eq_and_validity(&array_utf8, array_casted));
}

This generates the same wrong output as shown in your example

casted None value is: <"0">, is null: true
utf8 None value is: <"">, is null: true
equal BooleanArray[true, false, true]

ritchie46 · 2022-09-07T08:03:40Z

Fixed by #4685

mcrumiller added bug Something isn't working python Related to Python Polars labels Sep 3, 2022

mcrumiller changed the title ~~null values that originated from different dtypes are not equal~~ null values cast from or to Int64 (and only Int64) always evaluate to False Sep 3, 2022

mcrumiller changed the title ~~null values cast from or to Int64 (and only Int64) always evaluate to False~~ null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False Sep 3, 2022

mcrumiller changed the title ~~null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False~~ null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False when compared to other nulls Sep 3, 2022

ritchie46 closed this as completed Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False when compared to other nulls #4718

null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False when compared to other nulls #4718

mcrumiller commented Sep 3, 2022 •

edited

Loading

mcrumiller commented Sep 3, 2022 •

edited

Loading

mcrumiller commented Sep 3, 2022 •

edited

Loading

ritchie46 commented Sep 4, 2022

mcrumiller commented Sep 4, 2022 •

edited

Loading

cannero commented Sep 6, 2022

ritchie46 commented Sep 7, 2022

null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False when compared to other nulls #4718

null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False when compared to other nulls #4718

Comments

mcrumiller commented Sep 3, 2022 • edited Loading

Polars version checks

Issue Description

Reproducible Example

Output

Expected Behavior

Installed Versions

mcrumiller commented Sep 3, 2022 • edited Loading

Output

mcrumiller commented Sep 3, 2022 • edited Loading

ritchie46 commented Sep 4, 2022

mcrumiller commented Sep 4, 2022 • edited Loading

cannero commented Sep 6, 2022

ritchie46 commented Sep 7, 2022

mcrumiller commented Sep 3, 2022 •

edited

Loading

mcrumiller commented Sep 3, 2022 •

edited

Loading

mcrumiller commented Sep 3, 2022 •

edited

Loading

mcrumiller commented Sep 4, 2022 •

edited

Loading