Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False when compared to other nulls #4718

Closed
2 tasks done
mcrumiller opened this issue Sep 3, 2022 · 6 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Sep 3, 2022

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of polars.

Issue Description

Typically, in a polars series, None == None and null == null. However, if we create two series of different dtype with a None value, and cast so they have the same dtype, the resulting None values are not treated as equal.

Reproducible Example

import polars as pl

s1 = pl.Series(['1', '2', None, '4', '5'])
s2 = pl.Series(['1', '2', None, '4', '5'])

print(s1 == s2) # all elements are equal

s2 = pl.Series([1, 2, None, 4, 5]).cast(pl.Utf8)

print(s1 == s2) # 3rd element is not equal

Output

shape: (5,)
Series: '' [bool]
[
        true
        true
        true
        true
        true
]
shape: (5,)
Series: '' [bool]
[
        true
        true
        false
        true
        true
]

Expected Behavior

All elements are equal.

Installed Versions

Polars: 0.14.8
Index type: UInt32
Platform: Windows-10-10.0.19044-SP0
Python: 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 7.0.0
pandas: 1.4.0
numpy: 1.22.2
fsspec:
connectorx: 0.3.0
xlsx2csv: 0.8
pytz: 2021.3

@mcrumiller mcrumiller added bug Something isn't working python Related to Python Polars labels Sep 3, 2022
@mcrumiller
Copy link
Contributor Author

mcrumiller commented Sep 3, 2022

I wrote a little script to compare all main dtypes to see which combinations result in True versus False. The general process is this:

  1. create s1 with dtype 1
  2. create s2 with dtype 2
  3. cast s2 to dtype 1
  4. compare results.

It turns out nulls only compares as False with pl.Utf8 and nothing else. Here's the script:

import polars as pl

pl.Config.set_tbl_cols(10)
pl.Config.set_tbl_rows(10)

dtypes = [pl.UInt8, pl.Int8, pl.UInt16, pl.Int16, pl.UInt32, pl.Int32, pl.UInt64, pl.Int64, pl.Utf8]
dtypes_str = ["dtype", "UInt8", "Int8", "UInt16", "Int16", "UInt32", "Int32", "UInt64", "Int64", "Utf8"]
num_dtypes = len(dtypes)
series = [dtypes_str[1:]]
series.extend(([['']*num_dtypes]*num_dtypes))
cast_matrix = pl.DataFrame(dict(zip(dtypes_str,series)))

values = [1, 2, None, 4, 5]
for col_idx, type1 in enumerate(dtypes, start=1):
    s1 = pl.Series(values, dtype=type1)
    for row_idx, type2 in enumerate(dtypes):
        s2 = pl.Series(values, dtype=type2).cast(type1)
        result = {True: "", False: "X"}[(s1 == s2).all()]

        cast_matrix[row_idx, col_idx] = result # is there a more idiomatic way of doing this assignment?

print(cast_matrix)

Output

Each column refers to the "target data type". The row indicates the dtype that the second dataframe was cast from. An X indicates that the comparison evaluated to False.

The following table shows that if either pl.Utf8 is cast to any other value or any value is cast to pl.Utf8, null comparison will evaluate to False. In any other instance, they evaluate to True.

shape: (9, 10)
┌────────┬───────┬──────┬────────┬───────┬────────┬───────┬────────┬───────┬──────┐
│ dtype  ┆ UInt8 ┆ Int8 ┆ UInt16 ┆ Int16 ┆ UInt32 ┆ Int32 ┆ UInt64 ┆ Int64 ┆ Utf8 │
│ ---    ┆ ---   ┆ ---  ┆ ---    ┆ ---   ┆ ---    ┆ ---   ┆ ---    ┆ ---   ┆ ---  │
│ str    ┆ str   ┆ str  ┆ str    ┆ str   ┆ str    ┆ str   ┆ str    ┆ str   ┆ str  │
╞════════╪═══════╪══════╪════════╪═══════╪════════╪═══════╪════════╪═══════╪══════╡
│ UInt8  ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Int8   ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ UInt16 ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Int16  ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ UInt32 ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Int32  ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ UInt64 ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Int64  ┆       ┆      ┆        ┆       ┆        ┆       ┆        ┆       ┆ X    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Utf8   ┆ X     ┆ X    ┆ X      ┆ X     ┆ X      ┆ X     ┆ X      ┆ X     ┆      │
└────────┴───────┴──────┴────────┴───────┴────────┴───────┴────────┴───────┴──────┘

@mcrumiller mcrumiller changed the title null values that originated from different dtypes are not equal null values cast from or to Int64 (and only Int64) always evaluate to False Sep 3, 2022
@mcrumiller mcrumiller changed the title null values cast from or to Int64 (and only Int64) always evaluate to False null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False Sep 3, 2022
@mcrumiller
Copy link
Contributor Author

mcrumiller commented Sep 3, 2022

Sorry for all the edits, I switched my rows/columns in the script and had an offset error that has since been fixed, which gave an indication for the wrong dtype, hence the title change. The script now shows that Utf8 is the culprit.

@mcrumiller mcrumiller changed the title null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False null values cast from or to pl.Utf8 (and only pl.Utf8) always evaluate to False when compared to other nulls Sep 3, 2022
@ritchie46
Copy link
Member

Thanks for the report. Can you do my brain a favor and don't call Series variables df. 😅

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Sep 4, 2022

Ahh geez I usually don't make that mistake. I'll edit it when I'm at my puter. Edit: fixed.

@cannero
Copy link
Contributor

cannero commented Sep 6, 2022

This seems to come from arrow2. The None value, when cast to utf8, has a value of "0" instead of the empty string "" and the value is compared for a utf8 data type.
I made a short script to reproduce it:

use arrow2::array::*;
use arrow2::datatypes::*;
use arrow2::compute::{cast::*, comparison::*};

fn main() {
    let array_int = Int32Array::from_iter(vec![Some(1), None, Some(10)]);
    let array_casted = cast(&array_int, &DataType::Utf8, Default::default()).unwrap();
    let array_casted = array_casted.as_any().downcast_ref::<Utf8Array<i32>>().unwrap();
    
    let array_utf8 = Utf8Array::<i32>::from_iter(vec![Some("1"), None, Some("10")]);
    
    println!("casted None value is: <{:?}>, is null: {}",
             array_casted.value(1),
             array_casted.is_null(1));
    println!("utf8 None value is: <{:?}>, is null: {}",
             array_utf8.value(1),
             array_utf8.is_null(1));
    println!("equal {:?}", eq_and_validity(&array_utf8, array_casted));
}

This generates the same wrong output as shown in your example

casted None value is: <"0">, is null: true
utf8 None value is: <"">, is null: true
equal BooleanArray[true, false, true]

@ritchie46
Copy link
Member

Fixed by #4685

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants