Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter gets mis-optimized and doesn't filter out null values #1954

Closed
TheDan64 opened this issue Dec 2, 2021 · 1 comment · Fixed by #1966
Closed

Filter gets mis-optimized and doesn't filter out null values #1954

TheDan64 opened this issue Dec 2, 2021 · 1 comment · Fixed by #1966

Comments

@TheDan64
Copy link
Contributor

TheDan64 commented Dec 2, 2021

Are you using Python or Rust?

Rust

Which feature gates did you use?

"lazy", "random", "temporal", "object"

What version of polars are you using?

d4c9e26, master Nov 30th

What operating system are you using polars on?

Windows

Describe your bug.

This particular filter by a boolean column filters out false values but not null ones.

What are the steps to reproduce the behavior?

fn foo(lf: LazyFrame, column_name: &str) -> LazyFrame {
    let shift_col_1 = col(column_name).shift_and_fill(1, lit(true)).lt(col(column_name));
    let shift_col_neg_1 = col(column_name).shift(-1).lt(col(column_name));
    let filled_shift_1 = col("shift_1");
    let filled_shift_neg_1 = col("shift_neg_1");
    let filled_shift_1_2 = col("shift_1_2");
    let filled_shift_neg_1_2 = col("shift_neg_1_2").fill_null(lit(true));
    lf
        .with_columns(vec![
            shift_col_1.clone().alias("shift_1"),
            shift_col_neg_1.clone().alias("shift_neg_1"),
        ])
        .with_column(filled_shift_1.and(filled_shift_neg_1).alias("diff"))
        .filter(col("diff"))
}

pub fn test() -> Result<()> {
    let dts = vec![
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 0, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 1, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 2, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 3, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 4, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 5, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 6, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 7, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 8, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 9, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 10, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 11, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 12, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 13, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 14, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 15, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 16, 0),
        NaiveDate::from_ymd(2021, 11, 11).and_hms(11, 17, 0),
    ];
    let data = vec![
        None,
        None,
        None,
        None,
        Some(374.227),
        Some(375.922),
        Some(375.971),
        Some(375.334),
        Some(375.406),
        Some(372.806),
        Some(372.87),
        Some(369.83),
        Some(369.974),
        Some(369.736),
        Some(369.754),
        Some(369.234),
        Some(369.259),
        Some(369.789),
    ];
    let series = Series::new("data", data);
    let df = DataFrame::new(vec![
        DatetimeChunked::new_from_naive_datetime("date", &*dts).into(),
        series,
    ])?;

    dbg!(foo(df.lazy(), "data").collect());

    Ok(())
}

What is the actual behavior?

Produces a row with null value:

    shape: (6, 5)
    +---------------------+---------+---------+-------------+------+
    | date                | data    | shift_1 | shift_neg_1 | diff |
    | ---                 | ---     | ---     | ---         | ---  |
    | datetime            | f64     | bool    | bool        | bool |
    +=====================+=========+=========+=============+======+
    | 2021-11-11 11:06:00 | 375.971 | true    | true        | true |
    +---------------------+---------+---------+-------------+------+
    | 2021-11-11 11:08:00 | 375.406 | true    | true        | true |
    +---------------------+---------+---------+-------------+------+
    | 2021-11-11 11:10:00 | 372.87  | true    | true        | true |
    +---------------------+---------+---------+-------------+------+
    | 2021-11-11 11:12:00 | 369.974 | true    | true        | true |
    +---------------------+---------+---------+-------------+------+
    | 2021-11-11 11:14:00 | 369.754 | true    | true        | true |
    +---------------------+---------+---------+-------------+------+
    | 2021-11-11 11:17:00 | 369.789 | true    | null        | null |
    +---------------------+---------+---------+-------------+------+

What is the expected behavior?

It is expected that null values would get filtered out

@ritchie46
Copy link
Member

Its not the optimizer. I did a fix upstream jorgecarleitao/arrow2#653

For now you can fix your code by running.

 .filter(col("diff").fill_null(lit(true)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants