Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to query by partition column #1445

Closed
ChewingGlass opened this issue Jun 7, 2023 · 12 comments
Closed

Unable to query by partition column #1445

ChewingGlass opened this issue Jun 7, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@ChewingGlass
Copy link
Contributor

Environment

Max
Delta-rs version: latest(deltalake = { git = "https://github.com/delta-io/delta-rs", branch = "main", features = ["s3", "datafusion"] })

Binding:

Environment: OsX

  • Cloud provider: AWS
  • OS: Mac
  • Other:

Bug

What happened:

I create a table with a "date" partition column,

SchemaField::new(
            "date".to_string(),
            SchemaDataType::primitive("date".to_string()),
            false,
            HashMap::new(),
        );
DeltaOps(table)
                .create()
                .with_columns(delta_schema)
                .with_partition_columns(vec!["date"])
                .await?

If I run any query that either selects date or filters by date, I get:

Error: External error: External error: Arrow error: External error: Arrow error: Invalid argument error: column types must match schema types, expected Dictionary(UInt16, Date32) but found Dictionary(UInt16, Null) at column index 1

Caused by:
    0: External error: Arrow error: External error: Arrow error: Invalid argument error: column types must match schema types, expected Dictionary(UInt16, Date32) but found Dictionary(UInt16, Null) at column index 1
    1: Arrow error: External error: Arrow error: Invalid argument error: column types must match schema types, expected Dictionary(UInt16, Date32) but found Dictionary(UInt16, Null) at column index 1
    2: External error: Arrow error: Invalid argument error: column types must match schema types, expected Dictionary(UInt16, Date32) but found Dictionary(UInt16, Null) at column index 1
    3: Arrow error: Invalid argument error: column types must match schema types, expected Dictionary(UInt16, Date32) but found Dictionary(UInt16, Null) at column index 1
    4: Invalid argument error: column types must match schema types, expected Dictionary(UInt16, Date32) but found Dictionary(UInt16, Null) at column index 1

Removing the partition from the table and querying no longer results in an error, but querying on the date field is not working as expected (any query on the field results in no data).

What you expected to happen:
I expect the queries to run and filter by date

How to reproduce it:
See above

@ChewingGlass ChewingGlass added the bug Something isn't working label Jun 7, 2023
@cmackenzie1
Copy link
Contributor

Thanks for the report! I put up a draft PR demonstrating the issue as a test case.

cmackenzie1 added a commit to cmackenzie1-contrib/delta-rs that referenced this issue Jun 7, 2023
cmackenzie1 added a commit to cmackenzie1-contrib/delta-rs that referenced this issue Jun 7, 2023
cmackenzie1 added a commit to cmackenzie1-contrib/delta-rs that referenced this issue Jun 7, 2023
@ChewingGlass
Copy link
Contributor Author

I also noticed that with no partition, in order to get date filters to work you have to do arrow_cast(date, 'Utf8') >= arrow_cast('{}', 'Utf8')

@Blajda
Copy link
Collaborator

Blajda commented Jun 9, 2023

I played around with this. I think the issue is that Datafusion / arrow cannot cast a Date32 into a Dictionary(Int32, Date32) and it silently fails. If you explicitly try to do this it will return an error and mention it is not supported.

When I prevent Date32 from being dictionary encoded here my tests don't error out.

   #[tokio::test]
    async fn test_date() -> Result<(), Box<dyn Error>>{
        let ctx = SessionContext::new();
        let schema: Schema = serde_json::from_value(json!({
            "type": "struct",
            "fields": [
                {"name": "id", "type": "string", "nullable": true, "metadata": {}},
                {"name": "my_date", "type": "date", "nullable": true, "metadata": {}},
            ]
        }))
        .unwrap();
        let table = DeltaOps::new_in_memory()
            .create()
            .with_save_mode(SaveMode::ErrorIfExists)
            .with_columns(schema.get_fields().clone())
            .with_partition_columns(["my_date"])
            .await
            .unwrap();
        assert_eq!(table.version(), 0);

        let data = ctx.sql("select 1 as id, now() as my_date").await?.collect().await?;

        let table = DeltaOps(table)
            .write(data)
            .await
            .unwrap();
        
        ctx.register_table("test", Arc::new(table)).unwrap();
        println!("test1");
        let data = ctx.sql("select * from test where my_date <=  arrow_cast(now(), 'Date32')").await?;
        data.show().await?;

        println!("test2");
        let data = ctx.sql("select * from test where my_date > '2023-06-07'").await?;
        data.show().await?;

        println!("test3");
        ctx.table("test").await?
            .filter(col("my_date").gt(lit("2023-06-05")))?
            .show().await?;

        Ok(())
    }

@cmackenzie1
Copy link
Contributor

cmackenzie1 commented Jun 9, 2023

Looking at a few of my tables, using the type string for partition values avoids this issue as well.

SchemaField::new(
            "date".to_string(),
            SchemaDataType::primitive("string".to_string()),
            false,
            HashMap::new(),
        );

@wjones127
Copy link
Collaborator

I think the issue is that Datafusion / arrow cannot cast a Date32 into a Dictionary(Int32, Date32)

That's very odd. Arrow should be able to cast any type into it's dictionary form.
https://github.com/apache/arrow-rs/blob/ab56693985826bb8caea30558b8c25db286a5e37/arrow-cast/src/cast.rs#LL127C16-L127C16
https://github.com/apache/arrow-rs/blob/ab56693985826bb8caea30558b8c25db286a5e37/arrow-cast/src/cast.rs#L216

If it's the only quick fix we can find, we can remove the dictionary encoding on partition columns. But we should add it back when we can, since it's much more performant for those to be dictionary.

@Blajda
Copy link
Collaborator

Blajda commented Jun 9, 2023

This is the query I used and the resulting error

let data = ctx.sql("select * from test where my_date <=  arrow_cast(now(), 'Dictionary(UInt16, Date32)')").await?;

Error

Error: Context("Optimizer rule 'simplify_expressions' failed", ArrowError(CastError("Unsupported output type for dictionary packing: Date32")))

which traces to here
https://github.com/apache/arrow-rs/blob/ab56693985826bb8caea30558b8c25db286a5e37/arrow-cast/src/cast.rs#L3572

@wjones127
Copy link
Collaborator

Reported this upstream: apache/arrow-rs#4390

Sounds like we should remove the dictionary encoding for now then. And we can try bringing it back once we fix that upstream issue.

@tustvold
Copy link

tustvold commented Jun 9, 2023

remove the dictionary encoding

FWIW Dictionary<Int32, Date32> will be larger, slower to process, and less well supported than Date32. I don't see a reason to ever dictionary encode primitives

@wjones127
Copy link
Collaborator

Ah okay. If Arrow won't support that well, then let's not plan on using dictionary columns for small values.

ChewingGlass pushed a commit to helium/delta-rs that referenced this issue Jun 12, 2023
@ChewingGlass
Copy link
Contributor Author

Thanks for the quick responses guys!

FWIW this fixed it (ae7d2d2)

Not sure if the right fix for the repo at large, but it at least unblocked me.

watfordkcf added a commit to watfordkcf/delta-rs that referenced this issue Jun 19, 2023
@watfordkcf
Copy link
Contributor

I can confirm that @ChewingGlass's fix works for our projects as well. I saw a PR was started but is hanging, happy to push a combined PR (ae7d2d2 and #1447).

I tried building locally without success (deltalake-python / pyo3 don't seem to like my M2 Max arch), but working through that.

@ChewingGlass
Copy link
Contributor Author

Just made a PR #1481

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants