Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeltaTable DataFusion scan does not prune files properly #1063

Closed
alexwilcoxson-rel opened this issue Jan 13, 2023 · 0 comments · Fixed by #1071
Closed

DeltaTable DataFusion scan does not prune files properly #1063

alexwilcoxson-rel opened this issue Jan 13, 2023 · 0 comments · Fixed by #1071
Labels
bug Something isn't working

Comments

@alexwilcoxson-rel
Copy link
Contributor

alexwilcoxson-rel commented Jan 13, 2023

Environment

Delta-rs version: 0.6

Binding: Rust

Environment:

  • Cloud provider: Azure
  • OS: macOS
  • Other: M1

Bug

What happened:
When providing filters directly to DeltaTable scan implementation for DataFusion TableProvider, files are not pruned properly based on the statistics. The files that do NOT match the predicate are passed to the parquet plan.

What you expected to happen:
Files that do match the predicate are passed to the parquet plan.

How to reproduce it:
Run the following snippet with attached table data.zip

    let table = Box::new(open_table("./data/table").await?);
    let ctx = SessionContext::new();
    let filters = [
        col("n").eq(lit(1 as i64))
    ];
    let plan = table.scan(&ctx.state(), None, &filters, None).await?;
    let children = plan.children();
    let parquet_plan = children.first().unwrap().as_any().downcast_ref::<ParquetExec>().unwrap();
    let file_scan = parquet_plan.base_config();
    for f in file_scan.file_groups.iter().flatten() {
        println!("{:?}", f.object_meta.location);
    }

Record 1 is in partition 0 which is not present in the file scan although all the other files that do not match predicate are present.

More details:
The pruning logic https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs#L354 should be updated to the following

            let pruning_predicate = PruningPredicate::try_new(predicate, schema.clone())?;
            let files_to_prune = pruning_predicate.prune(self)?;
            self.get_state()
                .files()
                .iter()
                .zip(files_to_prune.into_iter())
                .for_each(|(action, keep_file)| {
                    if keep_file {
                        let part = partitioned_file_from_action(action, &schema);
                        file_groups
                            .entry(part.partition_values.clone())
                            .or_default()
                            .push(part);
                    };
                });
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant