You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
When using Datafusion, statistics from the delta log are used to prune which files are scanned.
Delta logs do not contain statistics for partition columns in the regular stats columns but store the partition values in a separate key.
What you expected to happen:
Partition information should be used while pruning.
How to reproduce it:
This test which currently fails demonstrates that the scan visits all files for a partition that does not exist in the table.
The implementation also needs to consider null partitions are handled correctly.
#[tokio::test]asyncfntest_datafusion_prune_partitioned() -> Result<()>{use datafusion::prelude::*;// Validate that partition information in include in table statisticslet table = deltalake::open_table("./test/data/delta-2.2.0-partitioned-types").await.unwrap();let statistics = table.datafusion_table_statistics();assert_eq!(statistics.num_rows,Some(3),);assert_eq!(statistics.total_byte_size,Some(452*3));asyncfnget_scan_metrics(table:&DeltaTable,state:&SessionState,e:Expr) -> Result<ExecutionMetricsCollector>{letmut metrics = ExecutionMetricsCollector::default();let scan = table.scan(state,None,&[e],None).await?;if scan.output_partitioning().partition_count() > 0{let plan = CoalescePartitionsExec::new(scan);let task_ctx = Arc::new(TaskContext::from(state));let _result = collect(plan.execute(0, task_ctx)?).await?;visit_execution_plan(&plan,&mut metrics).unwrap();}returnOk(metrics);}let ctx = SessionContext::new();let state = ctx.state();let e = col("c1").eq(lit(1));let metrics = get_scan_metrics(&table,&state, e).await?;println!("{:?}", metrics.scanned_files);assert!(metrics.num_scanned_files() == 0);let e = col("c1").eq(lit(4));let metrics = get_scan_metrics(&table,&state, e).await?;println!("{:?}", metrics.scanned_files);assert!(metrics.num_scanned_files() == 1);let e = col("c3").eq(lit(4));let metrics = get_scan_metrics(&table,&state, e).await?;println!("{:?}", metrics.scanned_files);assert!(metrics.num_scanned_files() == 1);let e = col("c3").eq(lit(0));let metrics = get_scan_metrics(&table,&state, e).await?;println!("{:?}", metrics.scanned_files);assert!(metrics.num_scanned_files() == 0);Ok(())}
The text was updated successfully, but these errors were encountered:
In this particular case it can be fixed by updating the PruningStatics implementation to check if a column is partition column and then handle it. I think this would be a shallow fix and maybe a better involves updating get_stats directly on the DeltaTable
# Description
Exposes partition columns in Datafusion's `PruningStatistics` which will
reduce the number of files scanned when the table is queried.
This also resolves another partition issues where involving `null`
partitions. Previously `ScalarValue::Null` was used which would cause an
error when the actual datatype was obtained from the physical parquet
files.
# Related Issue(s)
- closes#1175
# Description
Exposes partition columns in Datafusion's `PruningStatistics` which will
reduce the number of files scanned when the table is queried.
This also resolves another partition issues where involving `null`
partitions. Previously `ScalarValue::Null` was used which would cause an
error when the actual datatype was obtained from the physical parquet
files.
# Related Issue(s)
- closesdelta-io#1175
Environment
Delta-rs version: Main
Binding: rust
Bug
What happened:
When using Datafusion, statistics from the delta log are used to prune which files are scanned.
Delta logs do not contain statistics for partition columns in the regular stats columns but store the partition values in a separate key.
What you expected to happen:
Partition information should be used while pruning.
How to reproduce it:
This test which currently fails demonstrates that the scan visits all files for a partition that does not exist in the table.
The implementation also needs to consider null partitions are handled correctly.
The text was updated successfully, but these errors were encountered: