fix: include stats for all columns (#1223) #1342

mrjoe7 · 2023-05-07T23:33:24Z

Description

This is a proposal for how #1223 could be fixed.

Related Issue(s)

fixes .get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed #1223

Documentation

The current implementation excludes all columns that lack statistical information. The proposed fix will generate information for all columns, with missing statistical values being replaced by 'null' values. However, it is unclear if this is the correct behavior since the stats_as_batch function lacks documentation.

wjones127 · 2023-05-08T00:09:40Z

Thanks for working on this.

I'm not sure which, but I'm thinking for columns that don't have any statistics, we should either:

exclude them from the output
use NullArray for their stats, to minimize memory overhead.

roeap · 2023-05-08T17:12:25Z

My 2 cents would be to exclude them, since delta allows for a config how many columns to collect metrics for (which delta-rs does not yet honor :D), my expectation would be to only get metrics for these columns. This defaults to the first 32 columns.

https://learn.microsoft.com/en-us/azure/databricks/delta/table-properties (delta.dataSkippingNumIndexedCols)

Update: should have read that you were referencing that very property in the description 😆.

mrjoe7 · 2023-05-08T20:44:14Z

Updated the PR to not include cols without any stats.

code to verify:

import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

location = "/tmp/delta/people10m"

DeltaTable.createIfNotExists(spark).addColumn("firstName", "STRING").addColumn("lastName", "STRING").addColumn("gender", "STRING").property("description", "table with people data").location(location).execute()

# create stats for firstName, lastName
spark.sql(f"ALTER TABLE delta.`{location}` SET TBLPROPERTIES(delta.dataSkippingNumIndexedCols = 2);")

columns = ["firstName", "lastName", "gender"]

data = [("Maria", "Dimas", "female"), ("John", "Francis", "male"), ("Frank", "Sinatra", "male")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.coalesce(1).write.format("delta").mode("append").save(location)

# This adds a new update to the table, but this time all three columns will have stats.
spark.sql(f"ALTER TABLE delta.`{location}` SET TBLPROPERTIES(delta.dataSkippingNumIndexedCols = 3);")
data = [("Martin", "Johnson", "male"), ("Victoria", "Neal", "female")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.coalesce(1).write.format("delta").mode("append").save(location)

#[tokio::main(flavor = "current_thread")]
async fn main() -> Result<(), deltalake::DeltaTableError> {
    let table_path = "/tmp/delta/people10m";
    let table = deltalake::open_table(table_path).await?;
    println!("{table}");
    let acts = table.get_state().add_actions_table(true).unwrap();
    dbg!(acts);
    Ok(())
}

rust/src/table_state_arrow.rs

wjones127

Looks good. Thanks!

mrjoe7 requested review from houqp, xianwill, wjones127, fvaleye, roeap, rtyler and mosyp as code owners May 7, 2023 23:33

github-actions bot added binding/rust Issues for the Rust crate rust labels May 7, 2023

mrjoe7 marked this pull request as draft May 8, 2023 08:36

mrjoe7 marked this pull request as ready for review May 8, 2023 14:02

wjones127 reviewed May 8, 2023

View reviewed changes

rust/src/table_state_arrow.rs Outdated Show resolved Hide resolved

fix: include stats for all columns (#1223)

20394f9

mrjoe7 requested a review from wjones127 May 11, 2023 18:05

Merge branch 'main' into fix_1223

7c8af41

wjones127 approved these changes May 13, 2023

View reviewed changes

wjones127 enabled auto-merge (squash) May 13, 2023 18:18

wjones127 merged commit 8a4b2b8 into delta-io:main May 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: include stats for all columns (#1223) #1342

fix: include stats for all columns (#1223) #1342

mrjoe7 commented May 7, 2023

wjones127 commented May 8, 2023 •

edited

Loading

roeap commented May 8, 2023 •

edited

Loading

mrjoe7 commented May 8, 2023

wjones127 left a comment

fix: include stats for all columns (#1223) #1342

fix: include stats for all columns (#1223) #1342

Conversation

mrjoe7 commented May 7, 2023

Description

Related Issue(s)

Documentation

wjones127 commented May 8, 2023 • edited Loading

roeap commented May 8, 2023 • edited Loading

mrjoe7 commented May 8, 2023

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 commented May 8, 2023 •

edited

Loading

roeap commented May 8, 2023 •

edited

Loading