Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: include stats for all columns (#1223) #1342

Merged
merged 2 commits into from
May 13, 2023
Merged

fix: include stats for all columns (#1223) #1342

merged 2 commits into from
May 13, 2023

Conversation

mrjoe7
Copy link
Contributor

@mrjoe7 mrjoe7 commented May 7, 2023

Description

This is a proposal for how #1223 could be fixed.

Related Issue(s)

Documentation

The current implementation excludes all columns that lack statistical information. The proposed fix will generate information for all columns, with missing statistical values being replaced by 'null' values. However, it is unclear if this is the correct behavior since the stats_as_batch function lacks documentation.

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels May 7, 2023
@wjones127
Copy link
Collaborator

wjones127 commented May 8, 2023

Thanks for working on this.

I'm not sure which, but I'm thinking for columns that don't have any statistics, we should either:

  1. exclude them from the output
  2. use NullArray for their stats, to minimize memory overhead.

@mrjoe7 mrjoe7 marked this pull request as draft May 8, 2023 08:36
@mrjoe7 mrjoe7 marked this pull request as ready for review May 8, 2023 14:02
@roeap
Copy link
Collaborator

roeap commented May 8, 2023

My 2 cents would be to exclude them, since delta allows for a config how many columns to collect metrics for (which delta-rs does not yet honor :D), my expectation would be to only get metrics for these columns. This defaults to the first 32 columns.

https://learn.microsoft.com/en-us/azure/databricks/delta/table-properties (delta.dataSkippingNumIndexedCols)

Update: should have read that you were referencing that very property in the description 😆.

@mrjoe7
Copy link
Contributor Author

mrjoe7 commented May 8, 2023

Updated the PR to not include cols without any stats.

code to verify:

import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

location = "/tmp/delta/people10m"

DeltaTable.createIfNotExists(spark).addColumn("firstName", "STRING").addColumn("lastName", "STRING").addColumn("gender", "STRING").property("description", "table with people data").location(location).execute()

# create stats for firstName, lastName
spark.sql(f"ALTER TABLE delta.`{location}` SET TBLPROPERTIES(delta.dataSkippingNumIndexedCols = 2);")

columns = ["firstName", "lastName", "gender"]

data = [("Maria", "Dimas", "female"), ("John", "Francis", "male"), ("Frank", "Sinatra", "male")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.coalesce(1).write.format("delta").mode("append").save(location)

# This adds a new update to the table, but this time all three columns will have stats.
spark.sql(f"ALTER TABLE delta.`{location}` SET TBLPROPERTIES(delta.dataSkippingNumIndexedCols = 3);")
data = [("Martin", "Johnson", "male"), ("Victoria", "Neal", "female")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.coalesce(1).write.format("delta").mode("append").save(location)
#[tokio::main(flavor = "current_thread")]
async fn main() -> Result<(), deltalake::DeltaTableError> {
    let table_path = "/tmp/delta/people10m";
    let table = deltalake::open_table(table_path).await?;
    println!("{table}");
    let acts = table.get_state().add_actions_table(true).unwrap();
    dbg!(acts);
    Ok(())
}

@mrjoe7 mrjoe7 requested a review from wjones127 May 11, 2023 18:05
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@wjones127 wjones127 enabled auto-merge (squash) May 13, 2023 18:18
@wjones127 wjones127 merged commit 8a4b2b8 into delta-io:main May 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
3 participants