-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: include stats for all columns (#1223) #1342
Conversation
Thanks for working on this. I'm not sure which, but I'm thinking for columns that don't have any statistics, we should either:
|
My 2 cents would be to exclude them, since delta allows for a config how many columns to collect metrics for (which delta-rs does not yet honor :D), my expectation would be to only get metrics for these columns. This defaults to the first 32 columns. https://learn.microsoft.com/en-us/azure/databricks/delta/table-properties ( Update: should have read that you were referencing that very property in the description 😆. |
Updated the PR to not include cols without any stats. code to verify: import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
location = "/tmp/delta/people10m"
DeltaTable.createIfNotExists(spark).addColumn("firstName", "STRING").addColumn("lastName", "STRING").addColumn("gender", "STRING").property("description", "table with people data").location(location).execute()
# create stats for firstName, lastName
spark.sql(f"ALTER TABLE delta.`{location}` SET TBLPROPERTIES(delta.dataSkippingNumIndexedCols = 2);")
columns = ["firstName", "lastName", "gender"]
data = [("Maria", "Dimas", "female"), ("John", "Francis", "male"), ("Frank", "Sinatra", "male")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.coalesce(1).write.format("delta").mode("append").save(location)
# This adds a new update to the table, but this time all three columns will have stats.
spark.sql(f"ALTER TABLE delta.`{location}` SET TBLPROPERTIES(delta.dataSkippingNumIndexedCols = 3);")
data = [("Martin", "Johnson", "male"), ("Victoria", "Neal", "female")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.coalesce(1).write.format("delta").mode("append").save(location) #[tokio::main(flavor = "current_thread")]
async fn main() -> Result<(), deltalake::DeltaTableError> {
let table_path = "/tmp/delta/people10m";
let table = deltalake::open_table(table_path).await?;
println!("{table}");
let acts = table.get_state().add_actions_table(true).unwrap();
dbg!(acts);
Ok(())
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
Description
This is a proposal for how #1223 could be fixed.
Related Issue(s)
Documentation
The current implementation excludes all columns that lack statistical information. The proposed fix will generate information for all columns, with missing statistical values being replaced by 'null' values. However, it is unclear if this is the correct behavior since the
stats_as_batch
function lacks documentation.