You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened: When using WriteMode::MergeSchema on RecordBatchWriter::write_with_mode, I encountered a scenario where the commit had an attached metaData action that a) remoted the partition columns from the metadata, and b) removed those columns entirely from the schema (even though the schemas matched).
What you expected to happen: The schemas to match, and it not to remove the partition column information.
How to reproduce it: write a batch with WriteMode::MergeSchema against a table with partition columns.
This explains why the partition columns get zeroed on write, the code was just never written with them in mind.
The reason the schema gets updated, is that in the presence of partition columns, self.arrow_schema_ref and self.original_schema_ref will never match. This is because original_schema_ref is the schema of the table, and arrow_schema_ref is the schema of the written parquet file. This second one gets partition columns stripped.
The text was updated successfully, but these errors were encountered:
Environment
Delta-rs version: 0.17.0 (not fixed in master)
Binding: rust
Environment: N/A
Bug
What happened: When using
WriteMode::MergeSchema
onRecordBatchWriter::write_with_mode
, I encountered a scenario where the commit had an attachedmetaData
action that a) remoted the partition columns from the metadata, and b) removed those columns entirely from the schema (even though the schemas matched).What you expected to happen: The schemas to match, and it not to remove the partition column information.
How to reproduce it: write a batch with
WriteMode::MergeSchema
against a table with partition columns.More details:
This looks like maybe an oversight in the original schema-merging PR. The code point when dealing with this has a TODO in it for setting the partition columns: https://github.com/delta-io/delta-rs/blob/main/crates/core/src/writer/record_batch.rs#L242
This explains why the partition columns get zeroed on write, the code was just never written with them in mind.
The reason the schema gets updated, is that in the presence of partition columns,
self.arrow_schema_ref
andself.original_schema_ref
will never match. This is becauseoriginal_schema_ref
is the schema of the table, andarrow_schema_ref
is the schema of the written parquet file. This second one gets partition columns stripped.The text was updated successfully, but these errors were encountered: