-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic instead of discarding nulls converting StructArray to RecordBatch - (#3951) #3953
Conversation
0bd4138
to
36168c8
Compare
@@ -89,6 +89,14 @@ impl StructArray { | |||
} | |||
} | |||
|
|||
/// Returns the [`Field`] of this [`StructArray`] | |||
pub fn fields(&self) -> &[Field] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this needs to be updated to avoid a logical conflict with FieldList
arrow-array/src/record_batch.rs
Outdated
assert_eq!( | ||
value.null_count(), | ||
0, | ||
"Cannot convert nullable StructArray to RecordBatch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is an error
Wouldn't it make more sense to apply the parent StructArray
validity mask to each child's mask to form a new mask that is only non null if both the struct array and the child array were non null 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a possibility, but then what happens if the children are not nullable - either because of their type, e.g. RunArray or UnionArray, or because the field is labelled not nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 then it seems like we would have to update the schema to not simply be the schema of the Struct array itself
- I am not saying we have to implement this logic, but a hard panic seems like it may be overly drastic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to not simply be the schema of the Struct array itself
Not all child types are nullable, as a result of their encoding. This would not just be a schema-only operation. Taking a step back though, silently changing the schema based on the presence of nulls seems like a very unexpected behaviour? DataFusion would get very confused if one of its RecordBatch suddenly had a different schema?
TBC this conversion from RecordBatch to StructArray should only be being attempted at a top-level where you can statically guarantee there won't be any nulls present - e.g. the top-level of a parquet schema, panicking if you violate that invariant seems perfectly reasonable to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panicking if you violate that invariant seems perfectly reasonable to me...
Is this invariant documented anywhere? I am just thinking of some case where the code "used to work" ( I realize not correctly) and now it still compiles but panic's at runtime.
Maybe we can just document the invariant better -- and I suppose as long as there is a workaround it should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this invariant documented anywhere
Well yes, a RecordBatch can't contain nulls? I think I'm missing something as to why this is a controversial change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something as to why this is a controversial change?
My understanding is that it makes something that used to work panic. Maybe my problem is that I don't really understand the problem this is fixing (#3952 just says that something should work, not the why). It is probably obvious to you but sadly it is not obvious to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following our synchronous discussion I've added some more docs in a165ddc PTAL
/// code that needs to handle both will typically share an implementation in terms of | ||
/// [`StructArray`] and convert to/from [`RecordBatch`] as necessary. | ||
/// | ||
/// [`From`] implementations are provided to facilitate this conversion, however, converting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thank you -- this is much clearer now
Which issue does this PR close?
Closes #3952
Rationale for this change
Discarding the null buffer could potentially unmask nulls in non-nullable children, which would be ill-formed.
What changes are included in this PR?
Are there any user-facing changes?
Technically this may change the behaviour of code that was relying on the null buffers silently being discarded, in practice this code was ill-formed.