Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic instead of discarding nulls converting StructArray to RecordBatch - (#3951) #3953

Merged
merged 3 commits into from
Apr 1, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 26, 2023

Which issue does this PR close?

Closes #3952

Rationale for this change

Discarding the null buffer could potentially unmask nulls in non-nullable children, which would be ill-formed.

What changes are included in this PR?

Are there any user-facing changes?

Technically this may change the behaviour of code that was relying on the null buffers silently being discarded, in practice this code was ill-formed.

@tustvold tustvold added the api-change Changes to the arrow API label Mar 26, 2023
@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Mar 26, 2023
@tustvold tustvold force-pushed the record-batch-struct-array-nulls branch from 0bd4138 to 36168c8 Compare March 26, 2023 15:28
@@ -89,6 +89,14 @@ impl StructArray {
}
}

/// Returns the [`Field`] of this [`StructArray`]
pub fn fields(&self) -> &[Field] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this needs to be updated to avoid a logical conflict with FieldList

assert_eq!(
value.null_count(),
0,
"Cannot convert nullable StructArray to RecordBatch"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is an error

Wouldn't it make more sense to apply the parent StructArray validity mask to each child's mask to form a new mask that is only non null if both the struct array and the child array were non null 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a possibility, but then what happens if the children are not nullable - either because of their type, e.g. RunArray or UnionArray, or because the field is labelled not nullable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 then it seems like we would have to update the schema to not simply be the schema of the Struct array itself

  • I am not saying we have to implement this logic, but a hard panic seems like it may be overly drastic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to not simply be the schema of the Struct array itself

Not all child types are nullable, as a result of their encoding. This would not just be a schema-only operation. Taking a step back though, silently changing the schema based on the presence of nulls seems like a very unexpected behaviour? DataFusion would get very confused if one of its RecordBatch suddenly had a different schema?

TBC this conversion from RecordBatch to StructArray should only be being attempted at a top-level where you can statically guarantee there won't be any nulls present - e.g. the top-level of a parquet schema, panicking if you violate that invariant seems perfectly reasonable to me...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panicking if you violate that invariant seems perfectly reasonable to me...

Is this invariant documented anywhere? I am just thinking of some case where the code "used to work" ( I realize not correctly) and now it still compiles but panic's at runtime.

Maybe we can just document the invariant better -- and I suppose as long as there is a workaround it should be fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this invariant documented anywhere

Well yes, a RecordBatch can't contain nulls? I think I'm missing something as to why this is a controversial change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something as to why this is a controversial change?

My understanding is that it makes something that used to work panic. Maybe my problem is that I don't really understand the problem this is fixing (#3952 just says that something should work, not the why). It is probably obvious to you but sadly it is not obvious to me

Copy link
Contributor Author

@tustvold tustvold Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following our synchronous discussion I've added some more docs in a165ddc PTAL

/// code that needs to handle both will typically share an implementation in terms of
/// [`StructArray`] and convert to/from [`RecordBatch`] as necessary.
///
/// [`From`] implementations are provided to facilitate this conversion, however, converting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thank you -- this is much clearer now

@alamb alamb changed the title Don't discard nulls converting StructArray to RecordBatch (#3951) Panic instead of discarding nulls converting StructArray to RecordBatch - (#3951) Apr 1, 2023
@alamb alamb merged commit 8bac91d into apache:master Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RecordBatch From StructArray Silently Discards Nulls
2 participants