Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ipc reading code into methods on ArrayReader #7006

Merged
merged 2 commits into from
Jan 27, 2025

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 21, 2025

Note this PR includes a bunch of whitespace changes that make it look like a much larger change than it is

Viewing the diff by ignoring whitespace makes the structure more clear I think

Which issue does this PR close?

Rationale for this change

@totoroyyb and I are working on adding another flag to the ipc reader code that allows disabling validation during read (see #6938). If the flag is true it will be unsafe.

The current structure of the code as a bunch of free functions makes it impossible to pass the parameter down without having to mark all the inner functions unsafe, which is not correct.

What changes are included in this PR?

This PR refactors the code that currently takes an ArrayReader as a parameter
and makes it a method on the function. This has the nice property that
we don't have to pass require_alignment as a flag anymore and it sets us up very nicely
to be able to add disable_validation in a follow on PR

Are there any user-facing changes?

There are no intended changes in this PR -- all changes are to internal structures only.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 21, 2025
.map(|_| reader.next_buffer())
.collect::<Result<Vec<_>, _>>()?;
create_primitive_array(
impl ArrayReader<'_> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This illustrates the point of this PR: remove the need to pass require_alignment as an argument.

It does this by this code into a method on ArrayReader which means require_alignment is now passed as a field.

require_alignment is not a problem, but skip_validation is, for the reasons @tustvold states on #6938 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should rename ArrayReader to RecordBatchDecoder (there is a trait already called RecordBatchReader), and this isn't actually doing anything to do with io::Read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea -- I will do so in a follow on PR

field: &Field,
variadic_counts: &mut VecDeque<i64>,
) -> Result<ArrayRef, ArrowError> {
let reader = self;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aliased self --> reader to minimize the diff. I can remove this rename as a follow on PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably also be good to move it to the rest of the impl at the same time

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a few other methods that need to be moved into ArrayReader (e.g. anything that has a require_alignment flag

}

impl<'a> ArrayReader<'a> {
/// Create a reader for decoding arrays from an encoded [`RecordBatch`]
fn try_new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is refactored code from read_record_batch_impl

metadata: &MetadataVersion,
require_alignment: bool,
) -> Result<RecordBatch, ArrowError> {
let buffers = batch.buffers().ok_or_else(|| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code in this method was split across ArrayReader::try_new and ArrayReader::read_record_batch

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I did have a thought that this might make it harder to "safely" encapsulate the unsafe unchecked flag, but I think we could do something like

mod private {
    /// A boolean flag that cannot be mutated outside of unsafe code
    pub struct UnsafeFlag(bool);

    impl UnsafeFlag {
        #[inline]
        pub unsafe fn set(&mut self, val: bool) {
            self.0 = val;
        }

        #[inline]
        pub fn get(&self) -> bool {
             return self.0
        }
    }
}

.map(|_| reader.next_buffer())
.collect::<Result<Vec<_>, _>>()?;
create_primitive_array(
impl ArrayReader<'_> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should rename ArrayReader to RecordBatchDecoder (there is a trait already called RecordBatchReader), and this isn't actually doing anything to do with io::Read

field: &Field,
variadic_counts: &mut VecDeque<i64>,
) -> Result<ArrayRef, ArrowError> {
let reader = self;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably also be good to move it to the rest of the impl at the same time


let options = RecordBatchOptions::new().with_row_count(Some(self.batch.length() as usize));

let schema = Arc::clone(&self.schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this will now always clone the schema even if it is projected, this is almost certainly irrelevant but something I noticed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point -- this Arc clone comes from the fact that self.create_array below takes &mut self so the code can't also have self.schema borrowed immutably as well

As it happens once per RecordBatch I agree with your assesment that this will be irrelevant in practice

@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2025

This looks good to me, I did have a thought that this might make it harder to "safely" encapsulate the unsafe unchecked flag, but I think we could do something like

This is a great idea; I tried it out in ArrayDataBuilder and I think it looks quite good 👌

@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2025

@alamb alamb merged commit 15249bf into apache:main Jan 27, 2025
26 checks passed
@alamb alamb deleted the alamb/ipc_reader_factor branch January 27, 2025 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants