Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to treat Rows as bytes #6063

Closed
bkirwi opened this issue Jul 15, 2024 · 5 comments · Fixed by #6096
Closed

Make it easier to treat Rows as bytes #6063

bkirwi opened this issue Jul 15, 2024 · 5 comments · Fixed by #6096
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@bkirwi
Copy link
Contributor

bkirwi commented Jul 15, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'm trying to convert back and forth between Rows data and raw bytes. (Think an external sort, for example: converting to rows and then shifting the data to off-heap storage.)

Describe the solution you'd like

Arrow's APIs for this are already pretty good, but there are two things that would make my life easier.

*** A data accessor on Row ***

Right now it's impossible to write this function:

fn bytes<'a>(row: &Row<'a>) -> &'a [u8] { ... }

While Row has an AsRef implementation, that will only give you access to the bytes with the lifetime of the Row, not the underlying Rows buffer.

Something like this would allow it:

impl<'a> Row<'a> {
    /// The row's bytes, with the lifetime of the underlying data.
    pub fn data(&self) -> &'a [u8] {
        self.data
    }
}

*** Rows to BinaryArray

I would love to have something like:

impl Rows {
    fn into_array(self) -> BinaryArray { ... }
}

This should be fairly straightforward to implement - it can at least reuse the allocation for the binary data - and lets the caller take advantage of all the functionality on array. (For example, if I want to copy the data, I can do a single memcpy instead of working row-by-row.)

Describe alternatives you've considered

Both of these things have workarounds - they just take extra compute or allocations. For example, I can emulate rows.into_array() with rows.iter() and an BinaryBuilder, though that costs an extra Vec allocation and a bunch of compute to do all the small copies.

A more invasive change would be to change Rows to be actually backed by an array, probably also adding a RowsBuilder that was backed by an ArrayBuilder. That has a few other advantages, but it's a breaking change and significantly more work AFAICT.

@bkirwi bkirwi added the enhancement Any new improvement worthy of a entry in the changelog label Jul 15, 2024
@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

Thanks @bkirwi -- both of these APIs make sense to me

Just out of curiosity, how do you go the other way (bytes --> Rows)? In other words, I wonder why you aren't also proposing ways to crate Rows out of [u8] or out of BinaryArray?

@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

A more invasive change would be to change Rows to be actually backed by an array, probably also adding a RowsBuilder that was backed by an ArrayBuilder. That has a few other advantages, but it's a breaking change and significantly more work AFAICT.

I agree with the assesment it is more work. What other advantages do you see?

@bkirwi
Copy link
Contributor Author

bkirwi commented Jul 16, 2024

Just out of curiosity, how do you go the other way?

Currently, the idea is to go from of &[u8] to an iterator of Row via RowParser::parse and then feed that to convert_rows. At a glance this doesn't seem to cost more than converting a BinaryArray to Rows and then passing that to convert_rows... and may be better if the binary array is shared, since Rows wouldn't be able to reuse the allocation.

I do think it makes sense to have an API that goes the other way - it seems "natural" and easy to implement - but IIUC it's less important for performance.

What other advantages do you see?

API consistency and code reuse, I suppose... you can imagine having an API like impl Rows { fn binary(&self) -> &BinaryArray; }, and then doing things like finding the maximum row by passing that to arrow::compute::max_binary or using any of the other functionality the ecosystem has on the array type. And the Rows impl itself may get to reuse some code - since it's structurally very similar to BinaryArray, and that code's been very heavily optimized.

Might be worthwhile! But to me it feels a bit murkier than the other APIs under discussion.

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

Just out of curiosity, how do you go the other way?
Currently, the idea is to go from of &[u8] to an iterator of Row via RowParser::parse and then feed that to convert_rows.

Makes sense. I think the piece I missed is that RowParser is currently not a pub struct.

https://docs.rs/arrow/latest/arrow/index.html?search=RowParser

impl RowParser {

Making it pub is effectively the public API I was wondering about.

Make sense. THank you

FYI @tustvold in case you have thoughts

@bkirwi
Copy link
Contributor Author

bkirwi commented Jul 19, 2024

Thanks! Drafted a version of this - totally appreciate it is under discussion still and may need to change, but what I have is up at #6096.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants