-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier to treat Rows
as bytes
#6063
Comments
Thanks @bkirwi -- both of these APIs make sense to me Just out of curiosity, how do you go the other way (bytes --> |
I agree with the assesment it is more work. What other advantages do you see? |
Currently, the idea is to go from of I do think it makes sense to have an API that goes the other way - it seems "natural" and easy to implement - but IIUC it's less important for performance.
API consistency and code reuse, I suppose... you can imagine having an API like Might be worthwhile! But to me it feels a bit murkier than the other APIs under discussion. |
Makes sense. I think the piece I missed is that RowParser is currently not a pub struct. https://docs.rs/arrow/latest/arrow/index.html?search=RowParser Line 780 in 8a5be13
Making it pub is effectively the public API I was wondering about. Make sense. THank you FYI @tustvold in case you have thoughts |
Thanks! Drafted a version of this - totally appreciate it is under discussion still and may need to change, but what I have is up at #6096. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I'm trying to convert back and forth between
Rows
data and raw bytes. (Think an external sort, for example: converting to rows and then shifting the data to off-heap storage.)Describe the solution you'd like
Arrow's APIs for this are already pretty good, but there are two things that would make my life easier.
*** A
data
accessor onRow
***Right now it's impossible to write this function:
While
Row
has anAsRef
implementation, that will only give you access to the bytes with the lifetime of theRow
, not the underlyingRows
buffer.Something like this would allow it:
*** Rows to
BinaryArray
I would love to have something like:
This should be fairly straightforward to implement - it can at least reuse the allocation for the binary data - and lets the caller take advantage of all the functionality on array. (For example, if I want to copy the data, I can do a single memcpy instead of working row-by-row.)
Describe alternatives you've considered
Both of these things have workarounds - they just take extra compute or allocations. For example, I can emulate
rows.into_array()
withrows.iter()
and anBinaryBuilder
, though that costs an extraVec
allocation and a bunch of compute to do all the small copies.A more invasive change would be to change
Rows
to be actually backed by an array, probably also adding aRowsBuilder
that was backed by anArrayBuilder
. That has a few other advantages, but it's a breaking change and significantly more work AFAICT.The text was updated successfully, but these errors were encountered: