-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards #75: Further refactor for using GATs + clippy fixes for rust 1.65 #76
Conversation
ac90962
to
f6c9617
Compare
f6c9617
to
783fe65
Compare
arrow2_convert/src/deserialize.rs
Outdated
@@ -48,23 +48,36 @@ pub trait ArrowDeserialize: ArrowField + Sized { | |||
/// something like for<'a> &'a T::ArrayType: IntoIterator<Item=Option<E>>, | |||
/// However, the E parameter seems to confuse the borrow checker if it's a reference. | |||
fn arrow_deserialize_internal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aldanor Cleaning this up will need a bit more thought. Elements in arrow2 arrays are optional, since they can be nullable. Therefore when converting from Option<ArrowType>
-> T (where T is non-nullable), we should check that the arrow array doesn't have a validity bitmap set in this case (otherwise the conversion should just fail upfront). If we can make the deserialize of required fields take ArrowType
instead of Option<ArrowType>
, that may be much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option/internal stuff aside, being able to take ArrowType
instead of Option<ArrowType>
would be a great move and make API a lot cleaner in general, without having to mess with options and manual .map()
all over the place (also making it more accessible for people who just want to write their own serializers/deserializers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @aldanor. I started this discussion in arrow2 (jorgecarleitao/arrow2#1273). Lets see where that lands. I don't think this should hold up wrapping up GATs and your generics change. We can make another pass after those changes before v0.4.0 release.
Codecov Report
@@ Coverage Diff @@
## feature/v0.4.0 #76 +/- ##
=================================================
Coverage ? 95.32%
=================================================
Files ? 8
Lines ? 1454
Branches ? 0
=================================================
Hits ? 1386
Misses ? 68
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
arrow2_convert/src/deserialize.rs
Outdated
/// Iterator type. | ||
type Iter<'a>: Iterator | ||
type Iterator<'a>: Iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this was Iter
is because this is the common naming e.g. in the standard library (this way you can infer what kind of a symbol it is without going to hunt for its definition):
std::option::Iter
,std::result::Iter
, etc, always a*Iter
if it's a type*Iterator
if it's a trait, likeDoubleEndedIterator
Definitely +1 on RefIntoIterator
, but I think Iter<'_>
is more consistent with the commonly used naming scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is something we can potentially tackle as well in the GAT rework?
Yes, absolutely. I'll rework this PR based on your feedback, and either include this change in this PR or next one.
arrow2_convert/src/deserialize.rs
Outdated
/// TODO: this can be removed up by using arrow2::array::StructArray and | ||
/// arrow2::array::UnionArray to perform the iteration for unions and structs | ||
/// which should be possible if structs and unions are deserialized via scalars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncpenke Could you please expand on this? E.g. what needs to be done and what exactly does "deserialized via scalars" mean/imply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aldanor The arrow2::array::StructArray
iterator iterates over arrow2 scalars (https://github.com/jorgecarleitao/arrow2/blob/main/src/array/struct_/iterator.rs#L27). The concept of arrow scalars is described in the C++ API (https://arrow.apache.org/docs/cpp/api/scalar.html). I don't think they're an official part of the arrow spec, but an implementation choice to hold results of aggregate operations (for example sum of an array).
However, they're also used to represent a single Struct
element when iterating over a StructArray
. I'm backtracking on this TODO though. I think existing approach is fine if the unimplemented
into_iter hack is cleaned up.
arrow2_convert/src/deserialize.rs
Outdated
array: &dyn Array, | ||
) -> Option<<Self::ArrayType as RefIntoIterator>::Iterator<'_>> | ||
where | ||
Self::ArrayType: 'static, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be 'static? Or just within the lifetime of &dyn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with a named lifetime. I'm not fully sure, but I think 'static
is needed for the downcast_ref
to facilitate the vtable generation (since in that case it would have to be a type that would live for the lifetime of the program).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, right, because Any: 'static
bound. Nevermind then!
Might even add a comment to avoid future confusion if you'd like:
// because of `Any: 'static` bound
arrow2_convert/src/deserialize.rs
Outdated
/// | ||
/// Overridden by struct and enum arrays generated by the derive macro, to | ||
/// downcast to the arrow2 array type. | ||
fn arrow_array_ref_into_iter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging ArrowArray
into ArrowDeserialize
has pros and cons - I guess now you have a "standalone" overridable method that doesn't depend on any of the deserialize methods which is a bit weird -- however, this removes an extra trait which is always a plus... maybe it's the right move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing it seemed simpler, but I think I'll have to bring it back in a different way to cleanly iterate through structs and enums.
I would definitely bench it first, on small examples, like |
Btw, a bit of an offtopic, but related to GATs :) You also mention in the README:
Do you think it is something we can potentially tackle as well in the GAT rework? |
@ncpenke Not to rush, but wonder if you have any particular plans for this PR? I'd love to get back working on the generics part but it seems like we need to figure out and stabilize everything else first, otherwise it will be much harder to refactor it... |
@aldanor Thanks for following up. I was waiting for the discussion in arrow2 to be resolved. I will proceed forward on this PR this week so that you’re unblocked for the generics changes. |
@aldanor I’ll reference @jorgecarleitao’s change once it merges to arrow2. I have a change already, but have one thing left to figure out related to using non-null elements. I’ll update this PR soon as that’s done. |
@ncpenke Looks like jorgecarleitao/arrow2#1270 has been merged into master, wonder if we can temporarily switch to git-version of arrow2 here so as not to have to wait for the version bump? |
Also, if there's anything to bikeshed, discuss or brainstorm (re: figuring things out), would always be glad to help or participate :) |
Thanks @aldanor. I'm planning to point to the latest arrow2 main to use the change. I'll keep you posted if I'm stuck, and ask for feedback. The last couple of weeks have been a bit busy, but I will try to get this wrapped up asap, so that you can move forward with your changes. |
@aldanor Just following up, I did a bit of a major refactor and also setup the building blocks to address some of the issues you filed. I'm just wrapping up the serialize path then will post an update. |
That's awesome, looking forward to any incoming updates and further bikeshedding :) |
@ncpenke Hey, I've been out for a few weeks myself, but wondering what's the current story? Any roadblocks or maybe anything to discuss? Would really want to jump in and help out with anything if I could :) (Since it's been quite a bit of time already, I'd actually have to re-read my own local branch with generics implementation since I've fell a little bit out of context...) Plus, GATs have landed already, so we can just continue working on stable. |
Hey @aldanor, thanks for following up. I've been slammed the past few weeks, so I apologize for the slowness in wrapping this change. I very much appreciate your help, and really enjoying our discussion and collaboration.
Besides that and the GAT you're working on, there are a few issues that came in that would be good to fix. We can discuss those once both these changes are done, in case it's something you want help out with. I'm actively looking for an opportunity to carve out some time to wrap up this change, will do it asap. |
@aldanor Going to close this PR and rebase off main. |
@aldanor This cleans up your change further by removing
ArrowArray
. I think the next change that would be worth investigating/doing before generics is using arrow2'sStructIter
andUnionIter
which would remove the need to autogenerate our ownStructArray
andEnumArray
and the unimplementedinto_iter
. I need to double check the downsides of that. I think I was concerned about having adyn
lookup in an iterator, but perhaps a cleaner API/implementation justifies it for now.