Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards #75: Further refactor for using GATs + clippy fixes for rust 1.65 #76

Closed
wants to merge 7 commits into from

Conversation

ncpenke
Copy link
Collaborator

@ncpenke ncpenke commented Oct 10, 2022

@aldanor This cleans up your change further by removing ArrowArray. I think the next change that would be worth investigating/doing before generics is using arrow2's StructIter and UnionIter which would remove the need to autogenerate our own StructArray and EnumArray and the unimplemented into_iter. I need to double check the downsides of that. I think I was concerned about having a dyn lookup in an iterator, but perhaps a cleaner API/implementation justifies it for now.

@ncpenke ncpenke changed the base branch from main to feature/v0.4.0 October 10, 2022 20:54
@ncpenke ncpenke changed the title Towards #75: Further refactor for using GATs + clippy fixes for rust 1.64 Towards #75: Further refactor for using GATs + clippy fixes for rust 1.65 Oct 10, 2022
@@ -48,23 +48,36 @@ pub trait ArrowDeserialize: ArrowField + Sized {
/// something like for<'a> &'a T::ArrayType: IntoIterator<Item=Option<E>>,
/// However, the E parameter seems to confuse the borrow checker if it's a reference.
fn arrow_deserialize_internal(
Copy link
Collaborator Author

@ncpenke ncpenke Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldanor Cleaning this up will need a bit more thought. Elements in arrow2 arrays are optional, since they can be nullable. Therefore when converting from Option<ArrowType> -> T (where T is non-nullable), we should check that the arrow array doesn't have a validity bitmap set in this case (otherwise the conversion should just fail upfront). If we can make the deserialize of required fields take ArrowType instead of Option<ArrowType>, that may be much cleaner.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option/internal stuff aside, being able to take ArrowType instead of Option<ArrowType> would be a great move and make API a lot cleaner in general, without having to mess with options and manual .map() all over the place (also making it more accessible for people who just want to write their own serializers/deserializers).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @aldanor. I started this discussion in arrow2 (jorgecarleitao/arrow2#1273). Lets see where that lands. I don't think this should hold up wrapping up GATs and your generics change. We can make another pass after those changes before v0.4.0 release.

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (feature/v0.4.0@37f5500). Click here to learn what that means.
The diff coverage is n/a.

@@                Coverage Diff                @@
##             feature/v0.4.0      #76   +/-   ##
=================================================
  Coverage                  ?   95.32%           
=================================================
  Files                     ?        8           
  Lines                     ?     1454           
  Branches                  ?        0           
=================================================
  Hits                      ?     1386           
  Misses                    ?       68           
  Partials                  ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

/// Iterator type.
type Iter<'a>: Iterator
type Iterator<'a>: Iterator
Copy link

@aldanor aldanor Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this was Iter is because this is the common naming e.g. in the standard library (this way you can infer what kind of a symbol it is without going to hunt for its definition):

  • std::option::Iter, std::result::Iter, etc, always a *Iter if it's a type
  • *Iterator if it's a trait, like DoubleEndedIterator

Definitely +1 on RefIntoIterator, but I think Iter<'_> is more consistent with the commonly used naming scheme?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is something we can potentially tackle as well in the GAT rework?

Yes, absolutely. I'll rework this PR based on your feedback, and either include this change in this PR or next one.

Comment on lines 60 to 62
/// TODO: this can be removed up by using arrow2::array::StructArray and
/// arrow2::array::UnionArray to perform the iteration for unions and structs
/// which should be possible if structs and unions are deserialized via scalars.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncpenke Could you please expand on this? E.g. what needs to be done and what exactly does "deserialized via scalars" mean/imply?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldanor The arrow2::array::StructArray iterator iterates over arrow2 scalars (https://github.com/jorgecarleitao/arrow2/blob/main/src/array/struct_/iterator.rs#L27). The concept of arrow scalars is described in the C++ API (https://arrow.apache.org/docs/cpp/api/scalar.html). I don't think they're an official part of the arrow spec, but an implementation choice to hold results of aggregate operations (for example sum of an array).

However, they're also used to represent a single Struct element when iterating over a StructArray. I'm backtracking on this TODO though. I think existing approach is fine if the unimplemented into_iter hack is cleaned up.

array: &dyn Array,
) -> Option<<Self::ArrayType as RefIntoIterator>::Iterator<'_>>
where
Self::ArrayType: 'static,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be 'static? Or just within the lifetime of &dyn?

Copy link
Collaborator Author

@ncpenke ncpenke Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with a named lifetime. I'm not fully sure, but I think 'static is needed for the downcast_ref to facilitate the vtable generation (since in that case it would have to be a type that would live for the lifetime of the program).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, right, because Any: 'static bound. Nevermind then!

Might even add a comment to avoid future confusion if you'd like:

// because of `Any: 'static` bound

///
/// Overridden by struct and enum arrays generated by the derive macro, to
/// downcast to the arrow2 array type.
fn arrow_array_ref_into_iter(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging ArrowArray into ArrowDeserialize has pros and cons - I guess now you have a "standalone" overridable method that doesn't depend on any of the deserialize methods which is a bit weird -- however, this removes an extra trait which is always a plus... maybe it's the right move.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it seemed simpler, but I think I'll have to bring it back in a different way to cleanly iterate through structs and enums.

@aldanor
Copy link

aldanor commented Oct 10, 2022

dyn lookup in an iterator, but perhaps a cleaner API/implementation justifies it for now

I would definitely bench it first, on small examples, like Foo(i32, i32) - how invasive would be dyn lookup there? (I have a feeling it would be noticeable).

@aldanor
Copy link

aldanor commented Oct 11, 2022

Btw, a bit of an offtopic, but related to GATs :) You also mention in the README:

Availability of generaic associated types would simplify the implementation for large and fixed types, since a generic MutableArray can be defined. Ideally for code reusability, we wouldn’t have to reimplement ArrowSerialize and ArrowDeserialize for large and fixed size types since the primitive types are the same. However, this requires the trait functions to take a generic bounded mutable array as an argument instead of a single array type. This requires the ArrowSerialize and ArrowDeserialize implementations to be able to specify the bounds as part of the associated type, which is not possible without generic associated types.

Do you think it is something we can potentially tackle as well in the GAT rework?

@aldanor
Copy link

aldanor commented Oct 16, 2022

@ncpenke Not to rush, but wonder if you have any particular plans for this PR? I'd love to get back working on the generics part but it seems like we need to figure out and stabilize everything else first, otherwise it will be much harder to refactor it...

@ncpenke
Copy link
Collaborator Author

ncpenke commented Oct 17, 2022

@aldanor Thanks for following up. I was waiting for the discussion in arrow2 to be resolved. I will proceed forward on this PR this week so that you’re unblocked for the generics changes.

@ncpenke
Copy link
Collaborator Author

ncpenke commented Oct 18, 2022

@aldanor I’ll reference @jorgecarleitao’s change once it merges to arrow2. I have a change already, but have one thing left to figure out related to using non-null elements. I’ll update this PR soon as that’s done.

@aldanor
Copy link

aldanor commented Oct 18, 2022

@ncpenke Looks like jorgecarleitao/arrow2#1270 has been merged into master, wonder if we can temporarily switch to git-version of arrow2 here so as not to have to wait for the version bump?

@aldanor
Copy link

aldanor commented Oct 18, 2022

Also, if there's anything to bikeshed, discuss or brainstorm (re: figuring things out), would always be glad to help or participate :)

@ncpenke
Copy link
Collaborator Author

ncpenke commented Oct 18, 2022

Thanks @aldanor. I'm planning to point to the latest arrow2 main to use the change. I'll keep you posted if I'm stuck, and ask for feedback. The last couple of weeks have been a bit busy, but I will try to get this wrapped up asap, so that you can move forward with your changes.

@ncpenke
Copy link
Collaborator Author

ncpenke commented Oct 26, 2022

@aldanor Just following up, I did a bit of a major refactor and also setup the building blocks to address some of the issues you filed. I'm just wrapping up the serialize path then will post an update.

@aldanor
Copy link

aldanor commented Oct 26, 2022

That's awesome, looking forward to any incoming updates and further bikeshedding :)

@aldanor
Copy link

aldanor commented Nov 13, 2022

@ncpenke Hey, I've been out for a few weeks myself, but wondering what's the current story? Any roadblocks or maybe anything to discuss? Would really want to jump in and help out with anything if I could :)

(Since it's been quite a bit of time already, I'd actually have to re-read my own local branch with generics implementation since I've fell a little bit out of context...)

Plus, GATs have landed already, so we can just continue working on stable.

@ncpenke ncpenke marked this pull request as draft November 13, 2022 15:29
@ncpenke
Copy link
Collaborator Author

ncpenke commented Nov 13, 2022

Hey @aldanor, thanks for following up. I've been slammed the past few weeks, so I apologize for the slowness in wrapping this change. I very much appreciate your help, and really enjoying our discussion and collaboration.

  • I agree we can move to stable now, let's do that once this PR is merged.
  • I pushed up my latest changes with rough notes (please note what I pushed up doesn't build).
  • I'm pretty happy with deserialize. We handle null and non-null now, and a much cleaner external interface. I'm not so happy with serialize, and that's where this change has been stuck. I need to give it a good think, but basically am trying to see if we can have serialize follow conversion semantics rather than inserting into a mutable array. If we had impl returns in trait it would be easy, so maybe what we have is the best we can do. Please lmk your thoughts.

Besides that and the GAT you're working on, there are a few issues that came in that would be good to fix. We can discuss those once both these changes are done, in case it's something you want help out with. I'm actively looking for an opportunity to carve out some time to wrap up this change, will do it asap.

@ncpenke
Copy link
Collaborator Author

ncpenke commented Nov 13, 2022

@aldanor just a heads up the approach for deserialize also sets up the framework for #79 that you filed via the new GenericList type.

@ncpenke
Copy link
Collaborator Author

ncpenke commented Jan 3, 2023

@aldanor Going to close this PR and rebase off main.

@ncpenke ncpenke closed this Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants