Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards #75: Further refactor for using GATs + clippy fixes for rust 1.65 #76

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ jobs:
with:
toolchain: beta # FIXME: revert to stable when 1.65 lands
components: clippy
override: true
- name: "clippy --all"
run: cargo clippy --all --tests -- -D warnings

Expand All @@ -28,6 +29,7 @@ jobs:
- uses: actions-rs/toolchain@v1
with:
toolchain: beta # FIXME: revert to stable when 1.65 lands
override: true
components: rustfmt
- name: Run
run: cargo fmt --all -- --check
Expand All @@ -40,6 +42,7 @@ jobs:
- uses: actions-rs/toolchain@v1
with:
toolchain: beta # FIXME: revert to stable when 1.65 lands
override: true
- uses: Swatinem/rust-cache@v1
- uses: taiki-e/install-action@nextest
- name: Run
Expand All @@ -54,6 +57,7 @@ jobs:
- uses: actions-rs/toolchain@v1
with:
toolchain: beta # FIXME: revert to stable when 1.65 lands
override: true
components: llvm-tools-preview
- name: Install cargo-llvm-cov
uses: taiki-e/install-action@cargo-llvm-cov
Expand Down
137 changes: 66 additions & 71 deletions arrow2_convert/src/deserialize.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,36 +7,36 @@ use crate::field::*;

#[doc(hidden)]
/// Type whose reference can be used to create an iterator.
pub trait IterRef {
pub trait RefIntoIterator: Sized {
/// Iterator type.
type Iter<'a>: Iterator
type Iterator<'a>: Iterator
Copy link

@aldanor aldanor Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this was Iter is because this is the common naming e.g. in the standard library (this way you can infer what kind of a symbol it is without going to hunt for its definition):

  • std::option::Iter, std::result::Iter, etc, always a *Iter if it's a type
  • *Iterator if it's a trait, like DoubleEndedIterator

Definitely +1 on RefIntoIterator, but I think Iter<'_> is more consistent with the commonly used naming scheme?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is something we can potentially tackle as well in the GAT rework?

Yes, absolutely. I'll rework this PR based on your feedback, and either include this change in this PR or next one.

where
Self: 'a;

/// Converts `&self` into an iterator.
fn iter_ref(&self) -> Self::Iter<'_>;
fn ref_into_iter(&self) -> Self::Iterator<'_>;
}

impl<T> IterRef for T
impl<T> RefIntoIterator for T
where
for<'a> &'a T: IntoIterator,
{
type Iter<'a> = <&'a T as IntoIterator>::IntoIter where Self: 'a;
type Iterator<'a> = <&'a T as IntoIterator>::IntoIter<> where Self: 'a;

#[inline]
fn iter_ref(&self) -> Self::Iter<'_> {
fn ref_into_iter(&self) -> Self::Iterator<'_> {
self.into_iter()
}
}

/// Implemented by [`ArrowField`] that can be deserialized from arrow
pub trait ArrowDeserialize: ArrowField + Sized {
/// The `arrow2::Array` type corresponding to this field
type ArrayType: ArrowArray;
type ArrayType: RefIntoIterator;

/// Deserialize this field from arrow
fn arrow_deserialize(
v: <<Self::ArrayType as IterRef>::Iter<'_> as Iterator>::Item,
v: <<Self::ArrayType as RefIntoIterator>::Iterator<'_> as Iterator>::Item,
) -> Option<<Self as ArrowField>::Type>;

#[inline]
Expand All @@ -48,23 +48,36 @@ pub trait ArrowDeserialize: ArrowField + Sized {
/// something like for<'a> &'a T::ArrayType: IntoIterator<Item=Option<E>>,
/// However, the E parameter seems to confuse the borrow checker if it's a reference.
fn arrow_deserialize_internal(
Copy link
Collaborator Author

@ncpenke ncpenke Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldanor Cleaning this up will need a bit more thought. Elements in arrow2 arrays are optional, since they can be nullable. Therefore when converting from Option<ArrowType> -> T (where T is non-nullable), we should check that the arrow array doesn't have a validity bitmap set in this case (otherwise the conversion should just fail upfront). If we can make the deserialize of required fields take ArrowType instead of Option<ArrowType>, that may be much cleaner.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option/internal stuff aside, being able to take ArrowType instead of Option<ArrowType> would be a great move and make API a lot cleaner in general, without having to mess with options and manual .map() all over the place (also making it more accessible for people who just want to write their own serializers/deserializers).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @aldanor. I started this discussion in arrow2 (jorgecarleitao/arrow2#1273). Lets see where that lands. I don't think this should hold up wrapping up GATs and your generics change. We can make another pass after those changes before v0.4.0 release.

v: <<Self::ArrayType as IterRef>::Iter<'_> as Iterator>::Item,
v: <<Self::ArrayType as RefIntoIterator>::Iterator<'_> as Iterator>::Item,
) -> <Self as ArrowField>::Type {
Self::arrow_deserialize(v).unwrap()
}
}

/// Internal trait used to support deserialization and iteration of structs, and nested struct lists
///
/// Trivial pass-thru implementations are provided for arrow2 arrays that auto-implement IterRef.
///
/// The derive macro generates implementations for typed struct arrays.
#[doc(hidden)]
pub trait ArrowArray: IterRef {
type BaseArrayType: Array;

// Returns a typed iterator to the underlying elements of the array from an untyped Array reference.
fn iter_from_array_ref(b: &dyn Array) -> <Self as IterRef>::Iter<'_>;
#[inline]
#[doc(hidden)]
/// For internal use only
///
/// TODO: this can be removed up by using arrow2::array::StructArray and
/// arrow2::array::UnionArray to perform the iteration for unions and structs
/// which should be possible if structs and unions are deserialized via scalars.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncpenke Could you please expand on this? E.g. what needs to be done and what exactly does "deserialized via scalars" mean/imply?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldanor The arrow2::array::StructArray iterator iterates over arrow2 scalars (https://github.com/jorgecarleitao/arrow2/blob/main/src/array/struct_/iterator.rs#L27). The concept of arrow scalars is described in the C++ API (https://arrow.apache.org/docs/cpp/api/scalar.html). I don't think they're an official part of the arrow spec, but an implementation choice to hold results of aggregate operations (for example sum of an array).

However, they're also used to represent a single Struct element when iterating over a StructArray. I'm backtracking on this TODO though. I think existing approach is fine if the unimplemented into_iter hack is cleaned up.

///
/// Helper to return an iterator for elements from a [`arrow2::array::Array`].
///
/// Overridden by struct and enum arrays generated by the derive macro, to
/// downcast to the arrow2 array type.
fn arrow_array_ref_into_iter(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging ArrowArray into ArrowDeserialize has pros and cons - I guess now you have a "standalone" overridable method that doesn't depend on any of the deserialize methods which is a bit weird -- however, this removes an extra trait which is always a plus... maybe it's the right move.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it seemed simpler, but I think I'll have to bring it back in a different way to cleanly iterate through structs and enums.

array: &dyn Array,
) -> Option<<Self::ArrayType as RefIntoIterator>::Iterator<'_>>
where
Self::ArrayType: 'static,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be 'static? Or just within the lifetime of &dyn?

Copy link
Collaborator Author

@ncpenke ncpenke Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with a named lifetime. I'm not fully sure, but I think 'static is needed for the downcast_ref to facilitate the vtable generation (since in that case it would have to be a type that would live for the lifetime of the program).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, right, because Any: 'static bound. Nevermind then!

Might even add a comment to avoid future confusion if you'd like:

// because of `Any: 'static` bound

{
Some(
array
.as_any()
.downcast_ref::<Self::ArrayType>()?
.ref_into_iter(),
)
}
}

// Macro to facilitate implementation for numeric types and numeric arrays.
Expand All @@ -78,23 +91,6 @@ macro_rules! impl_arrow_deserialize_primitive {
v.map(|t| *t)
}
}

impl_arrow_array!(PrimitiveArray<$physical_type>);
};
}

macro_rules! impl_arrow_array {
($array:ty) => {
impl ArrowArray for $array {
type BaseArrayType = Self;

fn iter_from_array_ref(b: &dyn Array) -> <Self as IterRef>::Iter<'_> {
b.as_any()
.downcast_ref::<Self::BaseArrayType>()
.unwrap()
.iter_ref()
}
}
};
}

Expand All @@ -107,17 +103,26 @@ where

#[inline]
fn arrow_deserialize(
v: <<Self::ArrayType as IterRef>::Iter<'_> as Iterator>::Item,
v: <<Self::ArrayType as RefIntoIterator>::Iterator<'_> as Iterator>::Item,
) -> Option<<Self as ArrowField>::Type> {
Self::arrow_deserialize_internal(v).map(Some)
}

#[inline]
fn arrow_deserialize_internal(
v: <<Self::ArrayType as IterRef>::Iter<'_> as Iterator>::Item,
v: <<Self::ArrayType as RefIntoIterator>::Iterator<'_> as Iterator>::Item,
) -> <Self as ArrowField>::Type {
<T as ArrowDeserialize>::arrow_deserialize(v)
}

fn arrow_array_ref_into_iter(
array: &dyn Array,
) -> Option<<Self::ArrayType as RefIntoIterator>::Iterator<'_>>
where
Self::ArrayType: 'static,
{
<T as ArrowDeserialize>::arrow_array_ref_into_iter(array)
}
}

impl_arrow_deserialize_primitive!(u8);
Expand All @@ -140,8 +145,6 @@ impl<const PRECISION: usize, const SCALE: usize> ArrowDeserialize for I128<PRECI
}
}

impl_arrow_array!(PrimitiveArray<i128>);

impl ArrowDeserialize for String {
type ArrayType = Utf8Array<i32>;

Expand Down Expand Up @@ -221,10 +224,10 @@ where
T: ArrowDeserialize + ArrowEnableVecForType + 'static,
{
use std::ops::Deref;
v.map(|t| {
arrow_array_deserialize_iterator_internal::<<T as ArrowField>::Type, T>(t.deref())
.collect::<Vec<<T as ArrowField>::Type>>()
})
Some(
arrow_array_deserialize_iterator_internal::<<T as ArrowField>::Type, T>(v?.deref())?
.collect::<Vec<<T as ArrowField>::Type>>(),
)
}

// Blanket implementation for Vec
Expand Down Expand Up @@ -261,16 +264,6 @@ where
}
}

impl_arrow_array!(BooleanArray);
impl_arrow_array!(Utf8Array<i32>);
impl_arrow_array!(Utf8Array<i64>);
impl_arrow_array!(BinaryArray<i32>);
impl_arrow_array!(BinaryArray<i64>);
impl_arrow_array!(FixedSizeBinaryArray);
impl_arrow_array!(ListArray<i32>);
impl_arrow_array!(ListArray<i64>);
impl_arrow_array!(FixedSizeListArray);

/// Top-level API to deserialize from Arrow
pub trait TryIntoCollection<Collection, Element>
where
Expand All @@ -288,40 +281,42 @@ where
}

/// Helper to return an iterator for elements from a [`arrow2::array::Array`].
fn arrow_array_deserialize_iterator_internal<'a, Element, Field>(
b: &'a dyn Array,
) -> impl Iterator<Item = Element> + 'a
fn arrow_array_deserialize_iterator_internal<Element, Field>(
b: &dyn Array,
) -> Option<impl Iterator<Item = Element> + '_>
where
Field: ArrowDeserialize + ArrowField<Type = Element> + 'static,
{
<<Field as ArrowDeserialize>::ArrayType as ArrowArray>::iter_from_array_ref(b)
.map(<Field as ArrowDeserialize>::arrow_deserialize_internal)
Some(
<Field as ArrowDeserialize>::arrow_array_ref_into_iter(b)?
.map(<Field as ArrowDeserialize>::arrow_deserialize_internal),
)
}

/// Returns a typed iterator to a target type from an `arrow2::Array`
pub fn arrow_array_deserialize_iterator_as_type<'a, Element, ArrowType>(
arr: &'a dyn Array,
) -> arrow2::error::Result<impl Iterator<Item = Element> + 'a>
pub fn arrow_array_deserialize_iterator_as_type<Element, ArrowType>(
arr: &dyn Array,
) -> arrow2::error::Result<impl Iterator<Item = Element> + '_>
where
Element: 'static,
ArrowType: ArrowDeserialize + ArrowField<Type = Element> + 'static,
{
if &<ArrowType as ArrowField>::data_type() != arr.data_type() {
// TODO: use arrow2_convert error type here and include more detail
Err(arrow2::error::Error::InvalidArgumentError(
"Data type mismatch".to_string(),
))
} else {
Ok(arrow_array_deserialize_iterator_internal::<
Element,
ArrowType,
>(arr))
arrow_array_deserialize_iterator_internal::<Element, ArrowType>(arr).ok_or_else(||
// TODO: use arrow2_convert error type here and include more detail
arrow2::error::Error::InvalidArgumentError("Schema mismatch".to_string()))
}
}

/// Return an iterator that deserializes an [`Array`] to an element of type T
pub fn arrow_array_deserialize_iterator<'a, T>(
arr: &'a dyn Array,
) -> arrow2::error::Result<impl Iterator<Item = T> + 'a>
pub fn arrow_array_deserialize_iterator<T>(
arr: &dyn Array,
) -> arrow2::error::Result<impl Iterator<Item = T> + '_>
where
T: ArrowDeserialize + ArrowField<Type = T> + 'static,
{
Expand Down
33 changes: 0 additions & 33 deletions arrow2_convert/tests/test_round_trip.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,39 +11,6 @@ use arrow2_convert::{
use std::borrow::Borrow;
use std::sync::Arc;

#[test]
fn test_nested_optional_struct_array() {
#[derive(Debug, Clone, ArrowField, PartialEq)]
struct Top {
child_array: Vec<Option<Child>>,
}
#[derive(Debug, Clone, ArrowField, PartialEq)]
struct Child {
a1: i64,
}

let original_array = vec![
Top {
child_array: vec![
Some(Child { a1: 10 }),
None,
Some(Child { a1: 12 }),
Some(Child { a1: 14 }),
],
},
Top {
child_array: vec![None, None, None, None],
},
Top {
child_array: vec![None, None, Some(Child { a1: 12 }), None],
},
];

let b: Box<dyn Array> = original_array.try_into_arrow().unwrap();
let round_trip: Vec<Top> = b.try_into_collection().unwrap();
assert_eq!(original_array, round_trip);
}

#[test]
fn test_large_string() {
let strs = vec!["1".to_string(), "2".to_string()];
Expand Down
14 changes: 14 additions & 0 deletions arrow2_convert/tests/test_struct.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@ use arrow2_convert::deserialize::*;
use arrow2_convert::serialize::*;
use arrow2_convert::ArrowField;

#[test]
fn test_optional_struct_array() {
#[derive(Debug, Clone, ArrowField, PartialEq)]
struct Foo {
field: i32,
}

let original_array = vec![Some(Foo { field: 0 }), None, Some(Foo { field: 10 })];
let b: Box<dyn Array> = original_array.try_into_arrow().unwrap();
println!("{:?}", b.data_type());
let round_trip: Vec<Option<Foo>> = b.try_into_collection().unwrap();
assert_eq!(original_array, round_trip);
}

#[test]
fn test_nested_optional_struct_array() {
#[derive(Debug, Clone, ArrowField, PartialEq)]
Expand Down
40 changes: 17 additions & 23 deletions arrow2_convert_derive/src/derive_enum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -446,28 +446,6 @@ pub fn expand(input: DeriveEnum) -> TokenStream {
{}
};

let array_impl = quote! {
impl arrow2_convert::deserialize::ArrowArray for #array_name
{
type BaseArrayType = arrow2::array::UnionArray;

#[inline]
fn iter_from_array_ref<'a>(b: &'a dyn arrow2::array::Array) -> <&'a Self as IntoIterator>::IntoIter
{
use core::ops::Deref;
let arr = b.as_any().downcast_ref::<arrow2::array::UnionArray>().unwrap();
let fields = arr.fields();

#iterator_name {
#(
#variant_names: <<#variant_types as arrow2_convert::deserialize::ArrowDeserialize>::ArrayType as arrow2_convert::deserialize::ArrowArray>::iter_from_array_ref(fields[#variant_indices].deref()),
)*
types_iter: arr.types().iter(),
}
}
}
};

let array_into_iterator_impl = quote! {
impl<'a> IntoIterator for &'a #array_name
{
Expand Down Expand Up @@ -517,12 +495,28 @@ pub fn expand(input: DeriveEnum) -> TokenStream {
fn arrow_deserialize<'a>(v: Option<Self>) -> Option<Self> {
v
}

#[inline]
fn arrow_array_ref_into_iter(
array: &dyn arrow2::array::Array
) -> Option<<#array_name as arrow2_convert::deserialize::RefIntoIterator>::Iterator<'_>>
{
use core::ops::Deref;
let arr = array.as_any().downcast_ref::<arrow2::array::UnionArray>()?;
let fields = arr.fields();

Some(#iterator_name {
#(
#variant_names: <#variant_types as arrow2_convert::deserialize::ArrowDeserialize>::arrow_array_ref_into_iter(fields[#variant_indices].deref())?,
)*
types_iter: arr.types().iter(),
})
}
}
};

generated.extend([
array_decl,
array_impl,
array_into_iterator_impl,
array_iterator_decl,
array_iterator_iterator_impl,
Expand Down
Loading