Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about design #15

Closed
jorgecarleitao opened this issue Aug 21, 2021 · 2 comments
Closed

Question about design #15

jorgecarleitao opened this issue Aug 21, 2021 · 2 comments

Comments

@jorgecarleitao
Copy link

Hey!

This is a really interesting approach; I have tried this before working on "arrow2", but hit some walls when working with nested data in IO boundaries, so I am excited to see someone else trying!

A couple of questions, since I am very curious about it:

  • If I understand correctly, the validity of the arrays is known at compile time, via the const bool. Why for this decision? Is there a major performance difference between this and only being known at runtime?
  • Nested arrays seem to have a compile-time childs. How do we interact with IPC and ffi if those are only known at runtime? (e.g. the nesteness of an IPC file is only known after reading the schema in the files's footer), and FFI is even stronger, since e.g. people in Python may write a nested array at any point and pass it to Rust.

the later was what ultimately led me to abandon the static design and continue to work under a dyn design.

@mbrobbel
Copy link
Owner

mbrobbel commented Aug 22, 2021

Hi!

If I understand correctly, the validity of the arrays is known at compile time, via the const bool. Why for this decision?

The const generic bool argument indicates whether or not the array can contain null values i.e. if a validity bitmap is allocated. I considered the following alternative methods to expose this:

  • Handle this at runtime with an Option<Bitmap> field in the array. I discarded this method because I assume the nullability of an array type to be known at compile time in the context of this crate. This method prevents (given that we don't have specialization yet) idiomatic implementations of for example the FromIterator trait for both nullable and non-nullable arrays (as mentioned in Allow collecting from Values as well as Option<value> apache/arrow-rs#655).
  • Define different product types for nullable and non-nullable arrays, with an additional bitmap field for the nullable array type e.g. BooleanArray and NonNullableBooleanArray. This method potentially results in code duplication for methods that are invariant to the nullability of arrays.
  • Define a Nullable<T: Array> that wraps an array with a validity bitmap. This works fine, but it results in an API that is not very ergonomic.
  • Use the Nullable wrapper type internally but expose the use of this as a const generic argument of the array. This involves some tricks in the implementation but results in an API that is more ergonomic e.g. List<BooleanArray<true>, true> vs Nullable<List<Nullable<BooleanArray>>. Having different types and the validity abstraction has additional benefits for the implementation. It's possible to write generic implementations for methods that are invariant to the nullability. It's possible to have different trait implementations for the nullable and non-nullable array types e.g. FromIterator. It also results in idiomatic code to convert between non-nullable array types and nullable array types (allocate a validity bitmap) e.g. using impl From<BooleanArray<false>> for BooleanArray<true>.

Is there a major performance difference between this and only being known at runtime?

I haven't measured this, but I'm assuming that the performance benefit of this is negligible.

Nested arrays seem to have a compile-time childs. How do we interact with IPC and ffi if those are only known at runtime? (e.g. the nesteness of an IPC file is only known after reading the schema in the files's footer), and FFI is even stronger, since e.g. people in Python may write a nested array at any point and pass it to Rust.

I'm building this crate to support the more narrow use-case where array types are known at compile time. Whenever data enters the application (either through io or ffi) the expected array type is specified and the read schema is used to validate that it is compatible. An incompatible schema results in a runtime error.

#[derive(Array, ...)]
pub struct Foo {
  ...
}

fn read_ipc<T, const N: bool>(...) -> Result<StructArray<T, N>, Error> {
  ...
}

fn main() -> Result<(), Error> {
  // Read a non-nullable Foo array from an ipc file.
  // Returns an error when schema of ipc file is not compatible.
  let foo_array = read_ipc::<Foo, false>(...)?;

  ...

  Ok(())
}

To more precisely answer your question:

How do we interact with IPC and ffi if those are only known at runtime?

When types are not known at compile time this crate is not useful and I would use arrow2. A compatibility layer can be added to make use of methods from other arrow implementations e.g.:

  • Fn(&Self) -> Box<dyn arrow2::array::Array> in the array trait of this crate.
  • impl TryFrom<Box<dyn arrow2::array::Array>> for BooleanArray<false> for array types in this crate.

@jorgecarleitao
Copy link
Author

So sorry, for some reason your answer slipped under my radar.

Thank you very much for this summary! I think it makes sense. If you think there is anything that you would benefit from arrow2, please let me know and we can try to move it out of arrow2 into a common crate.

Also, an idea to further simplify here is jorgecarleitao/arrow2#385, which changes arrow2 to use rust default Vec (since alignments are not really needed in arrow)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants