Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Proposal: use Box<dyn Array> instead of Arc<dyn Array> #1036

Closed
jorgecarleitao opened this issue Jun 1, 2022 · 4 comments
Closed

Proposal: use Box<dyn Array> instead of Arc<dyn Array> #1036

jorgecarleitao opened this issue Jun 1, 2022 · 4 comments
Labels
investigation Issues or PRs that are investigations. Prs may or may not be merged. no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@jorgecarleitao
Copy link
Owner

I am trying to improve the clone-on-write aspect of this crate.

One important aspect in this journey is how to handle Arc<dyn Array>. Let's go through this problem in detail. Consider Int32Array, whose in-memory repr is

struct Int32Array {
    data_type: DataType,
    values: Buffer<i32>,
    validity: Option<Bitmap>
}

the size of this struct is quite small - e.g. values and validity are already Arced.

I.e. when we use Arc<dyn Array>, we do not gain much vs Box<dyn Array> - what we are wrapping into the Arc is basically DataType.

For nested types, we have things like

struct ListArray {
    data_type: DataType,
    offsets: Buffer<i32>,
    values: Arc<dyn Array>,
    validity: Option<Bitmap>
}

again, the Arc<dyn Array> is not very helpful here, since the struct itself is quite small.

The biggest advantage of Arc<dyn Array> is that it implements Clone. However, I recently found that Box<dyn Array> can also be clone via the dyn-clone crate.

The biggest advantage of Box<dyn Array> is that it would enable as_any_mut - i.e. we can essentially do stuff like

// This example demos how to operate on arrays in-place, i.e. without allocating an extra buffer.
use arrow2::{
    array::{Array, PrimitiveArray},
    types::NativeType,
};

// this function will clone-on-write the array and apply `f` to its values
fn cow_apply<T: NativeType, F: Fn(&mut [T])>(mut array: Box<dyn Array>, f: F) -> Box<dyn Array> {
    // 1. downcast the array to its concrete type
    let array = array
        .as_any_mut()
        .downcast_mut::<PrimitiveArray<T>>()
        .unwrap();

    // 2. empty the mut reference and create a new array on the stack
    let array = array.take();

    // 3. deconstruct the array into its parts
    let (dt, values, validity) = array.into_inner();

    // 4. clone-on-write the values
    let mut values = values.make_mut();

    // 5. apply the function over the values
    f(&mut values);

    // 6. construct the array back
    Box::new(PrimitiveArray::try_new(dt, values.into(), validity).unwrap())
}

fn main() {
    // say we have
    let a = PrimitiveArray::from_vec(vec![1i32, 2]);
    let array = Box::new(a) as Box<dyn Array>;

    // say we have an operation over values
    let f = |values: &mut [i32]| values.iter_mut().for_each(|x| *x *= 10);

    // let's apply the transform in place:
    let result = cow_apply(array, f);
    assert_eq!(result.as_ref(), PrimitiveArray::from_vec(vec![10i32, 20]));
}

in this case bypassing the MutableArray API altogether. I.e. it saves us a Arc::get_mut to deref Arc<dyn Array>, and imo makes the API easier to use and follow.

Any thoughts?

@jorgecarleitao jorgecarleitao added the investigation Issues or PRs that are investigations. Prs may or may not be merged. label Jun 1, 2022
@ritchie46
Copy link
Collaborator

#951 touches the same topic.

I do think we can do this within Arc. We can implement a trait for Arc<dyn Array> which implements as_any_mut() -> Option<&mut dyn Any>. This will return None if the ref count of the Arc is > 1.

This makes the clone still ergonomic. But maybe we can get the same ergonomics with the dyn clone crate, (though that is another dependency. :/)

@jorgecarleitao
Copy link
Owner Author

The main issue for me in the current design is that we use Arc for no reason other than "we have been using it" (it came from arrow-rs and I never really thought about it). Now that I think about it, it is making COW more challenging / less idiomatic as we have two levels of Arcs to unwrap, and the first Arc is not adding anything,

It seems to be a limitation of Rust that does not auto-implement Clone for Box<dyn T> when T: Clone. So, I see dyn-clone as fixing an issue in std (it has 100 LOC), a bit like bytesmuck. fwiw the author is pretty big in Rust - also author of syn, anyhow, cxx and thiserror.

I think the main pain is the migration - e.g. if Polars is using Arc<dyn Array>, we would need to arrow::array::clone(&*array) to get the Box.

@ritchie46
Copy link
Collaborator

I think the main pain is the migration - e.g. if Polars is using Arc, we would need to arrow::array::clone(&*array) to get the Box.

I don't think this is too painful. The list iterators already return Box<dyn Array>. In polars I call Arc::from on these.

With regard to Clone, we already have this defined on dyn Array.

    /// Clone a `&dyn Array` to an owned `Box<dyn Array>`.
    fn to_boxed(&self) -> Box<dyn Array>;

Given that the data values and validity is still Arced. The &mut Any might not be very useful. It only allows mutating the datatype, which is something you probably rarely want/ need?

To get mutable access on the underlying buffers, we still need to check the ref counts of the Arcs and give an Optional<Mutable> back (or what we currently do with either).

struct Int32Array {
    data_type: DataType,
    values: Buffer<i32>,
    validity: Option<Bitmap>
}

I do think we should implement this logic in arrow2. Currently I do this in polars, but I feel it should be part of dyn Array.

@jorgecarleitao
Copy link
Owner Author

Closed by #1042

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jun 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
investigation Issues or PRs that are investigations. Prs may or may not be merged. no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants