Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Arc with Box in ArrowArray for FFI structs #1432

Closed
wants to merge 8 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 12, 2022

Which issue does this PR close?

Closes #1425.

Rationale for this change

What changes are included in this PR?

  • Remove Clone from FFI_ArrowArray and FFI_ArrowSchema
  • Change array and schema in ArrowArray from Arc to Box type

Are there any user-facing changes?

TryFrom<ffi::ArrowArray> for ArrayData is changed to TryFrom<&ffi::ArrowArray> for ArrayData.

pub unsafe fn from_unowned(
    ptr: NonNull<u8>,
    len: usize,
    data: Arc<ffi::FFI_ArrowArray>,
) -> Self

is changed to

pub unsafe fn from_unowned(
    ptr: NonNull<u8>,
    len: usize,
) -> Self

fn owner(&self) -> &Arc<FFI_ArrowArray>; in trait ArrowArrayRef is changed to fn owner(&self) -> Arc<&FFI_ArrowArray>;.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 12, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 12, 2022

Codecov Report

Merging #1432 (e10b1d6) into master (729934c) will decrease coverage by 0.00%.
The diff coverage is 71.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1432      +/-   ##
==========================================
- Coverage   82.67%   82.67%   -0.01%     
==========================================
  Files         185      185              
  Lines       53822    53820       -2     
==========================================
- Hits        44500    44498       -2     
  Misses       9322     9322              
Impacted Files Coverage Δ
arrow/src/array/array.rs 85.18% <0.00%> (-0.23%) ⬇️
arrow/src/bytes.rs 56.25% <0.00%> (ø)
arrow/src/pyarrow.rs 0.00% <0.00%> (ø)
arrow/src/ffi.rs 85.91% <82.75%> (+0.13%) ⬆️
arrow/src/array/ffi.rs 100.00% <100.00%> (ø)
arrow/src/buffer/immutable.rs 99.45% <100.00%> (ø)
parquet/src/encodings/encoding.rs 93.52% <0.00%> (-0.20%) ⬇️
arrow/src/array/transform/mod.rs 86.20% <0.00%> (-0.12%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (ø)
arrow/src/datatypes/field.rs 54.10% <0.00%> (+0.30%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 729934c...e10b1d6. Read the comment docs.

@viirya
Copy link
Member Author

viirya commented Mar 12, 2022

Copy link
Contributor

@wangfenjin wangfenjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort, the change looks good to me. Only one comment in the code.

let data = ArrayData::try_from(array)?;
let data = ArrayData::try_from(&array)?;
// Avoid dropping the `Box` pointers and trigger the `release` mechanism.
let _ = ffi::ArrowArray::into_raw(array);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to change ArrayData::try_from() to have &? This line seems very tricky.

And as we change the ArrayData::try_from() API, all other users may also need to change the code and also add this tricky line.

Copy link
Member Author

@viirya viirya Mar 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ArrayData::try_from() moves ffi::ArrowArray, it will drop the ArrowArray and trigger release for the structs (as they are just Box pointers now). Users cannot prevent it happened. For example, without this change, our internal usecase got a SIGSEGV.

So I change it to a borrowed reference to avoid dropping/releasing there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @viirya , understand the issue here.

I'm not sure if #1441 this can be the solution? Or even we can add a ArrowArray::try_from_box() to do the similar thing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

Arc::from(Box::from_raw(array as *mut FFI_ArrowArray)) is Arc<Box<FFI_ArrowArray>>. Then the raw pointer is *Box<FFI_ArrowArray>, but you treat it as *FFI_ArrowArray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm wrong. I'm not aware that there is from(v: Box<T>) API in Arc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously you don't need it. As Arc is kept in the created Buffer of Array data, you can rely on deallocation of the Buffer to call release of such ffi structs.

But Box cannot give us such benefit. So it makes the management more explicit and relying on users. We need to keep these structs so release won't be called, before we don't need the Array data (Buffer).

The code is at ArrowArrayRef.to_data. It is to create an ArrayData from an ArrowArray(Ref). And you can follow buffers -> create_buffer -> Buffer::from_unowned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation.

I'm wondering if we can still use Arc but drop the "envelop memory" allocated for the struct holding the actual pointers, for the input array and schema. For example:

    pub unsafe fn try_from_raw(
        array: *const FFI_ArrowArray,
        schema: *const FFI_ArrowSchema,
    ) -> Result<Self> {
        if array.is_null() || schema.is_null() {
            return Err(ArrowError::MemoryError(
                "At least one of the pointers passed to `try_from_raw` is null"
                    .to_string(),
            ));
        };

        let array_mut = array as *mut FFI_ArrowArray;
        let schema_mut = schema as *mut FFI_ArrowSchema;

        let array_data = std::ptr::replace(array_mut, FFI_ArrowArray::empty());
        let schema_data = std::ptr::replace(schema_mut, FFI_ArrowSchema::empty());

        std::ptr::drop_in_place(array_mut);
        std::ptr::drop_in_place(schema_mut);

        Ok(Self {
            array: Arc::new(array_data),
            schema: Arc::new(schema_data),
        })
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks close to previous suggestion Arc::from as it also copies bytes from source structs. But it causes SIGSEGV. I will try to test this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried it. It seems okay. Arc::from previously not work, I think, is because it calls allocator to deallocate the memory allocation. As it is allocated by Java in our case, we cannot let Rust to deallocate it.

std::ptr::drop_in_place seems only trigger dropping. As we make it as empty structs, it won't trigger release. I think this is close to #1436 which cleans up release field of source structs after cloning it. Here we in fact still clone it, but just internally and don't expose clone.

Looks good to me. Thanks @sunchao .

cc @alamb @wangfenjin WDYT? Are you agreed with this approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested approach is at #1449.

@viirya
Copy link
Member Author

viirya commented Mar 13, 2022

Note that because the struct pointers are now Box instead of Arc, it requires the users of these structs to be more careful on managing them. For example, ArrayData::try_from. Users need to keep these Box/raw pointers (in order not to release them too early) and do release after using the array data.

That's why I personally prefer alternative Arc + clone + removing source structs' release approach (see #1436). But I'm fine with Box approach too if this is preferred by the community.

@alamb
Copy link
Contributor

alamb commented Mar 14, 2022

Note that because the struct pointers are now Box instead of Arc, it requires the users of these structs to be more careful on managing them. For example, ArrayData::try_from. Users need to keep these Box/raw pointers (in order not to release them too early) and do release after using the array data.

That's why I personally prefer alternative Arc + clone + removing source structs' release approach (see #1436). But I'm fine with Box approach too if this is preferred by the community.

I think the "be explicit about memory management" is the typical Rust mantra in this case, so I would prefer the Box approach.

Perhaps in your use of the FFI structs, you could use std::mem::forget if you don't mind not releasing the memory over time and don't want to track the allocations individually.

@alamb
Copy link
Contributor

alamb commented Mar 14, 2022

#1442 tracks the patch releases

@viirya
Copy link
Member Author

viirya commented Mar 14, 2022

@alamb Thanks. It makes sense. Then I think the Box approach is good.

@viirya
Copy link
Member Author

viirya commented Mar 23, 2022

This can be closed now.

@viirya viirya closed this Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FFI: ArrowArray::try_from_raw shouldn't clone
5 participants