Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow safe refactoring #3

Merged
merged 3 commits into from
Aug 10, 2024
Merged

Arrow safe refactoring #3

merged 3 commits into from
Aug 10, 2024

Conversation

Hennzau
Copy link
Collaborator

@Hennzau Hennzau commented Aug 8, 2024

Objective

Hi! I was adding other datatypes in the other_datatype branch, and after discussing with Marc Perlade https://github.com/Ar3sion https://git.eleves.ens.fr/mperlade, a student at ENS PSL, we found a safer way to retrieve data from the arrow::UnionArray than the previous method. So, I rewrote the Arrow part and achieved a really neat result!

Changes

If you want to convert an Arrow object into an Image, you can simply get the UnionArray, convert it into a HashMap, and then extract the fields as Vec<T>.

pub fn from_arrow(array: arrow::array::UnionArray) -> Result<Self> {
    let mut map = arrow_union_into_map(array)?;

    let width =
        get_primitive_array_from_map::<u32, arrow::datatypes::UInt32Type>("width", &mut map)?
            .first();
    let name = get_utf8_array_from_map("name", &mut map)?
        .first();
}

Code

Everything is safe!

// Convert an ArrowUnion to a HashMap
pub fn arrow_union_into_map(
    array: arrow::array::UnionArray,
) -> Result<HashMap<String, arrow::array::ArrayRef>> {
    let mut result = HashMap::new();

    let (union_fields, _, _, children) = array.into_parts();

    for (a, b) in union_fields.iter() {
        let child = children
            .get(a as usize)
            .ok_or_eyre(Report::msg(
                "Invalid union array field index. Must be >= 0 and correspond to the children index in the array.",
            ))?
            .clone();

        result.insert(b.name().to_string(), child);
    }

    Ok(result)
}

// Extract a Primitive Array without copying by consuming the `Arc<dyn Array>`
pub fn get_primitive_array_from_map<
    T: arrow::datatypes::ArrowNativeType,
    G: arrow::datatypes::ArrowPrimitiveType,
>(
    field: &str,
    map: &mut HashMap<String, arrow::array::ArrayRef>,
) -> Result<Vec<T>> {
    use arrow::array::Array;

    let array_data = map
        .remove(field)
        .ok_or_eyre(Report::msg("Invalid field for this map."))?
        .into_data();

    let array = arrow::array::PrimitiveArray::<G>::from(array_data);
    let (_, buffer, _) = array.into_parts();
    let buffer = buffer.into_inner();

    match buffer.into_vec::<T>() {
        Ok(vec) => Ok(vec),
        Err(e) => Err(Report::msg(format!(
            "T is not a valid type for this buffer. It must have the same layout as the buffer (this usually occurs when the type is incorrect or when another reference exists). Error: {:?}", e
        ))),
    }
}

// Extract a UTF8 String Array with only ONE copy per string, by consuming the `Arc<dyn Array>`
pub fn get_utf8_array_from_map(
    field: &str,
    map: &mut HashMap<String, arrow::array::ArrayRef>,
) -> Result<Vec<String>> {
    use arrow::array::Array;

    let array_data = map
        .remove(field)
        .ok_or_eyre(Report::msg("Invalid field for this map."))?
        .into_data();

    let array = arrow::array::StringArray::from(array_data);
    let (offsets, buffer, _) = array.into_parts();

    let slice = buffer.as_slice();
    let mut last_offset = 0;
    let mut iterator = offsets.iter();
    iterator.next();

    iterator
        .map(|&offset| {
            let offset = offset as usize;
            let slice = &slice[last_offset..offset];
            last_offset = offset;

            String::from_utf8(slice.to_vec()).wrap_err(Report::msg("Array is not UTF-8 encoded."))
        })
        .collect::<Result<Vec<String>>>()
}

Acknowledgments

I really want to thank Marc Perlade for his help. He’s not into sharing his work on GitHub/GitLab, but he is fantastic.

@Hennzau Hennzau requested a review from haixuanTao August 8, 2024 18:16
@Hennzau Hennzau self-assigned this Aug 8, 2024
@Hennzau Hennzau added the enhancement New feature or request label Aug 8, 2024
Copy link
Contributor

@haixuanTao haixuanTao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think this should work, but it should also add couple of heap calls.

I'm not against it, but it could be interesting to add some roundtrip benchmark if you have time.

Comment on lines +75 to +95
Encoding::RGB8 => {
let data = get_primitive_array_from_map::<u8, arrow::datatypes::UInt8Type>(
"data", &mut map,
)?;

DataContainer::from_u8(data)
}
Encoding::BGR8 => {
let data = get_primitive_array_from_map::<u8, arrow::datatypes::UInt8Type>(
"data", &mut map,
)?;

DataContainer::from_u8(data)
}
Encoding::GRAY8 => {
let data = get_primitive_array_from_map::<u8, arrow::datatypes::UInt8Type>(
"data", &mut map,
)?;

DataContainer::from_u8(data)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Encoding::RGB8 => {
let data = get_primitive_array_from_map::<u8, arrow::datatypes::UInt8Type>(
"data", &mut map,
)?;
DataContainer::from_u8(data)
}
Encoding::BGR8 => {
let data = get_primitive_array_from_map::<u8, arrow::datatypes::UInt8Type>(
"data", &mut map,
)?;
DataContainer::from_u8(data)
}
Encoding::GRAY8 => {
let data = get_primitive_array_from_map::<u8, arrow::datatypes::UInt8Type>(
"data", &mut map,
)?;
DataContainer::from_u8(data)
}
Encoding::RGB8 | Encoding::BGR8 | Encoding::GRAY8 => {
let data = get_primitive_array_from_map::<u8, arrow::datatypes::UInt8Type>(
"data", &mut map,
)?;
DataContainer::from_u8(data)
}

@Hennzau Hennzau merged commit 53ac2ba into other_datatype Aug 10, 2024
24 checks passed
@Hennzau Hennzau deleted the arrow_refactor branch August 10, 2024 09:53
Hennzau added a commit that referenced this pull request Sep 4, 2024
Hennzau added a commit that referenced this pull request Sep 9, 2024
# Objective

This PR adds a new datatype for **BBox** to ensure that everything work
well and it's easy to add a new datatype.

# Datatypes & Usage

- [x] Image

```Rust
use crate::image::Image;

let flat_image = (0..27).collect::<Vec<u8>>();
let image = Image::new_rgb8(flat_image, 3, 3, Some("camera.test")).unwrap();

let final_image = image.into_bgr8().unwrap();
let final_image_data = final_image.data.as_u8().unwrap();

let expected_image = vec![
    2, 1, 0, 5, 4, 3, 8, 7, 6, 11, 10, 9, 14, 13, 12, 17, 16, 15, 20, 19, 18, 23, 22, 21,
    26, 25, 24,
];

assert_eq!(&expected_image, final_image_data);

use crate::image::Image;

let flat_image = vec![0; 27];
let original_buffer_address = flat_image.as_ptr();

let bgr8_image = Image::new_bgr8(flat_image, 3, 3, None).unwrap();
let image_buffer_address = bgr8_image.as_ptr();

let arrow_image = bgr8_image.into_arrow().unwrap();

let new_image = Image::from_arrow(arrow_image).unwrap();
let final_image_buffer = new_image.as_ptr();

assert_eq!(original_buffer_address, image_buffer_address);
assert_eq!(image_buffer_address, final_image_buffer);
```

- [x] BBox

```Rust
use crate::bbox::BBox;

let flat_bbox = vec![1.0, 1.0, 2.0, 2.0];
let confidence = vec![0.98];
let label = vec!["cat".to_string()];

let bbox = BBox::new_xyxy(flat_bbox, confidence, label).unwrap();
let final_bbox = bbox.into_xywh().unwrap();
let final_bbox_data = final_bbox.data;

let expected_bbox = vec![1.0, 1.0, 1.0, 1.0];

assert_eq!(expected_bbox, final_bbox_data);

use crate::bbox::BBox;

let flat_bbox = vec![1.0, 1.0, 2.0, 2.0];
let original_buffer_address = flat_bbox.as_ptr();

let confidence = vec![0.98];
let label = vec!["cat".to_string()];

let xyxy_bbox = BBox::new_xyxy(flat_bbox, confidence, label).unwrap();
let bbox_buffer_address = xyxy_bbox.data.as_ptr();

let arrow_bbox = xyxy_bbox.into_arrow().unwrap();

let new_bbox = BBox::from_arrow(arrow_bbox).unwrap();
let final_bbox_buffer = new_bbox.data.as_ptr();

assert_eq!(original_buffer_address, bbox_buffer_address);
assert_eq!(bbox_buffer_address, final_bbox_buffer);
```

# Quick Fixes

- I also improved readability and consistency with Rust formatting,
following the guidelines mentioned in [this
comment](#1 (comment)).
- Fix Arrow Array extraction from Union inside #3 
- Fix Dora compatibility (by viewing objects when it's not possible to
own them) inside #5
- I improved the structure of the library, with separated packages and
with some `features` as one might want to use `fastformat` only with
ndarray/arrow. #6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants