Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Increased API consistency for COW and respective docs #833

Merged
merged 1 commit into from
Feb 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions guide/src/high_level.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Contrarily to `Arc<Vec<Option<T>>`, arrays in this crate are represented in such
that they can be zero-copied to any other Arrow implementation via foreign interfaces (FFI).

Probably the simplest `Array` in this crate is the `PrimitiveArray<T>`. It can be
constructed as from a slice of option values,
constructed from a slice of option values,

```rust
# use arrow2::array::{Array, PrimitiveArray};
Expand Down Expand Up @@ -36,13 +36,13 @@ assert_eq!(array.len(), 3)
# }
```

A `PrimitiveArray` has 3 components:
A `PrimitiveArray` (and every `Array` implemented in this crate) has 3 components:

1. A physical type (e.g. `i32`)
2. A logical type (e.g. `DataType::Int32`)
3. Data

The main differences from a `Vec<Option<T>>` are:
The main differences from a `Arc<Vec<Option<T>>>` are:

* Its data is laid out in memory as a `Buffer<T>` and an `Option<Bitmap>` (see [../low_level.md])
* It has an associated logical type (`DataType`).
Expand Down Expand Up @@ -84,16 +84,16 @@ The following arrays are supported:
* `Utf8Array<i32>` and `Utf8Array<i64>` (for strings)
* `BinaryArray<i32>` and `BinaryArray<i64>` (for opaque binaries)
* `FixedSizeBinaryArray` (like `BinaryArray`, but fixed size)
* `ListArray<i32>` and `ListArray<i64>` (nested arrays)
* `FixedSizeListArray` (nested arrays of fixed size)
* `StructArray` (every row has multiple logical types)
* `ListArray<i32>` and `ListArray<i64>` (array of arrays)
* `FixedSizeListArray` (array of arrays of a fixed size)
* `StructArray` (multiple named arrays where each row has one element from each array)
* `UnionArray` (every row has a different logical type)
* `DictionaryArray<K>` (nested array with encoded values)

## Array as a trait object

`Array` is object safe, and all implementations of `Array` and can be casted
to `&dyn Array`, which enables run-time nesting.
to `&dyn Array`, which enables dynamic casting and run-time nesting.

```rust
# use arrow2::array::{Array, PrimitiveArray};
Expand Down Expand Up @@ -177,8 +177,8 @@ This crate's APIs are generally split into two patterns: whether an operation le
contiguous memory regions or whether it does not.

What this means is that certain operations can be performed irrespectively of whether a value
is "null" or not (e.g. `PrimitiveArray<i32> + i32` can be applied to _all_ values via SIMD and
only copy the validity bitmap independently).
is "null" or not (e.g. `PrimitiveArray<i32> + i32` can be applied to _all_ values
via SIMD and only copy the validity bitmap independently).

When an operation benefits from such arrangement, it is advantageous to use

Expand Down
12 changes: 10 additions & 2 deletions guide/src/low_level.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The starting point of this crate is the idea that data is stored in memory in a
The most important design aspect of this crate is that contiguous regions are shared via an
`Arc`. In this context, the operation of slicing a memory region is `O(1)` because it
corresponds to changing an offset and length. The tradeoff is that once under
an `Arc`, memory regions are immutable.
an `Arc`, memory regions are immutable. See note below on how to overcome this.

The second most important aspect is that Arrow has two main types of data buffers: bitmaps,
whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are
Expand Down Expand Up @@ -55,7 +55,8 @@ interoperable in-memory format.
## Bitmaps

Arrow's in-memory arrangement of boolean values is different from `Vec<bool>`. Specifically,
arrow uses individual bits to represent a boolean, as opposed to the usual byte that `bool` holds.
arrow uses individual bits to represent a boolean, as opposed to the usual byte
that `bool` holds.
Besides the 8x compression, this makes the validity particularly useful for
[AVX512](https://en.wikipedia.org/wiki/AVX-512) masks.
One tradeoff is that an arrows' bitmap is not represented as a Rust slice, as Rust slices use
Expand Down Expand Up @@ -86,3 +87,10 @@ x.set(1, true);
assert_eq!(x.get(1), true);
# }
```

## Copy on write (COW) semantics

Both `Buffer` and `Bitmap` support copy on write semantics via `into_mut`, that may convert
them to a `Vec` or `MutableBitmap` respectively.

This allows re-using them to e.g. perform multiple operations without allocations.
4 changes: 2 additions & 2 deletions src/array/primitive/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ impl<T: NativeType> PrimitiveArray<T> {
self.values,
Some(bitmap),
)),
Right(mutable_bitmap) => match self.values.get_vec() {
Right(mutable_bitmap) => match self.values.into_mut() {
Left(buffer) => Left(PrimitiveArray::from_data(
self.data_type,
buffer,
Expand All @@ -210,7 +210,7 @@ impl<T: NativeType> PrimitiveArray<T> {
},
}
} else {
match self.values.get_vec() {
match self.values.into_mut() {
Left(buffer) => Left(PrimitiveArray::from_data(self.data_type, buffer, None)),
Right(values) => Right(MutablePrimitiveArray::from_data(
self.data_type,
Expand Down
4 changes: 2 additions & 2 deletions src/array/utf8/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ impl<O: Offset> Utf8Array<O> {
self.values,
Some(bitmap),
)),
Right(mutable_bitmap) => match (self.values.get_vec(), self.offsets.get_vec()) {
Right(mutable_bitmap) => match (self.values.into_mut(), self.offsets.into_mut()) {
(Left(immutable_values), Left(immutable_offsets)) => {
Left(Utf8Array::from_data(
self.data_type,
Expand Down Expand Up @@ -250,7 +250,7 @@ impl<O: Offset> Utf8Array<O> {
},
}
} else {
match (self.values.get_vec(), self.offsets.get_vec()) {
match (self.values.into_mut(), self.offsets.into_mut()) {
(Left(immutable_values), Left(immutable_offsets)) => Left(Utf8Array::from_data(
self.data_type,
immutable_offsets,
Expand Down
8 changes: 7 additions & 1 deletion src/bitmap/immutable.rs
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,13 @@ impl Bitmap {
self.offset
}

/// Try to convert this `Bitmap` to a `MutableBitmap`
/// Converts this [`Bitmap`] to [`MutableBitmap`], returning itself if the conversion
/// is not possible
///
/// This operation returns a [`MutableBitmap`] iff:
/// * this [`Bitmap`] is not an offsetted slice of another [`Bitmap`]
/// * this [`Bitmap`] has not been cloned (i.e. [`Arc`]`::get_mut` yields [`Some`])
/// * this [`Bitmap`] was not imported from the c data interface (FFI)
pub fn into_mut(mut self) -> Either<Self, MutableBitmap> {
match (
self.offset,
Expand Down
14 changes: 8 additions & 6 deletions src/buffer/immutable.rs
Original file line number Diff line number Diff line change
Expand Up @@ -131,12 +131,14 @@ impl<T: NativeType> Buffer<T> {
self.offset
}

/// Try to get the inner data as a mutable [`Vec<T>`].
/// This succeeds iff:
/// * This data was allocated by Rust (i.e. it does not come from the C data interface)
/// * This region is not being shared any other struct.
/// * This buffer has no offset
pub fn get_vec(mut self) -> Either<Self, Vec<T>> {
/// Converts this [`Buffer`] to [`Vec`], returning itself if the conversion
/// is not possible
///
/// This operation returns a [`Vec`] iff this [`Buffer`]:
/// * is not an offsetted slice of another [`Buffer`]
/// * has not been cloned (i.e. [`Arc`]`::get_mut` yields [`Some`])
/// * has not been imported from the c data interface (FFI)
pub fn into_mut(mut self) -> Either<Self, Vec<T>> {
if self.offset != 0 {
Either::Left(self)
} else {
Expand Down