diff --git a/guide/src/README.md b/guide/src/README.md index 3029531bd31..3042526a2e8 100644 --- a/guide/src/README.md +++ b/guide/src/README.md @@ -10,22 +10,3 @@ Arrow2 is divided into three main parts: * a [low-level API](./low_level.md) to efficiently operate with contiguous memory regions; * a [high-level API](./high_level.md) to operate with arrow arrays; * a [metadata API](./metadata.md) to declare and operate with logical types and metadata. - -## Cargo features - -This crate has a significant number of cargo features to reduce compilation -time and number of dependencies. The feature `"full"` activates most -functionality, such as: - -* `io_ipc`: to interact with the IPC format -* `io_ipc_compression`: to read and write compressed IPC (v2) -* `io_csv` to read and write CSV -* `io_json` to read and write JSON -* `io_parquet` to read and write parquet -* `io_parquet_compression` to read and write compressed parquet -* `io_print` to write batches to formatted ASCII tables -* `compute` to operate on arrays (addition, sum, sort, etc.) - -The feature `simd` (not part of `full`) produces more explicit SIMD instructions -via [`packed_simd`](https://github.com/rust-lang/packed_simd), but requires the -nightly channel. diff --git a/guide/src/arrow.md b/guide/src/arrow.md index fa99a5b38d1..d59166205fb 100644 --- a/guide/src/arrow.md +++ b/guide/src/arrow.md @@ -1,6 +1,6 @@ # Introduction -Welcome to the Arrow2 guide for the Rust programming language. This guide was +Welcome to the Arrow2 guide for the Rust programming language. This guide was created to help you become familiar with the Arrow2 crate and its functionalities. diff --git a/guide/src/compute.md b/guide/src/compute.md index 68948149de0..21944d988a7 100644 --- a/guide/src/compute.md +++ b/guide/src/compute.md @@ -1,12 +1,18 @@ # Compute API -When compiled with the feature `compute`, this crate offers a wide range of functions to perform both vertical (e.g. add two arrays) and horizontal (compute the sum of an array) operations. +When compiled with the feature `compute`, this crate offers a wide range of functions +to perform both vertical (e.g. add two arrays) and horizontal +(compute the sum of an array) operations. -```rust -{{#include ../../examples/arithmetics.rs}} -``` +The overall design of the `compute` module is that it offers two APIs: + +* statically typed, such as `sum_primitive(&PrimitiveArray) -> Option` +* dynamically typed, such as `sum(&dyn Array) -> Box` + +the dynamically typed API usually has a function `can_*(&DataType) -> bool` denoting whether +the operation is defined for the particular logical type. -An overview of the implemented functionality. +Overview of the implemented functionality: * arithmetics, checked, saturating, etc. * `sum`, `min` and `max` @@ -17,7 +23,13 @@ An overview of the implemented functionality. * `sort`, `hash`, `merge-sort` * `if-then-else` * `nullif` -* `lenght` (of string) -* `hour`, `year` (of temporal logical types) +* `length` (of string) +* `hour`, `year`, `month`, `iso_week` (of temporal logical types) * `regex` * (list) `contains` + +and an example of how to use them: + +```rust +{{#include ../../examples/arithmetics.rs}} +``` diff --git a/guide/src/extension.md b/guide/src/extension.md index a514701036e..ee60e6ec1c4 100644 --- a/guide/src/extension.md +++ b/guide/src/extension.md @@ -1,7 +1,12 @@ # Extension types This crate supports Arrows' ["extension type"](https://arrow.apache.org/docs/format/Columnar.html#extension-types), to declare, use, and share custom logical types. -The follow example shows how to declare one: + +An extension type is just a `DataType` with a name and some metadata. +In particular, its physical representation is equal to its inner `DataType`, which implies +that all functionality in this crate works as if it was the inner `DataType`. + +The following example shows how to declare one: ```rust {{#include ../../examples/extension.rs}} diff --git a/guide/src/ffi.md b/guide/src/ffi.md index 40d01984c38..4b8267266a2 100644 --- a/guide/src/ffi.md +++ b/guide/src/ffi.md @@ -5,8 +5,8 @@ has a specification, which allows languages to share data structures via foreign interfaces at zero cost (i.e. via pointers). This is known as the [C Data interface](https://arrow.apache.org/docs/format/CDataInterface.html). -This crate supports importing from and exporting to all `DataType`s. Follow the -example below to learn how to import and export: +This crate supports importing from and exporting to all its physical types. The +example below demonstrates how to use the API: ```rust {{#include ../../examples/ffi.rs}} diff --git a/guide/src/high_level.md b/guide/src/high_level.md index 3bac96c4c43..c5887e7c02d 100644 --- a/guide/src/high_level.md +++ b/guide/src/high_level.md @@ -1,10 +1,12 @@ # High-level API -The simplest way to think about an arrow `Array` is that it represents -`Vec>` and has a logical type (see [metadata](../metadata.md))) associated with it. +Arrow core trait the `Array`, which you can think of as representing `Arc>` +with associated metadata (see [metadata](../metadata.md))). +Contrarily to `Arc>`, arrays in this crate are represented in such a way +that they can be zero-copied to any other Arrow implementation via foreign interfaces (FFI). -Probably the simplest array in this crate is `PrimitiveArray`. It can be constructed -from a slice as follows: +Probably the simplest `Array` in this crate is the `PrimitiveArray`. It can be +constructed as from a slice of option values, ```rust # use arrow2::array::{Array, PrimitiveArray}; @@ -72,7 +74,7 @@ let array = PrimitiveArray::::from_slice([1, 0, 123]); assert_eq!(array.data_type(), &DataType::Int32); # } ``` -they can be cheaply converted to via `.to(DataType)`. +they can be cheaply (`O(1)`) converted to via `.to(DataType)`. The following arrays are supported: @@ -88,11 +90,10 @@ The following arrays are supported: * `UnionArray` (every row has a different logical type) * `DictionaryArray` (nested array with encoded values) -## Dynamic Array +## Array as a trait object -There is a more powerful aspect of arrow arrays, and that is that they all -implement the trait `Array` and can be cast to `&dyn Array`, i.e. they can be turned into -a trait object. This enables arrays to have types that are dynamic in nature. +`Array` is object safe, and all implementations of `Array` and can be casted +to `&dyn Array`, which enables run-time nesting. ```rust # use arrow2::array::{Array, PrimitiveArray}; @@ -106,7 +107,7 @@ let a: &dyn Array = &a; Given a trait object `array: &dyn Array`, we know its physical type via `PhysicalType: array.data_type().to_physical_type()`, which we use to downcast the array -to its concrete type: +to its concrete physical type: ```rust # use arrow2::array::{Array, PrimitiveArray}; @@ -135,6 +136,7 @@ an each implementation of `Array` (a struct): | `FixedSizeList` | `FixedSizeListArray` | | `Struct` | `StructArray` | | `Union` | `UnionArray` | +| `Map` | `MapArray` | | `Dictionary(_)` | `DictionaryArray<_>` | where `_` represents each of the variants (e.g. `PrimitiveType::Int32 <-> i32`). @@ -174,13 +176,18 @@ and how to make them even more efficient. This crate's APIs are generally split into two patterns: whether an operation leverages contiguous memory regions or whether it does not. -If yes, then use: +What this means is that certain operations can be performed irrespectively of whether a value +is "null" or not (e.g. `PrimitiveArray + i32` can be applied to _all_ values via SIMD and +only copy the validity bitmap independently). + +When an operation benefits from such arrangement, it is advantageous to use * `Buffer::from_iter` * `Buffer::from_trusted_len_iter` * `Buffer::try_from_trusted_len_iter` -If not, then use the builder API, such as `MutablePrimitiveArray`, `MutableUtf8Array` or `MutableListArray`. +If not, then use the `MutableArray` API, such as +`MutablePrimitiveArray`, `MutableUtf8Array` or `MutableListArray`. We have seen examples where the latter API was used. In the last example of this page you will be introduced to an example of using the former for SIMD. @@ -209,14 +216,23 @@ Like `FromIterator`, this crate contains two sets of APIs to iterate over data. an array `array: &PrimitiveArray`, the following applies: 1. If you need to iterate over `Option<&T>`, use `array.iter()` -2. If you can operate over the values and validity independently, use `array.values() -> &Buffer` and `array.validity() -> Option<&Bitmap>` +2. If you can operate over the values and validity independently, + use `array.values() -> &Buffer` and `array.validity() -> Option<&Bitmap>` -Note that case 1 is useful when e.g. you want to perform an operation that depends on both validity and values, while the latter is suitable for SIMD and copies, as they return contiguous memory regions (buffers and bitmaps). We will see below how to leverage these APIs. +Note that case 1 is useful when e.g. you want to perform an operation that depends on both +validity and values, while the latter is suitable for SIMD and copies, as they return +contiguous memory regions (buffers and bitmaps). We will see below how to leverage these APIs. -This idea holds more generally in this crate's arrays: `values()` returns something that has a contiguous in-memory representation, while `iter()` returns items taking validity into account. To get an iterator over contiguous values, use `array.values().iter()`. +This idea holds more generally in this crate's arrays: `values()` returns something that has +a contiguous in-memory representation, while `iter()` returns items taking validity into account. +To get an iterator over contiguous values, use `array.values().iter()`. There is one last API that is worth mentioning, and that is `Bitmap::chunks`. When performing -bitwise operations, it is often more performant to operate on chunks of bits instead of single bits. `chunks` offers a `TrustedLen` of `u64` with the bits + an extra `u64` remainder. We expose two functions, `unary(Bitmap, Fn) -> Bitmap` and `binary(Bitmap, Bitmap, Fn) -> Bitmap` that use this API to efficiently perform bitmap operations. +bitwise operations, it is often more performant to operate on chunks of bits +instead of single bits. `chunks` offers a `TrustedLen` of `u64` with the bits ++ an extra `u64` remainder. We expose two functions, `unary(Bitmap, Fn) -> Bitmap` +and `binary(Bitmap, Bitmap, Fn) -> Bitmap` that use this API to efficiently +perform bitmap operations. ## Vectorized operations @@ -238,18 +254,23 @@ where O: NativeType, F: Fn(I) -> O, { + // create the iterator over _all_ values let values = array.values().iter().map(|v| op(*v)); let values = Buffer::from_trusted_len_iter(values); + // create the new array, cloning its validity PrimitiveArray::::from_data(data_type.clone(), values, array.validity().cloned()) } ``` Some notes: -1. We used `array.values()`, as described above: this operation leverages a contiguous memory region. +1. We used `array.values()`, as described above: this operation leverages a + contiguous memory region. 2. We leveraged normal rust iterators for the operation. 3. We used `op` on the array's values irrespectively of their validity, -and cloned its validity. This approach is suitable for operations whose branching off is more expensive than operating over all values. If the operation is expensive, then using `PrimitiveArray::::from_trusted_len_iter` is likely faster. + and cloned its validity. This approach is suitable for operations whose branching off + is more expensive than operating over all values. If the operation is expensive, + then using `PrimitiveArray::::from_trusted_len_iter` is likely faster. diff --git a/guide/src/low_level.md b/guide/src/low_level.md index e4c69069fe3..9f1bc332e1c 100644 --- a/guide/src/low_level.md +++ b/guide/src/low_level.md @@ -1,17 +1,24 @@ # Low-level API -The starting point of this crate is the idea that data must be stored in memory in a specific arrangement to be interoperable with Arrow's ecosystem. With this in mind, this crate does not use `Vec` but instead has its own containers to store data, including sharing and consuming it via FFI. +The starting point of this crate is the idea that data is stored in memory in a specific arrangement to be interoperable with Arrow's ecosystem. -The most important design decision of this crate is that contiguous regions are shared via an `Arc`. In this context, the operation of slicing a memory region is `O(1)` because it corresponds to changing an offset and length. The tradeoff is that once under an `Arc`, memory regions are immutable. +The most important design aspect of this crate is that contiguous regions are shared via an +`Arc`. In this context, the operation of slicing a memory region is `O(1)` because it +corresponds to changing an offset and length. The tradeoff is that once under +an `Arc`, memory regions are immutable. -The second important aspect is that Arrow has two main types of data buffers: bitmaps, whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are measured in bytes. With this in mind, this crate has 2 main types of containers of contiguous memory regions: +The second most important aspect is that Arrow has two main types of data buffers: bitmaps, +whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are +measured in bytes. With this in mind, this crate has 2 main types of containers of +contiguous memory regions: * `Buffer`: handle contiguous memory regions of type T whose offsets are measured in items * `Bitmap`: handle contiguous memory regions of bits whose offsets are measured in bits These hold _all_ data-related memory in this crate. -Due to their intrinsic immutability, each container has a corresponding mutable (and non-shareable) variant: +Due to their intrinsic immutability, each container has a corresponding mutable +(and non-shareable) variant: * `MutableBuffer` * `MutableBitmap` @@ -44,7 +51,8 @@ assert_eq!(x.as_slice(), &[0, 5, 2, 10]) ``` The following demonstrates how to efficiently -perform an operation from an iterator of [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html): +perform an operation from an iterator of +[TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html): ```rust # use arrow2::buffer::MutableBuffer; @@ -65,6 +73,7 @@ the following physical types: * `u8-u64` * `f32` and `f64` * `arrow2::types::days_ms` +* `arrow2::types::months_days_ns` This is because the arrow specification only supports the above Rust types; all other complex types supported by arrow are built on top of these types, which enables Arrow to be a highly diff --git a/guide/src/metadata.md b/guide/src/metadata.md index 9ced2cdfeb1..7a78d82edac 100644 --- a/guide/src/metadata.md +++ b/guide/src/metadata.md @@ -13,7 +13,7 @@ In Arrow2, logical types are declared as variants of the `enum` `arrow2::datatyp For example, `DataType::Int32` represents a signed integer of 32 bits. Each `DataType` has an associated `enum PhysicalType` (many-to-one) representing the -particular in-memory representation, and is associated to specific semantics. +particular in-memory representation, and is associated to a specific semantics. For example, both `DataType::Date32` and `DataType::Int32` have the same `PhysicalType` (`PhysicalType::Primitive(PrimitiveType::Int32)`) but `Date32` represents the number of days since UNIX epoch. @@ -23,10 +23,9 @@ Logical types are metadata: they annotate physical types with extra information ## `Field` (column metadata) Besides logical types, the arrow format supports other relevant metadata to the format. -All this information is stored in `arrow2::datatypes::Field`. - -A `Field` is arrow's metadata associated to a column in the context of a columnar format. -It has a name, a logical type `DataType`, whether the column is nullable, etc. +An important one is `Field` broadly corresponding to a column in traditional columnar formats. +A `Field` is composed by a name (`String`), a logical type (`DataType`), whether it is +nullable (`bool`), and optional metadata. ## `Schema` (table metadata) diff --git a/src/compute/merge_sort/mod.rs b/src/compute/merge_sort/mod.rs index c619229a005..238e06d7447 100644 --- a/src/compute/merge_sort/mod.rs +++ b/src/compute/merge_sort/mod.rs @@ -1,4 +1,4 @@ -//! This module exposes functions to perform merge-sorts. +//! Functions to perform merge-sorts. //! //! The goal of merge-sort is to merge two sorted arrays, `[a0, a1]`, `merge_sort(a0, a1)`, //! so that the resulting array is sorted, i.e. the following invariant upholds: diff --git a/src/compute/partition.rs b/src/compute/partition.rs index c1cd4fb6c7d..15d46fd5de5 100644 --- a/src/compute/partition.rs +++ b/src/compute/partition.rs @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -//! Defines partition kernel for `ArrayRef` +//! Defines partition kernel for [`crate::array::Array`] use crate::compute::sort::{build_compare, Compare, SortColumn}; use crate::error::{ArrowError, Result}; diff --git a/src/doc/lib.md b/src/doc/lib.md index aa905384731..d37adce76ca 100644 --- a/src/doc/lib.md +++ b/src/doc/lib.md @@ -1,6 +1,6 @@ Welcome to arrow2's documentation. Thanks for checking it out! -This is a library for efficient in-memory data operations using +This is a library for efficient in-memory data operations with [Arrow in-memory format](https://arrow.apache.org/docs/format/Columnar.html). It is a re-write from the bottom up of the official `arrow` crate with soundness and type safety in mind. @@ -66,3 +66,25 @@ fn main() -> Result<()> { Ok(()) } ``` + +## Cargo features + +This crate has a significant number of cargo features to reduce compilation +time and number of dependencies. The feature `"full"` activates most +functionality, such as: + +* `io_ipc`: to interact with the Arrow IPC format +* `io_ipc_compression`: to read and write compressed Arrow IPC (v2) +* `io_csv` to read and write CSV +* `io_json` to read and write JSON +* `io_parquet` to read and write parquet +* `io_parquet_compression` to read and write compressed parquet +* `io_print` to write batches to formatted ASCII tables +* `compute` to operate on arrays (addition, sum, sort, etc.) + +The feature `simd` (not part of `full`) produces more explicit SIMD instructions +via [`packed_simd`](https://github.com/rust-lang/packed_simd), but requires the +nightly channel. + +The feature `cache_aligned` uses a custom allocator instead of `Vec`, which may be +more performant but is not interoperable with `Vec`.