diff --git a/guide/src/README.md b/guide/src/README.md index 3029531bd31..3042526a2e8 100644 --- a/guide/src/README.md +++ b/guide/src/README.md @@ -10,22 +10,3 @@ Arrow2 is divided into three main parts: * a [low-level API](./low_level.md) to efficiently operate with contiguous memory regions; * a [high-level API](./high_level.md) to operate with arrow arrays; * a [metadata API](./metadata.md) to declare and operate with logical types and metadata. - -## Cargo features - -This crate has a significant number of cargo features to reduce compilation -time and number of dependencies. The feature `"full"` activates most -functionality, such as: - -* `io_ipc`: to interact with the IPC format -* `io_ipc_compression`: to read and write compressed IPC (v2) -* `io_csv` to read and write CSV -* `io_json` to read and write JSON -* `io_parquet` to read and write parquet -* `io_parquet_compression` to read and write compressed parquet -* `io_print` to write batches to formatted ASCII tables -* `compute` to operate on arrays (addition, sum, sort, etc.) - -The feature `simd` (not part of `full`) produces more explicit SIMD instructions -via [`packed_simd`](https://github.com/rust-lang/packed_simd), but requires the -nightly channel. diff --git a/guide/src/arrow.md b/guide/src/arrow.md index fa99a5b38d1..d59166205fb 100644 --- a/guide/src/arrow.md +++ b/guide/src/arrow.md @@ -1,6 +1,6 @@ # Introduction -Welcome to the Arrow2 guide for the Rust programming language. This guide was +Welcome to the Arrow2 guide for the Rust programming language. This guide was created to help you become familiar with the Arrow2 crate and its functionalities. diff --git a/guide/src/compute.md b/guide/src/compute.md index 68948149de0..21944d988a7 100644 --- a/guide/src/compute.md +++ b/guide/src/compute.md @@ -1,12 +1,18 @@ # Compute API -When compiled with the feature `compute`, this crate offers a wide range of functions to perform both vertical (e.g. add two arrays) and horizontal (compute the sum of an array) operations. +When compiled with the feature `compute`, this crate offers a wide range of functions +to perform both vertical (e.g. add two arrays) and horizontal +(compute the sum of an array) operations. -```rust -{{#include ../../examples/arithmetics.rs}} -``` +The overall design of the `compute` module is that it offers two APIs: + +* statically typed, such as `sum_primitive(&PrimitiveArray) -> Option` +* dynamically typed, such as `sum(&dyn Array) -> Box` + +the dynamically typed API usually has a function `can_*(&DataType) -> bool` denoting whether +the operation is defined for the particular logical type. -An overview of the implemented functionality. +Overview of the implemented functionality: * arithmetics, checked, saturating, etc. * `sum`, `min` and `max` @@ -17,7 +23,13 @@ An overview of the implemented functionality. * `sort`, `hash`, `merge-sort` * `if-then-else` * `nullif` -* `lenght` (of string) -* `hour`, `year` (of temporal logical types) +* `length` (of string) +* `hour`, `year`, `month`, `iso_week` (of temporal logical types) * `regex` * (list) `contains` + +and an example of how to use them: + +```rust +{{#include ../../examples/arithmetics.rs}} +``` diff --git a/guide/src/extension.md b/guide/src/extension.md index a514701036e..ee60e6ec1c4 100644 --- a/guide/src/extension.md +++ b/guide/src/extension.md @@ -1,7 +1,12 @@ # Extension types This crate supports Arrows' ["extension type"](https://arrow.apache.org/docs/format/Columnar.html#extension-types), to declare, use, and share custom logical types. -The follow example shows how to declare one: + +An extension type is just a `DataType` with a name and some metadata. +In particular, its physical representation is equal to its inner `DataType`, which implies +that all functionality in this crate works as if it was the inner `DataType`. + +The following example shows how to declare one: ```rust {{#include ../../examples/extension.rs}} diff --git a/guide/src/ffi.md b/guide/src/ffi.md index 40d01984c38..4b8267266a2 100644 --- a/guide/src/ffi.md +++ b/guide/src/ffi.md @@ -5,8 +5,8 @@ has a specification, which allows languages to share data structures via foreign interfaces at zero cost (i.e. via pointers). This is known as the [C Data interface](https://arrow.apache.org/docs/format/CDataInterface.html). -This crate supports importing from and exporting to all `DataType`s. Follow the -example below to learn how to import and export: +This crate supports importing from and exporting to all its physical types. The +example below demonstrates how to use the API: ```rust {{#include ../../examples/ffi.rs}} diff --git a/guide/src/high_level.md b/guide/src/high_level.md index 3bac96c4c43..c5887e7c02d 100644 --- a/guide/src/high_level.md +++ b/guide/src/high_level.md @@ -1,10 +1,12 @@ # High-level API -The simplest way to think about an arrow `Array` is that it represents -`Vec>` and has a logical type (see [metadata](../metadata.md))) associated with it. +Arrow core trait the `Array`, which you can think of as representing `Arc>` +with associated metadata (see [metadata](../metadata.md))). +Contrarily to `Arc>`, arrays in this crate are represented in such a way +that they can be zero-copied to any other Arrow implementation via foreign interfaces (FFI). -Probably the simplest array in this crate is `PrimitiveArray`. It can be constructed -from a slice as follows: +Probably the simplest `Array` in this crate is the `PrimitiveArray`. It can be +constructed as from a slice of option values, ```rust # use arrow2::array::{Array, PrimitiveArray}; @@ -72,7 +74,7 @@ let array = PrimitiveArray::::from_slice([1, 0, 123]); assert_eq!(array.data_type(), &DataType::Int32); # } ``` -they can be cheaply converted to via `.to(DataType)`. +they can be cheaply (`O(1)`) converted to via `.to(DataType)`. The following arrays are supported: @@ -88,11 +90,10 @@ The following arrays are supported: * `UnionArray` (every row has a different logical type) * `DictionaryArray` (nested array with encoded values) -## Dynamic Array +## Array as a trait object -There is a more powerful aspect of arrow arrays, and that is that they all -implement the trait `Array` and can be cast to `&dyn Array`, i.e. they can be turned into -a trait object. This enables arrays to have types that are dynamic in nature. +`Array` is object safe, and all implementations of `Array` and can be casted +to `&dyn Array`, which enables run-time nesting. ```rust # use arrow2::array::{Array, PrimitiveArray}; @@ -106,7 +107,7 @@ let a: &dyn Array = &a; Given a trait object `array: &dyn Array`, we know its physical type via `PhysicalType: array.data_type().to_physical_type()`, which we use to downcast the array -to its concrete type: +to its concrete physical type: ```rust # use arrow2::array::{Array, PrimitiveArray}; @@ -135,6 +136,7 @@ an each implementation of `Array` (a struct): | `FixedSizeList` | `FixedSizeListArray` | | `Struct` | `StructArray` | | `Union` | `UnionArray` | +| `Map` | `MapArray` | | `Dictionary(_)` | `DictionaryArray<_>` | where `_` represents each of the variants (e.g. `PrimitiveType::Int32 <-> i32`). @@ -174,13 +176,18 @@ and how to make them even more efficient. This crate's APIs are generally split into two patterns: whether an operation leverages contiguous memory regions or whether it does not. -If yes, then use: +What this means is that certain operations can be performed irrespectively of whether a value +is "null" or not (e.g. `PrimitiveArray + i32` can be applied to _all_ values via SIMD and +only copy the validity bitmap independently). + +When an operation benefits from such arrangement, it is advantageous to use * `Buffer::from_iter` * `Buffer::from_trusted_len_iter` * `Buffer::try_from_trusted_len_iter` -If not, then use the builder API, such as `MutablePrimitiveArray`, `MutableUtf8Array` or `MutableListArray`. +If not, then use the `MutableArray` API, such as +`MutablePrimitiveArray`, `MutableUtf8Array` or `MutableListArray`. We have seen examples where the latter API was used. In the last example of this page you will be introduced to an example of using the former for SIMD. @@ -209,14 +216,23 @@ Like `FromIterator`, this crate contains two sets of APIs to iterate over data. an array `array: &PrimitiveArray`, the following applies: 1. If you need to iterate over `Option<&T>`, use `array.iter()` -2. If you can operate over the values and validity independently, use `array.values() -> &Buffer` and `array.validity() -> Option<&Bitmap>` +2. If you can operate over the values and validity independently, + use `array.values() -> &Buffer` and `array.validity() -> Option<&Bitmap>` -Note that case 1 is useful when e.g. you want to perform an operation that depends on both validity and values, while the latter is suitable for SIMD and copies, as they return contiguous memory regions (buffers and bitmaps). We will see below how to leverage these APIs. +Note that case 1 is useful when e.g. you want to perform an operation that depends on both +validity and values, while the latter is suitable for SIMD and copies, as they return +contiguous memory regions (buffers and bitmaps). We will see below how to leverage these APIs. -This idea holds more generally in this crate's arrays: `values()` returns something that has a contiguous in-memory representation, while `iter()` returns items taking validity into account. To get an iterator over contiguous values, use `array.values().iter()`. +This idea holds more generally in this crate's arrays: `values()` returns something that has +a contiguous in-memory representation, while `iter()` returns items taking validity into account. +To get an iterator over contiguous values, use `array.values().iter()`. There is one last API that is worth mentioning, and that is `Bitmap::chunks`. When performing -bitwise operations, it is often more performant to operate on chunks of bits instead of single bits. `chunks` offers a `TrustedLen` of `u64` with the bits + an extra `u64` remainder. We expose two functions, `unary(Bitmap, Fn) -> Bitmap` and `binary(Bitmap, Bitmap, Fn) -> Bitmap` that use this API to efficiently perform bitmap operations. +bitwise operations, it is often more performant to operate on chunks of bits +instead of single bits. `chunks` offers a `TrustedLen` of `u64` with the bits ++ an extra `u64` remainder. We expose two functions, `unary(Bitmap, Fn) -> Bitmap` +and `binary(Bitmap, Bitmap, Fn) -> Bitmap` that use this API to efficiently +perform bitmap operations. ## Vectorized operations @@ -238,18 +254,23 @@ where O: NativeType, F: Fn(I) -> O, { + // create the iterator over _all_ values let values = array.values().iter().map(|v| op(*v)); let values = Buffer::from_trusted_len_iter(values); + // create the new array, cloning its validity PrimitiveArray::::from_data(data_type.clone(), values, array.validity().cloned()) } ``` Some notes: -1. We used `array.values()`, as described above: this operation leverages a contiguous memory region. +1. We used `array.values()`, as described above: this operation leverages a + contiguous memory region. 2. We leveraged normal rust iterators for the operation. 3. We used `op` on the array's values irrespectively of their validity, -and cloned its validity. This approach is suitable for operations whose branching off is more expensive than operating over all values. If the operation is expensive, then using `PrimitiveArray::::from_trusted_len_iter` is likely faster. + and cloned its validity. This approach is suitable for operations whose branching off + is more expensive than operating over all values. If the operation is expensive, + then using `PrimitiveArray::::from_trusted_len_iter` is likely faster. diff --git a/guide/src/low_level.md b/guide/src/low_level.md index e4c69069fe3..9f1bc332e1c 100644 --- a/guide/src/low_level.md +++ b/guide/src/low_level.md @@ -1,17 +1,24 @@ # Low-level API -The starting point of this crate is the idea that data must be stored in memory in a specific arrangement to be interoperable with Arrow's ecosystem. With this in mind, this crate does not use `Vec` but instead has its own containers to store data, including sharing and consuming it via FFI. +The starting point of this crate is the idea that data is stored in memory in a specific arrangement to be interoperable with Arrow's ecosystem. -The most important design decision of this crate is that contiguous regions are shared via an `Arc`. In this context, the operation of slicing a memory region is `O(1)` because it corresponds to changing an offset and length. The tradeoff is that once under an `Arc`, memory regions are immutable. +The most important design aspect of this crate is that contiguous regions are shared via an +`Arc`. In this context, the operation of slicing a memory region is `O(1)` because it +corresponds to changing an offset and length. The tradeoff is that once under +an `Arc`, memory regions are immutable. -The second important aspect is that Arrow has two main types of data buffers: bitmaps, whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are measured in bytes. With this in mind, this crate has 2 main types of containers of contiguous memory regions: +The second most important aspect is that Arrow has two main types of data buffers: bitmaps, +whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are +measured in bytes. With this in mind, this crate has 2 main types of containers of +contiguous memory regions: * `Buffer`: handle contiguous memory regions of type T whose offsets are measured in items * `Bitmap`: handle contiguous memory regions of bits whose offsets are measured in bits These hold _all_ data-related memory in this crate. -Due to their intrinsic immutability, each container has a corresponding mutable (and non-shareable) variant: +Due to their intrinsic immutability, each container has a corresponding mutable +(and non-shareable) variant: * `MutableBuffer` * `MutableBitmap` @@ -44,7 +51,8 @@ assert_eq!(x.as_slice(), &[0, 5, 2, 10]) ``` The following demonstrates how to efficiently -perform an operation from an iterator of [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html): +perform an operation from an iterator of +[TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html): ```rust # use arrow2::buffer::MutableBuffer; @@ -65,6 +73,7 @@ the following physical types: * `u8-u64` * `f32` and `f64` * `arrow2::types::days_ms` +* `arrow2::types::months_days_ns` This is because the arrow specification only supports the above Rust types; all other complex types supported by arrow are built on top of these types, which enables Arrow to be a highly diff --git a/guide/src/metadata.md b/guide/src/metadata.md index 9ced2cdfeb1..7a78d82edac 100644 --- a/guide/src/metadata.md +++ b/guide/src/metadata.md @@ -13,7 +13,7 @@ In Arrow2, logical types are declared as variants of the `enum` `arrow2::datatyp For example, `DataType::Int32` represents a signed integer of 32 bits. Each `DataType` has an associated `enum PhysicalType` (many-to-one) representing the -particular in-memory representation, and is associated to specific semantics. +particular in-memory representation, and is associated to a specific semantics. For example, both `DataType::Date32` and `DataType::Int32` have the same `PhysicalType` (`PhysicalType::Primitive(PrimitiveType::Int32)`) but `Date32` represents the number of days since UNIX epoch. @@ -23,10 +23,9 @@ Logical types are metadata: they annotate physical types with extra information ## `Field` (column metadata) Besides logical types, the arrow format supports other relevant metadata to the format. -All this information is stored in `arrow2::datatypes::Field`. - -A `Field` is arrow's metadata associated to a column in the context of a columnar format. -It has a name, a logical type `DataType`, whether the column is nullable, etc. +An important one is `Field` broadly corresponding to a column in traditional columnar formats. +A `Field` is composed by a name (`String`), a logical type (`DataType`), whether it is +nullable (`bool`), and optional metadata. ## `Schema` (table metadata) diff --git a/src/array/fixed_size_binary/mutable.rs b/src/array/fixed_size_binary/mutable.rs index 7717e594dda..2e5b728035f 100644 --- a/src/array/fixed_size_binary/mutable.rs +++ b/src/array/fixed_size_binary/mutable.rs @@ -150,7 +150,7 @@ impl MutableFixedSizeBinaryArray { std::slice::from_raw_parts(self.values.as_ptr().add(i * self.size), self.size) } - /// Shrinks the capacity of the [`MutablePrimitive`] to fit its current length. + /// Shrinks the capacity of the [`MutableFixedSizeBinaryArray`] to fit its current length. pub fn shrink_to_fit(&mut self) { self.values.shrink_to_fit(); if let Some(validity) = &mut self.validity { diff --git a/src/array/fixed_size_list/mutable.rs b/src/array/fixed_size_list/mutable.rs index 13924106f8e..cf0850ba92b 100644 --- a/src/array/fixed_size_list/mutable.rs +++ b/src/array/fixed_size_list/mutable.rs @@ -74,7 +74,7 @@ impl MutableFixedSizeListArray { None => self.init_validity(), } } - /// Shrinks the capacity of the [`MutableFixedSizeList`] to fit its current length. + /// Shrinks the capacity of the [`MutableFixedSizeListArray`] to fit its current length. pub fn shrink_to_fit(&mut self) { self.values.shrink_to_fit(); if let Some(validity) = &mut self.validity { diff --git a/src/array/growable/fixed_size_list.rs b/src/array/growable/fixed_size_list.rs index cabd0b75928..09b72a88dc5 100644 --- a/src/array/growable/fixed_size_list.rs +++ b/src/array/growable/fixed_size_list.rs @@ -22,7 +22,7 @@ pub struct GrowableFixedSizeList<'a> { } impl<'a> GrowableFixedSizeList<'a> { - /// Creates a new [`GrowableList`] bound to `arrays` with a pre-allocated `capacity`. + /// Creates a new [`GrowableFixedSizeList`] bound to `arrays` with a pre-allocated `capacity`. /// # Panics /// If `arrays` is empty. pub fn new( diff --git a/src/array/list/mutable.rs b/src/array/list/mutable.rs index 1892dc50955..fab320655c0 100644 --- a/src/array/list/mutable.rs +++ b/src/array/list/mutable.rs @@ -43,7 +43,7 @@ impl MutableListArray { } } - /// Shrinks the capacity of the [`MutableList`] to fit its current length. + /// Shrinks the capacity of the [`MutableListArray`] to fit its current length. pub fn shrink_to_fit(&mut self) { self.values.shrink_to_fit(); self.offsets.shrink_to_fit(); diff --git a/src/array/primitive/mutable.rs b/src/array/primitive/mutable.rs index e8d764fddc3..7c4f7d7c518 100644 --- a/src/array/primitive/mutable.rs +++ b/src/array/primitive/mutable.rs @@ -238,7 +238,7 @@ impl MutablePrimitiveArray { Arc::new(a) } - /// Shrinks the capacity of the [`MutablePrimitive`] to fit its current length. + /// Shrinks the capacity of the [`MutablePrimitiveArray`] to fit its current length. pub fn shrink_to_fit(&mut self) { self.values.shrink_to_fit(); if let Some(validity) = &mut self.validity { diff --git a/src/array/utf8/mutable.rs b/src/array/utf8/mutable.rs index 25d4521d883..bf7c7f9d68a 100644 --- a/src/array/utf8/mutable.rs +++ b/src/array/utf8/mutable.rs @@ -165,7 +165,7 @@ impl MutableUtf8Array { Arc::new(a) } - /// Shrinks the capacity of the [`MutableUtf8`] to fit its current length. + /// Shrinks the capacity of the [`MutableUtf8Array`] to fit its current length. pub fn shrink_to_fit(&mut self) { self.values.shrink_to_fit(); self.offsets.shrink_to_fit(); diff --git a/src/compute/merge_sort/mod.rs b/src/compute/merge_sort/mod.rs index c619229a005..238e06d7447 100644 --- a/src/compute/merge_sort/mod.rs +++ b/src/compute/merge_sort/mod.rs @@ -1,4 +1,4 @@ -//! This module exposes functions to perform merge-sorts. +//! Functions to perform merge-sorts. //! //! The goal of merge-sort is to merge two sorted arrays, `[a0, a1]`, `merge_sort(a0, a1)`, //! so that the resulting array is sorted, i.e. the following invariant upholds: diff --git a/src/compute/partition.rs b/src/compute/partition.rs index c1cd4fb6c7d..15d46fd5de5 100644 --- a/src/compute/partition.rs +++ b/src/compute/partition.rs @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -//! Defines partition kernel for `ArrayRef` +//! Defines partition kernel for [`crate::array::Array`] use crate::compute::sort::{build_compare, Compare, SortColumn}; use crate::error::{ArrowError, Result}; diff --git a/src/doc/lib.md b/src/doc/lib.md index aa905384731..d37adce76ca 100644 --- a/src/doc/lib.md +++ b/src/doc/lib.md @@ -1,6 +1,6 @@ Welcome to arrow2's documentation. Thanks for checking it out! -This is a library for efficient in-memory data operations using +This is a library for efficient in-memory data operations with [Arrow in-memory format](https://arrow.apache.org/docs/format/Columnar.html). It is a re-write from the bottom up of the official `arrow` crate with soundness and type safety in mind. @@ -66,3 +66,25 @@ fn main() -> Result<()> { Ok(()) } ``` + +## Cargo features + +This crate has a significant number of cargo features to reduce compilation +time and number of dependencies. The feature `"full"` activates most +functionality, such as: + +* `io_ipc`: to interact with the Arrow IPC format +* `io_ipc_compression`: to read and write compressed Arrow IPC (v2) +* `io_csv` to read and write CSV +* `io_json` to read and write JSON +* `io_parquet` to read and write parquet +* `io_parquet_compression` to read and write compressed parquet +* `io_print` to write batches to formatted ASCII tables +* `compute` to operate on arrays (addition, sum, sort, etc.) + +The feature `simd` (not part of `full`) produces more explicit SIMD instructions +via [`packed_simd`](https://github.com/rust-lang/packed_simd), but requires the +nightly channel. + +The feature `cache_aligned` uses a custom allocator instead of `Vec`, which may be +more performant but is not interoperable with `Vec`. diff --git a/src/io/README.md b/src/io/README.md index 5c97d1a6ee1..26be0b337a6 100644 --- a/src/io/README.md +++ b/src/io/README.md @@ -6,17 +6,19 @@ This document describes the overall design of this module. * Each directory in this module corresponds to a specific format such as `csv` and `json`. * directories that depend on external dependencies MUST be feature gated, with a feature named with a prefix `io_`. -* modules MUST re-export any API of external dependencies they require as part of their public API. E.g. +* modules MUST re-export any API of external dependencies they require as part of their public API. + E.g. * if a module as an API `write(writer: &mut csv:Writer, ...)`, it MUST contain `pub use csv::Writer;`. The rational is that adding this crate to `cargo.toml` must be sufficient to use it. -* Each directory SHOULD contain two directories, `read` and `write`, corresponding to functionality about -reading from the format and writing to the format respectively. +* Each directory SHOULD contain two directories, `read` and `write`, corresponding + to functionality about reading from the format and writing to the format respectively. * The base module SHOULD contain `use pub read;` and `use pub write;`. * Implementations SHOULD separate reading of "data" from reading of "metadata". Examples: * schema read or inference SHOULD be a separate function * functions that read "data" SHOULD consume a schema typically pre-read. -* Implementations SHOULD separate IO-bounded operations from CPU-bounded operations. I.e. implementations SHOULD: +* Implementations SHOULD separate IO-bounded operations from CPU-bounded operations. + I.e. implementations SHOULD: * contain functions that consume a `Read` implementor and output a "raw" struct, i.e. a struct that is e.g. compressed and serialized * contain functions that consume a "raw" struct and convert it into Arrow. * offer each of these functions as independent public APIs, so that consumers can decide how to balance CPU-bounds and IO-bounds. diff --git a/src/io/csv/read/deserialize.rs b/src/io/csv/read/deserialize.rs index 14172e09a6b..7a6bcaf2bb9 100644 --- a/src/io/csv/read/deserialize.rs +++ b/src/io/csv/read/deserialize.rs @@ -198,7 +198,7 @@ pub fn deserialize_column( /// Deserializes rows [`ByteRecord`] into a [`RecordBatch`]. /// Note that this is a convenience function: column deserialization -///is trivially parallelizable (e.g. rayon). +/// is trivially parallelizable (e.g. rayon). pub fn deserialize_batch( rows: &[ByteRecord], fields: &[Field], diff --git a/src/io/csv/read/infer_schema.rs b/src/io/csv/read/infer_schema.rs index 7618ab7cb27..1b8fd060d0b 100644 --- a/src/io/csv/read/infer_schema.rs +++ b/src/io/csv/read/infer_schema.rs @@ -50,6 +50,16 @@ fn is_datetime(string: &str) -> Option { } /// Infers [`DataType`] from `bytes` +/// # Implementation +/// * case insensitive "true" or "false" are mapped to [`DataType::Boolean`] +/// * parsable to integer is mapped to [`DataType::Int64`] +/// * parsable to float is mapped to [`DataType::Float64`] +/// * parsable to date is mapped to [`DataType::Date32`] +/// * parsable to time is mapped to [`DataType::Time32(TimeUnit::Millisecond)`] +/// * parsable to naive datetime is mapped to [`DataType::Timestamp(TimeUnit::Millisecond, None)`] +/// * parsable to time-aware datetime is mapped to [`DataType::Timestamp`] of milliseconds and parsed offset. +/// * other utf8 is mapped to [`DataType::Utf8`] +/// * invalid utf8 is mapped to [`DataType::Binary`] pub fn infer(bytes: &[u8]) -> DataType { if is_boolean(bytes) { DataType::Boolean @@ -75,9 +85,8 @@ pub fn infer(bytes: &[u8]) -> DataType { } } -/// Infer the schema of a CSV file by reading through the first n records up to `max_rows`. -/// -/// Return infered schema and number of records used for inference. +/// Infers a [`Schema`] of a CSV file by reading through the first n records up to `max_rows`. +/// Seeks back to the begining of the file _after_ the header pub fn infer_schema DataType>( reader: &mut Reader, max_rows: Option, diff --git a/src/io/csv/read/reader.rs b/src/io/csv/read/reader.rs index 14ddd8d07a1..e2b95a8e405 100644 --- a/src/io/csv/read/reader.rs +++ b/src/io/csv/read/reader.rs @@ -20,9 +20,9 @@ pub fn projected_schema(schema: &Schema, projection: Option<&[usize]>) -> Schema } } -/// Reads `len` rows from the CSV into Bytes, skiping `skip` +/// Reads `len` rows from `reader` into `row`, skiping `skip`. /// This operation has minimal CPU work and is thus the fastest way to read through a CSV -/// without deserializing the contents to arrow. +/// without deserializing the contents to Arrow. pub fn read_rows( reader: &mut Reader, skip: usize, diff --git a/src/record_batch.rs b/src/record_batch.rs index b7a53f4e1f3..02182af41df 100644 --- a/src/record_batch.rs +++ b/src/record_batch.rs @@ -51,7 +51,7 @@ impl RecordBatch { /// Creates a [`RecordBatch`] from a schema and columns, with additional options, /// such as whether to strictly validate field names. /// - /// See [`fn@try_new`] for the expected conditions. + /// See [`Self::try_new()`] for the expected conditions. pub fn try_new_with_options( schema: Arc, columns: Vec>, diff --git a/src/scalar/binary.rs b/src/scalar/binary.rs index f847e8f0c88..c5e1a40b978 100644 --- a/src/scalar/binary.rs +++ b/src/scalar/binary.rs @@ -2,6 +2,7 @@ use crate::{array::*, buffer::Buffer, datatypes::DataType}; use super::Scalar; +/// The [`Scalar`] implementation of binary (`Vec`). #[derive(Debug, Clone)] pub struct BinaryScalar { value: Buffer, @@ -16,6 +17,7 @@ impl PartialEq for BinaryScalar { } impl BinaryScalar { + /// Returns a new [`BinaryScalar`]. #[inline] pub fn new>(v: Option

) -> Self { let is_valid = v.is_some(); @@ -28,6 +30,7 @@ impl BinaryScalar { } } + /// Its value #[inline] pub fn value(&self) -> &[u8] { self.value.as_slice() diff --git a/src/scalar/boolean.rs b/src/scalar/boolean.rs index 7f0ffdaa0de..74df27534f3 100644 --- a/src/scalar/boolean.rs +++ b/src/scalar/boolean.rs @@ -2,6 +2,7 @@ use crate::datatypes::DataType; use super::Scalar; +/// The [`Scalar`] implementation of a boolean. #[derive(Debug, Clone)] pub struct BooleanScalar { value: bool, @@ -15,6 +16,7 @@ impl PartialEq for BooleanScalar { } impl BooleanScalar { + /// Returns a new [`BooleanScalar`] #[inline] pub fn new(v: Option) -> Self { let is_valid = v.is_some(); @@ -24,6 +26,7 @@ impl BooleanScalar { } } + /// The value irrespectively of the validity #[inline] pub fn value(&self) -> bool { self.value diff --git a/src/scalar/list.rs b/src/scalar/list.rs index 4623711a5b2..c8c505ef503 100644 --- a/src/scalar/list.rs +++ b/src/scalar/list.rs @@ -24,6 +24,7 @@ impl PartialEq for ListScalar { } impl ListScalar { + /// returns a new [`ListScalar`] /// # Panics /// iff /// * the `data_type` is not `List` or `LargeList` (depending on this scalar's offset `O`) @@ -46,6 +47,7 @@ impl ListScalar { } } + /// The values of the [`ListScalar`] pub fn values(&self) -> &Arc { &self.values } diff --git a/src/scalar/mod.rs b/src/scalar/mod.rs index 7d1e9ae9b54..e0909438979 100644 --- a/src/scalar/mod.rs +++ b/src/scalar/mod.rs @@ -1,3 +1,4 @@ +#![warn(missing_docs)] //! contains the [`Scalar`] trait object representing individual items of [`Array`](crate::array::Array)s, //! as well as concrete implementations such as [`BooleanScalar`]. use std::any::Any; @@ -20,12 +21,16 @@ pub use null::*; mod struct_; pub use struct_::*; -/// Trait object declaring an optional value with a logical type. +/// Trait object declaring an optional value with a [`DataType`]. +/// This strait is often used in APIs that accept multiple scalar types. pub trait Scalar: std::fmt::Debug { + /// convert itself to fn as_any(&self) -> &dyn Any; + /// whether it is valid fn is_valid(&self) -> bool; + /// the logical type. fn data_type(&self) -> &DataType; } diff --git a/src/scalar/null.rs b/src/scalar/null.rs index 3751c6cfbd6..8957a150e95 100644 --- a/src/scalar/null.rs +++ b/src/scalar/null.rs @@ -2,10 +2,12 @@ use crate::datatypes::DataType; use super::Scalar; -#[derive(Debug, Clone, PartialEq)] +/// The representation of a single entry of a [`crate::array::NullArray`]. +#[derive(Debug, Clone, PartialEq, Eq)] pub struct NullScalar {} impl NullScalar { + /// A new [`NullScalar`] #[inline] pub fn new() -> Self { Self {} diff --git a/src/scalar/primitive.rs b/src/scalar/primitive.rs index 90265bf7cb0..f5796fffbb7 100644 --- a/src/scalar/primitive.rs +++ b/src/scalar/primitive.rs @@ -1,10 +1,13 @@ use crate::{ datatypes::DataType, types::{NativeType, NaturalDataType}, + error::ArrowError, }; use super::Scalar; +/// The implementation of [`Scalar`] for primitive, semantically equivalent to [`Option`] +/// with [`DataType`]. #[derive(Debug, Clone)] pub struct PrimitiveScalar { // Not Option because this offers a stabler pointer offset on the struct @@ -22,8 +25,17 @@ impl PartialEq for PrimitiveScalar { } impl PrimitiveScalar { + /// Returns a new [`PrimitiveScalar`]. #[inline] pub fn new(data_type: DataType, v: Option) -> Self { + if !T::is_valid(&data_type) { + Err(ArrowError::InvalidArgumentError(format!( + "Type {} does not support logical type {}", + std::any::type_name::(), + data_type + ))) + .unwrap() + } let is_valid = v.is_some(); Self { value: v.unwrap_or_default(), @@ -32,6 +44,7 @@ impl PrimitiveScalar { } } + /// Returns the value irrespectively of its validity. #[inline] pub fn value(&self) -> T { self.value diff --git a/src/scalar/struct_.rs b/src/scalar/struct_.rs index eab4671f1dc..b7822b3ae02 100644 --- a/src/scalar/struct_.rs +++ b/src/scalar/struct_.rs @@ -4,6 +4,7 @@ use crate::datatypes::DataType; use super::Scalar; +/// A single entry of a [`crate::array::StructArray`]. #[derive(Debug, Clone)] pub struct StructScalar { values: Vec>, @@ -20,6 +21,7 @@ impl PartialEq for StructScalar { } impl StructScalar { + /// Returns a new [`StructScalar`] #[inline] pub fn new(data_type: DataType, values: Option>>) -> Self { let is_valid = values.is_some(); @@ -30,6 +32,7 @@ impl StructScalar { } } + /// Returns the values irrespectively of the validity. #[inline] pub fn values(&self) -> &[Arc] { &self.values diff --git a/src/scalar/utf8.rs b/src/scalar/utf8.rs index 207fe084f0c..60417cb7813 100644 --- a/src/scalar/utf8.rs +++ b/src/scalar/utf8.rs @@ -2,9 +2,10 @@ use crate::{array::*, buffer::Buffer, datatypes::DataType}; use super::Scalar; +/// The implementation of [`Scalar`] for utf8, semantically equivalent to [`Option<&str>`]. #[derive(Debug, Clone)] pub struct Utf8Scalar { - value: Buffer, + value: Buffer, // safety: valid utf8 is_valid: bool, phantom: std::marker::PhantomData, } @@ -16,6 +17,7 @@ impl PartialEq for Utf8Scalar { } impl Utf8Scalar { + /// Returns a new [`Utf8Scalar`] #[inline] pub fn new>(v: Option

) -> Self { let is_valid = v.is_some(); @@ -28,8 +30,10 @@ impl Utf8Scalar { } } + /// Returns the value irrespectively of the validity. #[inline] pub fn value(&self) -> &str { + // Safety: invariant of the struct unsafe { std::str::from_utf8_unchecked(self.value.as_slice()) } } }