Skip to content

Commit

Permalink
Merge branch 'master' into comfy-table
Browse files Browse the repository at this point in the history
  • Loading branch information
Chojan Shang authored Aug 10, 2021
2 parents db770af + fa5acd9 commit 8b81a33
Show file tree
Hide file tree
Showing 10 changed files with 295 additions and 36 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -327,12 +327,14 @@ jobs:
rustup override set ${{ matrix.rust }}
rustup component add rustfmt
rustup target add wasm32-unknown-unknown
rustup target add wasm32-wasi
- name: Build arrow crate
run: |
export CARGO_HOME="/github/home/.cargo"
export CARGO_TARGET_DIR="/github/home/target"
cd arrow
cargo build --features=js --target wasm32-unknown-unknown
cargo build --no-default-features --features=csv,ipc,simd --target wasm32-unknown-unknown
cargo build --no-default-features --features=csv,ipc,simd --target wasm32-wasi
# test builds with various feature flags
default-build:
Expand Down
13 changes: 6 additions & 7 deletions arrow/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,7 @@ serde = { version = "1.0", features = ["rc"] }
serde_derive = "1.0"
serde_json = { version = "1.0", features = ["preserve_order"] }
indexmap = "1.6"
rand = { version = "0.8", default-features = false }
# getrandom is a dependency of rand, not (directly) of arrow
# need to specify `js` feature to build on wasm
getrandom = { version = "0.2", optional = true }
rand = { version = "0.8", optional = true }
num = "0.4"
csv_crate = { version = "1.1", optional = true, package="csv" }
regex = "1.3"
Expand All @@ -64,16 +61,18 @@ csv = ["csv_crate"]
ipc = ["flatbuffers"]
simd = ["packed_simd"]
prettyprint = ["comfy-table"]
js = ["getrandom/js"]
# The test utils feature enables code used in benchmarks and tests but
# not the core arrow code itself
test_utils = ["rand/std", "rand/std_rng"]
# not the core arrow code itself. Be aware that `rand` must be kept as
# an optional dependency for supporting compile to wasm32-unknown-unknown
# target without assuming an environment containing JavaScript.
test_utils = ["rand"]
# this is only intended to be used in single-threaded programs: it verifies that
# all allocated memory is being released (no memory leaks).
# See README for details
memory-check = []

[dev-dependencies]
rand = "0.8"
criterion = "0.3"
flate2 = "1"
tempfile = "3"
Expand Down
30 changes: 27 additions & 3 deletions arrow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,13 @@

This crate contains the official Native Rust implementation of [Apache Arrow](https://arrow.apache.org/) in memory format. Please see the API documents for additional details.

## Versioning / Releases

Unlike many other crates in the Rust ecosystem which spend extended time in "pre 1.0.0" state, releasing versions 0.x, the arrow-rs crate follows the versioning scheme of the overall [Apache Arrow](https://arrow.apache.org/) project in an effort to signal which language implementations have been integration tested with each other.

## Features

The arrow crate provides the following optional features:
The arrow crate provides the following features which may be enabled:

- `csv` (default) - support for reading and writing Arrow arrays to/from csv files
- `ipc` (default) - support for the [arrow-flight]((https://crates.io/crates/arrow-flight) IPC and wire format
Expand All @@ -35,13 +39,33 @@ The arrow crate provides the following optional features:
implementations of some [compute](https://github.com/apache/arrow/tree/master/rust/arrow/src/compute)
kernels using explicit SIMD processor intrinsics.

## Safety

TLDR: You should avoid using the `alloc` and `buffer` and `bitmap` modules if at all possible. These modules contain `unsafe` code and are easy to misuse.

As with all open source code, you should carefully evaluate the suitability of `arrow` for your project, taking into consideration your needs and risk tolerance prior to use.

_Background_: There are various parts of the `arrow` crate which use `unsafe` and `transmute` code internally. We are actively working as a community to minimize undefined behavior and remove `unsafe` usage to align more with Rust's core principles of safety (e.g. the arrow2 project).

As `arrow` exists today, it is fairly easy to misuse the APIs, leading to undefined behavior, and it is especially easy to misuse code in modules named above. For an example, as described in [the arrow2 crate](https://github.com/jorgecarleitao/arrow2#why), the following code compiles, does not panic, but results in undefined behavior:

```rust
let buffer = Buffer::from_slic_ref(&[0i32, 2i32])
let data = ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);
let array = Float64Array::from(Arc::new(data));

println!("{:?}", array.value(1));
```

## Building for WASM

In order to compile Arrow for Web Assembly (the `wasm32-unknown-unknown` WASM target), you will likely need to turn off this crate's default features and use the `js` feature.
Arrow can compile to WebAssembly using the `wasm32-unknown-unknown` and `wasm32-wasi` targets.

In order to compile Arrow for `wasm32-unknown-unknown` you will need to disable default features, then include the desired features, but exclude test dependencies (the `test_utils` feature). For example, use this snippet in your `Cargo.toml`:

```toml
[dependencies]
arrow = { version = "5.0", default-features = false, features = ["js"] }
arrow = { version = "5.0", default-features = false, features = ["csv", "ipc", "simd"] }
```

## Examples
Expand Down
30 changes: 30 additions & 0 deletions arrow/src/array/array_dictionary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,22 @@ impl<T: ArrowPrimitiveType> From<ArrayData> for DictionaryArray<T> {
}

/// Constructs a `DictionaryArray` from an iterator of optional strings.
///
/// # Example:
/// ```
/// use arrow::array::{DictionaryArray, PrimitiveArray, StringArray};
/// use arrow::datatypes::Int8Type;
///
/// let test = vec!["a", "a", "b", "c"];
/// let array: DictionaryArray<Int8Type> = test
/// .iter()
/// .map(|&x| if x == "b" { None } else { Some(x) })
/// .collect();
/// assert_eq!(
/// "DictionaryArray {keys: PrimitiveArray<Int8>\n[\n 0,\n 0,\n null,\n 1,\n] values: StringArray\n[\n \"a\",\n \"c\",\n]}\n",
/// format!("{:?}", array)
/// );
/// ```
impl<'a, T: ArrowPrimitiveType + ArrowDictionaryKeyType> FromIterator<Option<&'a str>>
for DictionaryArray<T>
{
Expand Down Expand Up @@ -181,6 +197,20 @@ impl<'a, T: ArrowPrimitiveType + ArrowDictionaryKeyType> FromIterator<Option<&'a
}

/// Constructs a `DictionaryArray` from an iterator of strings.
///
/// # Example:
///
/// ```
/// use arrow::array::{DictionaryArray, PrimitiveArray, StringArray};
/// use arrow::datatypes::Int8Type;
///
/// let test = vec!["a", "a", "b", "c"];
/// let array: DictionaryArray<Int8Type> = test.into_iter().collect();
/// assert_eq!(
/// "DictionaryArray {keys: PrimitiveArray<Int8>\n[\n 0,\n 0,\n 1,\n 2,\n] values: StringArray\n[\n \"a\",\n \"b\",\n \"c\",\n]}\n",
/// format!("{:?}", array)
/// );
/// ```
impl<'a, T: ArrowPrimitiveType + ArrowDictionaryKeyType> FromIterator<&'a str>
for DictionaryArray<T>
{
Expand Down
39 changes: 39 additions & 0 deletions arrow/src/array/builder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1062,6 +1062,11 @@ pub struct FixedSizeBinaryBuilder {
builder: FixedSizeListBuilder<UInt8Builder>,
}

///
/// Array Builder for [`DecimalArray`]
///
/// See [`DecimalArray`] for example.
///
#[derive(Debug)]
pub struct DecimalBuilder {
builder: FixedSizeListBuilder<UInt8Builder>,
Expand Down Expand Up @@ -2095,6 +2100,40 @@ impl UnionBuilder {
/// Array builder for `DictionaryArray`. For example to map a set of byte indices
/// to f32 values. Note that the use of a `HashMap` here will not scale to very large
/// arrays or result in an ordered dictionary.
///
/// # Example:
///
/// ```
/// use arrow::array::{
/// Array, PrimitiveBuilder, PrimitiveDictionaryBuilder,
/// UInt8Array, UInt32Array,
/// };
/// use arrow::datatypes::{UInt8Type, UInt32Type};
///
/// let key_builder = PrimitiveBuilder::<UInt8Type>::new(3);
/// let value_builder = PrimitiveBuilder::<UInt32Type>::new(2);
/// let mut builder = PrimitiveDictionaryBuilder::new(key_builder, value_builder);
/// builder.append(12345678).unwrap();
/// builder.append_null().unwrap();
/// builder.append(22345678).unwrap();
/// let array = builder.finish();
///
/// assert_eq!(
/// array.keys(),
/// &UInt8Array::from(vec![Some(0), None, Some(1)])
/// );
///
/// // Values are polymorphic and so require a downcast.
/// let av = array.values();
/// let ava: &UInt32Array = av.as_any().downcast_ref::<UInt32Array>().unwrap();
/// let avs: &[u32] = ava.values();
///
/// assert!(!array.is_null(0));
/// assert!(array.is_null(1));
/// assert!(!array.is_null(2));
///
/// assert_eq!(avs, &[12345678, 22345678]);
/// ```
#[derive(Debug)]
pub struct PrimitiveDictionaryBuilder<K, V>
where
Expand Down
4 changes: 2 additions & 2 deletions dev/release/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,15 +260,15 @@ For example, to backport `b2de5446cc1e45a0559fb39039d0545df1ac0d26` to active_re
```shell
git clone [email protected]:apache/arrow-rs.git /tmp/arrow-rs

ARROW_GITHUB_API_TOKEN=$ARROW_GITHUB_API_TOKEN CHECKOUT_ROOT=/tmp/arrow-rs CHERRY_PICK_SHA=b2de5446cc1e45a0559fb39039d0545df1ac0d26 python3 dev/release/cherry-pick-pr.py
CHERRY_PICK_SHA=b2de5446cc1e45a0559fb39039d0545df1ac0d26 ARROW_GITHUB_API_TOKEN=$ARROW_GITHUB_API_TOKEN CHECKOUT_ROOT=/tmp/arrow-rs python3 dev/release/cherry-pick-pr.py
```

## Labels

There are two labels that help keep track of backporting:

1. [`cherry-picked`](https://github.com/apache/arrow-rs/labels/cherry-picked) for PRs that have been cherry-picked/backported to `active_release`
2. [`release-cherry-pick`](https://github.com/apache/arrow-rs/labels/release-cherry-pick) for the PRs that are the cherry pick
2. [`release-cherry-pick`](https://github.com/apache/arrow-rs/labels/release-cherry-pick) for the PRs that are the cherry pick to `active_release`

You can find candidates to cherry pick using [this filter](https://github.com/apache/arrow-rs/pulls?q=is%3Apr+is%3Aclosed+-label%3Arelease-cherry-pick+-label%3Acherry-picked)

Expand Down
28 changes: 27 additions & 1 deletion parquet/src/arrow/arrow_writer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,7 @@ fn write_leaves(
ArrowDataType::FixedSizeList(_, _) | ArrowDataType::Union(_) => {
Err(ParquetError::NYI(
format!(
"Attempting to write an Arrow type {:?} to parquet that is not yet implemented",
"Attempting to write an Arrow type {:?} to parquet that is not yet implemented",
array.data_type()
)
))
Expand Down Expand Up @@ -1199,6 +1199,32 @@ mod tests {
);
}

#[test]
fn bool_large_single_column() {
let values = Arc::new(
[None, Some(true), Some(false)]
.iter()
.cycle()
.copied()
.take(200_000)
.collect::<BooleanArray>(),
);
let schema =
Schema::new(vec![Field::new("col", values.data_type().clone(), true)]);
let expected_batch =
RecordBatch::try_new(Arc::new(schema), vec![values]).unwrap();
let file = get_temp_file("bool_large_single_column", &[]);

let mut writer = ArrowWriter::try_new(
file.try_clone().unwrap(),
expected_batch.schema(),
None,
)
.expect("Unable to write file");
writer.write(&expected_batch).unwrap();
writer.close().unwrap();
}

#[test]
fn i8_single_column() {
required_and_optional::<Int8Array, _>(0..SMALL_SIZE as i8, "i8_single_column");
Expand Down
Loading

0 comments on commit 8b81a33

Please sign in to comment.