Skip to content

Commit

Permalink
Auto merge of #111475 - workingjubilee:sync-simd-2023-may-10, r=worki…
Browse files Browse the repository at this point in the history
…ngjubilee

 Sync portable-simd to 2023 May 10

Take 2.

r? `@ghost`
  • Loading branch information
bors committed May 12, 2023
2 parents 5b24e12 + e4cecc1 commit 26e0c57
Show file tree
Hide file tree
Showing 50 changed files with 2,199 additions and 763 deletions.
4 changes: 4 additions & 0 deletions library/portable-simd/.github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,10 @@ jobs:
- "--features std"
- "--features generic_const_exprs"
- "--features std --features generic_const_exprs"
- "--features all_lane_counts"
- "--features all_lane_counts --features std"
- "--features all_lane_counts --features generic_const_exprs"
- "--features all_lane_counts --features std --features generic_const_exprs"

steps:
- uses: actions/checkout@v2
Expand Down
34 changes: 12 additions & 22 deletions library/portable-simd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,44 +24,34 @@ or by setting up `rustup default nightly` or else with `cargo +nightly {build,te
```bash
cargo new hellosimd
```
to create a new crate. Edit `hellosimd/Cargo.toml` to be
```toml
[package]
name = "hellosimd"
version = "0.1.0"
edition = "2018"
[dependencies]
core_simd = { git = "https://github.com/rust-lang/portable-simd" }
```

and finally write this in `src/main.rs`:
to create a new crate. Finally write this in `src/main.rs`:
```rust
use core_simd::*;
#![feature(portable_simd)]
use std::simd::f32x4;
fn main() {
let a = f32x4::splat(10.0);
let b = f32x4::from_array([1.0, 2.0, 3.0, 4.0]);
println!("{:?}", a + b);
}
```

Explanation: We import all the bindings from the crate with the first line. Then, we construct our SIMD vectors with methods like `splat` or `from_array`. Finally, we can use operators on them like `+` and the appropriate SIMD instructions will be carried out. When we run `cargo run` you should get `[11.0, 12.0, 13.0, 14.0]`.

## Code Organization
Explanation: We construct our SIMD vectors with methods like `splat` or `from_array`. Next, we can use operators like `+` on them, and the appropriate SIMD instructions will be carried out. When we run `cargo run` you should get `[11.0, 12.0, 13.0, 14.0]`.

Currently the crate is organized so that each element type is a file, and then the 64-bit, 128-bit, 256-bit, and 512-bit vectors using those types are contained in said file.
## Supported vectors

All types are then exported as a single, flat module.
Currently, vectors may have up to 64 elements, but aliases are provided only up to 512-bit vectors.

Depending on the size of the primitive type, the number of lanes the vector will have varies. For example, 128-bit vectors have four `f32` lanes and two `f64` lanes.

The supported element types are as follows:
* **Floating Point:** `f32`, `f64`
* **Signed Integers:** `i8`, `i16`, `i32`, `i64`, `i128`, `isize`
* **Unsigned Integers:** `u8`, `u16`, `u32`, `u64`, `u128`, `usize`
* **Masks:** `mask8`, `mask16`, `mask32`, `mask64`, `mask128`, `masksize`
* **Signed Integers:** `i8`, `i16`, `i32`, `i64`, `isize` (`i128` excluded)
* **Unsigned Integers:** `u8`, `u16`, `u32`, `u64`, `usize` (`u128` excluded)
* **Pointers:** `*const T` and `*mut T` (zero-sized metadata only)
* **Masks:** 8-bit, 16-bit, 32-bit, 64-bit, and `usize`-sized masks

Floating point, signed integers, and unsigned integers are the [primitive types](https://doc.rust-lang.org/core/primitive/index.html) you're already used to.
The `mask` types are "truthy" values, but they use the number of bits in their name instead of just 1 bit like a normal `bool` uses.
Floating point, signed integers, unsigned integers, and pointers are the [primitive types](https://doc.rust-lang.org/core/primitive/index.html) you're already used to.
The mask types have elements that are "truthy" values, like `bool`, but have an unspecified layout because different architectures prefer different layouts for mask types.

[simd-guide]: ./beginners-guide.md
[zulip-project-portable-simd]: https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd
Expand Down
9 changes: 4 additions & 5 deletions library/portable-simd/crates/core_simd/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,11 @@ default = ["as_crate"]
as_crate = []
std = []
generic_const_exprs = []
all_lane_counts = []

[target.'cfg(target_arch = "wasm32")'.dev-dependencies.wasm-bindgen]
version = "0.2"

[dev-dependencies.wasm-bindgen-test]
version = "0.3"
[target.'cfg(target_arch = "wasm32")'.dev-dependencies]
wasm-bindgen = "0.2"
wasm-bindgen-test = "0.3"

[dev-dependencies.proptest]
version = "0.10"
Expand Down
13 changes: 13 additions & 0 deletions library/portable-simd/crates/core_simd/examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
### `stdsimd` examples

This crate is a port of example uses of `stdsimd`, mostly taken from the `packed_simd` crate.

The examples contain, as in the case of `dot_product.rs`, multiple ways of solving the problem, in order to show idiomatic uses of SIMD and iteration of performance designs.

Run the tests with the command

```
cargo run --example dot_product
```

and verify the code for `dot_product.rs` on your machine.
169 changes: 169 additions & 0 deletions library/portable-simd/crates/core_simd/examples/dot_product.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
// Code taken from the `packed_simd` crate
// Run this code with `cargo test --example dot_product`
//use std::iter::zip;

#![feature(array_chunks)]
#![feature(slice_as_chunks)]
// Add these imports to use the stdsimd library
#![feature(portable_simd)]
use core_simd::simd::*;

// This is your barebones dot product implementation:
// Take 2 vectors, multiply them element wise and *then*
// go along the resulting array and add up the result.
// In the next example we will see if there
// is any difference to adding and multiplying in tandem.
pub fn dot_prod_scalar_0(a: &[f32], b: &[f32]) -> f32 {
assert_eq!(a.len(), b.len());

a.iter().zip(b.iter()).map(|(a, b)| a * b).sum()
}

// When dealing with SIMD, it is very important to think about the amount
// of data movement and when it happens. We're going over simple computation examples here, and yet
// it is not trivial to understand what may or may not contribute to performance
// changes. Eventually, you will need tools to inspect the generated assembly and confirm your
// hypothesis and benchmarks - we will mention them later on.
// With the use of `fold`, we're doing a multiplication,
// and then adding it to the sum, one element from both vectors at a time.
pub fn dot_prod_scalar_1(a: &[f32], b: &[f32]) -> f32 {
assert_eq!(a.len(), b.len());
a.iter()
.zip(b.iter())
.fold(0.0, |a, zipped| a + zipped.0 * zipped.1)
}

// We now move on to the SIMD implementations: notice the following constructs:
// `array_chunks::<4>`: mapping this over the vector will let use construct SIMD vectors
// `f32x4::from_array`: construct the SIMD vector from a slice
// `(a * b).reduce_sum()`: Multiply both f32x4 vectors together, and then reduce them.
// This approach essentially uses SIMD to produce a vector of length N/4 of all the products,
// and then add those with `sum()`. This is suboptimal.
// TODO: ASCII diagrams
pub fn dot_prod_simd_0(a: &[f32], b: &[f32]) -> f32 {
assert_eq!(a.len(), b.len());
// TODO handle remainder when a.len() % 4 != 0
a.array_chunks::<4>()
.map(|&a| f32x4::from_array(a))
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
.map(|(a, b)| (a * b).reduce_sum())
.sum()
}

// There's some simple ways to improve the previous code:
// 1. Make a `zero` `f32x4` SIMD vector that we will be accumulating into
// So that there is only one `sum()` reduction when the last `f32x4` has been processed
// 2. Exploit Fused Multiply Add so that the multiplication, addition and sinking into the reduciton
// happen in the same step.
// If the arrays are large, minimizing the data shuffling will lead to great perf.
// If the arrays are small, handling the remainder elements when the length isn't a multiple of 4
// Can become a problem.
pub fn dot_prod_simd_1(a: &[f32], b: &[f32]) -> f32 {
assert_eq!(a.len(), b.len());
// TODO handle remainder when a.len() % 4 != 0
a.array_chunks::<4>()
.map(|&a| f32x4::from_array(a))
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
.fold(f32x4::splat(0.0), |acc, zipped| acc + zipped.0 * zipped.1)
.reduce_sum()
}

// A lot of knowledgeable use of SIMD comes from knowing specific instructions that are
// available - let's try to use the `mul_add` instruction, which is the fused-multiply-add we were looking for.
use std_float::StdFloat;
pub fn dot_prod_simd_2(a: &[f32], b: &[f32]) -> f32 {
assert_eq!(a.len(), b.len());
// TODO handle remainder when a.len() % 4 != 0
let mut res = f32x4::splat(0.0);
a.array_chunks::<4>()
.map(|&a| f32x4::from_array(a))
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
.for_each(|(a, b)| {
res = a.mul_add(b, res);
});
res.reduce_sum()
}

// Finally, we will write the same operation but handling the loop remainder.
const LANES: usize = 4;
pub fn dot_prod_simd_3(a: &[f32], b: &[f32]) -> f32 {
assert_eq!(a.len(), b.len());

let (a_extra, a_chunks) = a.as_rchunks();
let (b_extra, b_chunks) = b.as_rchunks();

// These are always true, but for emphasis:
assert_eq!(a_chunks.len(), b_chunks.len());
assert_eq!(a_extra.len(), b_extra.len());

let mut sums = [0.0; LANES];
for ((x, y), d) in std::iter::zip(a_extra, b_extra).zip(&mut sums) {
*d = x * y;
}

let mut sums = f32x4::from_array(sums);
std::iter::zip(a_chunks, b_chunks).for_each(|(x, y)| {
sums += f32x4::from_array(*x) * f32x4::from_array(*y);
});

sums.reduce_sum()
}

// Finally, we present an iterator version for handling remainders in a scalar fashion at the end of the loop.
// Unfortunately, this is allocating 1 `XMM` register on the order of `~len(a)` - we'll see how we can get around it in the
// next example.
pub fn dot_prod_simd_4(a: &[f32], b: &[f32]) -> f32 {
let mut sum = a
.array_chunks::<4>()
.map(|&a| f32x4::from_array(a))
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
.map(|(a, b)| a * b)
.fold(f32x4::splat(0.0), std::ops::Add::add)
.reduce_sum();
let remain = a.len() - (a.len() % 4);
sum += a[remain..]
.iter()
.zip(&b[remain..])
.map(|(a, b)| a * b)
.sum::<f32>();
sum
}

// This version allocates a single `XMM` register for accumulation, and the folds don't allocate on top of that.
// Notice the the use of `mul_add`, which can do a multiply and an add operation ber iteration.
pub fn dot_prod_simd_5(a: &[f32], b: &[f32]) -> f32 {
a.array_chunks::<4>()
.map(|&a| f32x4::from_array(a))
.zip(b.array_chunks::<4>().map(|&b| f32x4::from_array(b)))
.fold(f32x4::splat(0.), |acc, (a, b)| a.mul_add(b, acc))
.reduce_sum()
}

fn main() {
// Empty main to make cargo happy
}

#[cfg(test)]
mod tests {
#[test]
fn smoke_test() {
use super::*;
let a: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let b: Vec<f32> = vec![-8.0, -7.0, -6.0, -5.0, 4.0, 3.0, 2.0, 1.0];
let x: Vec<f32> = [0.5; 1003].to_vec();
let y: Vec<f32> = [2.0; 1003].to_vec();

// Basic check
assert_eq!(0.0, dot_prod_scalar_0(&a, &b));
assert_eq!(0.0, dot_prod_scalar_1(&a, &b));
assert_eq!(0.0, dot_prod_simd_0(&a, &b));
assert_eq!(0.0, dot_prod_simd_1(&a, &b));
assert_eq!(0.0, dot_prod_simd_2(&a, &b));
assert_eq!(0.0, dot_prod_simd_3(&a, &b));
assert_eq!(0.0, dot_prod_simd_4(&a, &b));
assert_eq!(0.0, dot_prod_simd_5(&a, &b));

// We can handle vectors that are non-multiples of 4
assert_eq!(1003.0, dot_prod_simd_3(&x, &y));
}
}
Loading

0 comments on commit 26e0c57

Please sign in to comment.