-
Notifications
You must be signed in to change notification settings - Fork 224
Considering migrate to portable-simd #580
Comments
Asked for some guidance over the zulip channel: https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/topic/is.20it.20mature.20to.20switch.20to.20portable-simd.3F/near/261018409 |
IMO this feels like a super cool issue. If anyone would like to pick this one, go ahead. |
fwiw, quick perf benchmark shows benefits on the sum of non-nulls:
with #![feature(portable_simd)]
use criterion::{criterion_group, criterion_main, Criterion};
use std::convert::TryInto;
use core_simd::f32x16;
use packed_simd::f32x16 as p_f32x16;
const LANES: usize = 16;
pub fn packed_simd_sum(values: &[f32]) -> f32 {
let chunks = values.chunks_exact(LANES);
let remainder = chunks.remainder();
let sum = chunks.fold(p_f32x16::default(), |acc, chunk| {
let chunk: [f32; 16] = chunk.try_into().unwrap();
let chunk: p_f32x16 = p_f32x16::from_slice_unaligned(&chunk);
acc + chunk
});
let remainder: f32 = remainder.iter().copied().sum();
sum.sum() + remainder
}
pub fn core_simd_sum(values: &[f32]) -> f32 {
let chunks = values.chunks_exact(LANES);
let remainder = chunks.remainder();
let sum = chunks.fold(f32x16::default(), |acc, chunk| {
let chunk: [f32; 16] = chunk.try_into().unwrap();
let chunk: f32x16 = f32x16::from_array(chunk);
acc + chunk
});
let remainder: f32 = remainder.iter().copied().sum();
let mut reduced = 0.0f32;
for i in 0..LANES {
reduced += sum[i];
}
reduced + remainder
}
pub fn nonsimd_sum(values: &[f32]) -> f32 {
let chunks = values.chunks_exact(LANES);
let remainder = chunks.remainder();
let sum = chunks.fold([0.0f32; LANES], |mut acc, chunk| {
let chunk: [f32; LANES] = chunk.try_into().unwrap();
for i in 0..LANES {
acc[i] += chunk[i];
}
acc
});
let remainder: f32 = remainder.iter().copied().sum();
let mut reduced = 0.0f32;
(0..LANES).for_each(|i| {
reduced += sum[i];
});
reduced + remainder
}
pub fn naive_sum(values: &[f32]) -> f32 {
values.iter().sum()
}
fn add_benchmark(c: &mut Criterion) {
(10..=20).step_by(2).for_each(|log2_size| {
let size = 2usize.pow(log2_size);
let array = (0..size)
.map(|x| std::f32::consts::PI * x as f32 * x as f32 - std::f32::consts::PI * x as f32)
.collect::<Vec<_>>();
c.bench_function(&format!("core_simd_sum 2^{} f32", log2_size), |b| {
b.iter(|| core_simd_sum(&array))
});
c.bench_function(&format!("packed_simd_sum 2^{} f32", log2_size), |b| {
b.iter(|| packed_simd_sum(&array))
});
c.bench_function(&format!("nonsimd_sum 2^{} f32", log2_size), |b| {
b.iter(|| nonsimd_sum(&array))
});
c.bench_function(&format!("naive_sum null 2^{} f32", log2_size), |b| {
b.iter(|| naive_sum(&array))
});
});
}
criterion_group!(benches, add_benchmark);
criterion_main!(benches); and [package]
name = "test"
version = "0.1.0"
edition = "2018"
[dependencies]
core_simd = { git = "https://github.com/rust-lang/portable-simd" }
packed_simd = { version = "0.3", package = "packed_simd_2" }
[dev-dependencies]
criterion = "0.3"
[[bench]]
name = "sum"
harness = false |
See also https://github.com/DataEngineeringLabs/simd-benches, where I am benchmarking the algorithms. |
A cool thing is support for |
Waiting for rust-lang/portable-simd#197 |
Nota bene, this prevents from compiling datafusion on stable with simd enabled when using arrow2. |
Maybe we can have two simd implementations separated by feature flags? |
Features bound to which rustc make things very clunky... I think we'll have to limit datafusion on arrow2 to rust nightly for now. |
does datafusion compile on stable with simd enabled? - I think it depends on arrow, which depends on My understanding is that currently |
My bad, it is in fact only available on nightly. |
Seems packed_simd2 is not developed.
https://github.com/rust-lang/portable-simd/
The text was updated successfully, but these errors were encountered: