Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the substring kernel by about 2x #1512

Merged
merged 6 commits into from
Apr 7, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 57 additions & 39 deletions arrow/src/compute/kernels/substring.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,56 +24,74 @@ use crate::{
error::{ArrowError, Result},
};

#[allow(clippy::unnecessary_wraps)]
fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
array: &GenericStringArray<OffsetSize>,
start: OffsetSize,
length: &Option<OffsetSize>,
) -> Result<ArrayRef> {
// compute current offsets
let offsets = array.data_ref().clone().buffers()[0].clone();
let offsets: &[OffsetSize] = unsafe { offsets.typed_data::<OffsetSize>() };

// compute null bitmap (copy)
let offsets = array.value_offsets();
let null_bit_buffer = array.data_ref().null_buffer().cloned();

// compute values
let values = &array.data_ref().buffers()[1];
let values = array.value_data();
let data = values.as_slice();
let zero = OffsetSize::zero();

let mut new_values = MutableBuffer::new(0); // we have no way to estimate how much this will be.
let mut new_offsets: Vec<OffsetSize> = Vec::with_capacity(array.len() + 1);

let mut length_so_far = OffsetSize::zero();
new_offsets.push(length_so_far);
(0..array.len()).for_each(|i| {
// the length of this entry
let length_i: OffsetSize = offsets[i + 1] - offsets[i];
// compute where we should start slicing this entry
let start = offsets[i]
+ if start >= OffsetSize::zero() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to compare start with zero multiple time. So I move this outside the for loop

start
} else {
length_i + start
};

let start = start.max(offsets[i]).min(offsets[i + 1]);
// compute the length of the slice
let length: OffsetSize = length
.unwrap_or(length_i)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason! we don't need to unwrap length in each loop. So move this outside the loop.

// .max(0) is not needed as it is guaranteed
.min(offsets[i + 1] - start); // so we do not go beyond this entry

length_so_far += length;
// calculate the start offset for each substring
// if `start` >= 0
// then, count from the start of each string
// else, count from the end of each string
let new_starts: Vec<OffsetSize> = if start >= zero {
offsets
.windows(2)
.map(|pair| (pair[0] + start).min(pair[1]))
.collect()
} else {
offsets
.windows(2)
.map(|pair| (pair[1] + start).max(pair[0]))
.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to do new_starts/new_length in one iteration over offsets and use unzip to materialize into two Vecs?

Copy link
Contributor

@Dandandan Dandandan Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I think actually materializing new_length to a Vec is not needed. It probably is faster (and saves allocating this Vec) to compute the length again when calculating new_values.

Copy link
Contributor Author

@HaoYang670 HaoYang670 Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove .collect, actually. But you know, closure values in if-else branch can get no two closures, even if identical, have the same type error. And I did not find a good solution except adding .collect
rust-lang/rust#87961

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is not possible in Rust (I think only by Boxing or manually inlining / macros).

Calculating the length in two places with something like let length = *end - *start within both the new_offsets new_values doesn´t seem too bad in terms of duplication?

Copy link
Contributor Author

@HaoYang670 HaoYang670 Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is not possible in Rust (I think only by Boxing or manually inlining / macros).

Calculating the length in two places with something like let length = *end - *start within both the new_offsets new_values doesn´t seem too bad in terms of duplication?

One way is to allow the closure not to capture the environment by adding more parameters, and the compiler will downcast closure to function. This works for calculating new_starts, but doesn't work for new_length where we must capture Some(length) = length

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My later comment mentioned not allocating the new vec but not allocating the array at all (by moving the calculation to the iteration in new_values). Is that something you want to try?

Thinking about it, I think actually materializing new_length to a Vec is not needed. It probably is faster (and saves allocating this Vec) to compute the length again when calculating new_values.

};

new_offsets.push(length_so_far);
// count the length of each substring
// if `length` is given
// then, use it
// else, length is `string[new_start..].len()`
let new_length: Vec<OffsetSize> = if let Some(length) = length {
offsets[1..]
.iter()
.zip(new_starts.iter())
.map(|(end, start)| *(length.min(&(*end - *start))))
.collect()
} else {
offsets[1..]
.iter()
.zip(new_starts.iter())
.map(|(end, start)| *end - *start)
.collect()
};

// we need usize for ranges
let start = start.to_usize().unwrap();
let length = length.to_usize().unwrap();
let new_offsets: Vec<OffsetSize> = [zero]
.iter()
.copied()
.chain(new_length.iter().scan(zero, |len_so_far, &len| {
*len_so_far += len;
Some(*len_so_far)
}))
.collect();

new_values.extend_from_slice(&data[start..start + length]);
});
// concatenate substrings into a buffer
let new_values = {
let mut new_values =
MutableBuffer::new(new_offsets.last().unwrap().to_usize().unwrap());
new_starts
.iter()
.zip(new_length.iter())
.map(|(start, length)| {
(start.to_usize().unwrap(), length.to_usize().unwrap())
})
.map(|(start, length)| &data[start..start + length])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map(|(start, length)| {
(start.to_usize().unwrap(), length.to_usize().unwrap())
})
.map(|(start, length)| &data[start..start + length])
.map(|(start, length)| {
let start = start.to_usize().unwrap();
let length = length.to_usize().unwrap();
&data[start..start + length]
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

.for_each(|slice| new_values.extend_from_slice(slice));
new_values
};

let data = unsafe {
ArrayData::new_unchecked(
Expand Down