-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SlicesIterator for ArrayData Equality #3198
Conversation
eca457d
to
24788ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could be simplified by using try_for_each_valid_idx?
Tried with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we possibly back out the changes to SlicesIterator
and just use BitSliceIterator
, then I think this is good to go 👍
arrow-data/src/slices_iterator.rs
Outdated
/// | ||
/// Only performant for filters that copy across long contiguous runs | ||
#[derive(Debug)] | ||
pub struct SlicesIterator<'a>(BitSliceIterator<'a>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this is a breaking change, as it is exposed publicly - https://docs.rs/arrow-select/latest/arrow_select/filter/struct.SlicesIterator.html
arrow-data/src/slices_iterator.rs
Outdated
|
||
impl<'a> SlicesIterator<'a> { | ||
pub fn new_from_buffer(values: &'a Buffer, offset: usize, len: usize) -> Self { | ||
Self(BitSliceIterator::new(values, offset, len)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point why not just use BitSliceIterator
?
|
||
use super::utils::equal_len; | ||
|
||
pub(crate) const NULL_SLICES_SELECTIVITY_THRESHOLD: f64 = 0.4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come up with 0.4, not saying it is a bad choice, just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It came from benchmarking. equal_nulls_512
has 0.5 null density, and this doesn't improve it. So I pick 0.4 as the threshold.
Benchmark runs are scheduled for baseline = 1daf7d3 and contender = f985818. f985818 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #1600.
Rationale for this change
This improves the performance of long array with less nulls mostly, e.g. the added benchmark:
For array with many nulls, this actually causes regression when using
SlicesIterator
, so I only useSlicesIterator
for the cases with less nulls.What changes are included in this PR?
Are there any user-facing changes?