-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support statistics pruning for formats other than parquet #380
Conversation
/// Note this function takes a slice of statistics as a parameter | ||
/// to amortize the cost of the evaluation of the predicate | ||
/// against a single record batch. | ||
pub fn prune<S: PruningStatistics>(&self, statistics: &[S]) -> Result<Vec<bool>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the heart of the change -- rather than building up Arrays
directly from ParquetMetadata, this PR now builds the arrays up from ScalarValue
s provided by the PruningStatistics
trait.
I also tried to improve the comments to make it easier to follow what is going on
// here the first max value is None and not the Some(10) value which was actually set | ||
// because the min value is None | ||
assert_eq!(int32_vec, vec![None, Some(20), Some(30)]); | ||
fn test_build_statistics_record_batch() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shows how creating the statistics record batch works
@@ -748,4 +821,40 @@ mod tests { | |||
|
|||
Ok(()) | |||
} | |||
|
|||
#[test] | |||
fn prune_api() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows end-to-end how to use the prune API (what I want to be able to do in IOx)
datafusion/src/scalar.rs
Outdated
@@ -283,6 +283,155 @@ impl ScalarValue { | |||
self.to_array_of_size(1) | |||
} | |||
|
|||
/// Converts an iterator of references [`ScalarValue`] into an [`ArrayRef`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any other way to take a bunch of ScalarValues
and turn them back into an Array. Perhaps I missed something
Codecov Report
@@ Coverage Diff @@
## master #380 +/- ##
==========================================
+ Coverage 74.94% 74.98% +0.04%
==========================================
Files 146 146
Lines 24314 24448 +134
==========================================
+ Hits 18221 18332 +111
- Misses 6093 6116 +23
Continue to review full report at Codecov.
|
I think this is really cool, I think it would be also great to have this for in-memory tables. |
I agree -- I think all that is needed is to calculate the min/max statistics for each partition (or maybe even record batch) though we might have to be careful not to slow down queries where it wouldn't help. Maybe it could be opt in. Or perhaps we could compute the statistics "on demand" (after we have created a PruningPredicate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/// ```text | ||
/// s1_min | s2_maxx | ||
/// -------+-------- | ||
/// 5 | 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean s2_max = 1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes -- that is a great catch
|
||
let p = PruningPredicate::try_new(&expr, schema).unwrap(); | ||
let result = p.prune(&[stats1, stats2, stats3, stats4]).unwrap(); | ||
let expected = vec![false, true, true, true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Yes, I think we might want also look into support something like |
896d261
to
6b1a261
Compare
Co-authored-by: Nga Tran <[email protected]>
…datafusion into alamb/generic_pruning_input
@Dandandan I think this PR is now ready for review. |
/// Interface to pass statistics information to [`PruningPredicates`] | ||
pub trait PruningStatistics { | ||
/// return the minimum value for the named column, if known | ||
fn min_value(&self, column: &str) -> Option<ScalarValue>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern I have (in general with current state of DataFusion) is that we use ScalarValue
a lot in code which can detrimental to performance in some cases (compared to a typed array).
Also in this case, if we would have a large dataset with 1000s of statistics values, calculating statistics might be slower than it could be when it is stored in contiguous memory.
Just a thought - not something we should block this PR for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be a better alternative in the long run? generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the longer run I would say we should start pushing more towards typed contiguous arrays (Array
s or Vec
), indeed using generics. For example, here the min and max values per group could be stored in two arrays of corresponding types which would be faster and uses less memory.
Vectorized processing is what we try to use Arrow for already, so I would say it is good to try to use it in more places where it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much agree with you, @Dandandan . I have a PR on arrow2 exactly on this on arrow2. jorgecarleitao/arrow2#56
I am waiting for the experimental repos to be available so that we can discuss it further in apache/*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can take a shot at trying to do this.
@Dandandan are you thinking something like:
pub trait PruningStatistics {
fn min_values(&self, column: &str) -> Option<ArrayRef>;
fn max_values(&self, column: &str) -> Option<ArrayRef>;
}
Which would be constrained to return Array
s the same number of rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb yes - something along those lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a go and see what it looks like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what the alternate API looks like #426
I would say it is now harder to implement the PruningStatistics
API in the parquet reader, but the implementation in pruning.rs
is simpler
Co-authored-by: Daniël Heres <[email protected]>
#426 appears to be the more populate option; Closing in favor of that one |
Closes #363
Edit: Note: There is an alternate API in #426
Rationale
As explained on #363 the high level goal is to make the parquet row group pruning logic generic to any types of min/max statistics (not just parquet metadata)
Changes:
PruningStatistics
traitPruningPredicateBuilder
to be generic in terms ofPruningStatistics
Example
Here is a brief snippet of one of the tests that shows the new API:
Sequence:
I am trying to do this in a few small PRs to reduce review burden; Here is how connect together:
Planned changes:
Fn
#370)ScalarValue::iter_to_array
(Function to createArrayRef
from an iterator of ScalarValues #381)PruningStatstics
Trait (this PR)