-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(metrics): Implement more efficient sorted distributions #979
Conversation
We still need to perform a benchmark to decide between the following two approaches:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks ok, some ideas:
- if singles ever gets too large we may want to consider rounding values such that we can have more duplicates
- singles could be some sort of bloom filter contraption
one unsolved question:
- may want to add some tests for NaN since those can have different bit reprs and I am not sure if float_ord gets equality right. The code looks like it does... something but there are no testcases on what happens with different nans. Maybe I am missing something and nans are normalized somewhere else. The other option is to drop nans on the floor.
relay-metrics/src/aggregation.rs
Outdated
#[derive(Clone, Default, PartialEq)] | ||
pub struct DistributionValue { | ||
singles: BTreeSet<FloatOrd<f64>>, | ||
duplicates: BTreeMap<FloatOrd<f64>, usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we ever want to use platform-specific integer sizes here? seems like almost all uses of usize in this file should be u64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I chose usize
since I thought it doesn't matter and since we can't handle more than usize individual values in one bucket anyway. Will change to u64
or maybe even u32
to clarify intent, however.
/// This struct is created by the [`iter`](DistributionValue::iter) method on | ||
/// `DistributionValue`. See its documentation for more. | ||
#[derive(Clone)] | ||
pub struct DistributionIter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we declare the iterators before DistributionValue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @untitaker!
- if singles ever gets too large we may want to consider rounding values such that we can have more duplicates
We can consider of lowering precision, although I'm not sure what reasonable thresholds should be. At the time, we're committed to keeping full precision in Relay.
- singles could be some sort of bloom filter contraption
This is the form of lossy aggregation we currently leave to storage. Potentially, we add this to Relay in future, too, although then using the same sketching algorithm also used in storage.
may want to add some tests for NaN since those can have different bit reprs and I am not sure if float_ord gets equality right.
Good point, reporting NaN
doesn't make sense on any of our metrics. @jjbayer this is something we can catch at an earlier stage before merging into the aggregation.
We removed the split between
In benchmarking, we noticed that there is only a marginal gain for sparse values with no duplication, but a massive penalty for dense distributions. Using these maps, we can beat a raw BTreeMap by roughly 20% in uncommon cases, but not without side effects. Therefore, keeping it simple is better here. |
Implements a dedicated type for representing distributions in memory. Internally, it uses a B-Tree map to store values deduplicated with counters for how often they occur. In the serialized payloads, it is still a flat list of values.