ref(metrics): Implement more efficient sorted distributions #979

jan-auer · 2021-04-20T10:34:35Z

Implements a dedicated type for representing distributions in memory. Internally, it uses a B-Tree map to store values deduplicated with counters for how often they occur. In the serialized payloads, it is still a flat list of values.

jan-auer · 2021-04-20T10:37:34Z

We still need to perform a benchmark to decide between the following two approaches:

(current) Keep duplicates also in the set. This makes lookups slightly easier and avoids rebalancing the trees when removing values. However, it creates a theoretical worst case when a large number of values is inserted exactly twice, since they now need 3 slots (key, key, count).
Remove duplicates from the set. This can make the lookup code more straight forward, but requires shrinking the set on insert.

untitaker

this looks ok, some ideas:

if singles ever gets too large we may want to consider rounding values such that we can have more duplicates
singles could be some sort of bloom filter contraption

one unsolved question:

may want to add some tests for NaN since those can have different bit reprs and I am not sure if float_ord gets equality right. The code looks like it does... something but there are no testcases on what happens with different nans. Maybe I am missing something and nans are normalized somewhere else. The other option is to drop nans on the floor.

untitaker · 2021-04-21T11:32:19Z

relay-metrics/src/aggregation.rs

+#[derive(Clone, Default, PartialEq)]
+pub struct DistributionValue {
+    singles: BTreeSet<FloatOrd<f64>>,
+    duplicates: BTreeMap<FloatOrd<f64>, usize>,


do we ever want to use platform-specific integer sizes here? seems like almost all uses of usize in this file should be u64

You're right. I chose usize since I thought it doesn't matter and since we can't handle more than usize individual values in one bucket anyway. Will change to u64 or maybe even u32 to clarify intent, however.

jjbayer · 2021-04-21T15:54:25Z

relay-metrics/src/aggregation.rs

+/// This struct is created by the [`iter`](DistributionValue::iter) method on
+/// `DistributionValue`. See its documentation for more.
+#[derive(Clone)]
+pub struct DistributionIter<'a> {


Should we declare the iterators before DistributionValue?

Thanks for the review, @untitaker!

if singles ever gets too large we may want to consider rounding values such that we can have more duplicates

We can consider of lowering precision, although I'm not sure what reasonable thresholds should be. At the time, we're committed to keeping full precision in Relay.

singles could be some sort of bloom filter contraption

This is the form of lossy aggregation we currently leave to storage. Potentially, we add this to Relay in future, too, although then using the same sketching algorithm also used in storage.

may want to add some tests for NaN since those can have different bit reprs and I am not sure if float_ord gets equality right.

Good point, reporting NaN doesn't make sense on any of our metrics. @jjbayer this is something we can catch at an earlier stage before merging into the aggregation.

jan-auer · 2021-04-21T16:00:36Z

We removed the split between singles and duplicates now. Original implementation was commented with this:

Internally, it uses a B-Tree set and a B-Tree map for storing values:

Single values are stored in the set. This optimizes for cases where values are spread out far and there is a low duplication ratio.

Duplicate values are stored in the map with a counter. The second time a value is inserted, it remains in the set but is added to the map with a counter of 2. From there on, the counter is incremented.

Runtime complexity for inserts and lookups is slightly worse than a plain B-Tree map with 2 * log n as compared to log n. Space complexity is better on average, significantly for sparse distributions, with the worst case being every element is inserted exactly twice.

In benchmarking, we noticed that there is only a marginal gain for sparse values with no duplication, but a massive penalty for dense distributions. Using these maps, we can beat a raw BTreeMap by roughly 20% in uncommon cases, but not without side effects. Therefore, keeping it simple is better here.

* master: feat(protocol): Add frame.stack_start for async stack traces (#981) release: 21.4.1 Propagate the Relay logo to the rest of the docs (#978)

ref(metrics): Implement more efficient sorted distributions

3918063

jan-auer requested a review from jjbayer April 20, 2021 10:34

jan-auer self-assigned this Apr 20, 2021

jan-auer requested a review from a team April 20, 2021 10:34

untitaker approved these changes Apr 21, 2021

View reviewed changes

ref(metrics): Switch to faster BTreeMap

a98b554

jjbayer reviewed Apr 21, 2021

View reviewed changes

jjbayer approved these changes Apr 21, 2021

View reviewed changes

jan-auer added 3 commits April 22, 2021 10:26

fix(metrics): Distribution iterator length

90c0fe2

Merge branch 'master' into feat/metrics-sorted-distributions

9096e3a

* master: feat(protocol): Add frame.stack_start for async stack traces (#981) release: 21.4.1 Propagate the Relay logo to the rest of the docs (#978)

meta: Changelog

f338d2b

jan-auer enabled auto-merge (squash) April 22, 2021 13:43

jan-auer merged commit fa40a77 into master Apr 22, 2021

jan-auer deleted the feat/metrics-sorted-distributions branch April 22, 2021 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(metrics): Implement more efficient sorted distributions #979

ref(metrics): Implement more efficient sorted distributions #979

jan-auer commented Apr 20, 2021 •

edited

Loading

jan-auer commented Apr 20, 2021

untitaker left a comment

untitaker Apr 21, 2021 •

edited

Loading

jan-auer Apr 21, 2021

jjbayer Apr 21, 2021

jan-auer Apr 21, 2021

jan-auer commented Apr 21, 2021

ref(metrics): Implement more efficient sorted distributions #979

ref(metrics): Implement more efficient sorted distributions #979

Conversation

jan-auer commented Apr 20, 2021 • edited Loading

jan-auer commented Apr 20, 2021

untitaker left a comment

Choose a reason for hiding this comment

untitaker Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

jan-auer Apr 21, 2021

Choose a reason for hiding this comment

jjbayer Apr 21, 2021

Choose a reason for hiding this comment

jan-auer Apr 21, 2021

Choose a reason for hiding this comment

jan-auer commented Apr 21, 2021

jan-auer commented Apr 20, 2021 •

edited

Loading

untitaker Apr 21, 2021 •

edited

Loading