Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(metrics): Implement more efficient sorted distributions #979

Merged
merged 5 commits into from
Apr 22, 2021

Conversation

jan-auer
Copy link
Member

@jan-auer jan-auer commented Apr 20, 2021

Implements a dedicated type for representing distributions in memory. Internally, it uses a B-Tree map to store values deduplicated with counters for how often they occur. In the serialized payloads, it is still a flat list of values.


image

@jan-auer jan-auer requested a review from jjbayer April 20, 2021 10:34
@jan-auer jan-auer self-assigned this Apr 20, 2021
@jan-auer jan-auer requested a review from a team April 20, 2021 10:34
@jan-auer
Copy link
Member Author

We still need to perform a benchmark to decide between the following two approaches:

  1. (current) Keep duplicates also in the set. This makes lookups slightly easier and avoids rebalancing the trees when removing values. However, it creates a theoretical worst case when a large number of values is inserted exactly twice, since they now need 3 slots (key, key, count).

  2. Remove duplicates from the set. This can make the lookup code more straight forward, but requires shrinking the set on insert.

Copy link
Member

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks ok, some ideas:

  • if singles ever gets too large we may want to consider rounding values such that we can have more duplicates
  • singles could be some sort of bloom filter contraption

one unsolved question:

  • may want to add some tests for NaN since those can have different bit reprs and I am not sure if float_ord gets equality right. The code looks like it does... something but there are no testcases on what happens with different nans. Maybe I am missing something and nans are normalized somewhere else. The other option is to drop nans on the floor.

#[derive(Clone, Default, PartialEq)]
pub struct DistributionValue {
singles: BTreeSet<FloatOrd<f64>>,
duplicates: BTreeMap<FloatOrd<f64>, usize>,
Copy link
Member

@untitaker untitaker Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we ever want to use platform-specific integer sizes here? seems like almost all uses of usize in this file should be u64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I chose usize since I thought it doesn't matter and since we can't handle more than usize individual values in one bucket anyway. Will change to u64 or maybe even u32 to clarify intent, however.

/// This struct is created by the [`iter`](DistributionValue::iter) method on
/// `DistributionValue`. See its documentation for more.
#[derive(Clone)]
pub struct DistributionIter<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we declare the iterators before DistributionValue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @untitaker!

  • if singles ever gets too large we may want to consider rounding values such that we can have more duplicates

We can consider of lowering precision, although I'm not sure what reasonable thresholds should be. At the time, we're committed to keeping full precision in Relay.

  • singles could be some sort of bloom filter contraption

This is the form of lossy aggregation we currently leave to storage. Potentially, we add this to Relay in future, too, although then using the same sketching algorithm also used in storage.

may want to add some tests for NaN since those can have different bit reprs and I am not sure if float_ord gets equality right.

Good point, reporting NaN doesn't make sense on any of our metrics. @jjbayer this is something we can catch at an earlier stage before merging into the aggregation.

@jan-auer
Copy link
Member Author

We removed the split between singles and duplicates now. Original implementation was commented with this:

Internally, it uses a B-Tree set and a B-Tree map for storing values:

  • Single values are stored in the set. This optimizes for cases where values are spread out far and there is a low duplication ratio.
  • Duplicate values are stored in the map with a counter. The second time a value is inserted, it remains in the set but is added to the map with a counter of 2. From there on, the counter is incremented.

Runtime complexity for inserts and lookups is slightly worse than a plain B-Tree map with 2 * log n as compared to log n. Space complexity is better on average, significantly for sparse distributions, with the worst case being every element is inserted exactly twice.

In benchmarking, we noticed that there is only a marginal gain for sparse values with no duplication, but a massive penalty for dense distributions. Using these maps, we can beat a raw BTreeMap by roughly 20% in uncommon cases, but not without side effects. Therefore, keeping it simple is better here.

* master:
  feat(protocol): Add frame.stack_start for async stack traces (#981)
  release: 21.4.1
  Propagate the Relay logo to the rest of the docs (#978)
@jan-auer jan-auer enabled auto-merge (squash) April 22, 2021 13:43
@jan-auer jan-auer merged commit fa40a77 into master Apr 22, 2021
@jan-auer jan-auer deleted the feat/metrics-sorted-distributions branch April 22, 2021 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants