Precalculated attribute set hashes #1407

KallDrexx · 2023-11-27T20:37:08Z

The hash of AttributeSets are expensive to compute, as they have to be computed for each key and value in the attribute set. This hash is used by the ValueMap to look up if we are already aggregating a time series for this set of attributes or not. Since this hashmap lookup occurs inside a mutex lock, no other counters can execute their add() calls while this hash is being calculated, and therefore contention in high throughput scenarios exists.

This PR calculates and caches the hashmap at creation time. This improves throughput because the hashmap is calculated by the thread creating the AttributeSet and is performed outside of any mutex locks, meaning hashes can be computed in parallel and the time spent within a mutex lock is reduced. As larger sets of attributes are used for time series, the benefits of reduction of lock times should be greater.

The stress test results of this change for different thread counts are:

Thread Count	Main	PR
2	3,376,040	3,310,920
3	5,908,640	5,807,240
4	3,382,040	8,094,960
5	1,212,640	9,086,520
6	1,225,280	6,595,600

The non-precomputed hashes starts feeling contention with 4 threads, and drops substantially after that while precomputed hashes doesn't start seeing contention until 6 threads, and even then we still have 5-6x more throughput after contention due to reduced locking times.

While these benchmarks may not be "realistic" (since most applications will be doing more work in between counter updates) it does show a benefit of better parallelism and the opportunity to reduce lock contention at the cost of only 8 bytes per time series (so a total of 16KB additional memory at maximum cardinality).

Bechmark results:

main:

Counter_Add_Sorted      time:   [704.41 ns 710.47 ns 717.13 ns]
Found 14 outliers among 100 measurements (14.00%)
  10 (10.00%) high mild
  4 (4.00%) high severe

Counter_Add_Unsorted    time:   [723.78 ns 749.82 ns 781.35 ns]
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe

PR:

Counter_Add_Sorted      time:   [713.96 ns 717.19 ns 721.08 ns]
                        change: [+1.6730% +3.0990% +4.6842%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Counter_Add_Unsorted    time:   [713.50 ns 716.58 ns 720.07 ns]
                        change: [-10.737% -7.1220% -3.8036%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

Fixes #
Design discussion issue (if applicable) #

Changes

Please provide a brief description of the changes here.

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

The hash of `AttributeSet`s are expensive to compute, as they have to be computed for each key and value in the attribute set. This hash is used by the `ValueMap` to look up if we are already aggregating a time series for this set of attributes or not. Since this hashmap lookup occurs inside a mutex lock, no other counters can execute their `add()` calls while this hash is being calculated, and therefore contention in high throughput scenarios exists. This PR calculates and caches the hashmap at creation time. This improves throughput because the hashmap is calculated by the thread creating the `AttributeSet` and is performed outside of any mutex locks, meaning hashes can be computed in parallel and the time spent within a mutex lock is reduced. As larger sets of attributes are used for time series, the benefits of reduction of lock times should be greater. The stress test results of this change for different thread counts are: | Thread Count | Main | PR | | -------------- | ---------- | --------- | | 2 | 3,376,040 | 3,310,920 | | 3 | 5,908,640 | 5,807,240 | | 4 | 3,382,040 | 8,094,960 | | 5 | 1,212,640 | 9,086,520 | | 6 | 1,225,280 | 6,595,600 | The non-precomputed hashes starts feeling contention with 4 threads, and drops substantially after that while precomputed hashes doesn't start seeing contention until 6 threads, and even then we still have 5-6x more throughput after contention due to reduced locking times. While these benchmarks may not be "realistic" (since most applications will be doing more work in between counter updates) it does show a benefit of better parallelism and the opportunity to reduce lock contention at the cost of only 8 bytes per time series (so a total of 16KB additional memory at maximum cardinality). Bechmark results: ``` Counter_Add_Sorted time: [713.96 ns 717.19 ns 721.08 ns] change: [+1.6730% +3.0990% +4.6842%] (p = 0.00 < 0.05) Performance has regressed. Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe Counter_Add_Unsorted time: [713.50 ns 716.58 ns 720.07 ns] change: [-10.737% -7.1220% -3.8036%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe ```

KallDrexx · 2023-11-27T20:37:51Z

Also note that precomputing hashes has the opportunity for further benefits if bounded instruments is implemented, or if #1387 is done.

lalitb · 2023-11-27T20:49:32Z

Thanks for the PR. Can you also add the benchmark result from main branch (without these changes) for comparison?

KallDrexx · 2023-11-27T20:51:01Z

Thanks for the PR. Can you also add the benchmark result from main branch (without these changes) for comparison?

Can you be more specific of which benchmarks you are looking for? I posted stress test results between main and this pr, as well as the counter_add benchmarks between main and this pr.

codecov · 2023-11-27T20:54:07Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (897e70a) 57.3% compared to head (43b69eb) 57.3%.

Files	Patch %	Lines
opentelemetry-sdk/src/attributes/set.rs	88.8%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #1407   +/-   ##
=====================================
  Coverage   57.3%   57.3%           
=====================================
  Files        146     146           
  Lines      18179   18190   +11     
=====================================
+ Hits       10422   10434   +12     
+ Misses      7757    7756    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

KallDrexx · 2023-11-27T20:55:48Z

I added the benchmark run I had done previously from main before switching back to the PR for a run.

lalitb · 2023-11-27T21:47:27Z

I added the benchmark run I had done previously from main before switching back to the PR for a run.

Yeah, I meant the benchmark from main. Probably I missed it if it was already there. Thanks :)

opentelemetry-sdk/src/attributes/set.rs

cijothomas · 2023-11-28T01:32:54Z

#1405 might be skewing some results.. When we think we use 2 threads due to 2 cpus, we are doing 1 less actually!

cijothomas · 2023-11-28T02:46:27Z

I added the benchmark run I had done previously from main before switching back to the PR for a run.

Yeah, I meant the benchmark from main. Probably I missed it if it was already there. Thanks :)

benchmarks are probably not the right tool to show gains from this change! (or anything related to improving contentions!) Stress test is proving to be helpful, despite some limitations!

lalitb · 2023-11-28T03:29:36Z

benchmarks are probably not the right tool to show gains from this change! (or anything related to improving contentions!) Stress test is proving to be helpful, despite some #1405!

Not to see the gain, but to ensure that there is no adverse effect on benchmark results. Not related to this PR, but optimizing for concurrency can sometimes introduce additional overhead in single-threaded or low-load scenarios. So good to compare both :)

cijothomas · 2023-11-28T04:26:43Z

benchmarks are probably not the right tool to show gains from this change! (or anything related to improving contentions!) Stress test is proving to be helpful, despite some #1405!

Not to see the gain, but to ensure that there is no adverse effect on benchmark results. Not related to this PR, but optimizing for concurrency can sometimes introduce additional overhead in single-threaded or low-load scenarios. So good to compare both :)

Good point! thanks

KallDrexx · 2023-11-28T14:17:14Z

#1405 might be skewing some results.. When we think we use 2 threads due to 2 cpus, we are doing 1 less actually!

I had noticed this the other day when I tried setting thread count to 1 and got 0 throughput.

That being said even if you take 1 away from the thread count values in my table, you still get the same end result, since the point is when contention happens and how contention affects progress between main and this pr.

shaun-cox

Thanks!

KallDrexx requested a review from a team November 27, 2023 20:37

Merge branch 'main' into precached_attribute_set_hashes

bf59a23

shaun-cox requested changes Nov 27, 2023

View reviewed changes

opentelemetry-sdk/src/attributes/set.rs Outdated Show resolved Hide resolved

Condense duplicated hashing code

f773140

shaun-cox approved these changes Nov 28, 2023

View reviewed changes

cijothomas approved these changes Nov 28, 2023

View reviewed changes

Merge branch 'main' into precached_attribute_set_hashes

43b69eb

jtescher approved these changes Nov 28, 2023

View reviewed changes

hdost merged commit c0104d3 into open-telemetry:main Nov 28, 2023
15 checks passed

KallDrexx deleted the precached_attribute_set_hashes branch November 28, 2023 17:09

cijothomas mentioned this pull request Dec 1, 2023

Fix aggregation bug due to stale hash value #1422

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precalculated attribute set hashes #1407

Precalculated attribute set hashes #1407

KallDrexx commented Nov 27, 2023 •

edited

Loading

KallDrexx commented Nov 27, 2023

lalitb commented Nov 27, 2023

KallDrexx commented Nov 27, 2023

codecov bot commented Nov 27, 2023 •

edited

Loading

KallDrexx commented Nov 27, 2023

lalitb commented Nov 27, 2023 •

edited

Loading

cijothomas commented Nov 28, 2023

cijothomas commented Nov 28, 2023

lalitb commented Nov 28, 2023

cijothomas commented Nov 28, 2023

KallDrexx commented Nov 28, 2023 •

edited

Loading

shaun-cox left a comment

Precalculated attribute set hashes #1407

Precalculated attribute set hashes #1407

Conversation

KallDrexx commented Nov 27, 2023 • edited Loading

Changes

Merge requirement checklist

KallDrexx commented Nov 27, 2023

lalitb commented Nov 27, 2023

KallDrexx commented Nov 27, 2023

codecov bot commented Nov 27, 2023 • edited Loading

Codecov Report

KallDrexx commented Nov 27, 2023

lalitb commented Nov 27, 2023 • edited Loading

cijothomas commented Nov 28, 2023

cijothomas commented Nov 28, 2023

lalitb commented Nov 28, 2023

cijothomas commented Nov 28, 2023

KallDrexx commented Nov 28, 2023 • edited Loading

shaun-cox left a comment

Choose a reason for hiding this comment

KallDrexx commented Nov 27, 2023 •

edited

Loading

codecov bot commented Nov 27, 2023 •

edited

Loading

lalitb commented Nov 27, 2023 •

edited

Loading

KallDrexx commented Nov 28, 2023 •

edited

Loading