Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precalculated attribute set hashes #1407

Merged
merged 4 commits into from
Nov 28, 2023

Conversation

KallDrexx
Copy link
Contributor

@KallDrexx KallDrexx commented Nov 27, 2023

The hash of AttributeSets are expensive to compute, as they have to be computed for each key and value in the attribute set. This hash is used by the ValueMap to look up if we are already aggregating a time series for this set of attributes or not. Since this hashmap lookup occurs inside a mutex lock, no other counters can execute their add() calls while this hash is being calculated, and therefore contention in high throughput scenarios exists.

This PR calculates and caches the hashmap at creation time. This improves throughput because the hashmap is calculated by the thread creating the AttributeSet and is performed outside of any mutex locks, meaning hashes can be computed in parallel and the time spent within a mutex lock is reduced. As larger sets of attributes are used for time series, the benefits of reduction of lock times should be greater.

The stress test results of this change for different thread counts are:

Thread Count Main PR
2 3,376,040 3,310,920
3 5,908,640 5,807,240
4 3,382,040 8,094,960
5 1,212,640 9,086,520
6 1,225,280 6,595,600

The non-precomputed hashes starts feeling contention with 4 threads, and drops substantially after that while precomputed hashes doesn't start seeing contention until 6 threads, and even then we still have 5-6x more throughput after contention due to reduced locking times.

While these benchmarks may not be "realistic" (since most applications will be doing more work in between counter updates) it does show a benefit of better parallelism and the opportunity to reduce lock contention at the cost of only 8 bytes per time series (so a total of 16KB additional memory at maximum cardinality).

Bechmark results:

main:

Counter_Add_Sorted      time:   [704.41 ns 710.47 ns 717.13 ns]
Found 14 outliers among 100 measurements (14.00%)
  10 (10.00%) high mild
  4 (4.00%) high severe

Counter_Add_Unsorted    time:   [723.78 ns 749.82 ns 781.35 ns]
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe

PR:

Counter_Add_Sorted      time:   [713.96 ns 717.19 ns 721.08 ns]
                        change: [+1.6730% +3.0990% +4.6842%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Counter_Add_Unsorted    time:   [713.50 ns 716.58 ns 720.07 ns]
                        change: [-10.737% -7.1220% -3.8036%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

Fixes #
Design discussion issue (if applicable) #

Changes

Please provide a brief description of the changes here.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

The hash of `AttributeSet`s are expensive to compute, as they have to be computed for each key and value in the attribute set.  This hash is used by the `ValueMap` to look up if we are already aggregating a time series for this set of attributes or not.  Since this hashmap lookup occurs inside a mutex lock, no other counters can execute their `add()` calls while this hash is being calculated, and therefore contention in high throughput scenarios exists.

This PR calculates and caches the hashmap at creation time. This improves throughput because the hashmap is calculated by the thread creating the `AttributeSet` and is performed outside of any mutex locks, meaning hashes can be computed in parallel and the time spent within a mutex lock is reduced.  As larger sets of attributes are used for time series, the benefits of reduction of lock times should be greater.

The stress test results of this change for different thread counts are:

| Thread Count | Main        |  PR           |
| -------------- | ---------- | --------- |
| 2                       | 3,376,040 | 3,310,920 |
| 3                       | 5,908,640 | 5,807,240 |
| 4                       | 3,382,040 | 8,094,960 |
| 5                       | 1,212,640 | 9,086,520 |
| 6                       | 1,225,280 | 6,595,600 |

The non-precomputed hashes starts feeling contention with 4 threads, and drops substantially after that while precomputed hashes doesn't start seeing contention until 6 threads, and even then we still have 5-6x more throughput after contention due to reduced locking times.

While these benchmarks may not be "realistic" (since most applications will be doing more work in between counter updates) it does show a benefit of better parallelism and the opportunity to reduce lock contention at the cost of only 8 bytes per time series (so a total of 16KB additional memory at maximum cardinality).

Bechmark results:

```
Counter_Add_Sorted      time:   [713.96 ns 717.19 ns 721.08 ns]
                        change: [+1.6730% +3.0990% +4.6842%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Counter_Add_Unsorted    time:   [713.50 ns 716.58 ns 720.07 ns]
                        change: [-10.737% -7.1220% -3.8036%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
```
@KallDrexx KallDrexx requested a review from a team November 27, 2023 20:37
@KallDrexx
Copy link
Contributor Author

Also note that precomputing hashes has the opportunity for further benefits if bounded instruments is implemented, or if #1387 is done.

@lalitb
Copy link
Member

lalitb commented Nov 27, 2023

Thanks for the PR. Can you also add the benchmark result from main branch (without these changes) for comparison?

@KallDrexx
Copy link
Contributor Author

Thanks for the PR. Can you also add the benchmark result from main branch (without these changes) for comparison?

Can you be more specific of which benchmarks you are looking for? I posted stress test results between main and this pr, as well as the counter_add benchmarks between main and this pr.

Copy link

codecov bot commented Nov 27, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (897e70a) 57.3% compared to head (43b69eb) 57.3%.

Files Patch % Lines
opentelemetry-sdk/src/attributes/set.rs 88.8% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main   #1407   +/-   ##
=====================================
  Coverage   57.3%   57.3%           
=====================================
  Files        146     146           
  Lines      18179   18190   +11     
=====================================
+ Hits       10422   10434   +12     
+ Misses      7757    7756    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@KallDrexx
Copy link
Contributor Author

I added the benchmark run I had done previously from main before switching back to the PR for a run.

@lalitb
Copy link
Member

lalitb commented Nov 27, 2023

I added the benchmark run I had done previously from main before switching back to the PR for a run.

Yeah, I meant the benchmark from main. Probably I missed it if it was already there. Thanks :)

@cijothomas
Copy link
Member

#1405 might be skewing some results.. When we think we use 2 threads due to 2 cpus, we are doing 1 less actually!

@cijothomas
Copy link
Member

I added the benchmark run I had done previously from main before switching back to the PR for a run.

Yeah, I meant the benchmark from main. Probably I missed it if it was already there. Thanks :)

benchmarks are probably not the right tool to show gains from this change! (or anything related to improving contentions!) Stress test is proving to be helpful, despite some limitations!

@lalitb
Copy link
Member

lalitb commented Nov 28, 2023

benchmarks are probably not the right tool to show gains from this change! (or anything related to improving contentions!) Stress test is proving to be helpful, despite some #1405!

Not to see the gain, but to ensure that there is no adverse effect on benchmark results. Not related to this PR, but optimizing for concurrency can sometimes introduce additional overhead in single-threaded or low-load scenarios. So good to compare both :)

@cijothomas
Copy link
Member

benchmarks are probably not the right tool to show gains from this change! (or anything related to improving contentions!) Stress test is proving to be helpful, despite some #1405!

Not to see the gain, but to ensure that there is no adverse effect on benchmark results. Not related to this PR, but optimizing for concurrency can sometimes introduce additional overhead in single-threaded or low-load scenarios. So good to compare both :)

Good point! thanks

@KallDrexx
Copy link
Contributor Author

KallDrexx commented Nov 28, 2023

#1405 might be skewing some results.. When we think we use 2 threads due to 2 cpus, we are doing 1 less actually!

I had noticed this the other day when I tried setting thread count to 1 and got 0 throughput.

That being said even if you take 1 away from the thread count values in my table, you still get the same end result, since the point is when contention happens and how contention affects progress between main and this pr.

Copy link
Contributor

@shaun-cox shaun-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hdost hdost merged commit c0104d3 into open-telemetry:main Nov 28, 2023
15 checks passed
@KallDrexx KallDrexx deleted the precached_attribute_set_hashes branch November 28, 2023 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants