-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precalculated attribute set hashes #1407
Precalculated attribute set hashes #1407
Conversation
The hash of `AttributeSet`s are expensive to compute, as they have to be computed for each key and value in the attribute set. This hash is used by the `ValueMap` to look up if we are already aggregating a time series for this set of attributes or not. Since this hashmap lookup occurs inside a mutex lock, no other counters can execute their `add()` calls while this hash is being calculated, and therefore contention in high throughput scenarios exists. This PR calculates and caches the hashmap at creation time. This improves throughput because the hashmap is calculated by the thread creating the `AttributeSet` and is performed outside of any mutex locks, meaning hashes can be computed in parallel and the time spent within a mutex lock is reduced. As larger sets of attributes are used for time series, the benefits of reduction of lock times should be greater. The stress test results of this change for different thread counts are: | Thread Count | Main | PR | | -------------- | ---------- | --------- | | 2 | 3,376,040 | 3,310,920 | | 3 | 5,908,640 | 5,807,240 | | 4 | 3,382,040 | 8,094,960 | | 5 | 1,212,640 | 9,086,520 | | 6 | 1,225,280 | 6,595,600 | The non-precomputed hashes starts feeling contention with 4 threads, and drops substantially after that while precomputed hashes doesn't start seeing contention until 6 threads, and even then we still have 5-6x more throughput after contention due to reduced locking times. While these benchmarks may not be "realistic" (since most applications will be doing more work in between counter updates) it does show a benefit of better parallelism and the opportunity to reduce lock contention at the cost of only 8 bytes per time series (so a total of 16KB additional memory at maximum cardinality). Bechmark results: ``` Counter_Add_Sorted time: [713.96 ns 717.19 ns 721.08 ns] change: [+1.6730% +3.0990% +4.6842%] (p = 0.00 < 0.05) Performance has regressed. Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe Counter_Add_Unsorted time: [713.50 ns 716.58 ns 720.07 ns] change: [-10.737% -7.1220% -3.8036%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe ```
Also note that precomputing hashes has the opportunity for further benefits if bounded instruments is implemented, or if #1387 is done. |
Thanks for the PR. Can you also add the benchmark result from main branch (without these changes) for comparison? |
Can you be more specific of which benchmarks you are looking for? I posted stress test results between |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1407 +/- ##
=====================================
Coverage 57.3% 57.3%
=====================================
Files 146 146
Lines 18179 18190 +11
=====================================
+ Hits 10422 10434 +12
+ Misses 7757 7756 -1 ☔ View full report in Codecov by Sentry. |
I added the benchmark run I had done previously from main before switching back to the PR for a run. |
Yeah, I meant the benchmark from main. Probably I missed it if it was already there. Thanks :) |
#1405 might be skewing some results.. When we think we use 2 threads due to 2 cpus, we are doing 1 less actually! |
benchmarks are probably not the right tool to show gains from this change! (or anything related to improving contentions!) Stress test is proving to be helpful, despite some limitations! |
Not to see the gain, but to ensure that there is no adverse effect on benchmark results. Not related to this PR, but optimizing for concurrency can sometimes introduce additional overhead in single-threaded or low-load scenarios. So good to compare both :) |
Good point! thanks |
I had noticed this the other day when I tried setting thread count to 1 and got 0 throughput. That being said even if you take 1 away from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The hash of
AttributeSet
s are expensive to compute, as they have to be computed for each key and value in the attribute set. This hash is used by theValueMap
to look up if we are already aggregating a time series for this set of attributes or not. Since this hashmap lookup occurs inside a mutex lock, no other counters can execute theiradd()
calls while this hash is being calculated, and therefore contention in high throughput scenarios exists.This PR calculates and caches the hashmap at creation time. This improves throughput because the hashmap is calculated by the thread creating the
AttributeSet
and is performed outside of any mutex locks, meaning hashes can be computed in parallel and the time spent within a mutex lock is reduced. As larger sets of attributes are used for time series, the benefits of reduction of lock times should be greater.The stress test results of this change for different thread counts are:
The non-precomputed hashes starts feeling contention with 4 threads, and drops substantially after that while precomputed hashes doesn't start seeing contention until 6 threads, and even then we still have 5-6x more throughput after contention due to reduced locking times.
While these benchmarks may not be "realistic" (since most applications will be doing more work in between counter updates) it does show a benefit of better parallelism and the opportunity to reduce lock contention at the cost of only 8 bytes per time series (so a total of 16KB additional memory at maximum cardinality).
Bechmark results:
main:
PR:
Fixes #
Design discussion issue (if applicable) #
Changes
Please provide a brief description of the changes here.
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes