Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace mix of atomics and rwmutex with mutex around key material #58

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

raggi
Copy link
Contributor

@raggi raggi commented Sep 13, 2022

This reduces the overhead of various code paths as there are less pipeline delays and more code can be inlined. The mutex fast path is a single atomic operation, and this mutex is never held for very long.

Microbenchmark results in each commit show no micro-scale difference, macro scale tests demonstrate a slight speed up, though well within the noise of other external effects: small improvement in arm64 test on t4g.nano from 642mbps to 646mbps, x64 test on m6i.xlarge from 975mbps to 1.02gbps.

@zx2c4
Copy link
Member

zx2c4 commented Feb 7, 2023

Why do you suppose the micro tests show no performance improvement but the macro tests do?

device/device.go Outdated Show resolved Hide resolved
raggi added 2 commits February 9, 2023 15:08
name                              old time/op      new time/op      delta
TrieIPv4Peers100Addresses1000-32      83.5ns ± 0%      89.0ns ± 0%   ~     (p=1.000 n=1+1)
TrieIPv4Peers10Addresses10-32         33.6ns ± 0%      33.3ns ± 0%   ~     (p=1.000 n=1+1)
TrieIPv6Peers100Addresses1000-32      83.3ns ± 0%      82.7ns ± 0%   ~     (p=1.000 n=1+1)
TrieIPv6Peers10Addresses10-32         33.5ns ± 0%      33.4ns ± 0%   ~     (p=1.000 n=1+1)
Latency-32                             216µs ± 0%       211µs ± 0%   ~     (p=1.000 n=1+1)
Throughput-32                         2.31µs ± 0%      2.25µs ± 0%   ~     (p=1.000 n=1+1)
UAPIGet-32                            2.28µs ± 0%      2.12µs ± 0%   ~     (p=1.000 n=1+1)
WaitPool-32                           4.18µs ± 0%      4.06µs ± 0%   ~     (p=1.000 n=1+1)

name                              old packet-loss  new packet-loss  delta
Throughput-32                           0.00 ± 0%        0.00 ± 0%   ~     (p=1.000 n=1+1)

name                              old alloc/op     new alloc/op     delta
UAPIGet-32                              224B ± 0%        224B ± 0%   ~     (all equal)

name                              old allocs/op    new allocs/op    delta
UAPIGet-32                              17.0 ± 0%        17.0 ± 0%   ~     (all equal)
name                              old time/op      new time/op      delta
TrieIPv4Peers100Addresses1000-32      83.5ns ± 0%      83.3ns ± 0%   ~     (p=1.000 n=1+1)
TrieIPv4Peers10Addresses10-32         33.6ns ± 0%      33.4ns ± 0%   ~     (p=1.000 n=1+1)
TrieIPv6Peers100Addresses1000-32      83.3ns ± 0%      83.2ns ± 0%   ~     (p=1.000 n=1+1)
TrieIPv6Peers10Addresses10-32         33.5ns ± 0%      33.2ns ± 0%   ~     (p=1.000 n=1+1)
Latency-32                             216µs ± 0%       216µs ± 0%   ~     (p=1.000 n=1+1)
Throughput-32                         2.31µs ± 0%      2.28µs ± 0%   ~     (p=1.000 n=1+1)
UAPIGet-32                            2.28µs ± 0%      2.13µs ± 0%   ~     (p=1.000 n=1+1)
WaitPool-32                           4.18µs ± 0%      4.14µs ± 0%   ~     (p=1.000 n=1+1)

name                              old packet-loss  new packet-loss  delta
Throughput-32                           0.00 ± 0%        0.01 ± 0%   ~     (p=1.000 n=1+1)

name                              old alloc/op     new alloc/op     delta
UAPIGet-32                              224B ± 0%        224B ± 0%   ~     (all equal)

name                              old allocs/op    new allocs/op    delta
UAPIGet-32                              17.0 ± 0%        17.0 ± 0%   ~     (all equal)
@raggi raggi force-pushed the raggi/no-atomic-next branch from 5290a10 to 0cdb15c Compare February 9, 2023 23:45
@raggi
Copy link
Contributor Author

raggi commented Feb 9, 2023

Why do you suppose the micro tests show no performance improvement but the macro tests do?

The shortest summary is that my guess is that the micro-benchmarks speculate and pipeline better. RWMutex is a lot more expensive than Mutex even on the happy path, as not only does it do more (several loads, one from TLS, at least two of its own atomics), but it also has more branch paths and instruction pressure. The fast path for Mutex is a single cas operation that probably can be inlined.

Personally I think the simplification is probably of more value, I did testing on the performance primarily to be sure that it didn't get slower, but found that it did get materially faster for some system topologies. RWLocks are as a general rule probably not great to put in the per-packet path - at least not general purpose implementations of them.

@zx2c4
Copy link
Member

zx2c4 commented Mar 3, 2023

I remain sort of skeptical about this and it'd be nice to have some more real data on it. If RWLocks are a bottleneck here, wouldn't that mostly already be hidden by the much heavier usage of an RWLock in IndexTable's Lookup function?

@raggi
Copy link
Contributor Author

raggi commented Mar 15, 2023

To highlight again:

Personally I think the simplification is probably of more value, I did testing on the performance primarily to be sure that it didn't get slower, but found that it did get materially faster for some system topologies.

What are you skeptical about, performance concerns? As stated, they're not all that significant.

What I experienced, and the reason I split this PR out early on in our work was that I ran into this code along performance concerns, and it had intricate behaviors that are involved in both performance and correctness, and it is in a form that is harder to study than is necessary. This approach is far simpler, and benchmarks slightly faster. The primary motivation for landing this PR should be that is is simpler, and it does not regress, it improves on performance (albeit slightly).

Could you expand on some specific concerns that can be addressed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants