-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace mix of atomics and rwmutex with mutex around key material #58
base: master
Are you sure you want to change the base?
Conversation
Why do you suppose the micro tests show no performance improvement but the macro tests do? |
name old time/op new time/op delta TrieIPv4Peers100Addresses1000-32 83.5ns ± 0% 89.0ns ± 0% ~ (p=1.000 n=1+1) TrieIPv4Peers10Addresses10-32 33.6ns ± 0% 33.3ns ± 0% ~ (p=1.000 n=1+1) TrieIPv6Peers100Addresses1000-32 83.3ns ± 0% 82.7ns ± 0% ~ (p=1.000 n=1+1) TrieIPv6Peers10Addresses10-32 33.5ns ± 0% 33.4ns ± 0% ~ (p=1.000 n=1+1) Latency-32 216µs ± 0% 211µs ± 0% ~ (p=1.000 n=1+1) Throughput-32 2.31µs ± 0% 2.25µs ± 0% ~ (p=1.000 n=1+1) UAPIGet-32 2.28µs ± 0% 2.12µs ± 0% ~ (p=1.000 n=1+1) WaitPool-32 4.18µs ± 0% 4.06µs ± 0% ~ (p=1.000 n=1+1) name old packet-loss new packet-loss delta Throughput-32 0.00 ± 0% 0.00 ± 0% ~ (p=1.000 n=1+1) name old alloc/op new alloc/op delta UAPIGet-32 224B ± 0% 224B ± 0% ~ (all equal) name old allocs/op new allocs/op delta UAPIGet-32 17.0 ± 0% 17.0 ± 0% ~ (all equal)
name old time/op new time/op delta TrieIPv4Peers100Addresses1000-32 83.5ns ± 0% 83.3ns ± 0% ~ (p=1.000 n=1+1) TrieIPv4Peers10Addresses10-32 33.6ns ± 0% 33.4ns ± 0% ~ (p=1.000 n=1+1) TrieIPv6Peers100Addresses1000-32 83.3ns ± 0% 83.2ns ± 0% ~ (p=1.000 n=1+1) TrieIPv6Peers10Addresses10-32 33.5ns ± 0% 33.2ns ± 0% ~ (p=1.000 n=1+1) Latency-32 216µs ± 0% 216µs ± 0% ~ (p=1.000 n=1+1) Throughput-32 2.31µs ± 0% 2.28µs ± 0% ~ (p=1.000 n=1+1) UAPIGet-32 2.28µs ± 0% 2.13µs ± 0% ~ (p=1.000 n=1+1) WaitPool-32 4.18µs ± 0% 4.14µs ± 0% ~ (p=1.000 n=1+1) name old packet-loss new packet-loss delta Throughput-32 0.00 ± 0% 0.01 ± 0% ~ (p=1.000 n=1+1) name old alloc/op new alloc/op delta UAPIGet-32 224B ± 0% 224B ± 0% ~ (all equal) name old allocs/op new allocs/op delta UAPIGet-32 17.0 ± 0% 17.0 ± 0% ~ (all equal)
5290a10
to
0cdb15c
Compare
The shortest summary is that my guess is that the micro-benchmarks speculate and pipeline better. RWMutex is a lot more expensive than Mutex even on the happy path, as not only does it do more (several loads, one from TLS, at least two of its own atomics), but it also has more branch paths and instruction pressure. The fast path for Mutex is a single cas operation that probably can be inlined. Personally I think the simplification is probably of more value, I did testing on the performance primarily to be sure that it didn't get slower, but found that it did get materially faster for some system topologies. RWLocks are as a general rule probably not great to put in the per-packet path - at least not general purpose implementations of them. |
I remain sort of skeptical about this and it'd be nice to have some more real data on it. If RWLocks are a bottleneck here, wouldn't that mostly already be hidden by the much heavier usage of an RWLock in IndexTable's Lookup function? |
To highlight again:
What are you skeptical about, performance concerns? As stated, they're not all that significant. What I experienced, and the reason I split this PR out early on in our work was that I ran into this code along performance concerns, and it had intricate behaviors that are involved in both performance and correctness, and it is in a form that is harder to study than is necessary. This approach is far simpler, and benchmarks slightly faster. The primary motivation for landing this PR should be that is is simpler, and it does not regress, it improves on performance (albeit slightly). Could you expand on some specific concerns that can be addressed? |
This reduces the overhead of various code paths as there are less pipeline delays and more code can be inlined. The mutex fast path is a single atomic operation, and this mutex is never held for very long.
Microbenchmark results in each commit show no micro-scale difference, macro scale tests demonstrate a slight speed up, though well within the noise of other external effects: small improvement in arm64 test on t4g.nano from 642mbps to 646mbps, x64 test on m6i.xlarge from 975mbps to 1.02gbps.