Replace mix of atomics and rwmutex with mutex around key material #58

raggi · 2022-09-13T02:27:34Z

This reduces the overhead of various code paths as there are less pipeline delays and more code can be inlined. The mutex fast path is a single atomic operation, and this mutex is never held for very long.

Microbenchmark results in each commit show no micro-scale difference, macro scale tests demonstrate a slight speed up, though well within the noise of other external effects: small improvement in arm64 test on t4g.nano from 642mbps to 646mbps, x64 test on m6i.xlarge from 975mbps to 1.02gbps.

zx2c4 · 2023-02-07T23:15:00Z

Why do you suppose the micro tests show no performance improvement but the macro tests do?

device/device.go

name old time/op new time/op delta TrieIPv4Peers100Addresses1000-32 83.5ns ± 0% 89.0ns ± 0% ~ (p=1.000 n=1+1) TrieIPv4Peers10Addresses10-32 33.6ns ± 0% 33.3ns ± 0% ~ (p=1.000 n=1+1) TrieIPv6Peers100Addresses1000-32 83.3ns ± 0% 82.7ns ± 0% ~ (p=1.000 n=1+1) TrieIPv6Peers10Addresses10-32 33.5ns ± 0% 33.4ns ± 0% ~ (p=1.000 n=1+1) Latency-32 216µs ± 0% 211µs ± 0% ~ (p=1.000 n=1+1) Throughput-32 2.31µs ± 0% 2.25µs ± 0% ~ (p=1.000 n=1+1) UAPIGet-32 2.28µs ± 0% 2.12µs ± 0% ~ (p=1.000 n=1+1) WaitPool-32 4.18µs ± 0% 4.06µs ± 0% ~ (p=1.000 n=1+1) name old packet-loss new packet-loss delta Throughput-32 0.00 ± 0% 0.00 ± 0% ~ (p=1.000 n=1+1) name old alloc/op new alloc/op delta UAPIGet-32 224B ± 0% 224B ± 0% ~ (all equal) name old allocs/op new allocs/op delta UAPIGet-32 17.0 ± 0% 17.0 ± 0% ~ (all equal)

name old time/op new time/op delta TrieIPv4Peers100Addresses1000-32 83.5ns ± 0% 83.3ns ± 0% ~ (p=1.000 n=1+1) TrieIPv4Peers10Addresses10-32 33.6ns ± 0% 33.4ns ± 0% ~ (p=1.000 n=1+1) TrieIPv6Peers100Addresses1000-32 83.3ns ± 0% 83.2ns ± 0% ~ (p=1.000 n=1+1) TrieIPv6Peers10Addresses10-32 33.5ns ± 0% 33.2ns ± 0% ~ (p=1.000 n=1+1) Latency-32 216µs ± 0% 216µs ± 0% ~ (p=1.000 n=1+1) Throughput-32 2.31µs ± 0% 2.28µs ± 0% ~ (p=1.000 n=1+1) UAPIGet-32 2.28µs ± 0% 2.13µs ± 0% ~ (p=1.000 n=1+1) WaitPool-32 4.18µs ± 0% 4.14µs ± 0% ~ (p=1.000 n=1+1) name old packet-loss new packet-loss delta Throughput-32 0.00 ± 0% 0.01 ± 0% ~ (p=1.000 n=1+1) name old alloc/op new alloc/op delta UAPIGet-32 224B ± 0% 224B ± 0% ~ (all equal) name old allocs/op new allocs/op delta UAPIGet-32 17.0 ± 0% 17.0 ± 0% ~ (all equal)

raggi · 2023-02-09T23:52:50Z

Why do you suppose the micro tests show no performance improvement but the macro tests do?

The shortest summary is that my guess is that the micro-benchmarks speculate and pipeline better. RWMutex is a lot more expensive than Mutex even on the happy path, as not only does it do more (several loads, one from TLS, at least two of its own atomics), but it also has more branch paths and instruction pressure. The fast path for Mutex is a single cas operation that probably can be inlined.

Personally I think the simplification is probably of more value, I did testing on the performance primarily to be sure that it didn't get slower, but found that it did get materially faster for some system topologies. RWLocks are as a general rule probably not great to put in the per-packet path - at least not general purpose implementations of them.

zx2c4 · 2023-03-03T13:22:26Z

I remain sort of skeptical about this and it'd be nice to have some more real data on it. If RWLocks are a bottleneck here, wouldn't that mostly already be hidden by the much heavier usage of an RWLock in IndexTable's Lookup function?

raggi · 2023-03-15T22:29:09Z

To highlight again:

Personally I think the simplification is probably of more value, I did testing on the performance primarily to be sure that it didn't get slower, but found that it did get materially faster for some system topologies.

What are you skeptical about, performance concerns? As stated, they're not all that significant.

What I experienced, and the reason I split this PR out early on in our work was that I ran into this code along performance concerns, and it had intricate behaviors that are involved in both performance and correctness, and it is in a form that is harder to study than is necessary. This approach is far simpler, and benchmarks slightly faster. The primary motivation for landing this PR should be that is is simpler, and it does not regress, it improves on performance (albeit slightly).

Could you expand on some specific concerns that can be addressed?

raggi mentioned this pull request Jan 28, 2023

conn, device, tun: implement vectorized I/O on Linux #64

Closed

zx2c4 reviewed Feb 7, 2023

View reviewed changes

device/device.go Outdated Show resolved Hide resolved

raggi added 2 commits February 9, 2023 15:08

raggi force-pushed the raggi/no-atomic-next branch from 5290a10 to 0cdb15c Compare February 9, 2023 23:45

zx2c4-bot force-pushed the master branch from c7b76d3 to 1e2c3e5 Compare February 16, 2023 15:34

zx2c4-bot force-pushed the master branch from 787da64 to f41f474 Compare March 10, 2023 13:53

zx2c4-bot force-pushed the master branch from d3cb5bd to 6f895be Compare March 24, 2023 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace mix of atomics and rwmutex with mutex around key material #58

Replace mix of atomics and rwmutex with mutex around key material #58

raggi commented Sep 13, 2022

zx2c4 commented Feb 7, 2023

raggi commented Feb 9, 2023

zx2c4 commented Mar 3, 2023

raggi commented Mar 15, 2023

Replace mix of atomics and rwmutex with mutex around key material #58

Are you sure you want to change the base?

Replace mix of atomics and rwmutex with mutex around key material #58

Conversation

raggi commented Sep 13, 2022

zx2c4 commented Feb 7, 2023

raggi commented Feb 9, 2023

zx2c4 commented Mar 3, 2023

raggi commented Mar 15, 2023