-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tcp_conn_tuner: Fix cost calculation errors #139
base: main
Are you sure you want to change the base?
Conversation
We should use 'rate_delivered' instead of 'm->max_rate_delivered' to calculate the rate cost Signed-off-by: Feng Yang <[email protected]>
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application. When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated. If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public. |
If I remember correctly, this was done deliberately because delivery rate across connections varies too much; I discussed this in https://blogs.oracle.com/linux/post/bpftune-using-reinforcement-learning-in-bpf . Do you have some observations on why using the per connection delivery rate works better than using the per congestion control delivery rate? Thanks! |
I guess if the congestion control algorithm is seen as a strategy, and different connections are seen as actions of this strategy, the maximum delivery rate is similar to short-term benefits, and the delivery rate each time is similar to long-term benefits. The maximum delivery rate is highly likely to measure the delivery rate of congestion control, that is, short-term benefits can basically reflect long-term benefits. Therefore, using m ->max_rate_delive can converge quickly and optimally. Unless the delivery rate is usually low and occasionally large, it will cause the use of m ->max_rate_delive to fail to converge to the optimal solution. |
It's a good idea, tho I'd suggest trying something like tcp does where we update the average using a portion of the current rate delivered bit-shifted to avoid need for division. |
on my 6 way proxmox cluster I've been seeing the tcp_conn_tuner making.... a lot of changes... and I'm not confident it's doin the right thing: root@px-m-43:~# for face in tuners summary; do bpftune -q ${face};done;
sysctl active
net_buffer active
neigh_table active
tcp_conn active
tcp_buffer active
route_table active
ip_frag active
netns active
Summary of changes made across all tuners:
Summary: tcp_conn_tuner: CongAlg Count
Summary: tcp_conn_tuner: cubic 108148
Summary: tcp_conn_tuner: bbr 100966
Summary: tcp_conn_tuner: htcp 116662
Summary: tcp_conn_tuner: dctcp 157340
Summary: scenario 'need to increase TCP buffer size(s)' occurred 1 times for tunable 'net.ipv4.tcp_rmem' in global ns. Need to increase buffer size(s) to maximize throughput
sysctl 'net.ipv4.tcp_rmem' changed from (8388608 67108864 2147483647 ) -> (4096 131072 17301504 ) root@px-m-43:~# uptime;echo;grep ExecStart /lib/systemd/system/bpftune.service
22:13:15 up 22:40, 1 user, load average: 2.76, 3.49, 3.13
ExecStart=/usr/sbin/bpftune -r1 (I also find the tcp_rmem adjustment shown there a bit ... uh.... odd? ;) but that's not relevant to this issue. ) For shiggles, I'll try and build this change on one of the nodes and see what impact it has. |
thanks for the report @wolfspyre ! The above looks like the reinforcement learning is saying that dctcp has a slight edge for throughput+latency, reflected in the ~50% more uses of that algorithm for connections versus the rest. i've seen this too in low-latency network environments that do not experience large amounts of packet loss (i.e. within-datacenters). Would be interesting if we could simulate your workloads and verify if this is indeed the right answer for the kind of TCP traffic you are dealing with. More generally the challenge is we need to both explore and exploit as the environment may change, but it may be necessary to fine-tune things such that we explore less as we learn more. feel free to file an issue on these behaviours though of course and we can investigate further. If you're not using latest bpftune, I've added explicit selection of BBR where significant TCP packet loss is observed as that algorithm does a much better job of estimating bandwidth in such cases. The rmem tuning seems to stem from the fact you're at INT_MAX with rmem max; in such cases we should just stay put rather than adjusting things. On that topic it's likely we are adjusting rmem too aggressively also so that may require some fixing up too. |
okay so.. running since 2230 last night both hosts have the latest commit (as of 2230 central 01.12.25)... px-m-42 has this patch applied. . /usr/local/bin/build_util.sh; systemctl stop bpftune; /etc/rc.local >/dev/null 2>&1; info "$(date)"; systemctl start bpftune; sleep 300; info "$(date)";for face in tuners summary; do bpftune -q ${face};done; px-m-41: [00:05:05] INFO: Sun Jan 12 10:40:52 PM CST 2025
route_table active
netns active
net_buffer active
neigh_table active
ip_frag active
tcp_conn active
sysctl active
tcp_buffer active
Summary of changes made across all tuners:
Summary: tcp_conn_tuner: CongAlg Count
Summary: tcp_conn_tuner: cubic 286
Summary: tcp_conn_tuner: bbr 435
Summary: tcp_conn_tuner: htcp 565
Summary: tcp_conn_tuner: dctcp 358 px-m-42: [00:05:05] INFO: Sun Jan 12 10:40:52 PM CST 2025
netns active
neigh_table active
net_buffer active
sysctl active
route_table active
tcp_buffer active
ip_frag active
tcp_conn active
Summary of changes made across all tuners:
Summary: tcp_conn_tuner: CongAlg Count
Summary: tcp_conn_tuner: cubic 229
Summary: tcp_conn_tuner: bbr 210
Summary: tcp_conn_tuner: htcp 166
Summary: tcp_conn_tuner: dctcp 268 as of right now root@px-m-41:~# info "$(date)";for face in tuners summary; do bpftune -q ${face};done;
[18:25:10] INFO: Mon Jan 13 05:00:57 PM CST 2025
route_table active
netns active
net_buffer active
neigh_table active
ip_frag active
tcp_conn active
sysctl active
tcp_buffer active
Summary of changes made across all tuners:
Summary: tcp_conn_tuner: CongAlg Count
Summary: tcp_conn_tuner: cubic 86379
Summary: tcp_conn_tuner: bbr 93786
Summary: tcp_conn_tuner: htcp 87268
Summary: tcp_conn_tuner: dctcp 80164
root@px-m-41:~# root@px-m-42:/usr/src/bpftune# info "$(date)";for face in tuners summary; do bpftune -q ${face};done;
[18:25:10] INFO: Mon Jan 13 05:00:57 PM CST 2025
netns active
neigh_table active
net_buffer active
sysctl active
route_table active
tcp_buffer active
ip_frag active
tcp_conn active
Summary of changes made across all tuners:
Summary: tcp_conn_tuner: CongAlg Count
Summary: tcp_conn_tuner: cubic 46930
Summary: tcp_conn_tuner: bbr 51356
Summary: tcp_conn_tuner: htcp 44095
Summary: tcp_conn_tuner: dctcp 61755
root@px-m-42:/usr/src/bpftune# |
Certainly! happy to provide any info that would be legitimately helpful... something I think that would likely be helpful is a log level which emits its decision once without being super noisy... we see when it has made changes with summary, but we don't see something like:
ALTERNATIVELY, some way of querying bpftune to glean more insight might be sufficient... It's a really neat bit of kit you've got here, and it's REALLY well positioned to help peeps grok the WHY behind the WHAT .. so my intention with the comment isn't to criticize in any way, but rather to highlight the value of a slightly more verbose than silent/unobserved mode... This stuff's complex and often poorly understood by even the most smartypants greybeards. |
We should use 'rate_delivered' instead of 'm->max_rate_delivered' to calculate the rate cost