tcp_conn_tuner: Fix cost calculation errors #139

kknjh · 2024-12-21T07:36:49Z

We should use 'rate_delivered' instead of 'm->max_rate_delivered' to calculate the rate cost

We should use 'rate_delivered' instead of 'm->max_rate_delivered' to calculate the rate cost Signed-off-by: Feng Yang <[email protected]>

oracle-contributor-agreement · 2024-12-21T07:36:54Z

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

PR author: kknjh
[email protected] (@kknjh)

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

alan-maguire · 2024-12-21T16:30:29Z

If I remember correctly, this was done deliberately because delivery rate across connections varies too much; I discussed this in https://blogs.oracle.com/linux/post/bpftune-using-reinforcement-learning-in-bpf . Do you have some observations on why using the per connection delivery rate works better than using the per congestion control delivery rate? Thanks!

kknjh · 2024-12-23T15:05:22Z

I guess if the congestion control algorithm is seen as a strategy, and different connections are seen as actions of this strategy, the maximum delivery rate is similar to short-term benefits, and the delivery rate each time is similar to long-term benefits. The maximum delivery rate is highly likely to measure the delivery rate of congestion control, that is, short-term benefits can basically reflect long-term benefits. Therefore, using m ->max_rate_delive can converge quickly and optimally. Unless the delivery rate is usually low and occasionally large, it will cause the use of m ->max_rate_delive to fail to converge to the optimal solution.
If that's the case, what would happen if we use averaging? For example：
m->rate_delivered_mean = m->rate_delivered_mean +(rate_delivered - m->rate_delivered_mean) / index;
tcp_metric_calc(remote_host, min_rtt, m->rate_delivered_mean);

alan-maguire · 2025-01-09T13:39:13Z

I guess if the congestion control algorithm is seen as a strategy, and different connections are seen as actions of this strategy, the maximum delivery rate is similar to short-term benefits, and the delivery rate each time is similar to long-term benefits. The maximum delivery rate is highly likely to measure the delivery rate of congestion control, that is, short-term benefits can basically reflect long-term benefits. Therefore, using m ->max_rate_delive can converge quickly and optimally. Unless the delivery rate is usually low and occasionally large, it will cause the use of m ->max_rate_delive to fail to converge to the optimal solution. If that's the case, what would happen if we use averaging? For example： m->rate_delivered_mean = m->rate_delivered_mean +(rate_delivered - m->rate_delivered_mean) / index; tcp_metric_calc(remote_host, min_rtt, m->rate_delivered_mean);

It's a good idea, tho I'd suggest trying something like tcp does where we update the average using a portion of the current rate delivered bit-shifted to avoid need for division.

wolfspyre · 2025-01-13T04:18:24Z

on my 6 way proxmox cluster I've been seeing the tcp_conn_tuner making.... a lot of changes... and I'm not confident it's doin the right thing:

root@px-m-43:~# for face in tuners summary; do bpftune -q ${face};done;
              sysctl               active
          net_buffer               active
         neigh_table               active
            tcp_conn               active
          tcp_buffer               active
         route_table               active
             ip_frag               active
               netns               active

Summary of changes made across all tuners:
Summary: tcp_conn_tuner:              CongAlg                Count
Summary: tcp_conn_tuner:                cubic               108148
Summary: tcp_conn_tuner:                  bbr               100966
Summary: tcp_conn_tuner:                 htcp               116662
Summary: tcp_conn_tuner:                dctcp               157340
Summary: scenario 'need to increase TCP buffer size(s)' occurred 1 times for tunable 'net.ipv4.tcp_rmem' in global ns. Need to increase buffer size(s) to maximize throughput
sysctl 'net.ipv4.tcp_rmem' changed from (8388608 67108864 2147483647 ) -> (4096 131072 17301504 )

root@px-m-43:~# uptime;echo;grep ExecStart  /lib/systemd/system/bpftune.service
 22:13:15 up 22:40,  1 user,  load average: 2.76, 3.49, 3.13

ExecStart=/usr/sbin/bpftune -r1

(I also find the tcp_rmem adjustment shown there a bit ... uh.... odd? ;) but that's not relevant to this issue. )

For shiggles, I'll try and build this change on one of the nodes and see what impact it has.

alan-maguire · 2025-01-13T16:51:05Z

thanks for the report @wolfspyre ! The above looks like the reinforcement learning is saying that dctcp has a slight edge for throughput+latency, reflected in the ~50% more uses of that algorithm for connections versus the rest. i've seen this too in low-latency network environments that do not experience large amounts of packet loss (i.e. within-datacenters). Would be interesting if we could simulate your workloads and verify if this is indeed the right answer for the kind of TCP traffic you are dealing with. More generally the challenge is we need to both explore and exploit as the environment may change, but it may be necessary to fine-tune things such that we explore less as we learn more. feel free to file an issue on these behaviours though of course and we can investigate further. If you're not using latest bpftune, I've added explicit selection of BBR where significant TCP packet loss is observed as that algorithm does a much better job of estimating bandwidth in such cases. The rmem tuning seems to stem from the fact you're at INT_MAX with rmem max; in such cases we should just stay put rather than adjusting things. On that topic it's likely we are adjusting rmem too aggressively also so that may require some fixing up too.

wolfspyre · 2025-01-13T23:02:04Z

okay so.. running since 2230 last night both hosts have the latest commit (as of 2230 central 01.12.25)... px-m-42 has this patch applied.

 . /usr/local/bin/build_util.sh; systemctl stop bpftune; /etc/rc.local >/dev/null 2>&1; info "$(date)"; systemctl start bpftune; sleep 300; info "$(date)";for face in tuners summary; do bpftune -q ${face};done;

px-m-41:

[00:05:05] INFO: Sun Jan 12 10:40:52 PM CST 2025
         route_table               active
               netns               active
          net_buffer               active
         neigh_table               active
             ip_frag               active
            tcp_conn               active
              sysctl               active
          tcp_buffer               active

Summary of changes made across all tuners:
Summary: tcp_conn_tuner:              CongAlg                Count
Summary: tcp_conn_tuner:                cubic                  286
Summary: tcp_conn_tuner:                  bbr                  435
Summary: tcp_conn_tuner:                 htcp                  565
Summary: tcp_conn_tuner:                dctcp                  358

px-m-42:

[00:05:05] INFO: Sun Jan 12 10:40:52 PM CST 2025
               netns               active
         neigh_table               active
          net_buffer               active
              sysctl               active
         route_table               active
          tcp_buffer               active
             ip_frag               active
            tcp_conn               active

Summary of changes made across all tuners:
Summary: tcp_conn_tuner:              CongAlg                Count
Summary: tcp_conn_tuner:                cubic                  229
Summary: tcp_conn_tuner:                  bbr                  210
Summary: tcp_conn_tuner:                 htcp                  166
Summary: tcp_conn_tuner:                dctcp                  268

as of right now

root@px-m-41:~#  info "$(date)";for face in tuners summary; do bpftune -q ${face};done;
[18:25:10] INFO: Mon Jan 13 05:00:57 PM CST 2025
         route_table               active
               netns               active
          net_buffer               active
         neigh_table               active
             ip_frag               active
            tcp_conn               active
              sysctl               active
          tcp_buffer               active

Summary of changes made across all tuners:
Summary: tcp_conn_tuner:              CongAlg                Count
Summary: tcp_conn_tuner:                cubic                86379
Summary: tcp_conn_tuner:                  bbr                93786
Summary: tcp_conn_tuner:                 htcp                87268
Summary: tcp_conn_tuner:                dctcp                80164

root@px-m-41:~#

root@px-m-42:/usr/src/bpftune#  info "$(date)";for face in tuners summary; do bpftune -q ${face};done;
[18:25:10] INFO: Mon Jan 13 05:00:57 PM CST 2025
               netns               active
         neigh_table               active
          net_buffer               active
              sysctl               active
         route_table               active
          tcp_buffer               active
             ip_frag               active
            tcp_conn               active

Summary of changes made across all tuners:
Summary: tcp_conn_tuner:              CongAlg                Count
Summary: tcp_conn_tuner:                cubic                46930
Summary: tcp_conn_tuner:                  bbr                51356
Summary: tcp_conn_tuner:                 htcp                44095
Summary: tcp_conn_tuner:                dctcp                61755

root@px-m-42:/usr/src/bpftune#

wolfspyre · 2025-01-13T23:16:03Z

Certainly! happy to provide any info that would be legitimately helpful...
I think the idea of this tool is amazingly cool....

something I think that would likely be helpful is a log level which emits its decision once without being super noisy...

we see when it has made changes with summary, but we don't see something like:

"I thought about adjusting net.ipv4.tcp_mem but the current value is already at the kernel max... this knob, unfortunately, does NOT go to 11; so there's not much else I can do here..."

ALTERNATIVELY, some way of querying bpftune to glean more insight might be sufficient...

It's a really neat bit of kit you've got here, and it's REALLY well positioned to help peeps grok the WHY behind the WHAT .. so my intention with the comment isn't to criticize in any way, but rather to highlight the value of a slightly more verbose than silent/unobserved mode...

This stuff's complex and often poorly understood by even the most smartypants greybeards.

tcp_conn_tuner: Fix cost calculation errors

9a1d9f3

We should use 'rate_delivered' instead of 'm->max_rate_delivered' to calculate the rate cost Signed-off-by: Feng Yang <[email protected]>

oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tcp_conn_tuner: Fix cost calculation errors #139

tcp_conn_tuner: Fix cost calculation errors #139

kknjh commented Dec 21, 2024

oracle-contributor-agreement bot commented Dec 21, 2024

alan-maguire commented Dec 21, 2024

kknjh commented Dec 23, 2024

alan-maguire commented Jan 9, 2025

wolfspyre commented Jan 13, 2025

alan-maguire commented Jan 13, 2025

wolfspyre commented Jan 13, 2025

wolfspyre commented Jan 13, 2025 •

edited

Loading

tcp_conn_tuner: Fix cost calculation errors #139

Are you sure you want to change the base?

tcp_conn_tuner: Fix cost calculation errors #139

Conversation

kknjh commented Dec 21, 2024

oracle-contributor-agreement bot commented Dec 21, 2024

alan-maguire commented Dec 21, 2024

kknjh commented Dec 23, 2024

alan-maguire commented Jan 9, 2025

wolfspyre commented Jan 13, 2025

alan-maguire commented Jan 13, 2025

wolfspyre commented Jan 13, 2025

wolfspyre commented Jan 13, 2025 • edited Loading

wolfspyre commented Jan 13, 2025 •

edited

Loading