Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor TCP performance #23302

Closed
xhpohanka opened this issue Mar 5, 2020 · 77 comments
Closed

Poor TCP performance #23302

xhpohanka opened this issue Mar 5, 2020 · 77 comments
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@xhpohanka
Copy link
Contributor

I continue playing with Zephyr network stack and STM32 and I unfortunately found next issue. With nucleo_f429zi board and big_http_download sample I got very slow download speed. This pushed me to check network performace with zperf.

For UDP transfers I got around 10Mbps but for TCP the result was only 10kbps which is really bad.

I tried if some older versions of Zephyr behaves better - fortunately v2.0.0 got me also around 10Mbps for TCP in zperf. With bisecting i found that this issue starts with d88f25b.

I hoped that reverting it will solve also the slow big_http_download but surprisingly the download speed is still suspiciously low. I will continue to investigate this tomorrow.

I do not know if these issues are related just with STM32 platform. I have just mentioned nucleo_f429zi and custom board with STM32F750 which has slightly different ethernet peripheral and my driver is written also using HAL. Both behaves in same way.

The issues I met so far with Zephyr networking stack pose a question to me if it is mature enough for production?

@xhpohanka xhpohanka added the bug The issue is a bug, or the PR is fixing a bug label Mar 5, 2020
@jukkar
Copy link
Member

jukkar commented Mar 5, 2020

Indeed, there is a regression with the TCP throughput. Problem with TCP has been that currently we have no proper tests for TCP that would catch these kind of regression issues. This has been one of the reasons we have been building a new version of the TCP stack that would support proper testing that could be integrated into sanitychecker. The TCP2 has been cooking quite long time and we are slowly getting there where it would be useful, but we are not yet there.
Obviously the 10kb/sec is not good and this needs to be fixed.

@tbursztyka tbursztyka removed their assignment Mar 6, 2020
@jukkar
Copy link
Member

jukkar commented Mar 6, 2020

I just compiled samples/net/sockets/dumb_http_server_mt for Atmel sam-e70 board and got following numbers:

ab -n100 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient).....done


Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        2084 bytes

Concurrency Level:      1
Time taken for tests:   0.381 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      214000 bytes
HTML transferred:       208400 bytes
Requests per second:    262.67 [#/sec] (mean)
Time per request:       3.807 [ms] (mean)
Time per request:       3.807 [ms] (mean, across all concurrent requests)
Transfer rate:          548.95 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     1    3  20.7      1     208
Waiting:        1    3  20.7      1     208
Total:          1    4  20.6      2     208

Percentage of the requests served within a certain time (ms)
  50%      2
  66%      2
  75%      2
  80%      2
  90%      2
  95%      2
  98%      3
  99%    208
 100%    208 (longest request)

So definitely more than 10kb/sec. Then using wrk

./wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://192.0.2.1:8080
  24 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.74ms   62.79ms 616.12ms   97.41%
    Req/Sec     1.74      3.83    20.00     89.80%
  Latency Distribution
     50%    1.29ms
     75%    1.35ms
     90%    4.46ms
     99%  207.79ms
  116 requests in 20.10s, 242.42KB read
  Socket errors: connect 0, read 134, write 0, timeout 0
Requests/sec:      5.77
Transfer/sec:     12.06KB

Now the transfer speed is very poor.

I then increased the CONFIG_NET_TCP_BACKLOG_SIZE=2 and got considerably better results:

./wrk -d 10 -t 2 -c 100 --latency http://192.0.2.1:8080
Running 10s test @ http://192.0.2.1:8080
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.61ms  421.92us   9.90ms   95.03%
    Req/Sec   375.32    153.56   610.00     72.80%
  Latency Distribution
     50%    1.41ms
     75%    1.88ms
     90%    1.98ms
     99%    2.17ms
  4768 requests in 10.01s, 9.73MB read
  Socket errors: connect 0, read 4774, write 0, timeout 0
Requests/sec:    476.10
Transfer/sec:      0.97MB

@xhpohanka
Copy link
Contributor Author

I have this behavior for STM32F429ZI

CONFIG_NET_TCP_BACKLOG_SIZE=1

$ wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://192.0.2.1:8080
  24 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.51ms    1.79ms   8.77ms   89.86%
    Req/Sec     1.86      3.43    10.00     86.96%
  Latency Distribution
     50%    1.93ms
     75%    1.95ms
     90%    6.70ms
     99%    8.77ms
  69 requests in 20.10s, 144.20KB read
  Socket errors: connect 0, read 69, write 0, timeout 0
Requests/sec:      3.43
Transfer/sec:      7.17KB

CONFIG_NET_TCP_BACKLOG_SIZE=2

$ wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://192.0.2.1:8080
  24 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   196.18ms   36.93ms 407.71ms   96.32%
    Req/Sec     8.92      4.18    35.00     85.63%
  Latency Distribution
     50%  202.17ms
     75%  202.23ms
     90%  202.32ms
     99%  209.31ms
  896 requests in 20.09s, 1.84MB read
  Socket errors: connect 0, read 898, write 0, timeout 0
Requests/sec:     44.60
Transfer/sec:     93.92KB

I do not understand why latency is so slow.

The previous result was for v2.2.0-rc3, for 2.1.0-rc3 it is even worse...
CONFIG_NET_TCP_BACKLOG_SIZE=2

$ wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://10.42.0.192:8080
  24 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   168.84ms   74.40ms 220.12ms   82.07%
    Req/Sec     5.78     10.02   100.00     98.40%
  Latency Distribution
     50%  202.34ms
     75%  202.38ms
     90%  202.58ms
     99%  218.44ms
  145 requests in 20.03s, 305.86KB read
  Socket errors: connect 0, read 789, write 0, timeout 0
Requests/sec:      7.24
Transfer/sec:     15.27KB

@nashif nashif added the priority: medium Medium impact/importance bug label Mar 6, 2020
@xhpohanka
Copy link
Contributor Author

xhpohanka commented Mar 6, 2020

It also seems to me that issue with big_http_download sample is a bit different. The download is very slow for me with no respect to CONFIG_NET_TCP_BACKLOG_SIZE or reverting d88f25b commit. Downloading of 52kB sized file lasts 8.5 seconds. You can check my wireshark log https://drive.google.com/open?id=163-v3MlK3Hgc4F47X05GSSJRo3EFpX-s, it is full of [TCP Window Full] packets.

I also checked zperf again on v2.2.0-rc3 and this is what I get with upstream code

$ iperf -c 192.0.2.1
------------------------------------------------------------
Client connecting to 192.0.2.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 192.0.2.2 port 51802 connected with 192.0.2.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.2 sec  13.6 KBytes  10.9 Kbits/sec

and this with d88f25b reverted, backlog size has no impact here

$ iperf -c 192.0.2.1
------------------------------------------------------------
Client connecting to 192.0.2.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 192.0.2.2 port 51778 connected with 192.0.2.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec  16.1 MBytes  13.4 Mbits/sec

I'm very interested in getting this to work better. I can do more tests and even development if I get some guidance...

@jukkar
Copy link
Member

jukkar commented Mar 8, 2020

Some fixes for the memory leaks in #23334
The #23246 is related to this issue. The low performance numbers are probably culprit of lot of packets being dropped. Anyway, the memory leaks needs to be fixes first, and then we can see how the performance number behave.

@pfalcon
Copy link
Contributor

pfalcon commented Mar 10, 2020

For UDP transfers I got around 10Mbps but for TCP the result was only 10kbps which is really bad.

Indeed, there is a regression with the TCP throughput.

I personally never saw other TCP speed figures from Zephyr with frdm_k64f/any of qemu networking drivers. (Well, perhaps I saw 15KBytes/s, but you get the point.) I actually wanted to add dumping of download speed to big_http_download, but decided against that ;-).

@pfalcon
Copy link
Contributor

pfalcon commented Apr 2, 2020

I'm currently working on switching over big_http_download to a local host downloads (yep, to put it into CI and not rely on external hosts which would provide additional disturbance to testing).

And I decided to share what I see with just an asciinema recording: https://asciinema.org/a/BXbcuYTQsrsPxDmdCnbfUz7Qz (as I'm not sure how soon they garbage-collect uploads, also attaching here
big_http_download-localhost.zip ).

So, as you can see, it starts slow. Then at ~70Kb (~00:19 on the cast) it finally start to work "as it should", than at ~140KB (~00:21) it breaks again. You've already read my explanation of what happens (based on hours of peering into writeshark output (but months ago)): it starts mis-synchronized with Linux TCP sending rexmits and bumps to a large backoff delay, then manages to synchronize to Linux (this is literally the first time I see it), then loses the sync again and goes back to swamp for the rest of download.

@pfalcon
Copy link
Contributor

pfalcon commented Apr 2, 2020

(Which makes me think maybe add a calculation of "momentary" speed over e.g. last second or two, and print out each on a separate line, not as with "\r").

@jukkar
Copy link
Member

jukkar commented May 25, 2020

I tested this with nucleo_f767zi and see wildly different numbers which depend on the tool used. ApacheBench gave around 470kb/sec results which is quite reasonable. The wrk tool on the other hand pushes so much stuff to zephyr that it easily runs out of memory, and starts to drop packets. This then affects the performance number a lot. I am lowering the priority of this one until we figure out what are the correct numbers.

@jukkar jukkar added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels May 25, 2020
@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Jul 25, 2020
@github-actions github-actions bot closed this as completed Aug 8, 2020
@dleach02 dleach02 reopened this Sep 2, 2020
@pfalcon
Copy link
Contributor

pfalcon commented Sep 3, 2020

Recently, Zephyr switched to newer, "TCP2" implementation by default. I'm doing its triaging for the 2.4 release (#27876, #27982), ans also would like to record some quick performance data:

dumb_http_server sample, qemu_x86, standard SLIP-based QEMU networking setup, ab -n1000 http://192.0.2.1:8080/: 60.78 req/s.

For reference, with TCP1 (CONFIG_NET_TCP1=y), it's 4.18 req/s.

So, there's a noticeable improvement in connection latency.

@pfalcon
Copy link
Contributor

pfalcon commented Sep 3, 2020

But not so bright with data transfer throughput. The sample is big_http_download, the rest is as above.

First, reference with CONFIG_NET_TCP1=y: 3108 b/s . That's pretty slow. I remember speeds around 10KB/s. Dunno what's up now, my inet connection is not ideal either.

So, with default TCP2, it's 6154 b/s (mind #27982, the transfer doesn't complete successfully, implied end time was used to calc speed).

@pfalcon
Copy link
Contributor

pfalcon commented Sep 3, 2020

For reference, trying @nzmichaelh's hack from #26330 (comment) (set NET_TCP_BUF_MAX_LEN to (4*1280)) didn't make a statistical difference to me (on qemu_x86 with SLIP, again).

@hakehuang
Copy link
Collaborator

@rlubos so do we have a golden platform for zperf?

@rlubos
Copy link
Contributor

rlubos commented Mar 15, 2022

@rlubos so do we have a golden platform for zperf?

I'm not aware if we have any "golden" platform for tests, I think we should aim to get a decent performance on any (non-emulated) ethernet board. I've ordered a few on my own to perform some measurements.

@hakehuang
Copy link
Collaborator

I'm not aware if we have any "golden" platform for tests, I think we should aim to get a decent performance on any (non-emulated) ethernet board. I've ordered a few on my own to perform some measurements.

without a golden platform, we will have to take account of driver impacts, which may not be a good thing.

@pfalcon
Copy link
Contributor

pfalcon commented Mar 16, 2022

without a golden platform, we will have to take account of driver impacts, which may not be a good thing.

At Linaro, we standardized on frdm_k64f as the default networking platform from Zephyr's start. My records of Zephyr's networking testing against it (and qemu which is everyone's favorite platform) is at https://docs.google.com/spreadsheets/d/1_8CsACPEXqrMIbxBKxPAds091tNAwnwdWkMKr3994QY/edit#gid=0 (it's pretty spotty, as it's my personal initiative to maintain such as a spreadsheet, so it was done on the best effort basis). When I have a chance, I'll test current Zephyr using the testcases in the spreadsheet (I'm working on other projects now).

@mdkf
Copy link

mdkf commented Mar 21, 2022

Why is there a sleep within a send loop (https://github.com/mdkf/ZephyrTCPSlow/blob/main/src/socketTest.c#L159)? 2 ms sleep betwen each datagram sent doesn't sound like a best idea for a througput measuring test.

I inserted the sleep when testing the F746ZG. It died otherwise. It was not included in the other measurements.

Additionally, 90 byte payload size in datagram seems pretty small fro throughput measurement, did you send such a small packets with mbed as well?

I used the same test in Mbed and FreeRTOS. I also tried with a 1400byte packet. It did not improve the performance in Zephyr. After the first 90 byte packet was sent, Wireshark shows the subsequent packets sent were closer to 1480 bytes. Basically the 90 byte packets accumulated and were sent together.

@AndreyDodonov-EH
Copy link
Contributor

@hakehuang What test code did you use on the Zephyr side, zperf? Note, there's currently a bug, fixed in #43379, w/o it you won't get decent results (the recv window gets filled and the communication stalls).

Also, please note that increasing the window size means that you also need to increase the RX pkt/buf count in the system, otherwise the effort is futile (net driver will start dopping packets, enforcing retransmissions from the server which has a great negative impact on the throughput). Increasing TX pkt/buf count a bit also makes sense, as we need to acknowledge each TCP packet received (I wonder if there's room for improvement here, in theory it should be enough to acknowledge once with larger ACK value).

@rlubos
Great that you mentioned ACK issue.
Yes, there is room for improvement in terms of ACK-ing multiple packages with a single ACK, there is even a (somewhat stale) issue for that: #30366

@AndreyDodonov-EH
Copy link
Contributor

@jukkar
Probably wrong thread to ask, but is there a reason behind magic constant 3 ?
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/tcp.c#L1801

Because I observed that max window size is quite critical.
Actually not to get Data buffer allocation errors, I had to change that to 4, or define custom CONFIG_NET_TCP_MAX_SEND_WINDOW_SIZE

Sorry if I'm missing something here

AndreyDodonov-EH referenced this issue in endresshauser-lp/sdk-zephyr Apr 22, 2022
@jukkar
Copy link
Member

jukkar commented Apr 26, 2022

Probably wrong thread to ask, but is there a reason behind magic constant 3 ?

No specific reason, just a somewhat reasonable value when the code was written. If you find value 4 more suitable, please send a PR that changes it.

@AndreyDodonov-EH
Copy link
Contributor

Probably wrong thread to ask, but is there a reason behind magic constant 3 ?

No specific reason, just a somewhat reasonable value when the code was written. If you find value 4 more suitable, please send a PR that changes it.

I don't think it makes sense to open PR with another magic constant, at the very least with KConfig flag.

It worked for me, yes, but I'd like to understand the meaning behind it. Ideally this coefficient should be calculated.

@ssharks
Copy link
Collaborator

ssharks commented May 3, 2022

With the latest patches in, in the field I get the cloud application dropping larger transfers as the transfer rate drops too low. This happens when there is less then 240 bytes transferred in the last 5 seconds.

Looking at the test results I can see something interesting happening.

Based on the qemu_cortex_a9 target (not different for qemu x86)

Transferring 60 kByte

With preemptive scheduling:

Without packet loss:
===================================================================
START - test_v4_send_recv_large
 PASS - test_v4_send_recv_large in 19.84 seconds

With packet loss:

===================================================================
START - test_v4_send_recv_large
 PASS - test_v4_send_recv_large in 25.102 seconds
===================================================================

With cooperative scheduling

Without packet loss:

===================================================================
START - test_v4_send_recv_large
 PASS - test_v4_send_recv_large in 10.751 seconds

With packet loss:

===================================================================
START - test_v4_send_recv_large
 PASS - test_v4_send_recv_large in 22.12 seconds
===================================================================

In the case of no packet loss, I would expect the elapsed time to be less then a second, a no timeouts need to occur. It is also interesting to see the cooperative scheduling being almost 2 times faster. In case of packet loss, timeouts could need to occur for re-transmission, than it could take longer. Nevertheless is little difference in runtime between the case with and without packet loss.

@rlubos Mentioned that increasing the buffers

CONFIG_NET_BUF_RX_COUNT=64
CONFIG_NET_BUF_TX_COUNT=64

And putting a k_yield after the send call helps significantly to accelerate the test. But nevertheless, in a zero delay, no packet loss testcase, it should also be possible to get a fast throughput with just smaller number of buffers.

@ssharks
Copy link
Collaborator

ssharks commented May 5, 2022

I attempted to dive a little bit deeper in issue: #45367

@rlubos
Copy link
Contributor

rlubos commented May 31, 2022

I've finally managed to do some throughput tests on actual hardware. I had mimxrt1020_evk and nucleo_h723zg on the table.

TL;DR The results for nucleo_h723zg are good, but for mimxrt1020_evk they're rather poor.

For testing, I've used iperf on the Linux host side and zperf sample on the Zephyr side. Let's focus on nucleo_h723zg first. As a reference, I've used UDP throughput as in this case we avoid protocol specific constraints (like TX/RX window size with TCP).
Initial results for zperf running in default configuration are not bad but not great either. When analyzing eth_stm32_hal.c driver I've noticed though, that on the TX path, the driver blocks during the transmission, effectively negating any positive performance effects of using DMA. In result, the total transmission time of a single frame consists not only of the time needed to actually transmit the frame, but also the time needed to process UDP/IP is added to the overall transmit time. As in the default configuration Zephyr does the L4/L3/L2 and the driver processing in a single thread, all of the processing times adds up, affecting the final throughput.

I've managed to increase the throughput by enabling the TX queue (CONFIG_NET_TC_TX_COUNT=1). As a result, a packet, instead of being passed to L2 directy, is queued, and the actual L2 and driver processing is done in a separate thread. This allows for increased throughput, because when the driver blocks during the transmission, the other thread which does the L4/L3 processing is able to proceed with the next frame. I think this should be a default configuration in the zperf sample.

Another small throughput improvement can be achieved by setting the net buffer size to the actual network MTU (CONFIG_NET_BUF_DATA_SIZE=1500). In this case, L3/L4 processing takes less time, as the packet consists of a single buffer, instead of a chain of buffers that net stack needs to process. This also increases the default TCP TX/RX window size, which improves the TCP throughput, both ways.

Finally, TCP throughput can be further increased by maximizing the TCP window sizes. I've achieved that by increasing the net_pkt/net_buf count and relying on the default window size set by Zephyr.

The overall results are presented in the table below (the measurements were taken on the receiving node, i.e. iperf for upload, zperf for download):

Configuration TCP RX/TX window UDP upload TCP upload UDP download TCP download
Default 1194 51.2 Mbits/sec 670 Kbits/sec 88.12 Mbits/sec 7.71 Mbits/sec
CONFIG_NET_TC_TX_COUNT=1 1194 73.6 Mbits/sec 670 Kbits/sec 88.03 Mbits/sec 7.73 Mbits/sec
CONFIG_NET_TC_TX_COUNT=1 CONFIG_NET_BUF_DATA_SIZE=1500 14000 78.3 Mbits/sec 69.8 Mbits/sec 88.01 Mbits/sec 75.03 Mbits/sec
CONFIG_NET_TC_TX_COUNT=1 CONFIG_NET_BUF_DATA_SIZE=1500 CONFIG_NET_PKT_RX/TX_COUNT=80 CONFIG_NET_BUF_RX/TX_COUNT=80 40000 77.9 Mbits/sec 75.0 Mbits/sec 88.12 Mbits/sec 79.56 Mbits/sec

A side note, I was able to improve the UDP TX throughput even further by modifying the eth_stm32_hal.c, to block not after sending the packet to the HAL, but before (i. e. to block if the previous transfer hasn't finished yet). This allowed to reach ~87 Mbits/sec, however I'm not confident enough to push those changes upstream, as there are other aspects to consider (for instance PTP is processed after the packet is transmitted, I'm not sure if that change wouldn't break that). I'll leave it to the driver maintainers to decide whether to improve or not.

Now when it comes to mimxrt1020_evk, the results are presented below:

Configuration TCP RX/TX window UDP upload TCP upload UDP download TCP download
CONFIG_NET_TC_TX_COUNT=1 CONFIG_NET_BUF_DATA_SIZE=1500 CONFIG_NET_PKT_RX/TX_COUNT=80 CONFIG_NET_BUF_RX/TX_COUNT=80 40000 17.2 Mbits/sec 7.88 Mbits/sec 16.23 Mbits/sec 456 Kbits/sec

I've investigated this platform a bit, and the conclusion for the poor performance is as follows:

  • (minor) The eth_mcux.c driver does the same thing as eth_stm32_hal.c, i. e. it blocks during transfer. In this case however there is an additional thread within the driver involved to unblock, which adds extra overhead due to scheduling.
  • (major) When measuring the time needed to process individual frames I've noticed that this platform is much slower than nucleo_h723zg (it took ~4 times longer to do the L4/L3 processing). This is a bit surprising to me, as both platform appear to be running on Cortex M7, with similar CPU speed (500 MHz vs 550 MHz). @dleach02 Do you know perhaps what could be the reason of this?
  • (major) When downloading at full speed, the driver reports lots of errors (<err> eth_mcux: ENET_GetRxFrameSize return: 4001). I don't know the reason, but it could be a side effect of the above point.

To summarize, I think that the results achieved on nucleo_h723zg prove that it is possible to achieve competitive throughputs with Zephyr with proper configuration and a well-written ethernet drvier. Ideally it'd be good to test other platforms as well, but due to limited availability of development kits in general I couldn't get some obvious choices like the super popular frdm_k64f. I therefore suggest to close this general issue, as it might be misleading, given the above results and open a board/driver specific issues instead.

@rlubos
Copy link
Contributor

rlubos commented May 31, 2022

As for the zperf, the sample uses net_context API directly, which gave me a bit of a headache due to some issues with TCP handing in the sample (the TCP context was freed too early as the sample did not add extra ref to the net_contex, also it does not take EAGAIN/ENOBUF returned by the TCP layer into consideration). I'm thinking however, that instead of fixing those issue it'd be worthwhile to rewrite the sample to use socket API instead, which is a more realistic scenario for actual apps. I plan to work on this in a near future.

@ssharks
Copy link
Collaborator

ssharks commented May 31, 2022

Very interesting results. This clearly shows that in a happy flow situation the performance can be pretty decent. You are using a point to point wired link I assume.
The polling implementation has definitely helps to improve throughput quite a bit.

You increased the window by increasing the CONFIG_NET_BUF_DATA_SIZE to 1500 bytes over the default 128, do you know if this has the same affect as increasing the CONFIG_NET_BUF_RX/TX_COUNT by a factor 12? Apart from maybe some processing overhead I would expect it to have the same affect. Only small packets will consume considerably less space.

On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here.

@rlubos
Copy link
Contributor

rlubos commented Jun 1, 2022

You are using a point to point wired link I assume.

Yes, the whole point of this experiment was to compare how the actual throughput compares to the theoretical maximum throughput over 100 Mbit Ethernet, and it seems we're pretty close to the limit.

You increased the window by increasing the CONFIG_NET_BUF_DATA_SIZE to 1500 bytes over the default 128, do you know if this has the same affect as increasing the CONFIG_NET_BUF_RX/TX_COUNT by a factor 12? Apart from maybe some processing overhead I would expect it to have the same affect. Only small packets will consume considerably less space.

Yes, the default window size is calculated based on the buffer size and buffer count, i.e. the overall size of all of the buffers, so you could reach the same effect by increasing the buffer count. The sole reason to increase the buffer size here was to reduce the processing time of an individual frame, I would say however that this is only recommended if you really need to maximize your througputs, usuallly it's better to increase buffer count, as you don't waste space on small packets.

On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here.

Well yes, it is expected that the throughputs will be worse in case of lossy networks. If there are mechanisms specified in TCP, that could help to improve performance in such case, we should consider implementing them. I think though that those should be conidered as enhancements, not reported as "bugs" like this issue is.

@carlescufi
Copy link
Member

On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here.

@rlubos and @ssharks can we create an enhancement issue for this?

@ssharks
Copy link
Collaborator

ssharks commented Jun 23, 2022

@rlubos: Could you redo the upload tests with small window of #23302 (comment), with the fix of #46584 in? The figures will look very different I believe.

@xhpohanka: PR #46584 was recently merged and I think it solves the issue you described. Are you in a position to check if your problem has been fixed. If so, this issue can be closed. In fact, the issue #45844, looks pretty similar to your description.

@xhpohanka
Copy link
Contributor Author

Hello @ssharks,
I have not done zperf testing for a long time, but I have checked the recent updates to the TCP stack including #46584. In our application the performace really improved a lot. From my POV this issue can be closed :)

@rlubos
Copy link
Contributor

rlubos commented Jun 24, 2022

@ssharks Hmm, but the Silly Window shouldn't affect the upload as it's related to the RX window size? Did you mean download?

Anyways, I've ran the test again, no difference on the upload side, the download throughput is slightly improved (in low window scenario) to 8.68 Mbps. When I tested the solution, the most significant performance boost happended in case we reported Zero window to peer, as this didn't take place anymore with #46584. This didn't happend though in the initial test I performed here.

@rlubos
Copy link
Contributor

rlubos commented Jun 24, 2022

Hello @ssharks,
I have not done zperf testing for a long time, but I have checked the recent updates to the TCP stack including #46584. In our application the performace really improved a lot. From my POV this issue can be closed :)

I suggest we thereby close this long-open issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests