-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor TCP performance #23302
Comments
Indeed, there is a regression with the TCP throughput. Problem with TCP has been that currently we have no proper tests for TCP that would catch these kind of regression issues. This has been one of the reasons we have been building a new version of the TCP stack that would support proper testing that could be integrated into sanitychecker. The TCP2 has been cooking quite long time and we are slowly getting there where it would be useful, but we are not yet there. |
I just compiled
So definitely more than 10kb/sec. Then using wrk
Now the transfer speed is very poor. I then increased the
|
I have this behavior for STM32F429ZI
I do not understand why latency is so slow. The previous result was for v2.2.0-rc3, for 2.1.0-rc3 it is even worse...
|
It also seems to me that issue with I also checked
and this with d88f25b reverted, backlog size has no impact here
I'm very interested in getting this to work better. I can do more tests and even development if I get some guidance... |
I personally never saw other TCP speed figures from Zephyr with frdm_k64f/any of qemu networking drivers. (Well, perhaps I saw 15KBytes/s, but you get the point.) I actually wanted to add dumping of download speed to |
I'm currently working on switching over big_http_download to a local host downloads (yep, to put it into CI and not rely on external hosts which would provide additional disturbance to testing). And I decided to share what I see with just an asciinema recording: https://asciinema.org/a/BXbcuYTQsrsPxDmdCnbfUz7Qz (as I'm not sure how soon they garbage-collect uploads, also attaching here So, as you can see, it starts slow. Then at ~70Kb (~00:19 on the cast) it finally start to work "as it should", than at ~140KB (~00:21) it breaks again. You've already read my explanation of what happens (based on hours of peering into writeshark output (but months ago)): it starts mis-synchronized with Linux TCP sending rexmits and bumps to a large backoff delay, then manages to synchronize to Linux (this is literally the first time I see it), then loses the sync again and goes back to swamp for the rest of download. |
(Which makes me think maybe add a calculation of "momentary" speed over e.g. last second or two, and print out each on a separate line, not as with "\r"). |
I tested this with nucleo_f767zi and see wildly different numbers which depend on the tool used. ApacheBench gave around 470kb/sec results which is quite reasonable. The wrk tool on the other hand pushes so much stuff to zephyr that it easily runs out of memory, and starts to drop packets. This then affects the performance number a lot. I am lowering the priority of this one until we figure out what are the correct numbers. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
Recently, Zephyr switched to newer, "TCP2" implementation by default. I'm doing its triaging for the 2.4 release (#27876, #27982), ans also would like to record some quick performance data: dumb_http_server sample, qemu_x86, standard SLIP-based QEMU networking setup, For reference, with TCP1 ( So, there's a noticeable improvement in connection latency. |
But not so bright with data transfer throughput. The sample is big_http_download, the rest is as above. First, reference with CONFIG_NET_TCP1=y: 3108 b/s . That's pretty slow. I remember speeds around 10KB/s. Dunno what's up now, my inet connection is not ideal either. So, with default TCP2, it's 6154 b/s (mind #27982, the transfer doesn't complete successfully, implied end time was used to calc speed). |
For reference, trying @nzmichaelh's hack from #26330 (comment) (set NET_TCP_BUF_MAX_LEN to (4*1280)) didn't make a statistical difference to me (on qemu_x86 with SLIP, again). |
@rlubos so do we have a golden platform for zperf? |
I'm not aware if we have any "golden" platform for tests, I think we should aim to get a decent performance on any (non-emulated) ethernet board. I've ordered a few on my own to perform some measurements. |
without a golden platform, we will have to take account of driver impacts, which may not be a good thing. |
At Linaro, we standardized on frdm_k64f as the default networking platform from Zephyr's start. My records of Zephyr's networking testing against it (and qemu which is everyone's favorite platform) is at https://docs.google.com/spreadsheets/d/1_8CsACPEXqrMIbxBKxPAds091tNAwnwdWkMKr3994QY/edit#gid=0 (it's pretty spotty, as it's my personal initiative to maintain such as a spreadsheet, so it was done on the best effort basis). When I have a chance, I'll test current Zephyr using the testcases in the spreadsheet (I'm working on other projects now). |
I inserted the sleep when testing the F746ZG. It died otherwise. It was not included in the other measurements.
I used the same test in Mbed and FreeRTOS. I also tried with a 1400byte packet. It did not improve the performance in Zephyr. After the first 90 byte packet was sent, Wireshark shows the subsequent packets sent were closer to 1480 bytes. Basically the 90 byte packets accumulated and were sent together. |
@rlubos |
@jukkar Because I observed that max window size is quite critical. Sorry if I'm missing something here |
No specific reason, just a somewhat reasonable value when the code was written. If you find value 4 more suitable, please send a PR that changes it. |
I don't think it makes sense to open PR with another magic constant, at the very least with KConfig flag. It worked for me, yes, but I'd like to understand the meaning behind it. Ideally this coefficient should be calculated. |
With the latest patches in, in the field I get the cloud application dropping larger transfers as the transfer rate drops too low. This happens when there is less then 240 bytes transferred in the last 5 seconds. Looking at the test results I can see something interesting happening. Based on the qemu_cortex_a9 target (not different for qemu x86) Transferring 60 kByte With preemptive scheduling:
With packet loss:
With cooperative scheduling Without packet loss:
With packet loss:
In the case of no packet loss, I would expect the elapsed time to be less then a second, a no timeouts need to occur. It is also interesting to see the cooperative scheduling being almost 2 times faster. In case of packet loss, timeouts could need to occur for re-transmission, than it could take longer. Nevertheless is little difference in runtime between the case with and without packet loss. @rlubos Mentioned that increasing the buffers
And putting a k_yield after the send call helps significantly to accelerate the test. But nevertheless, in a zero delay, no packet loss testcase, it should also be possible to get a fast throughput with just smaller number of buffers. |
I attempted to dive a little bit deeper in issue: #45367 |
I've finally managed to do some throughput tests on actual hardware. I had TL;DR The results for For testing, I've used I've managed to increase the throughput by enabling the TX queue ( Another small throughput improvement can be achieved by setting the net buffer size to the actual network MTU ( Finally, TCP throughput can be further increased by maximizing the TCP window sizes. I've achieved that by increasing the The overall results are presented in the table below (the measurements were taken on the receiving node, i.e.
A side note, I was able to improve the UDP TX throughput even further by modifying the Now when it comes to
I've investigated this platform a bit, and the conclusion for the poor performance is as follows:
To summarize, I think that the results achieved on |
As for the |
Very interesting results. This clearly shows that in a happy flow situation the performance can be pretty decent. You are using a point to point wired link I assume. You increased the window by increasing the CONFIG_NET_BUF_DATA_SIZE to 1500 bytes over the default 128, do you know if this has the same affect as increasing the CONFIG_NET_BUF_RX/TX_COUNT by a factor 12? Apart from maybe some processing overhead I would expect it to have the same affect. Only small packets will consume considerably less space. On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here. |
Yes, the whole point of this experiment was to compare how the actual throughput compares to the theoretical maximum throughput over 100 Mbit Ethernet, and it seems we're pretty close to the limit.
Yes, the default window size is calculated based on the buffer size and buffer count, i.e. the overall size of all of the buffers, so you could reach the same effect by increasing the buffer count. The sole reason to increase the buffer size here was to reduce the processing time of an individual frame, I would say however that this is only recommended if you really need to maximize your througputs, usuallly it's better to increase buffer count, as you don't waste space on small packets.
Well yes, it is expected that the throughputs will be worse in case of lossy networks. If there are mechanisms specified in TCP, that could help to improve performance in such case, we should consider implementing them. I think though that those should be conidered as enhancements, not reported as "bugs" like this issue is. |
@rlubos and @ssharks can we create an enhancement issue for this? |
@rlubos: Could you redo the upload tests with small window of #23302 (comment), with the fix of #46584 in? The figures will look very different I believe. @xhpohanka: PR #46584 was recently merged and I think it solves the issue you described. Are you in a position to check if your problem has been fixed. If so, this issue can be closed. In fact, the issue #45844, looks pretty similar to your description. |
@ssharks Hmm, but the Silly Window shouldn't affect the upload as it's related to the RX window size? Did you mean download? Anyways, I've ran the test again, no difference on the upload side, the download throughput is slightly improved (in low window scenario) to 8.68 Mbps. When I tested the solution, the most significant performance boost happended in case we reported Zero window to peer, as this didn't take place anymore with #46584. This didn't happend though in the initial test I performed here. |
I continue playing with Zephyr network stack and STM32 and I unfortunately found next issue. With
nucleo_f429zi
board andbig_http_download
sample I got very slow download speed. This pushed me to check network performace withzperf
.For UDP transfers I got around 10Mbps but for TCP the result was only 10kbps which is really bad.
I tried if some older versions of Zephyr behaves better - fortunately v2.0.0 got me also around 10Mbps for TCP in
zperf
. With bisecting i found that this issue starts with d88f25b.I hoped that reverting it will solve also the slow
big_http_download
but surprisingly the download speed is still suspiciously low. I will continue to investigate this tomorrow.I do not know if these issues are related just with STM32 platform. I have just mentioned
nucleo_f429zi
and custom board with STM32F750 which has slightly different ethernet peripheral and my driver is written also using HAL. Both behaves in same way.The issues I met so far with Zephyr networking stack pose a question to me if it is mature enough for production?
The text was updated successfully, but these errors were encountered: