-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: tx_bufs are not freed when NET_TCP_BACKLOG_SIZE is too high #23246
Comments
Tried this with nucleo_f767zi and I could reproduce the issue with wrk. I did not see similar issue with Apache bench (ab) tool, wrk seems to stress test the target more. |
In fact I was able to reproduce it also with Chrome but it was quite difficult to press F5 so fast :) |
I am suspecting that when the amount of incoming connections increases, we eventually run out of network buffers in some part of the code. When that happens, the code notices that but does not release some allocated buffers which will lead to buffer leak. I will be investigating this. |
@xhpohanka I hopefully managed to fix the memory leak, can you try #23334 and report does it work for you. |
Hello @jukkar. Thank you for the support. With your patches and
|
Tried the PR with this config with Atmel SAM-E70 (300 MHz, 2 MB flash, 384 KB SRAM):
Tried this with multiple times and results were consistent, did not see any memory leaks. |
Please try to increase For example my config (I had to decrease buffers count due to sram limitation)
results in
Or is increasing |
Increasing num handlers to 8 got slightly better results, but not much. Best result out of 5 was this one
No leaks were seen. I will try with Nucleo F767zi next. |
There is something fishy with my nucleo board, the transfer rates are not that great.
Slightly better when having smaller number of connections
I did not see any memory leaks. I wonder why the numbers are so much smaller with this stm board. |
It looks that there is maybe some additional issue on stm32 platform. Unfortunately I do not have any other board to test now. |
I did some testing and stm numbers are not good. I used identical configuration options with all the boards.
Config in
Host side:
|
Just fyi, I could replicate the memory leak in stm32f767zi by setting the net buf/pkt count low. |
Hopefully I managed to nail the nasty net buf leak, basically if the device is flooded with incoming packets, then the tx timer in tcp could trigger and we then tried to send the packet second time even if the previous msg was still pending in TX queue. This caused a memory leak which was only seen if the device is very busy. @xhpohanka can you try the latest version of the PR and report status. |
I have tried all previous test and they are working correctly with latest PR. I was able to stress my targets so that there were no free buffers
and it is recovering from such state without no problem.
Let's hope that leaks are gone :) |
The very bad performance could be explained (at least partially) by the small amount of network buffers available. If the device just cannot handle the req, the data is dropped. The wrk tool is really stress testing the device as it floods the system with constant stream of http requests. I got a bit better performance values with apache bench tool, not sure which tool is right and how the values are calculated in each of them. Anyway, if we really got all the leaks gone, then we can probably start to investigate why the perf numbers look like that. |
I wouldn't agree that the amount of network buffers we have by default (now) is small. To remind, I advocated for another approach: to test (and get working) stuff with as little as possible number of buffers. That would expose many deadlocks/softlocks we have. Then we indeed can throw in more buffers, and everything would "just work" (more perfomantly, apparently).
I proposed to standardize on Apache Bench because it's well-known tool, readily available in any distro (so can reproduce reports you <-> other people). I so far found
Mind that there're many "moving parts":
So, as usual, to investigate something, we need to describe/standardize the setup to test against, and that has always been problem. |
Just another note to speed. Even if |
One config option that might help (not tested), is to make the |
In my investigations (dated by now), I came to conclusion that any retransmissions are controlled by Linux side, and we indeed badly interact with Linux TCP stack, causing these retransmissions to start, and then increase in positive feedback. I believe where I stopped is trying to change some rexmit timeout params on Linux side, to test this hypothesis, and IIRC, found out these params are effectively hardcoded (or can't be controlled precisely). Again, this is all from somewhat vague memory. Definitely +1 for trying to reproduce those matters, just a word of notice that it may be not that easy to see any changes. |
The code was leaking memory in TX side when there was lot of incoming packets. The reason was that the net_pkt_sent() flag was manipulated in two threads which caused races. The solution is to move the sent flag check only to tcp.c. Fixes zephyrproject-rtos#23246 Signed-off-by: Jukka Rissanen <[email protected]>
The code was leaking memory in TX side when there was lot of incoming packets. The reason was that the net_pkt_sent() flag was manipulated in two threads which caused races. The solution is to move the sent flag check only to tcp.c. Fixes #23246 Signed-off-by: Jukka Rissanen <[email protected]>
The code was leaking memory in TX side when there was lot of incoming packets. The reason was that the net_pkt_sent() flag was manipulated in two threads which caused races. The solution is to move the sent flag check only to tcp.c. Fixes zephyrproject-rtos#23246 Signed-off-by: Jukka Rissanen <[email protected]>
Hi using nucleo_767zi, zephyr 2.3.0, just having net/dhcpv4 and mqtt, i still see Failed to obtain RX buffer. info:prj.conf C LibraryCONFIG_MINIMAL_LIBC is not setCONFIG_NEWLIB_LIBC=y end of C Library#enable networking Enable IPv4 supportCONFIG_NET_IPV4=y CONFIG_NET_CONFIG_NEED_IPV4=yCONFIG_NET_IPV6=n Enable DHCPv4 supportCONFIG_NET_DHCPV4=y Network connection manager to receive L4 events ( like receive ip address notify)CONFIG_NET_CONNECTION_MANAGER=y CONFIG_NET_MGMT=y CONFIG_NET_SOCKETS=y end of # Networking#enable logging end of # loggingneeded for sys_rand32_getCONFIG_ENTROPY_GENERATOR=y Enable the DNS resolverCONFIG_DNS_RESOLVER=y Enable the MQTT LibCONFIG_MQTT_LIB=y LOG: *** Booting Zephyr OS build zephyr-v2.3.0 *** Hello new board definition on nucleo_f767zi ! [00:00:01.510,000] .[0m app_net: starting DHCPv4.[0m [00:00:02.912,000] .[0m app_net: dhcpv4_ip_retrieved.[0m [00:00:03.176,000] .[0m app_mqtt: MQTT client connected!.[0m [00:00:03.404,000] .[0m app_mqtt: SUBACK packet id: 1.[0m [00:00:10.620,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:11.027,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:11.434,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:11.841,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:12.248,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:12.656,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:13.063,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:13.470,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:13.877,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:14.284,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:14.691,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:15.098,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:15.505,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:15.912,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:16.319,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:16.727,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:17.134,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:17.541,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:17.948,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:18.355,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:18.762,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:19.169,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:19.576,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:19.983,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:20.390,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:20.798,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:21.205,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:21.612,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:22.019,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:22.426,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:22.833,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:23.240,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:23.647,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:24.054,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:24.461,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:24.869,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:25.276,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:25.683,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:26.090,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:26.497,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:26.904,000] .[1;31m eth_stm32_hal: Failed to obtain RX buffer.[0m [00:00:27.207,000] .[1;31m eth_stm32_hal: Failed to obtain RX buffer.[0m [00:00:27.513,000] .[1;31m eth_stm32_hal: Failed to obtain RX buffer.[0m [00:00:27.818,000] .[1;31m eth_stm32_hal: Failed to obtain RX buffer.[0m [00:00:28.539,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:28.946,000] .[0m app_mqtt: evt->type 2 .[0m [00:00:29.353,000] .[0m app_mqtt: evt->type 2 .[0m any idea: i checked and the above commits are in the zephyr. |
I'm playing with
samples/net/sockets/dumb_http_server_mt
sample onnucleo-f429zi
board. I have tried several network stack configurations and found out that when I increase number ofNET_TCP_BACKLOG_SIZE
(Number of simultaneous incoming TCP connections) to something higher than 6 and try to flood my board with many requests then some tx buffers are lost and never freed.For network testing I use wrk benchmarking tool with command:
wrk -d 20 -t 24 -c 500 --latency http://192.1.1.145:8080
With NET_TCP_BACKLOG_SIZE=10 I have following log
When I run of buffers completely I got zephyr Usage fault:
Is there any reasonable limit for backlog size? Is it dependent on other settings?
My current enviroment is
v2.1.0-rc3
, linux, zephyr-sdk..config
is followingThe text was updated successfully, but these errors were encountered: