-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible race condition in TCP connection establishment #44186
Comments
The following
If someone with a bit more expertise on Zephyr's network stack implementation can sanity check my analysis in the description, I can provide this ☝️ as a PR. |
@berendo Looks good, just send a PR in order to do a proper review. |
@carlescufi (and @jukkar) yes, I will do so before end of day today - thank you for your initial sanity-check! |
When connect() is called on a TCP socket, tcp_in() is called with a NULL packet to start establishing a connection. That in turn leads to a SYN packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, the connect() implementation, which at this point is executing net_tcp_connect() starts waiting to take a semaphore until the connection timeout is reached. However, if the transmit of the SYN packet results in a RST packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in the tcp_in() of the RST packet via the network RX thread. That can cause tcp_conn_unref() to be called before the connecting thread has gotten to the point of acquiring (or failing to) the semaphore, which results in a deinitialized semaphore being accessed. This commit fixes the possible race condition by ensuring that the connection lock mutex is held until after the connection state moves to "in connect." Fixes zephyrproject-rtos#44186 Signed-off-by: Berend Ozceri <[email protected]>
@carlescufi (and @jukkar) PR submitted #44384 |
When connect() is called on a TCP socket, tcp_in() is called with a NULL packet to start establishing a connection. That in turn leads to a SYN packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, the connect() implementation, which at this point is executing net_tcp_connect() starts waiting to take a semaphore until the connection timeout is reached. However, if the transmit of the SYN packet results in a RST packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in the tcp_in() of the RST packet via the network RX thread. That can cause tcp_conn_unref() to be called before the connecting thread has gotten to the point of acquiring (or failing to) the semaphore, which results in a deinitialized semaphore being accessed. This commit fixes the possible race condition by ensuring that the connection lock mutex is held until after the connection state moves to "in connect." Fixes #44186 Signed-off-by: Berend Ozceri <[email protected]>
When connect() is called on a TCP socket, tcp_in() is called with a NULL packet to start establishing a connection. That in turn leads to a SYN packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, the connect() implementation, which at this point is executing net_tcp_connect() starts waiting to take a semaphore until the connection timeout is reached. However, if the transmit of the SYN packet results in a RST packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in the tcp_in() of the RST packet via the network RX thread. That can cause tcp_conn_unref() to be called before the connecting thread has gotten to the point of acquiring (or failing to) the semaphore, which results in a deinitialized semaphore being accessed. This commit fixes the possible race condition by ensuring that the connection lock mutex is held until after the connection state moves to "in connect." Fixes zephyrproject-rtos#44186 Signed-off-by: Berend Ozceri <[email protected]>
When connect() is called on a TCP socket, tcp_in() is called with a NULL packet to start establishing a connection. That in turn leads to a SYN packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, the connect() implementation, which at this point is executing net_tcp_connect() starts waiting to take a semaphore until the connection timeout is reached. However, if the transmit of the SYN packet results in a RST packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in the tcp_in() of the RST packet via the network RX thread. That can cause tcp_conn_unref() to be called before the connecting thread has gotten to the point of acquiring (or failing to) the semaphore, which results in a deinitialized semaphore being accessed. This commit fixes the possible race condition by ensuring that the connection lock mutex is held until after the connection state moves to "in connect." Fixes zephyrproject-rtos#44186 Signed-off-by: Berend Ozceri <[email protected]>
Describe the bug
At this point I have not tied together all the conditions that contribute to this suspected issue, but I have a hunch that there may be a race condition during TCP/IP connection establishment depending on network stack and device driver event ordering. When an application calls
connect()
on a TCP/IP socket,tcp_in()
is called with aNULL
packet to start establishing a connection. That in turn will lead to aSYN
packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, theconnect()
implementation, which at this point is executingnet_tcp_connect()
starts waiting to take a semaphore until the connection timeout is reached.However, I suspect that if the transmit of the
SYN
packet results in aRST
packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in thetcp_in()
of theRST
packet via the network RX thread. That can causetcp_conn_unref()
to be called before the application thread has gotten to the point of acquiring (or failing to) the semaphore.tl;dr I suspect that there's a race condition between the point where the application thread calling
net_tcp_connect()
gives up the connection lock mutex (e.g.&conn->lock
) and the subsequent processing of the incomingRST
packet that prematurely the cleans up and frees the TCP connection structure before it is safe to do so from the application thread's point of view. I am able to "hide" this issue by either making sure that there's always a listening socket at the other end (such that aRST
is never sent back during the TCP connection establishment) or by simulating additional latency such that theSYN
->RST
time is longer.To Reproduce
It's difficult to provide exact reproduction steps right now, though I suspect it should be possible to reproduce this with QEMU and/or the POSIX emulation layer.
Expected behavior
There should be no crashes due to attempting to
connect()
to TCP/IP servers where no listening socket is present.Impact
Pretty much a show-stopper or at least a very big detriment to reliability in under unknown network conditions and latencies.
Logs and console output
N/A
Environment (please complete the following information):
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: