Possible race condition in TCP connection establishment #44186

berendo · 2022-03-24T17:36:29Z

Describe the bug
At this point I have not tied together all the conditions that contribute to this suspected issue, but I have a hunch that there may be a race condition during TCP/IP connection establishment depending on network stack and device driver event ordering. When an application calls connect() on a TCP/IP socket, tcp_in() is called with a NULL packet to start establishing a connection. That in turn will lead to a SYN packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, the connect() implementation, which at this point is executing net_tcp_connect() starts waiting to take a semaphore until the connection timeout is reached.

However, I suspect that if the transmit of the SYN packet results in a RST packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in the tcp_in() of the RST packet via the network RX thread. That can cause tcp_conn_unref() to be called before the application thread has gotten to the point of acquiring (or failing to) the semaphore.

tl;dr I suspect that there's a race condition between the point where the application thread calling net_tcp_connect() gives up the connection lock mutex (e.g. &conn->lock) and the subsequent processing of the incoming RST packet that prematurely the cleans up and frees the TCP connection structure before it is safe to do so from the application thread's point of view. I am able to "hide" this issue by either making sure that there's always a listening socket at the other end (such that a RST is never sent back during the TCP connection establishment) or by simulating additional latency such that the SYN->RST time is longer.

To Reproduce
It's difficult to provide exact reproduction steps right now, though I suspect it should be possible to reproduce this with QEMU and/or the POSIX emulation layer.

Expected behavior
There should be no crashes due to attempting to connect() to TCP/IP servers where no listening socket is present.

Impact
Pretty much a show-stopper or at least a very big detriment to reliability in under unknown network conditions and latencies.

Logs and console output
N/A

Environment (please complete the following information):
N/A

Additional context
N/A

The text was updated successfully, but these errors were encountered:

berendo · 2022-03-24T19:14:07Z

The following diff that sets in_connect to true (if applicable) before the tcp_in() to start transmitting the initial SYN packets appears to fix the issues:

diff --git a/subsys/net/ip/tcp2.c b/subsys/net/ip/tcp2.c
index b2729f2704..03de5b2898 100644
--- a/subsys/net/ip/tcp2.c
+++ b/subsys/net/ip/tcp2.c
@@ -2272,11 +2272,10 @@ int net_tcp_connect(struct net_context *context,
        /* Input of a (nonexistent) packet with no flags set will cause
         * a TCP connection to be established
         */
+       conn->in_connect = !IS_ENABLED(CONFIG_NET_TEST_PROTOCOL);
        tcp_in(conn, NULL);
 
        if (!IS_ENABLED(CONFIG_NET_TEST_PROTOCOL)) {
-               conn->in_connect = true;
-
                if (k_sem_take(&conn->connect_sem, timeout) != 0 &&
                    conn->state != TCP_ESTABLISHED) {
                        conn->in_connect = false;

If someone with a bit more expertise on Zephyr's network stack implementation can sanity check my analysis in the description, I can provide this ☝️ as a PR.

jukkar · 2022-03-29T06:27:48Z

@berendo Looks good, just send a PR in order to do a proper review.

carlescufi · 2022-03-30T09:28:08Z

@berendo Looks good, just send a PR in order to do a proper review.

@berendo will you be able to post a Pull Request with the change you suggested?

berendo · 2022-03-30T20:41:26Z

@carlescufi (and @jukkar) yes, I will do so before end of day today - thank you for your initial sanity-check!

When connect() is called on a TCP socket, tcp_in() is called with a NULL packet to start establishing a connection. That in turn leads to a SYN packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, the connect() implementation, which at this point is executing net_tcp_connect() starts waiting to take a semaphore until the connection timeout is reached. However, if the transmit of the SYN packet results in a RST packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in the tcp_in() of the RST packet via the network RX thread. That can cause tcp_conn_unref() to be called before the connecting thread has gotten to the point of acquiring (or failing to) the semaphore, which results in a deinitialized semaphore being accessed. This commit fixes the possible race condition by ensuring that the connection lock mutex is held until after the connection state moves to "in connect." Fixes zephyrproject-rtos#44186 Signed-off-by: Berend Ozceri <[email protected]>

berendo · 2022-03-30T22:29:45Z

@carlescufi (and @jukkar) PR submitted #44384

When connect() is called on a TCP socket, tcp_in() is called with a NULL packet to start establishing a connection. That in turn leads to a SYN packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, the connect() implementation, which at this point is executing net_tcp_connect() starts waiting to take a semaphore until the connection timeout is reached. However, if the transmit of the SYN packet results in a RST packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in the tcp_in() of the RST packet via the network RX thread. That can cause tcp_conn_unref() to be called before the connecting thread has gotten to the point of acquiring (or failing to) the semaphore, which results in a deinitialized semaphore being accessed. This commit fixes the possible race condition by ensuring that the connection lock mutex is held until after the connection state moves to "in connect." Fixes #44186 Signed-off-by: Berend Ozceri <[email protected]>

When connect() is called on a TCP socket, tcp_in() is called with a NULL packet to start establishing a connection. That in turn leads to a SYN packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, the connect() implementation, which at this point is executing net_tcp_connect() starts waiting to take a semaphore until the connection timeout is reached. However, if the transmit of the SYN packet results in a RST packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in the tcp_in() of the RST packet via the network RX thread. That can cause tcp_conn_unref() to be called before the connecting thread has gotten to the point of acquiring (or failing to) the semaphore, which results in a deinitialized semaphore being accessed. This commit fixes the possible race condition by ensuring that the connection lock mutex is held until after the connection state moves to "in connect." Fixes zephyrproject-rtos#44186 Signed-off-by: Berend Ozceri <[email protected]>

berendo added the bug The issue is a bug, or the PR is fixing a bug label Mar 24, 2022

jukkar assigned rlubos Mar 29, 2022

jukkar added the area: Networking label Mar 29, 2022

carlescufi added the priority: medium Medium impact/importance bug label Mar 29, 2022

berendo mentioned this issue Mar 30, 2022

net: tcp: Fix possible race condition in connection establishment #44384

Merged

MaureenHelm closed this as completed in #44384 Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race condition in TCP connection establishment #44186

Possible race condition in TCP connection establishment #44186

berendo commented Mar 24, 2022 •

edited

Loading

berendo commented Mar 24, 2022 •

edited

Loading

jukkar commented Mar 29, 2022

carlescufi commented Mar 30, 2022

berendo commented Mar 30, 2022

berendo commented Mar 30, 2022

Possible race condition in TCP connection establishment #44186

Possible race condition in TCP connection establishment #44186

Comments

berendo commented Mar 24, 2022 • edited Loading

berendo commented Mar 24, 2022 • edited Loading

jukkar commented Mar 29, 2022

carlescufi commented Mar 30, 2022

berendo commented Mar 30, 2022

berendo commented Mar 30, 2022

berendo commented Mar 24, 2022 •

edited

Loading

berendo commented Mar 24, 2022 •

edited

Loading