Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition in TCP connection establishment #44186

Closed
berendo opened this issue Mar 24, 2022 · 5 comments · Fixed by #44384
Closed

Possible race condition in TCP connection establishment #44186

berendo opened this issue Mar 24, 2022 · 5 comments · Fixed by #44384
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@berendo
Copy link
Contributor

berendo commented Mar 24, 2022

Describe the bug
At this point I have not tied together all the conditions that contribute to this suspected issue, but I have a hunch that there may be a race condition during TCP/IP connection establishment depending on network stack and device driver event ordering. When an application calls connect() on a TCP/IP socket, tcp_in() is called with a NULL packet to start establishing a connection. That in turn will lead to a SYN packet being produced which, depending on the Ethernet driver, may result in a synchronous transmit of that packet. After that, the connect() implementation, which at this point is executing net_tcp_connect() starts waiting to take a semaphore until the connection timeout is reached.

However, I suspect that if the transmit of the SYN packet results in a RST packet being returned from the connection destination (due to there being no listening socket) very quickly on a local network, the device driver may deliver an interrupt which can cause the receive path of the network stack to run, resulting in the tcp_in() of the RST packet via the network RX thread. That can cause tcp_conn_unref() to be called before the application thread has gotten to the point of acquiring (or failing to) the semaphore.

tl;dr I suspect that there's a race condition between the point where the application thread calling net_tcp_connect() gives up the connection lock mutex (e.g. &conn->lock) and the subsequent processing of the incoming RST packet that prematurely the cleans up and frees the TCP connection structure before it is safe to do so from the application thread's point of view. I am able to "hide" this issue by either making sure that there's always a listening socket at the other end (such that a RST is never sent back during the TCP connection establishment) or by simulating additional latency such that the SYN->RST time is longer.

To Reproduce
It's difficult to provide exact reproduction steps right now, though I suspect it should be possible to reproduce this with QEMU and/or the POSIX emulation layer.

Expected behavior
There should be no crashes due to attempting to connect() to TCP/IP servers where no listening socket is present.

Impact
Pretty much a show-stopper or at least a very big detriment to reliability in under unknown network conditions and latencies.

Logs and console output
N/A

Environment (please complete the following information):
N/A

Additional context
N/A

@berendo berendo added the bug The issue is a bug, or the PR is fixing a bug label Mar 24, 2022
@berendo
Copy link
Contributor Author

berendo commented Mar 24, 2022

The following diff that sets in_connect to true (if applicable) before the tcp_in() to start transmitting the initial SYN packets appears to fix the issues:

diff --git a/subsys/net/ip/tcp2.c b/subsys/net/ip/tcp2.c
index b2729f2704..03de5b2898 100644
--- a/subsys/net/ip/tcp2.c
+++ b/subsys/net/ip/tcp2.c
@@ -2272,11 +2272,10 @@ int net_tcp_connect(struct net_context *context,
        /* Input of a (nonexistent) packet with no flags set will cause
         * a TCP connection to be established
         */
+       conn->in_connect = !IS_ENABLED(CONFIG_NET_TEST_PROTOCOL);
        tcp_in(conn, NULL);
 
        if (!IS_ENABLED(CONFIG_NET_TEST_PROTOCOL)) {
-               conn->in_connect = true;
-
                if (k_sem_take(&conn->connect_sem, timeout) != 0 &&
                    conn->state != TCP_ESTABLISHED) {
                        conn->in_connect = false;

If someone with a bit more expertise on Zephyr's network stack implementation can sanity check my analysis in the description, I can provide this ☝️ as a PR.

@jukkar
Copy link
Member

jukkar commented Mar 29, 2022

@berendo Looks good, just send a PR in order to do a proper review.

@carlescufi carlescufi added the priority: medium Medium impact/importance bug label Mar 29, 2022
@carlescufi
Copy link
Member

@berendo Looks good, just send a PR in order to do a proper review.

@berendo will you be able to post a Pull Request with the change you suggested?

@berendo
Copy link
Contributor Author

berendo commented Mar 30, 2022

@carlescufi (and @jukkar) yes, I will do so before end of day today - thank you for your initial sanity-check!

berendo added a commit to recogni/zephyr that referenced this issue Mar 30, 2022
When connect() is called on a TCP socket, tcp_in() is called with a NULL
packet to start establishing a connection. That in turn leads to a SYN
packet being produced which, depending on the Ethernet driver, may
result in a synchronous transmit of that packet. After that, the
connect() implementation, which at this point is executing
net_tcp_connect() starts waiting to take a semaphore until the
connection timeout is reached. However, if the transmit of the SYN
packet results in a RST packet being returned from the connection
destination (due to there being no listening socket) very quickly on a
local network, the device driver may deliver an interrupt which can
cause the receive path of the network stack to run, resulting in the
tcp_in() of the RST packet via the network RX thread. That can cause
tcp_conn_unref() to be called before the connecting thread has gotten
to the point of acquiring (or failing to) the semaphore, which results
in a deinitialized semaphore being accessed.

This commit fixes the possible race condition by ensuring that the
connection lock mutex is held until after the connection state moves
to "in connect."

Fixes zephyrproject-rtos#44186

Signed-off-by: Berend Ozceri <[email protected]>
@berendo
Copy link
Contributor Author

berendo commented Mar 30, 2022

@carlescufi (and @jukkar) PR submitted #44384

MaureenHelm pushed a commit that referenced this issue Mar 31, 2022
When connect() is called on a TCP socket, tcp_in() is called with a NULL
packet to start establishing a connection. That in turn leads to a SYN
packet being produced which, depending on the Ethernet driver, may
result in a synchronous transmit of that packet. After that, the
connect() implementation, which at this point is executing
net_tcp_connect() starts waiting to take a semaphore until the
connection timeout is reached. However, if the transmit of the SYN
packet results in a RST packet being returned from the connection
destination (due to there being no listening socket) very quickly on a
local network, the device driver may deliver an interrupt which can
cause the receive path of the network stack to run, resulting in the
tcp_in() of the RST packet via the network RX thread. That can cause
tcp_conn_unref() to be called before the connecting thread has gotten
to the point of acquiring (or failing to) the semaphore, which results
in a deinitialized semaphore being accessed.

This commit fixes the possible race condition by ensuring that the
connection lock mutex is held until after the connection state moves
to "in connect."

Fixes #44186

Signed-off-by: Berend Ozceri <[email protected]>
berendo added a commit to recogni/zephyr that referenced this issue Apr 1, 2022
When connect() is called on a TCP socket, tcp_in() is called with a NULL
packet to start establishing a connection. That in turn leads to a SYN
packet being produced which, depending on the Ethernet driver, may
result in a synchronous transmit of that packet. After that, the
connect() implementation, which at this point is executing
net_tcp_connect() starts waiting to take a semaphore until the
connection timeout is reached. However, if the transmit of the SYN
packet results in a RST packet being returned from the connection
destination (due to there being no listening socket) very quickly on a
local network, the device driver may deliver an interrupt which can
cause the receive path of the network stack to run, resulting in the
tcp_in() of the RST packet via the network RX thread. That can cause
tcp_conn_unref() to be called before the connecting thread has gotten
to the point of acquiring (or failing to) the semaphore, which results
in a deinitialized semaphore being accessed.

This commit fixes the possible race condition by ensuring that the
connection lock mutex is held until after the connection state moves
to "in connect."

Fixes zephyrproject-rtos#44186

Signed-off-by: Berend Ozceri <[email protected]>
berendo added a commit to recogni/zephyr that referenced this issue Apr 1, 2022
When connect() is called on a TCP socket, tcp_in() is called with a NULL
packet to start establishing a connection. That in turn leads to a SYN
packet being produced which, depending on the Ethernet driver, may
result in a synchronous transmit of that packet. After that, the
connect() implementation, which at this point is executing
net_tcp_connect() starts waiting to take a semaphore until the
connection timeout is reached. However, if the transmit of the SYN
packet results in a RST packet being returned from the connection
destination (due to there being no listening socket) very quickly on a
local network, the device driver may deliver an interrupt which can
cause the receive path of the network stack to run, resulting in the
tcp_in() of the RST packet via the network RX thread. That can cause
tcp_conn_unref() to be called before the connecting thread has gotten
to the point of acquiring (or failing to) the semaphore, which results
in a deinitialized semaphore being accessed.

This commit fixes the possible race condition by ensuring that the
connection lock mutex is held until after the connection state moves
to "in connect."

Fixes zephyrproject-rtos#44186

Signed-off-by: Berend Ozceri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants