Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wifi: esp_at: race condition on mutex's leading to deadlock #43470

Closed
JordanYates opened this issue Mar 7, 2022 · 0 comments · Fixed by #43522
Closed

wifi: esp_at: race condition on mutex's leading to deadlock #43470

JordanYates opened this issue Mar 7, 2022 · 0 comments · Fixed by #43522
Assignees
Labels
area: Sockets Networking sockets area: Wi-Fi Wi-Fi bug The issue is a bug, or the PR is fixing a bug LTS Long term release branch related platform: ESP32 Espressif ESP32 priority: low Low impact/importance bug

Comments

@JordanYates
Copy link
Collaborator

JordanYates commented Mar 7, 2022

Describe the bug

The send and receive code paths for Zephyr sockets when using the ESP AT command WiFi driver claim the same two mutex's in different orders. This leads to a permanent deadlock when you get unlucky with packet timings.

When receiving

The ESP RX thread claims the socket mutex before calling into the Zephyr socket library.

k_mutex_lock(&sock->lock, K_FOREVER);
if (sock->recv_cb) {
sock->recv_cb(sock->context, pkt, NULL, NULL,

zsock_received_cb then claims the net_context mutex.

static void zsock_received_cb(struct net_context *ctx,
struct net_pkt *pkt,
union net_ip_header *ip_hdr,
union net_proto_header *proto_hdr,
int status,
void *user_data)
{
if (ctx->cond.lock) {
(void)k_mutex_lock(ctx->cond.lock, K_FOREVER);

Screen Shot 2022-03-07 at 2 43 52 pm

When transmitting

zsock_send_message claims the net_context mutex (inside net_context_sendmsg).

int net_context_sendmsg(struct net_context *context,
const struct msghdr *msghdr,
int flags,
net_context_send_cb_t cb,
k_timeout_t timeout,
void *user_data)
{
int ret;
k_mutex_lock(&context->lock, K_FOREVER);
ret = context_sendto(context, msghdr, 0, NULL, 0,

esp_socket_queue_tx then claims the socket mutex.

static inline int esp_socket_queue_tx(struct esp_socket *sock,
struct net_pkt *pkt)
{
int ret = -EBUSY;
k_mutex_lock(&sock->lock, K_FOREVER);

Screen Shot 2022-03-07 at 2 44 15 pm

Expected behavior
Mutex's should be claimed in the same order in all code paths to avoid deadlock.

Impact
WiFi driver is permanently unusable until the application is power cycled.

Environment (please complete the following information):

  • Zephyr v2.7.1
  • Zephyr v3.0.0
@JordanYates JordanYates added bug The issue is a bug, or the PR is fixing a bug area: Sockets Networking sockets area: Wi-Fi Wi-Fi LTS Long term release branch related labels Mar 7, 2022
@JordanYates JordanYates changed the title wifi: esp_at: mutex inversion leading to deadlock wifi: esp_at: race condition on mutex's leading to deadlock Mar 7, 2022
@mbolivar-nordic mbolivar-nordic added platform: ESP32 Espressif ESP32 priority: low Low impact/importance bug labels Mar 8, 2022
carlescufi pushed a commit that referenced this issue Mar 15, 2022
Claim the net_context mutext associated with a socket before claiming
the socket mutex. The receive callback claims the net_context mutex
internally, which will now always succeed immediately.

The TX path claims the net_context mutex before the socket mutex, and if
we don't use the same order, we can end up in a deadlock.

Fixes #43470.

Signed-off-by: Jordan Yates <[email protected]>
zephyrbot pushed a commit that referenced this issue Mar 21, 2022
Claim the net_context mutext associated with a socket before claiming
the socket mutex. The receive callback claims the net_context mutex
internally, which will now always succeed immediately.

The TX path claims the net_context mutex before the socket mutex, and if
we don't use the same order, we can end up in a deadlock.

Fixes #43470.

Signed-off-by: Jordan Yates <[email protected]>
cfriedt pushed a commit that referenced this issue Mar 23, 2022
Claim the net_context mutext associated with a socket before claiming
the socket mutex. The receive callback claims the net_context mutex
internally, which will now always succeed immediately.

The TX path claims the net_context mutex before the socket mutex, and if
we don't use the same order, we can end up in a deadlock.

Fixes #43470.

Signed-off-by: Jordan Yates <[email protected]>
ycsin pushed a commit to G-Technologies-Sdn-Bhd/zephyr that referenced this issue Aug 17, 2022
Claim the net_context mutext associated with a socket before claiming
the socket mutex. The receive callback claims the net_context mutex
internally, which will now always succeed immediately.

The TX path claims the net_context mutex before the socket mutex, and if
we don't use the same order, we can end up in a deadlock.

Fixes zephyrproject-rtos#43470.

Signed-off-by: Jordan Yates <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Sockets Networking sockets area: Wi-Fi Wi-Fi bug The issue is a bug, or the PR is fixing a bug LTS Long term release branch related platform: ESP32 Espressif ESP32 priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants