Bluetooth: L2CAP: Deadlock when there are no free buffers while transmitting on multiple channels #34600

KatrineNordic · 2021-04-27T12:41:29Z

Describe the bug
Transmitting many large SDUs as quickly as possible on multiple L2CAP channels quickly fills a lot of buffers. When no more segments can be queued to be transmitted because there are no more free buffers, or there are no more credits for the channel, queuing of segments is stopped. Whenever a previously queued segment has been transmitted, or more credits are received, segments are again being queued for transmission. If there are still no free buffers, queuing of segments is stopped again right away. If this happens when all segments that have previously been queued on the channel have already been transmitted, and there are no more credits to receive, queuing of segments is stopped indefinitely.

Expected behavior
Queuing of segments continues when there are free buffers available.

Impact
Annoyance.

Environment (please complete the following information):

OS: Linux
Toolchain (e.g Zephyr SDK, ...)
nrfconnect/sdk-zephyr@6a1d340

Additional context
It is possible to work around this by having a high enough number of buffers (CONFIG_BT_L2CAP_TX_BUF_COUNT or CONFIG_BT_CONN_TX_MAX) that there are always buffers available when queuing of segments is restarted.

The text was updated successfully, but these errors were encountered:

carlescufi · 2021-04-27T17:53:21Z

@KatrineNordic which Nordic (upstream, not NCS/sdk-zephyr) revision did you test this with?

carlescufi · 2021-05-17T15:55:04Z

@KatrineNordic which Nordic (upstream, not NCS/sdk-zephyr) revision did you test this with?

This seems to be an issue upstream, as per a discussion with @KatrineNordic and @joerchan

carlescufi · 2021-05-20T09:50:40Z

Downgrading to low since:

This is only an issue when using multiple L2CAP Connection-Oriented Channels (a rare use case in practice)
There is a workaround (increase CONFIG_BT_L2CAP_TX_BUF_COUNT)

carlescufi · 2021-06-17T11:11:29Z

@Vudentz FYI, in case you have additional comments to this.

carlescufi · 2021-07-15T11:10:19Z

@Vudentz FYI, in case you have additional comments to this.

@Vudentz ping on this one, would be good to get your input.

alwa-nordic · 2021-09-30T14:31:48Z

@KatrineNordic, it sounds like you've already done some digging here to find the cause. Do you have any suspicions on where in the code the bug is? Where does this queuing happen?

github-actions · 2021-11-30T00:29:25Z

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

alwa-nordic · 2022-02-08T15:20:42Z

This issue likely still exists. I will try to reproduce.

alwa-nordic · 2022-02-08T16:11:44Z

I have discussed this with @KatrineNordic. The issue is is a missing trigger for l2cap_chan_tx_resume(chan). l2cap_chan_tx_resume(chan) causes packets to be queued off the channel TX queue and onto the connection TX queue.

l2cap_chan_tx_resume(chan) is triggered on

queuing a SDU on chan.
reception of credits for chan.
when a segment from chan has been sent.

It is possible to end up in a situation where none of those triggers will happen, but there remain segments to be sent. Note that the segment sent trigger will not happen if there was no room to queue any segments from chan onto the connection queue.

This issue can be fixed by a trigger for when room becomes available on the connection TX queue. I believe we can make chan.tx_work, which runs l2cap_chan_tx_resume(chan), poll on the connection TX queue.

alwa-nordic · 2022-02-11T14:20:15Z

I believe this is corner case of issue #20640. The fix for that issue looks to be based on the belief that if at least one segment is queued, the next will be able to queue in the place of the previous. I don't have a complete overview over why it does not work, but I suspect that there is a race for the queue between the segment-sent-callback and the application queuing more.

This commit is the partial fix with the assumption: c654bcf

cvinayak · 2022-04-27T11:18:00Z

@alwa-nordic is following up on possible solutions.

jori-nordic · 2022-08-31T08:11:44Z

Working on it, managed to reproduce -a- deadlock (not sure if it's that specific one yet, but does look like it) on https://github.com/jori-nordic/zephyr/tree/l2cap-deadlock

This test reproduces more-or-less zephyrproject-rtos#34600. It has a central that connects to multiple peripherals, opens one l2cap CoC channel per connection, and transmits a few SDUs largely exceeding the MPS of the channel. In this commit, the test doesn't pass, but when it passes (after the subsequent commits), error and warning messages are expected from the stack, as this is not the happy path. We can later debate on whether these particular error messages should be downgraded to debug. Signed-off-by: Jonathan Rico <[email protected]>

This test reproduces more-or-less #34600. It has a central that connects to multiple peripherals, opens one l2cap CoC channel per connection, and transmits a few SDUs largely exceeding the MPS of the channel. In this commit, the test doesn't pass, but when it passes (after the subsequent commits), error and warning messages are expected from the stack, as this is not the happy path. We can later debate on whether these particular error messages should be downgraded to debug. Signed-off-by: Jonathan Rico <[email protected]>

This test reproduces more-or-less #34600. It has a central that connects to multiple peripherals, opens one l2cap CoC channel per connection, and transmits a few SDUs largely exceeding the MPS of the channel. In this commit, the test doesn't pass, but when it passes (after the subsequent commits), error and warning messages are expected from the stack, as this is not the happy path. We can later debate on whether these particular error messages should be downgraded to debug. Signed-off-by: Jonathan Rico <[email protected]> (cherry picked from commit 7a6872d)

KatrineNordic added the bug The issue is a bug, or the PR is fixing a bug label Apr 27, 2021

galak added area: Bluetooth priority: low Low impact/importance bug labels Apr 27, 2021

ioannisg assigned carlescufi Apr 27, 2021

ioannisg added priority: medium Medium impact/importance bug and removed priority: low Low impact/importance bug labels Apr 27, 2021

carlescufi assigned joerchan Apr 27, 2021

carlescufi added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels May 20, 2021

carlescufi added this to the v2.7.0 milestone May 27, 2021

carlescufi assigned Vudentz Jun 17, 2021

carlescufi assigned alwa-nordic and unassigned joerchan, Vudentz and carlescufi Sep 9, 2021

prathje mentioned this issue Sep 23, 2021

Bluetooth: Deadlock with TX of ACL data and HCI commands (command blocked by data) #25917

Closed

cfriedt removed this from the v2.7.0 milestone Sep 28, 2021

github-actions bot added the Stale label Nov 30, 2021

github-actions bot closed this as completed Dec 15, 2021

alwa-nordic added the area: Bluetooth Host Bluetooth Host (excluding BR/EDR) label Feb 8, 2022

alwa-nordic reopened this Feb 8, 2022

alwa-nordic removed the Stale label Feb 8, 2022

github-actions bot added the Stale label Jun 27, 2022

zephyrproject-rtos deleted a comment from github-actions bot Jun 27, 2022

cvinayak removed the Stale label Jun 27, 2022

carlescufi assigned jori-nordic Jul 15, 2022

jori-nordic unassigned alwa-nordic Jul 28, 2022

jori-nordic mentioned this issue Aug 14, 2022

MCUMGR_SMP_BT: system workqueue blocked during execution of shell commands #46347

Closed

carlescufi added this to the future milestone Aug 18, 2022

jori-nordic mentioned this issue Sep 21, 2022

Bluetooth: host: l2cap deadlock fix #50476

Merged

carlescufi modified the milestones: future, v3.2.0 Sep 22, 2022

carlescufi closed this as completed in #50476 Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bluetooth: L2CAP: Deadlock when there are no free buffers while transmitting on multiple channels #34600

Bluetooth: L2CAP: Deadlock when there are no free buffers while transmitting on multiple channels #34600

KatrineNordic commented Apr 27, 2021

carlescufi commented Apr 27, 2021

carlescufi commented May 17, 2021

carlescufi commented May 20, 2021

carlescufi commented Jun 17, 2021

carlescufi commented Jul 15, 2021

alwa-nordic commented Sep 30, 2021

github-actions bot commented Nov 30, 2021

alwa-nordic commented Feb 8, 2022

alwa-nordic commented Feb 8, 2022

alwa-nordic commented Feb 11, 2022 •

edited

Loading

cvinayak commented Apr 27, 2022

jori-nordic commented Aug 31, 2022

Bluetooth: L2CAP: Deadlock when there are no free buffers while transmitting on multiple channels #34600

Bluetooth: L2CAP: Deadlock when there are no free buffers while transmitting on multiple channels #34600

Comments

KatrineNordic commented Apr 27, 2021

carlescufi commented Apr 27, 2021

carlescufi commented May 17, 2021

carlescufi commented May 20, 2021

carlescufi commented Jun 17, 2021

carlescufi commented Jul 15, 2021

alwa-nordic commented Sep 30, 2021

github-actions bot commented Nov 30, 2021

alwa-nordic commented Feb 8, 2022

alwa-nordic commented Feb 8, 2022

alwa-nordic commented Feb 11, 2022 • edited Loading

cvinayak commented Apr 27, 2022

jori-nordic commented Aug 31, 2022

alwa-nordic commented Feb 11, 2022 •

edited

Loading