Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: L2CAP: Deadlock when there are no free buffers while transmitting on multiple channels #34600

Closed
KatrineNordic opened this issue Apr 27, 2021 · 12 comments · Fixed by #50476
Assignees
Labels
area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Milestone

Comments

@KatrineNordic
Copy link

Describe the bug
Transmitting many large SDUs as quickly as possible on multiple L2CAP channels quickly fills a lot of buffers. When no more segments can be queued to be transmitted because there are no more free buffers, or there are no more credits for the channel, queuing of segments is stopped. Whenever a previously queued segment has been transmitted, or more credits are received, segments are again being queued for transmission. If there are still no free buffers, queuing of segments is stopped again right away. If this happens when all segments that have previously been queued on the channel have already been transmitted, and there are no more credits to receive, queuing of segments is stopped indefinitely.

Expected behavior
Queuing of segments continues when there are free buffers available.

Impact
Annoyance.

Environment (please complete the following information):

Additional context
It is possible to work around this by having a high enough number of buffers (CONFIG_BT_L2CAP_TX_BUF_COUNT or CONFIG_BT_CONN_TX_MAX) that there are always buffers available when queuing of segments is restarted.

@KatrineNordic KatrineNordic added the bug The issue is a bug, or the PR is fixing a bug label Apr 27, 2021
@galak galak added area: Bluetooth priority: low Low impact/importance bug labels Apr 27, 2021
@ioannisg ioannisg added priority: medium Medium impact/importance bug and removed priority: low Low impact/importance bug labels Apr 27, 2021
@carlescufi
Copy link
Member

@KatrineNordic which Nordic (upstream, not NCS/sdk-zephyr) revision did you test this with?

@carlescufi
Copy link
Member

@KatrineNordic which Nordic (upstream, not NCS/sdk-zephyr) revision did you test this with?

This seems to be an issue upstream, as per a discussion with @KatrineNordic and @joerchan

@carlescufi carlescufi added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels May 20, 2021
@carlescufi
Copy link
Member

Downgrading to low since:

  • This is only an issue when using multiple L2CAP Connection-Oriented Channels (a rare use case in practice)
  • There is a workaround (increase CONFIG_BT_L2CAP_TX_BUF_COUNT)

@carlescufi carlescufi added this to the v2.7.0 milestone May 27, 2021
@carlescufi
Copy link
Member

@Vudentz FYI, in case you have additional comments to this.

@carlescufi
Copy link
Member

@Vudentz FYI, in case you have additional comments to this.

@Vudentz ping on this one, would be good to get your input.

@alwa-nordic
Copy link
Collaborator

@KatrineNordic, it sounds like you've already done some digging here to find the cause. Do you have any suspicions on where in the code the bug is? Where does this queuing happen?

@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Nov 30, 2021
@alwa-nordic alwa-nordic added the area: Bluetooth Host Bluetooth Host (excluding BR/EDR) label Feb 8, 2022
@alwa-nordic
Copy link
Collaborator

This issue likely still exists. I will try to reproduce.

@alwa-nordic alwa-nordic reopened this Feb 8, 2022
@alwa-nordic
Copy link
Collaborator

I have discussed this with @KatrineNordic. The issue is is a missing trigger for l2cap_chan_tx_resume(chan). l2cap_chan_tx_resume(chan) causes packets to be queued off the channel TX queue and onto the connection TX queue.

l2cap_chan_tx_resume(chan) is triggered on

  • queuing a SDU on chan.
  • reception of credits for chan.
  • when a segment from chan has been sent.

It is possible to end up in a situation where none of those triggers will happen, but there remain segments to be sent. Note that the segment sent trigger will not happen if there was no room to queue any segments from chan onto the connection queue.

This issue can be fixed by a trigger for when room becomes available on the connection TX queue. I believe we can make chan.tx_work, which runs l2cap_chan_tx_resume(chan), poll on the connection TX queue.

@alwa-nordic alwa-nordic removed the Stale label Feb 8, 2022
@alwa-nordic
Copy link
Collaborator

alwa-nordic commented Feb 11, 2022

I believe this is corner case of issue #20640. The fix for that issue looks to be based on the belief that if at least one segment is queued, the next will be able to queue in the place of the previous. I don't have a complete overview over why it does not work, but I suspect that there is a race for the queue between the segment-sent-callback and the application queuing more.

This commit is the partial fix with the assumption: c654bcf

@cvinayak
Copy link
Contributor

@alwa-nordic is following up on possible solutions.

@jori-nordic
Copy link
Collaborator

Working on it, managed to reproduce -a- deadlock (not sure if it's that specific one yet, but does look like it) on https://github.com/jori-nordic/zephyr/tree/l2cap-deadlock

@carlescufi carlescufi modified the milestones: future, v3.2.0 Sep 22, 2022
jori-nordic added a commit to jori-nordic/zephyr that referenced this issue Sep 23, 2022
This test reproduces more-or-less zephyrproject-rtos#34600.

It has a central that connects to multiple peripherals, opens one l2cap CoC
channel per connection, and transmits a few SDUs largely exceeding the MPS
of the channel.

In this commit, the test doesn't pass, but when it passes (after the
subsequent commits), error and warning messages are expected from the
stack, as this is not the happy path.

We can later debate on whether these particular error messages should be
downgraded to debug.

Signed-off-by: Jonathan Rico <[email protected]>
carlescufi pushed a commit that referenced this issue Sep 26, 2022
This test reproduces more-or-less #34600.

It has a central that connects to multiple peripherals, opens one l2cap CoC
channel per connection, and transmits a few SDUs largely exceeding the MPS
of the channel.

In this commit, the test doesn't pass, but when it passes (after the
subsequent commits), error and warning messages are expected from the
stack, as this is not the happy path.

We can later debate on whether these particular error messages should be
downgraded to debug.

Signed-off-by: Jonathan Rico <[email protected]>
theob-pro pushed a commit that referenced this issue Jun 5, 2023
This test reproduces more-or-less #34600.

It has a central that connects to multiple peripherals, opens one l2cap
CoC channel per connection, and transmits a few SDUs largely exceeding
the MPS of the channel.

In this commit, the test doesn't pass, but when it passes (after the
subsequent commits), error and warning messages are expected from the
stack, as this is not the happy path.

We can later debate on whether these particular error messages should
be downgraded to debug.

Signed-off-by: Jonathan Rico <[email protected]>
(cherry picked from commit 7a6872d)
cfriedt pushed a commit that referenced this issue Jun 6, 2023
This test reproduces more-or-less #34600.

It has a central that connects to multiple peripherals, opens one l2cap
CoC channel per connection, and transmits a few SDUs largely exceeding
the MPS of the channel.

In this commit, the test doesn't pass, but when it passes (after the
subsequent commits), error and warning messages are expected from the
stack, as this is not the happy path.

We can later debate on whether these particular error messages should
be downgraded to debug.

Signed-off-by: Jonathan Rico <[email protected]>
(cherry picked from commit 7a6872d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants