Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: Controller: Fix spurious ISO Sync receiver stall #80775

Conversation

cvinayak
Copy link
Contributor

@cvinayak cvinayak commented Nov 2, 2024

Fix spurious ISO Sync Receiver stall due to uninitialised
value accessed due to regression introduced by
commit cvinayak@64facee ("Bluetooth: controller: Stop Sync ISO
ticker when establishment fails").

Align audio test Controller Kconfig value same as used with nRF53bsim.

Fixes #80734.

Align audio test Controller Kconfig value same as used with
nRF53bsim.

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
Fix spurious ISO Sync Receiver stall due to uninitialised
value accessed due to regression introduced by
commit 64facee ("Bluetooth: controller: Stop Sync ISO
ticker when establishment fails").

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
@cvinayak cvinayak changed the title tests: bsim: Bluetooth: Align audio test Controller Kconfig Bluetooth: Controller: Fix spurious ISO Sync receiver stall Nov 2, 2024
@cvinayak cvinayak marked this pull request as ready for review November 2, 2024 09:45
Copy link
Member

@aescolar aescolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@aescolar aescolar added this to the v4.0.0 milestone Nov 2, 2024
@aescolar
Copy link
Member

aescolar commented Nov 2, 2024

@dkalowsk @mmahadevan108 It would be very nice to get this for 4.0, otherwise we should disable the failing test case (it would be quite bad to release 4.0 with a failing BT testcase as that would prevent backporting any other BT related fix)

@aescolar aescolar added the bug The issue is a bug, or the PR is fixing a bug label Nov 2, 2024
@Thalley
Copy link
Collaborator

Thalley commented Nov 2, 2024

Will try to test this with my PR that saw a similar problem

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead just use the nRF53 BSIM .conf file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will investigate as part of enabling some high reliability CAP tests. Reusing the target conf file is challenging as high reliability test need lot more tx and rx buffers to handle worst case retransmissions.

Trying out fixes and Kconfig changes here: #80788

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a feeling it was related to uninitialzed data, but valgrind didn't show anything for me. How did you find this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to debug inside the lower link layer to check when the reception stopped, there were calls to ticker_stop and lead to the caller which used this uninitialized value (recently added struct member). Since the value is from inside the Rx buffer node allocated from pool, the overall content get assigned something sometime before when receiving other packets.

@@ -1281,6 +1282,7 @@ static void isr_rx_done(void *param)

/* Calculate and place the drift information in done event */
e->type = EVENT_DONE_EXTRA_TYPE_SYNC_ISO;
e->estab_failed = 0U;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value isn't assigned after line 1252 - Is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, intentional. Only relevant structure members are assigned for each type being passed around.

Controller has large Rx data flow (in struct node_rx) overheads (two parts: fixed overhead of unions, and configurable PDU length payload). Generally, not all members of the overhead part (respective to the type of rx node) are assigned.

When adding new struct member, typically all sources of the rx node type in the data flow need to be inspected :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought i need to investigate further today...#80788 pass locally, but fails in CI :-(

@Thalley
Copy link
Collaborator

Thalley commented Nov 3, 2024

I think this fixes the stalling issue (passes those tests locally), but it will fail other tests from my PR #80691:

======== Running CAP AC_13 with 441_1_1 =========
...
d_01: @00:00:01.392685  [0]: Incoming audio on stream 0x84da700 len 97 and ts 1376271 flags 0x09
d_01: @00:00:01.392685  [0]: Incoming audio on stream 0x84da700 len 97 and ts 1376271
d_01: @00:00:01.392685  [0]: Incoming audio on stream 0x84da744 len 97 and ts 1383748 flags 0x09
d_01: @00:00:01.392685  [0]: Incoming audio on stream 0x84da744 len 97 and ts 1383748
d_01: @00:00:01.400187  [1]: Incoming audio on stream 0x84da700 len 0 and ts 1384434 flags 0x0c
d_01: @00:00:01.400187 ERROR: (WEST_TOPDIR/zephyr/tests/bsim/bluetooth/audio/src/cap_acceptor_test.c:321): ISO receive lost

This is a broadcast test, and the host always enqueues 2 SDUs per stream in the controller.

I think it's OK to merge, as it fixes the issues it claims, but it does not solve all issues. Will spend some more time tomorrow verifying that the host correctly enqueues the number of SDUs it expects.

@dkalowsk
Copy link
Contributor

dkalowsk commented Nov 4, 2024

@dkalowsk @mmahadevan108 It would be very nice to get this for 4.0, otherwise we should disable the failing test case (it would be quite bad to release 4.0 with a failing BT testcase as that would prevent backporting any other BT related fix)

@aescolar thanks for bringing it to attention. It does have the associated Issue to fix so as far as I can tell this will work on RC2 provided maintainers are good with it (and it looks like Thally is)

@cvinayak
Copy link
Contributor Author

cvinayak commented Nov 5, 2024

@aescolar @Thalley any new PR fails CI with different audio tests failing for same commit HEAD with re-runs: https://github.com/zephyrproject-rtos/zephyr/actions/runs/11666290372?pr=80788
May be many tests fail in CI (that do not fail locally or vice versa), but CI stops on first failure of many running in parallel?

Comparing the bsim timestamps between CI and local run (local native posix toolchain), they do not appear same and i suspect timing changes in bsim run on CI too for re-runs.

Audio test need the bsim timing to be deterministic on every run otherwise the Controller will experience overlapping states/roles causing tests to fail.

Example, timing compare here: #80788 (comment)

@cvinayak
Copy link
Contributor Author

cvinayak commented Nov 5, 2024

I have only seen it happen on the nRF52 BSIM, and have not been able to reproduce with the nRF53 BSIM nor on target.

Refer to #80734 (comment)

The conf change in the commit in this PR is to address the inconsistency between nrf52 bsim and nrf53 bsim testing.

@cvinayak
Copy link
Contributor Author

cvinayak commented Nov 5, 2024

@aescolar I do not like that you have independently opened two new PRs (#80894, and ##80896) and marked them as hotfixes while this PR is made to wait for the merge timeout.

@cvinayak
Copy link
Contributor Author

cvinayak commented Nov 6, 2024

Replaced by #80894. Other change to be addressed as part of #80788.

@cvinayak cvinayak closed this Nov 6, 2024
@cvinayak cvinayak removed this from the v4.0.0 milestone Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

tests/bsim/bluetooth/audio/test_scripts/cap_broadcast_ac_12.sh failing in main
5 participants