-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bluetooth Ext Adv:Sync: While simultaneous advertiser are working, and skip is non-zero, sync terminates repeatedly #42518
Comments
Could you let me know if all the 10 advertisers where powered up at the same time? (do all the periodic advertising interval overlap on every interval?) In such cases, 6 intervals to synchronized with a periodic advertiser may be facing issues, could you take air trace using a sniffer to confirm any errors due to collisions of these 10 periodic advertising?
Do you have 20 advertisements in total from 10 nRF52833 DKs, one Extended and one Legacy advertising instances on each of the 10 DKs? What about the periodic advertising, what are the used parameters?
It would be best if a draft PR with samples to reproduce the issue be made available for debugging the issue. |
Thanks @cvinayak for the reply.
To ensure it doesn't affect on the system, I turned on the transmitters randomly but it didn't change the output.
Sorry, I used incorrect terms (ext_adv_int=80ms, per_adv_int=150ms). To make it possible for everyone interested to test this issue, I programmed direction_finding_connectionless_tx into the transmitter. Only some minor changes were made to increase the periodic interval to (BT_GAP_PER_ADV_FAST_INT_MIN_1, BT_GAP_PER_ADV_FAST_INT_MAX_1) and not to send advertising train.
It was not possible to share the origin source code that I found the problem inside it. So I developed a similar app based on direction_finding_connectionless_rx sample, which now is capable of making parallel syncs with tags. Here you can find my source code: direction_finding_connectionless_rx_multiple. Please note some print lines are commented to focus only on sync issue. The above source code, in a while(true) loop, monitors the current state of the system and then makes the proper decision:
To build and program the project:
Results: In this scenario it works normally. You can see the output log here. To understand it better, I've added these letters at first line of each print line as you can find in the source code:
The above output is result of running the system for about 1 hour. After a while, sync of three tags is terminated and is synced again without any problem. Scenario B) When skip was set to Zero:
The above log is generated only in 5 minutes. Finally it failed after a while
|
@saleh-unikie , thanks for the reply, Unfortunately, I only have 2 nRF5340 DK and one nRF52833 DK. With this setup I cannot reproduce, as only 2 simultaneous sync are possible. I have a temporary change that I would like you to try for the assertion you see, I will send a draft PR and let you know here. |
@saleh-unikie could you please cherry-pick these two PRs (over latest upstream bf45055) and let me know if it helps with the assertion, #42664 and #42678. Also, use these values (finger in the air, based on my experience that if Rx events are not processed fast enough by Host, then Controller will generate sync failed to be established if the Rx buffers have been used up by the other syncs that are receiving reports):
|
It was changed due to debugging of issue #11 zephyrproject-rtos/zephyr#42518 (comment)
Thanks @cvinayak, if there was another changes let me know, then I will test them as soon as possible. Well, I switched to the latest commit of main upstream and then those two commit was cherry picked.
I ran receiver code while 10 transmitter were advertising (randomly turned on) for two rounds, each time letting it go for ~30 minutes. I couldn't see any assertion fail but after a while sync started to terminate and wasn't made again. Here you can see the log:
log files:
According to what you've mentioned, I was thinking maybe the while loop I've implemented with busy waiting mechanism, is not a suitable mechanism to make synch. So I replaced it with a blocking mechanism in order to release CPU for other tasks (you can find updated code here). Now seems the behavior is much better, but again sync lost regularly happens and by chance I could catch an assertion failure (unfortunately its print line is corrupted). Here you can see the log, each
Another test is ongoing and after ~20min, I havn't seen the failure again. I will repeat it and if failure was seen, I will report it here. |
Two things:
|
I made some changes in print logs to make sure all of them are unique and the output is more clear. The changes are pushed to repo.
The devices are working and when a failure is seen, I will report it here (fortunately/unfortunately it doesn't happen too much) |
@saleh-unikie This is the modified sample I have been testing under simulation #42721, and the one hour simulation log of 10 periodic advertisers to which the periodic sync sample makes upto 16 synchronizations (as advertiser change addresses, some repeat overlapping sync are established): https://github.com/zephyrproject-rtos/zephyr/files/8046922/periodic_sync_multiple_one_hour.log There are few sync terminates that I have not analyzed yet, but there are no stall or assertions under simulation. This is without #42678 (purely upstream code with modified sample in #42721) |
Thanks @cvinayak I had similar results, after disabling reception of CTE. Here you can find the results: Test Condition: Transmitter: Results:
Obviously it is stable and only after 65min one of the sync has terminated which is established again after ~4min. (each Scenario B) CTE reception is enabled. Test duration: 14min
The full log is attached: 8_cte_enabled.txt Well, obviously enabling the CTE reception affects on the syncs, what could be the next step? |
@saleh-unikie Thanks for your quick response.
I will switch my focus towards review of code for possible bugs when periodic sync has CTE reporting enabled, will post here if I have some thing. |
To ensure the results are true, I did the test again. One time with the latest upstream/main (5094a6e) and one time with #42678. In both cases, CTE reception was disabled. The result was similar, stable in both after running for ~1h. Here you can see the logs:
Yes it didn't stall at any point. The |
Thank you, once again. We are going in the right direction. (I believe the assertion could still happen, as I have done nothing to fix it, if it is a bug, it will recur). I have switch my focus to reviewing implementation when CTE enabled. (I wish I had 11 nRF52833 DKs) |
Thanks for your time! I tried to run above tests for several times and these "assertion failures" were seen: Test Condition: Transmitter: Failures:
Full log: 10_cte_enabled_42678.txt After the above log, I added a print line before when it tries to make enable reception of CTE.
Full log: 12_cte_enabled_42678.txt
Full log: 13_cte_enabled_42678.txt
Full log: 14_cte_enabled_42678.txt
Full log: 15_cte_enabled_42678.txt |
@saleh-unikie Based on code inspection, not sure if this is required, @ppryga can confirm whether these changes would help: #42757 |
After some time of running with 4DKs TX and single RX (based on code from @saleh-unikie repo) I start to observe this: Then RX starts to lose synchronization. I'll check it with DF functionality disabled. I'll do that tomorrow. |
I observe LE-BIS Transfer PDUs also without enabled CTE TX: |
UPDATE! It was false alarm with LE-BIS. I've just run a sniffer in the same environment without DKs exercising periodic advertising and still observe LE-BIS PDUs. So there is no issues in regard of periodic advertising. @saleh-unikie could you check if the issue is still available with code in current |
I tested it again using 22c0843 and the problem still exist. Test Scenario: Transmitter: Results:
Round 1 (failed after a long time)
Full log: 10tag_skip0_assertfailed.txt Round 2 (failed after a long time)
Full log: 10tag_skip0_assertfailed(2).txt |
@saleh-unikie we will address this as soon as possible. |
@saleh-unikie and @jakkra after some delays, I now have 10 nRF52833 DKs. Please confirm that I need to use these for debugging: |
Good news! |
@saleh-unikie During code inspection, something that could be related to the ASSERTION FAIL is related to fixes in this PR: #44183. I will do more testing tomorrow to confirm if this PR fixes the assertions reported by you here: #42518 (comment) Could please try your samples with ZEPHYR_BASE set to the changes in the PR #44183 |
Sure, I will do it tomorrow and will report here. |
@cvinayak I've tested #44183 and the reported problems are not seen any more. Thanks a lot! 🙏 Anyway, there is another problem to receive IQ samples properly which has seen also at main repository of Zephyr (But I've not reported yet). So I think these are completely different and we can consider it as solved. |
@saleh-unikie Thank you and appreciate the quick responses in helping me slowly resolve a few memory leaks and implementation defects on the way to here! I have run your rx_multiple sample for over 8 hours today without terminates or assertions. I believe the terminate issue is resolved. I will keep an open mind on assertions, and will let my boards continue to maintain the sync over the weekend. I have noticed that if you enable Please close this PR, you can re-open or create new issue if you discover the original problem reported in this issue being not resolved. |
Describe the bug
Using the #41091 (force-pushed on 2 Feb), I tested 10 simultaneous advertiser (AoA tag) with one scanner (AoA locator). The source codes are based on "connection-less direction finding" examples, by adding simultaneous tag support to scanner code. The advertiser are working with extended adv interval = 80ms, legacy adv interval = 150ms. All devices are nRF52833DK board.
Successful Scenario: If at receiver side, "skip" parameter is a non-zero value, then "synchronization" is made successfully and the probability of losing the sync is low.
The output log is something like this:
Failure Scenario: But if, "skip" parameter is set to zero, at the beginning of the program, syncs are made normally like above condition, but after a while it start to lose the sync with tags, and it usually must retries more than once to make the sync again. in most retries, it terminate eventually exactly after calling the
bt_le_per_adv_sync_create
function.The output log could be:
In both cases timeout value is same. I supposed that "skip" parameter can not affect on sync termination as I asked this question in nordic devzone, but seems it is not.
https://devzone.nordicsemi.com/f/nordic-q-a/84232/more-information-about-skip-and-timeout-parameters-of-ble-5-advertising
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The termination is only dependent on "timoute" parameter, changing "skip" should not change the behavior of losing sync between scanner and advertiser.
Impact
System is not stable when skip=0 and multiple advertiser try to sync with the scanner.
Logs and console output
as you can see in the bug description (above)
Environment (please complete the following information):
Additional context
In addition to the above problem, sometime this assertion failure happens too
The text was updated successfully, but these errors were encountered: