-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler #45267
Comments
Hm... can you walk me through which lock is held when the thread yields? The one in the workqueue seems correctly released around the callback. Is the bug that poll is calling its handlers with a spinlock held? |
In my case it's the MsgQ lock which is held when the thread yields. Let's take my test program as example (main.c):
Note:
|
OK, gotcha. Let me restate and see if this matches your expectations:
This is an evolutionary messup, I think. Poll traditionally wouldn't schedule from its handler (and there's no reason to expect it to). But now it's grown the "triggered work" feature, which does exactly that. So, a few possible ways to resolve this that I can see:
|
That's exactly it.
What is the best solution according to you? I drafted an implementation for the 2nd solution you proposed. Thus I would be able to propose a PR soon if this one is chosen. Do you think an appropriate test will be needed to cover this case ? |
This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes zephyrproject-rtos#45267 Signed-off-by: Lucas Dietrich <[email protected]>
When an object availability event triggers a k_work_poll item, the object lock should not be held anymore during the execution of the work callback. Signed-off-by: Lucas Dietrich <[email protected]>
Seems like @lucasdietrich is well on the way to the solution here, so reassigning for clarity. Move it back to me if it gets stuck for some reason. |
This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes #45267 Signed-off-by: Lucas Dietrich <[email protected]>
When an object availability event triggers a k_work_poll item, the object lock should not be held anymore during the execution of the work callback. Signed-off-by: Lucas Dietrich <[email protected]>
This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes zephyrproject-rtos#45267 Signed-off-by: Lucas Dietrich <[email protected]>
When an object availability event triggers a k_work_poll item, the object lock should not be held anymore during the execution of the work callback. Signed-off-by: Lucas Dietrich <[email protected]>
This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes zephyrproject-rtos#45267 Signed-off-by: Lucas Dietrich <[email protected]> (cherry picked from commit 9a848b3)
This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes #45267 Signed-off-by: Lucas Dietrich <[email protected]> (cherry picked from commit 9a848b3)
This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes zephyrproject-rtos#45267 Signed-off-by: Lucas Dietrich <[email protected]> (cherry picked from commit 9a848b3)
Describe the bug
Recursive spinlock in
k_msgq_get()
, when called from a k_work_poll handler function, when event is configured for notification on message availability in the MsgQ.This situation appears only if the thread "putting" the message in the MsgQ is preemptive and if the workqueue assigned for the process of the k_work_poll handler has a higher priority.
Edit: I described the issue for the MsgQ only but actually this issue occurs with all "transferable" kernel objects which support POLLING :
k_msgq_put()
/k_msgq_get()
throughK_POLL_TYPE_MSGQ_DATA_AVAILABLE
k_fifo_put()
/k_fifo_get()
throughK_POLL_TYPE_FIFO_DATA_AVAILABLE
k_sem_give()
/k_sem_take()
throughK_POLL_TYPE_SEM_AVAILABLE
To Reproduce
Please find here this Zephyr project zephyr-qemu-dev ( msgq: main.c ) which can be run in qemu to reproduce this issue. (Same issue with FIFO fifo: main.c)
Basically:
CONFIG_POLL
andCONFIG_ASSERT
K_PRIO_PREEMPT(12)
)K_PRIO_COOP(1)
)K_POLL_TYPE_MSGQ_DATA_AVAILABLE
), which will queue the work item to the configured workqueue.k_msg_get
on the MsgQ when calledk_msgq_put()
When running this application, the execution will fail with an "assertion error" (if CONFIG_ASSERT is enabled):
As far as I remember, when CONFIG_ASSERT is disabled the k_work_poll is not resubmitted correctly (to confirm).
Expected behavior
No assertion error.
Impact
It's then not possible to handle msgQ messages directly from the k_work_poll handler and the k_work_poll item cannot be resubmitted from the same handler using
k_work_pull_submit(_to_queue)
. A workarround would be to create a dedicated thread which waits on MsgQ messages instead of a k_work_poll.Logs and console output
The assertion error occurs here
Environment :
Diagnostic
I debugged this and noticed that a
k_yield()
is called fromk_work_submit_to_queue()
in the context of the MsgQ lock which may cause a thread switch before unlocking themsgq->lock
(see the call stack in the screenshot).zephyr/kernel/work.c
Lines 376 to 378 in 83c79d1
If the program follows this particular path, because of this
k_yield()
, the work item handler will be executed while themsgq->lock
stays locked by the thread callingk_msgq_put()
. Then the call tok_msgq_get()
in the work handler will cause a "Recursive spinlock" assertion error.Disabling the
k_yield()
solves the issue : lucasdietrich@50923a3Test
I already wrote this test routine to cover this situation (for which I could propose a pull request) : lucasdietrich@c0a7deb, run the test with
./scripts/twister -i -v -p qemu_x86 -T tests/kernel/workq/work_queue
Solutions
What would be good solutions for this ? This
k_yield()
should definitely not be called in this context, but how to notify thek_work_submit_to_queue()
to not yield in this case ?A simple and naive solution is, but which I think is terrible :
lucasdietrich@9c49c67#diff-06e616b4678c2bb4ee4b383f66a36821aa1940dc7bf20b8a3f68b1573bb02875
Thank you for your help 😄
The text was updated successfully, but these errors were encountered: