kernel: Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler #45267

lucasdietrich · 2022-05-01T16:34:50Z

Describe the bug
Recursive spinlock in k_msgq_get(), when called from a k_work_poll handler function, when event is configured for notification on message availability in the MsgQ.

This situation appears only if the thread "putting" the message in the MsgQ is preemptive and if the workqueue assigned for the process of the k_work_poll handler has a higher priority.

Edit: I described the issue for the MsgQ only but actually this issue occurs with all "transferable" kernel objects which support POLLING :

MsgQ: k_msgq_put() / k_msgq_get() through K_POLL_TYPE_MSGQ_DATA_AVAILABLE
Fifo: k_fifo_put() / k_fifo_get() through K_POLL_TYPE_FIFO_DATA_AVAILABLE
Semaphores: k_sem_give() / k_sem_take() through K_POLL_TYPE_SEM_AVAILABLE

To Reproduce
Please find here this Zephyr project zephyr-qemu-dev ( msgq: main.c ) which can be run in qemu to reproduce this issue. (Same issue with FIFO fifo: main.c)

Basically:

Enable CONFIG_POLL and CONFIG_ASSERT
Create a preemptive thread with a certain priority (e.g. K_PRIO_PREEMPT(12))
Create a workqueue with a higher priority (e.g. K_PRIO_COOP(1))
Configure a k_work_poll item to be notified for any new message in a MsgQ (K_POLL_TYPE_MSGQ_DATA_AVAILABLE), which will queue the work item to the configured workqueue.
Configure the k_work_poll handler function to do a k_msg_get on the MsgQ when called
From the thread, "put" a message to the MsgQ using k_msgq_put()

When running this application, the execution will fail with an "assertion error" (if CONFIG_ASSERT is enabled):

As far as I remember, when CONFIG_ASSERT is disabled the k_work_poll is not resubmitted correctly (to confirm).

Expected behavior
No assertion error.

Impact
It's then not possible to handle msgQ messages directly from the k_work_poll handler and the k_work_poll item cannot be resubmitted from the same handler using k_work_pull_submit(_to_queue). A workarround would be to create a dedicated thread which waits on MsgQ messages instead of a k_work_poll.

Logs and console output
The assertion error occurs here

*** Booting Zephyr OS build zephyr-v2.7.2  ***
ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:142
        Recursive spinlock 0x10a130

Environment :

Linux
Zephyr 2.7.2, 3.0.0 and current main (83c79d1)
All platforms

Diagnostic
I debugged this and noticed that a k_yield() is called from k_work_submit_to_queue() in the context of the MsgQ lock which may cause a thread switch before unlocking the msgq->lock (see the call stack in the screenshot).

zephyr/kernel/work.c

Lines 376 to 378 in 83c79d1

    
           if ((ret > 0) && (k_is_preempt_thread() != 0)) { 
        
           	k_yield(); 
        
           }

If the program follows this particular path, because of this k_yield(), the work item handler will be executed while the msgq->lock stays locked by the thread calling k_msgq_put(). Then the call to k_msgq_get() in the work handler will cause a "Recursive spinlock" assertion error.

Disabling the k_yield() solves the issue : lucasdietrich@50923a3

Test
I already wrote this test routine to cover this situation (for which I could propose a pull request) : lucasdietrich@c0a7deb, run the test with ./scripts/twister -i -v -p qemu_x86 -T tests/kernel/workq/work_queue

Solutions
What would be good solutions for this ? This k_yield() should definitely not be called in this context, but how to notify the k_work_submit_to_queue() to not yield in this case ?

A simple and naive solution is, but which I think is terrible :
lucasdietrich@9c49c67#diff-06e616b4678c2bb4ee4b383f66a36821aa1940dc7bf20b8a3f68b1573bb02875

Thank you for your help 😄

The text was updated successfully, but these errors were encountered:

andyross · 2022-05-03T16:46:47Z

Hm... can you walk me through which lock is held when the thread yields? The one in the workqueue seems correctly released around the callback. Is the bug that poll is calling its handlers with a spinlock held?

lucasdietrich · 2022-05-03T17:14:24Z

In my case it's the MsgQ lock which is held when the thread yields.

Let's take my test program as example (main.c):

In the call of k_msgq_put() :
- msgq->lock first locked here : https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/msg_q.c#L124
- then thread yields in k_work_submit_to_queue() function while the msgq->lock stays locked.
The k_work_poll callback calls k_msgq_get() function while the msgq->lock is already held.
- I get the assertion error here if it is enabled.

Note:

This situation appears only if the thread "putting" the message in the MsgQ is preemptive and if the workqueue assigned for the process of the k_work_poll handler has a higher priority.

andyross · 2022-05-03T17:31:08Z

OK, gotcha. Let me restate and see if this matches your expectations:

msgq calls z_handle_obj_poll_events() inside its spinlock
for the case of triggered work items, poll then calls k_work_submit_to_queue()
k_work_submit_to_queue() is a scheduling point and tries to context switch

This is an evolutionary messup, I think. Poll traditionally wouldn't schedule from its handler (and there's no reason to expect it to). But now it's grown the "triggered work" feature, which does exactly that.

So, a few possible ways to resolve this that I can see:

Release the msgq lock before signaling poll. This would require some surgery to the way the code is structured, but looks doable. I do worry about the introduction of races in app code though (because now it's possible to sneak in and mutate the queue before the polling thread returns). In theory k_poll() is level triggered and apps all "know" they need to retest, but in practice I'm sure there are messes.
Fix the work queue so that there is an internal "non-scheduling" variant of submit() that code like poll can use. This is I think exactly what you were thinking about with your sched lock solution? Basically take the meat of the code and turn it into z_work_submit() or whatever and then wrap the public API in one that tests a return value and reschedules if needed.
Remove the scheduling point form k_work_submit(). I'm not completely convinced this is a good place to reschedule anyway. Work queues are intended for low priority async work, very few apps are going to expect them to preempt the calling thread!

lucasdietrich · 2022-05-03T19:38:48Z

That's exactly it.

I quickly thought of this solution, it doesn't seem to be the simplest way to do it to me. Is there a particular advantage for using this solution over the others?
You're right, that's what I had in mind: Do not alter the behavior of the normal workqueue mecanism but prevent the thread from yielding in this particular case, to simply solve the bug.
(With my young experience) I admit I was quite suprised when I noticed that the kernel does the yield for us when submitting a work item. I would have expected this to be in charge on the application in only a few special cases.

What is the best solution according to you? I drafted an implementation for the 2nd solution you proposed. Thus I would be able to propose a PR soon if this one is chosen.

Do you think an appropriate test will be needed to cover this case ?

andyross · 2022-05-03T19:50:15Z

I think I'd lean with #2 also, as it doesn't involve an API change and can be done 100% compatibly. #3 is simpler still, and while it is a behavior change, we aren't actually breaking any specified behavior as k_work_submit_to_queue() was never documented to reschedule.

This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes zephyrproject-rtos#45267 Signed-off-by: Lucas Dietrich <[email protected]>

When an object availability event triggers a k_work_poll item, the object lock should not be held anymore during the execution of the work callback. Signed-off-by: Lucas Dietrich <[email protected]>

andyross · 2022-05-10T14:39:05Z

Seems like @lucasdietrich is well on the way to the solution here, so reassigning for clarity. Move it back to me if it gets stuck for some reason.

This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes #45267 Signed-off-by: Lucas Dietrich <[email protected]>

When an object availability event triggers a k_work_poll item, the object lock should not be held anymore during the execution of the work callback. Signed-off-by: Lucas Dietrich <[email protected]>

This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes zephyrproject-rtos#45267 Signed-off-by: Lucas Dietrich <[email protected]>

When an object availability event triggers a k_work_poll item, the object lock should not be held anymore during the execution of the work callback. Signed-off-by: Lucas Dietrich <[email protected]>

This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes zephyrproject-rtos#45267 Signed-off-by: Lucas Dietrich <[email protected]> (cherry picked from commit 9a848b3)

This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes #45267 Signed-off-by: Lucas Dietrich <[email protected]> (cherry picked from commit 9a848b3)

This adds the internal function z_work_submit_to_queue(), which submits the work item to the queue but doesn't force the thread to yield, compared to the public function k_work_submit_to_queue(). When called from poll.c in the context of k_work_poll events, it ensures that the thread does not yield in the context of the spinlock of object that became available. Fixes zephyrproject-rtos#45267 Signed-off-by: Lucas Dietrich <[email protected]> (cherry picked from commit 9a848b3)

lucasdietrich added the bug The issue is a bug, or the PR is fixing a bug label May 1, 2022

lucasdietrich changed the title ~~Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler~~ kernel: Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler May 1, 2022

mbolivar-nordic added the priority: high High impact/importance bug label May 3, 2022

mbolivar-nordic assigned andyross May 3, 2022

mbolivar-nordic added the area: Kernel label May 3, 2022

lucasdietrich mentioned this issue May 3, 2022

Fix recursive spinlock in k_msgq_get #45311

Merged

andyross assigned lucasdietrich and unassigned andyross May 10, 2022

carlescufi closed this as completed in #45311 May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel: Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler #45267

kernel: Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler #45267

lucasdietrich commented May 1, 2022 •

edited

Loading

andyross commented May 3, 2022

lucasdietrich commented May 3, 2022 •

edited

Loading

andyross commented May 3, 2022

lucasdietrich commented May 3, 2022

andyross commented May 3, 2022

andyross commented May 10, 2022

kernel: Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler #45267

kernel: Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler #45267

Comments

lucasdietrich commented May 1, 2022 • edited Loading

andyross commented May 3, 2022

lucasdietrich commented May 3, 2022 • edited Loading

andyross commented May 3, 2022

lucasdietrich commented May 3, 2022

andyross commented May 3, 2022

andyross commented May 10, 2022

lucasdietrich commented May 1, 2022 •

edited

Loading

lucasdietrich commented May 3, 2022 •

edited

Loading