Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler #45267

Closed
lucasdietrich opened this issue May 1, 2022 · 6 comments · Fixed by #45311
Closed
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug

Comments

@lucasdietrich
Copy link
Contributor

lucasdietrich commented May 1, 2022

Describe the bug
Recursive spinlock in k_msgq_get(), when called from a k_work_poll handler function, when event is configured for notification on message availability in the MsgQ.

This situation appears only if the thread "putting" the message in the MsgQ is preemptive and if the workqueue assigned for the process of the k_work_poll handler has a higher priority.

Edit: I described the issue for the MsgQ only but actually this issue occurs with all "transferable" kernel objects which support POLLING :

  • MsgQ: k_msgq_put() / k_msgq_get() through K_POLL_TYPE_MSGQ_DATA_AVAILABLE
  • Fifo: k_fifo_put() / k_fifo_get() through K_POLL_TYPE_FIFO_DATA_AVAILABLE
  • Semaphores: k_sem_give() / k_sem_take() through K_POLL_TYPE_SEM_AVAILABLE

To Reproduce
Please find here this Zephyr project zephyr-qemu-dev ( msgq: main.c ) which can be run in qemu to reproduce this issue. (Same issue with FIFO fifo: main.c)

Basically:

  • Enable CONFIG_POLL and CONFIG_ASSERT
  • Create a preemptive thread with a certain priority (e.g. K_PRIO_PREEMPT(12))
  • Create a workqueue with a higher priority (e.g. K_PRIO_COOP(1))
  • Configure a k_work_poll item to be notified for any new message in a MsgQ (K_POLL_TYPE_MSGQ_DATA_AVAILABLE), which will queue the work item to the configured workqueue.
  • Configure the k_work_poll handler function to do a k_msg_get on the MsgQ when called
  • From the thread, "put" a message to the MsgQ using k_msgq_put()

When running this application, the execution will fail with an "assertion error" (if CONFIG_ASSERT is enabled):

As far as I remember, when CONFIG_ASSERT is disabled the k_work_poll is not resubmitted correctly (to confirm).

Expected behavior
No assertion error.

Impact
It's then not possible to handle msgQ messages directly from the k_work_poll handler and the k_work_poll item cannot be resubmitted from the same handler using k_work_pull_submit(_to_queue). A workarround would be to create a dedicated thread which waits on MsgQ messages instead of a k_work_poll.

Logs and console output
The assertion error occurs here

*** Booting Zephyr OS build zephyr-v2.7.2  ***
ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:142
        Recursive spinlock 0x10a130

Environment :

  • Linux
  • Zephyr 2.7.2, 3.0.0 and current main (83c79d1)
  • All platforms

Diagnostic
I debugged this and noticed that a k_yield() is called from k_work_submit_to_queue() in the context of the MsgQ lock which may cause a thread switch before unlocking the msgq->lock (see the call stack in the screenshot).

zephyr/kernel/work.c

Lines 376 to 378 in 83c79d1

if ((ret > 0) && (k_is_preempt_thread() != 0)) {
k_yield();
}

issue_z_impl_k_msgq_put_2

If the program follows this particular path, because of this k_yield(), the work item handler will be executed while the msgq->lock stays locked by the thread calling k_msgq_put(). Then the call to k_msgq_get() in the work handler will cause a "Recursive spinlock" assertion error.

Disabling the k_yield() solves the issue : lucasdietrich@50923a3

Test
I already wrote this test routine to cover this situation (for which I could propose a pull request) : lucasdietrich@c0a7deb, run the test with ./scripts/twister -i -v -p qemu_x86 -T tests/kernel/workq/work_queue

Solutions
What would be good solutions for this ? This k_yield() should definitely not be called in this context, but how to notify the k_work_submit_to_queue() to not yield in this case ?

A simple and naive solution is, but which I think is terrible :
lucasdietrich@9c49c67#diff-06e616b4678c2bb4ee4b383f66a36821aa1940dc7bf20b8a3f68b1573bb02875

Thank you for your help 😄

@lucasdietrich lucasdietrich added the bug The issue is a bug, or the PR is fixing a bug label May 1, 2022
@lucasdietrich lucasdietrich changed the title Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler kernel: Recursive spinlock in k_msgq_get() in the context of a k_work_poll handler May 1, 2022
@mbolivar-nordic mbolivar-nordic added the priority: high High impact/importance bug label May 3, 2022
@andyross
Copy link
Contributor

andyross commented May 3, 2022

Hm... can you walk me through which lock is held when the thread yields? The one in the workqueue seems correctly released around the callback. Is the bug that poll is calling its handlers with a spinlock held?

@lucasdietrich
Copy link
Contributor Author

lucasdietrich commented May 3, 2022

In my case it's the MsgQ lock which is held when the thread yields.

Let's take my test program as example (main.c):

Note:

This situation appears only if the thread "putting" the message in the MsgQ is preemptive and if the workqueue assigned for the process of the k_work_poll handler has a higher priority.

@andyross
Copy link
Contributor

andyross commented May 3, 2022

OK, gotcha. Let me restate and see if this matches your expectations:

  1. msgq calls z_handle_obj_poll_events() inside its spinlock
  2. for the case of triggered work items, poll then calls k_work_submit_to_queue()
  3. k_work_submit_to_queue() is a scheduling point and tries to context switch

This is an evolutionary messup, I think. Poll traditionally wouldn't schedule from its handler (and there's no reason to expect it to). But now it's grown the "triggered work" feature, which does exactly that.

So, a few possible ways to resolve this that I can see:

  1. Release the msgq lock before signaling poll. This would require some surgery to the way the code is structured, but looks doable. I do worry about the introduction of races in app code though (because now it's possible to sneak in and mutate the queue before the polling thread returns). In theory k_poll() is level triggered and apps all "know" they need to retest, but in practice I'm sure there are messes.

  2. Fix the work queue so that there is an internal "non-scheduling" variant of submit() that code like poll can use. This is I think exactly what you were thinking about with your sched lock solution? Basically take the meat of the code and turn it into z_work_submit() or whatever and then wrap the public API in one that tests a return value and reschedules if needed.

  3. Remove the scheduling point form k_work_submit(). I'm not completely convinced this is a good place to reschedule anyway. Work queues are intended for low priority async work, very few apps are going to expect them to preempt the calling thread!

@lucasdietrich
Copy link
Contributor Author

That's exactly it.

  1. I quickly thought of this solution, it doesn't seem to be the simplest way to do it to me. Is there a particular advantage for using this solution over the others?
  2. You're right, that's what I had in mind: Do not alter the behavior of the normal workqueue mecanism but prevent the thread from yielding in this particular case, to simply solve the bug.
  3. (With my young experience) I admit I was quite suprised when I noticed that the kernel does the yield for us when submitting a work item. I would have expected this to be in charge on the application in only a few special cases.

What is the best solution according to you? I drafted an implementation for the 2nd solution you proposed. Thus I would be able to propose a PR soon if this one is chosen.

Do you think an appropriate test will be needed to cover this case ?

@andyross
Copy link
Contributor

andyross commented May 3, 2022

I think I'd lean with #2 also, as it doesn't involve an API change and can be done 100% compatibly. #3 is simpler still, and while it is a behavior change, we aren't actually breaking any specified behavior as k_work_submit_to_queue() was never documented to reschedule.

lucasdietrich added a commit to lucasdietrich/zephyr that referenced this issue May 4, 2022
This adds the internal function z_work_submit_to_queue(), which
submits the work item to the queue but doesn't force the thread to yield,
compared to the public function k_work_submit_to_queue().

When called from poll.c in the context of k_work_poll events, it ensures
that the thread does not yield in the context of the spinlock of object
that became available.

Fixes zephyrproject-rtos#45267

Signed-off-by: Lucas Dietrich <[email protected]>
lucasdietrich added a commit to lucasdietrich/zephyr that referenced this issue May 4, 2022
When an object availability event triggers a k_work_poll
item, the object lock should not be held anymore
during the execution of the work callback.

Signed-off-by: Lucas Dietrich <[email protected]>
@andyross
Copy link
Contributor

Seems like @lucasdietrich is well on the way to the solution here, so reassigning for clarity. Move it back to me if it gets stuck for some reason.

@andyross andyross assigned lucasdietrich and unassigned andyross May 10, 2022
carlescufi pushed a commit that referenced this issue May 10, 2022
This adds the internal function z_work_submit_to_queue(), which
submits the work item to the queue but doesn't force the thread to yield,
compared to the public function k_work_submit_to_queue().

When called from poll.c in the context of k_work_poll events, it ensures
that the thread does not yield in the context of the spinlock of object
that became available.

Fixes #45267

Signed-off-by: Lucas Dietrich <[email protected]>
carlescufi pushed a commit that referenced this issue May 10, 2022
When an object availability event triggers a k_work_poll
item, the object lock should not be held anymore
during the execution of the work callback.

Signed-off-by: Lucas Dietrich <[email protected]>
laxiLang pushed a commit to laxiLang/zephyr that referenced this issue May 30, 2022
This adds the internal function z_work_submit_to_queue(), which
submits the work item to the queue but doesn't force the thread to yield,
compared to the public function k_work_submit_to_queue().

When called from poll.c in the context of k_work_poll events, it ensures
that the thread does not yield in the context of the spinlock of object
that became available.

Fixes zephyrproject-rtos#45267

Signed-off-by: Lucas Dietrich <[email protected]>
laxiLang pushed a commit to laxiLang/zephyr that referenced this issue May 30, 2022
When an object availability event triggers a k_work_poll
item, the object lock should not be held anymore
during the execution of the work callback.

Signed-off-by: Lucas Dietrich <[email protected]>
npitre pushed a commit to npitre/zephyr that referenced this issue Feb 3, 2023
This adds the internal function z_work_submit_to_queue(), which
submits the work item to the queue but doesn't force the thread to yield,
compared to the public function k_work_submit_to_queue().

When called from poll.c in the context of k_work_poll events, it ensures
that the thread does not yield in the context of the spinlock of object
that became available.

Fixes zephyrproject-rtos#45267

Signed-off-by: Lucas Dietrich <[email protected]>
(cherry picked from commit 9a848b3)
cfriedt pushed a commit that referenced this issue Feb 3, 2023
This adds the internal function z_work_submit_to_queue(), which
submits the work item to the queue but doesn't force the thread to yield,
compared to the public function k_work_submit_to_queue().

When called from poll.c in the context of k_work_poll events, it ensures
that the thread does not yield in the context of the spinlock of object
that became available.

Fixes #45267

Signed-off-by: Lucas Dietrich <[email protected]>
(cherry picked from commit 9a848b3)
pakar1 pushed a commit to Taggr-AB/zephyr that referenced this issue May 2, 2023
This adds the internal function z_work_submit_to_queue(), which
submits the work item to the queue but doesn't force the thread to yield,
compared to the public function k_work_submit_to_queue().

When called from poll.c in the context of k_work_poll events, it ensures
that the thread does not yield in the context of the spinlock of object
that became available.

Fixes zephyrproject-rtos#45267

Signed-off-by: Lucas Dietrich <[email protected]>
(cherry picked from commit 9a848b3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants