Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wtclient: ensure correct disk mode for overflow queue #8377

Merged
merged 3 commits into from
Jan 23, 2024

Conversation

ellemouton
Copy link
Collaborator

@ellemouton ellemouton commented Jan 12, 2024

Before this commit, in the watchtower client's DiskOverflowQueue, there could be a situation if no consumer was reading from the queue that could lead to an in-memory build up of backup tasks instead of the expected disk overflow. This commit fixes this by ensuring that if a task is read from disk, then the mode is set to disk mode to ensure that any new items are persisted to disk in this scenario.

More detailed explanation

This order of events could trigger the flake in question:

  1. We start out with toDisk= false (ie, we are in memory-mode)
  2. Enough items get pushed to the queue to fill up memQueue to its capacity.
  3. Add 2 more items in quick succession to the queue.
  4. For the first item, drainInputList will try push it to inputChan but because memQueue is full, it will return success=false and the mode will be switched to disk-mode (toDisk=true).
  5. drainInputList will then instead write the item to disk & push a notification on the newDiskItemSignal channel.
  6. feedMemQueue then responds to this signal and calls feedFromDisk which will pop the first item from disk and will block on pushing the item to the currently full memQueue.
    ⚠️ Here is where things get race-y:
  7. drainInputList starts handling the second item - since the mode is still "disk-mode" it preps writing the item to disk (but doesnt yet!!)
  8. One item gets popped from memQueue meaning that feedMemQueue manages to push the first item to memQueue and then goes onto the next iteration of the feedFromDisk loop - Here, it will try read an item from disk but will see that there are no more items items & so will set the mode to memory-mode.
  9. At this point, drainInputList actually writes the second item to disk and sends a new newDiskItemSignal.
  10. feedMemQueue will then respond to this signal by reading the new item from disk and blocking on adding it to the memQueue.
  11. Now, any new items that drainInputList gets - it will try to push those to inputChan because the mode is "memory-mode" BUT feedMemQueue wont be listening on that channel because it is blocking on pushing an item to memQueue.

We want to make sure that if feedMemQueue is ever blocking on pushing to memQueue, then the mode MUST be disk-mode.

The Fix

The fix is straight forward: basically, if feedMemQueue successfully pops an item from disk (and is therefore about to potentially block on memQueue then it first sets the mode to disk-mode.

Confirmation of the correctness of the fix

Before this commit, I ran the test 10k times & ran into the flake 4 times.
After this commit, I ran the test 10k times and did not run into the flake at all.

Fixes #8355

Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation! And nice that the fix is so small :)

Could we perhaps add a reproduction as a unit test? Given we now know how it happened in the first place?

@ellemouton
Copy link
Collaborator Author

Could we perhaps add a reproduction as a unit test? Given we now know how it happened in the first place?

Good idea @guggero 👍

It was a bit more tricky than usual though since this is a race issue. So what i've done is first added a few temporary variables to allow more easy stop/go control from a test & then i've added the reproduction test. Then I add the commit that fixes the issue and finally, I remove all the stop/go code and also the test.

@ellemouton ellemouton force-pushed the towerTestFlakeFix branch 2 times, most recently from f942de4 to 1f5a645 Compare January 15, 2024 15:24
In this commit, some temporary variables and logic is added to the
DiskOverflowQueue for easy stop/go control from unit tests. This is then
used to write a temporary unit tests that demonstrates a race condition
that can cause the queue to be in disk mode when it should be in memory
mode. This new code & test will be removed after the issue has been
fixed.
Before this commit, in the watchtower client's DiskOverflowQueue, there
could be a situation _if no consumer was reading from the queue_ that
could lead to an in-memory build up of backup tasks instead of the
expected disk overflow. This commit fixes this by ensuring that if a
task is read from disk, then the mode is set to disk mode to ensure that
any new items are persisted to disk in this scenario.
The unit tests added in the previous commit is updated here in order to
show that the fix does in-fact fix the test.
This commit removes the temporary members added to the DiskOverflowQueue
that made it possible to more easily demonstrate a previous bug that
existed.
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix, LGTM 🎉

@bitromortac bitromortac self-requested a review January 16, 2024 13:20
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the nice write-up, demo, and fix! LGTM ⚡

@guggero guggero merged commit d7796b4 into lightningnetwork:master Jan 23, 2024
23 of 25 checks passed
@ellemouton ellemouton deleted the towerTestFlakeFix branch January 23, 2024 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

watchtower: unit test flake in TestDiskOverflowQueue/overflow_to_disk
3 participants