wtclient: ensure correct disk mode for overflow queue #8377

ellemouton · 2024-01-12T10:43:44Z

Before this commit, in the watchtower client's DiskOverflowQueue, there could be a situation if no consumer was reading from the queue that could lead to an in-memory build up of backup tasks instead of the expected disk overflow. This commit fixes this by ensuring that if a task is read from disk, then the mode is set to disk mode to ensure that any new items are persisted to disk in this scenario.

More detailed explanation

This order of events could trigger the flake in question:

We start out with toDisk= false (ie, we are in memory-mode)
Enough items get pushed to the queue to fill up memQueue to its capacity.
Add 2 more items in quick succession to the queue.
For the first item, drainInputList will try push it to inputChan but because memQueue is full, it will return success=false and the mode will be switched to disk-mode (toDisk=true).
drainInputList will then instead write the item to disk & push a notification on the newDiskItemSignal channel.
feedMemQueue then responds to this signal and calls feedFromDisk which will pop the first item from disk and will block on pushing the item to the currently full memQueue.
⚠️ Here is where things get race-y:
drainInputList starts handling the second item - since the mode is still "disk-mode" it preps writing the item to disk (but doesnt yet!!)
One item gets popped from memQueue meaning that feedMemQueue manages to push the first item to memQueue and then goes onto the next iteration of the feedFromDisk loop - Here, it will try read an item from disk but will see that there are no more items items & so will set the mode to memory-mode.
At this point, drainInputList actually writes the second item to disk and sends a new newDiskItemSignal.
feedMemQueue will then respond to this signal by reading the new item from disk and blocking on adding it to the memQueue.
Now, any new items that drainInputList gets - it will try to push those to inputChan because the mode is "memory-mode" BUT feedMemQueue wont be listening on that channel because it is blocking on pushing an item to memQueue.

We want to make sure that if feedMemQueue is ever blocking on pushing to memQueue, then the mode MUST be disk-mode.

The Fix

The fix is straight forward: basically, if feedMemQueue successfully pops an item from disk (and is therefore about to potentially block on memQueue then it first sets the mode to disk-mode.

Confirmation of the correctness of the fix

Before this commit, I ran the test 10k times & ran into the flake 4 times.
After this commit, I ran the test 10k times and did not run into the flake at all.

Fixes #8355

guggero

Great investigation! And nice that the fix is so small :)

Could we perhaps add a reproduction as a unit test? Given we now know how it happened in the first place?

ellemouton · 2024-01-15T15:02:58Z

Could we perhaps add a reproduction as a unit test? Given we now know how it happened in the first place?

Good idea @guggero 👍

It was a bit more tricky than usual though since this is a race issue. So what i've done is first added a few temporary variables to allow more easy stop/go control from a test & then i've added the reproduction test. Then I add the commit that fixes the issue and finally, I remove all the stop/go code and also the test.

In this commit, some temporary variables and logic is added to the DiskOverflowQueue for easy stop/go control from unit tests. This is then used to write a temporary unit tests that demonstrates a race condition that can cause the queue to be in disk mode when it should be in memory mode. This new code & test will be removed after the issue has been fixed.

Before this commit, in the watchtower client's DiskOverflowQueue, there could be a situation _if no consumer was reading from the queue_ that could lead to an in-memory build up of backup tasks instead of the expected disk overflow. This commit fixes this by ensuring that if a task is read from disk, then the mode is set to disk mode to ensure that any new items are persisted to disk in this scenario. The unit tests added in the previous commit is updated here in order to show that the fix does in-fact fix the test.

This commit removes the temporary members added to the DiskOverflowQueue that made it possible to more easily demonstrate a previous bug that existed.

guggero

Nice fix, LGTM 🎉

bitromortac

Thank you for the nice write-up, demo, and fix! LGTM ⚡

ellemouton added watchtower flake fix labels Jan 12, 2024

ellemouton self-assigned this Jan 12, 2024

guggero self-requested a review January 12, 2024 10:45

ellemouton force-pushed the towerTestFlakeFix branch from 1c60dac to b4b23f3 Compare January 12, 2024 11:20

guggero reviewed Jan 15, 2024

View reviewed changes

ellemouton force-pushed the towerTestFlakeFix branch from b4b23f3 to 84e7de6 Compare January 15, 2024 14:59

ellemouton force-pushed the towerTestFlakeFix branch 2 times, most recently from f942de4 to 1f5a645 Compare January 15, 2024 15:24

ellemouton added 3 commits January 15, 2024 17:26

wtclient: remove temporary bug demonstration logic and test

65de80b

This commit removes the temporary members added to the DiskOverflowQueue that made it possible to more easily demonstrate a previous bug that existed.

ellemouton force-pushed the towerTestFlakeFix branch from 1f5a645 to 65de80b Compare January 15, 2024 15:28

ellemouton requested a review from guggero January 15, 2024 15:48

guggero approved these changes Jan 16, 2024

View reviewed changes

bitromortac self-requested a review January 16, 2024 13:20

bitromortac approved these changes Jan 23, 2024

View reviewed changes

guggero merged commit d7796b4 into lightningnetwork:master Jan 23, 2024
23 of 25 checks passed

ellemouton deleted the towerTestFlakeFix branch January 23, 2024 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wtclient: ensure correct disk mode for overflow queue #8377

wtclient: ensure correct disk mode for overflow queue #8377

ellemouton commented Jan 12, 2024 •

edited

Loading

guggero left a comment

ellemouton commented Jan 15, 2024

guggero left a comment

bitromortac left a comment

wtclient: ensure correct disk mode for overflow queue #8377

wtclient: ensure correct disk mode for overflow queue #8377

Conversation

ellemouton commented Jan 12, 2024 • edited Loading

More detailed explanation

The Fix

Confirmation of the correctness of the fix

guggero left a comment

Choose a reason for hiding this comment

ellemouton commented Jan 15, 2024

guggero left a comment

Choose a reason for hiding this comment

bitromortac left a comment

Choose a reason for hiding this comment

ellemouton commented Jan 12, 2024 •

edited

Loading