-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wtclient: ensure correct disk mode for overflow queue #8377
wtclient: ensure correct disk mode for overflow queue #8377
Conversation
1c60dac
to
b4b23f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great investigation! And nice that the fix is so small :)
Could we perhaps add a reproduction as a unit test? Given we now know how it happened in the first place?
b4b23f3
to
84e7de6
Compare
Good idea @guggero 👍 It was a bit more tricky than usual though since this is a race issue. So what i've done is first added a few temporary variables to allow more easy stop/go control from a test & then i've added the reproduction test. Then I add the commit that fixes the issue and finally, I remove all the stop/go code and also the test. |
f942de4
to
1f5a645
Compare
In this commit, some temporary variables and logic is added to the DiskOverflowQueue for easy stop/go control from unit tests. This is then used to write a temporary unit tests that demonstrates a race condition that can cause the queue to be in disk mode when it should be in memory mode. This new code & test will be removed after the issue has been fixed.
Before this commit, in the watchtower client's DiskOverflowQueue, there could be a situation _if no consumer was reading from the queue_ that could lead to an in-memory build up of backup tasks instead of the expected disk overflow. This commit fixes this by ensuring that if a task is read from disk, then the mode is set to disk mode to ensure that any new items are persisted to disk in this scenario. The unit tests added in the previous commit is updated here in order to show that the fix does in-fact fix the test.
This commit removes the temporary members added to the DiskOverflowQueue that made it possible to more easily demonstrate a previous bug that existed.
1f5a645
to
65de80b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix, LGTM 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the nice write-up, demo, and fix! LGTM ⚡
Before this commit, in the watchtower client's DiskOverflowQueue, there could be a situation if no consumer was reading from the queue that could lead to an in-memory build up of backup tasks instead of the expected disk overflow. This commit fixes this by ensuring that if a task is read from disk, then the mode is set to disk mode to ensure that any new items are persisted to disk in this scenario.
More detailed explanation
This order of events could trigger the flake in question:
toDisk
= false (ie, we are in memory-mode)memQueue
to its capacity.drainInputList
will try push it toinputChan
but becausememQueue
is full, it will returnsuccess=false
and the mode will be switched to disk-mode (toDisk=true
).drainInputList
will then instead write the item to disk & push a notification on thenewDiskItemSignal
channel.feedMemQueue
then responds to this signal and callsfeedFromDisk
which will pop the first item from disk and will block on pushing the item to the currently fullmemQueue
.drainInputList
starts handling the second item - since the mode is still "disk-mode" it preps writing the item to disk (but doesnt yet!!)memQueue
meaning thatfeedMemQueue
manages to push the first item tomemQueue
and then goes onto the next iteration of thefeedFromDisk
loop - Here, it will try read an item from disk but will see that there are no more items items & so will set the mode to memory-mode.drainInputList
actually writes the second item to disk and sends a newnewDiskItemSignal
.feedMemQueue
will then respond to this signal by reading the new item from disk and blocking on adding it to thememQueue
.drainInputList
gets - it will try to push those toinputChan
because the mode is "memory-mode" BUTfeedMemQueue
wont be listening on that channel because it is blocking on pushing an item tomemQueue
.We want to make sure that if
feedMemQueue
is ever blocking on pushing tomemQueue
, then the mode MUST be disk-mode.The Fix
The fix is straight forward: basically, if
feedMemQueue
successfully pops an item from disk (and is therefore about to potentially block onmemQueue
then it first sets the mode todisk-mode
.Confirmation of the correctness of the fix
Before this commit, I ran the test 10k times & ran into the flake 4 times.
After this commit, I ran the test 10k times and did not run into the flake at all.
Fixes #8355