-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for #2593 #3074
Fix for #2593 #3074
Conversation
LGTM. Seems reasonable to me that testing for this is Hard™. |
// itself is removed, the node is re-added to the group | ||
// (which will trigger another re-proposal that will succeed), or the | ||
// proposal is committed. | ||
if initialProposal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't initialProposal
equal to the ok
in _, ok := g.pending[p.commandID]
and thus not required?
LGTM mod my question. |
LGTM mod @tschottdorf's question (I think he's right) |
The cause of the issue occurred as such: + Incoming client request to a node creates a trace context. + The context is attached to a raft command which is proposed. The command is added to the 'pending' map in multiraft before being proposed. The client request will be answered once the proposed command is committed and applied. + Concurrently, another raft command changes the configuration of the range's raft group, removing this node's replica. In existing code, all pending commands on that node which target that replica are synchronously dismissed with an error; the trace is therefore finalized. + However, while the replica has been removed from the group, the group itself has not yet been removed from the node; the proposed command can actually commit, it just commits after the configuration change. + When the committed change is applied, it attempts to use the trace, but the trace has already been finalized. The fix is to no longer abort pending commands on a replica just because that replica has been removed from the group; it is not yet safe to immediately abort pending requests, because they may actually complete. Instead, we do not abort commands until the group itself is removed (by the range GC queue).
Replica.Quiesce() was added in an attempt to fix cockroachdb#2593; however, it did not fix that issue, and was not necessary to fix it in the first place. This commit removes Replica.Quiesce().
bd39581
to
30953c1
Compare
Yes, you are correct that initialProposal was not necessary. I have updated this with Tobi's suggestion, everything appears to still be in working order. |
Finally got to the bottom of this, with special thanks to Tobias, Tamir and Ben.
The test which exercises this most consistently is TestStoreRangeRebalance; however, it was only able to quickly reproduce this in the multicpu branch, where it was reproduced with reliability.
A six node cluster was able to run stably under load for well over an hour with this fix; the bug appears to be completely gone.