-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: replication messaged dropped during host shutdown #6143
Bugfix: replication messaged dropped during host shutdown #6143
Conversation
return | ||
case p.requestChan <- &request{ | ||
} | ||
p.requestChan <- &request{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's still leave the select {}
block here with p.done check because we don't want this line to be blocking in case of shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, fair, I didn't consider this, let me fix
@@ -194,19 +194,18 @@ Loop: | |||
respChan := make(chan *types.ReplicationMessages, 1) | |||
// TODO: when we support prefetching, LastRetrievedMessageID can be different than LastProcessedMessageID | |||
|
|||
select { | |||
case <-p.done: | |||
if p.isShuttingDown() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding populating requestChan
if already shut down will be an improvement but will not cover all the known problematic edge cases as discussed offline.
task_fetcher.go is the one making the remote calls periodically for each shard. Not adding a request to requestChan
is unlikely to help because there might already be a request for the closing shard.
I'd recommend updating task fetcher's fetchAndDistributeTasks
to skip the shards that are no longer owned by current host. Task Processor could notify Task Fetcher on shut down to deregister previously sent request.
Otherwise this change is only going to help if the task processor for a given shard hasn't sent a request to requestChan
and being closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's a much better idea 🤦 I didn't consider the race there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Urgh, this is going to be not easy: We have no in-memory representation of shard ownership, only these shard-ownership-lost errors bubbling up from the DB.
There's no big in-memory map of shard-IDs with atomics which is keeping track of whats present, and even if there was, it'd be still subject to races.
At first glance, I think, to truly fix this, we need to change this API to have a proper Ack/update rather then doing it on read (its obviously more complex that that because the naieve impl will explode the number of RPC calls)
If its ok, let me land this change as-is and we can explore redesigning this replication a bit more in detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to merge this PR once other comments are addressed. i.e. The shutdown blocker issue.
Let's discuss the true fix as a follow up. Please make sure you capture the context in these comments in the follow up ticket so we don't forget :)
Pull Request Test Coverage Report for Build 01905832-5907-4782-b9bc-f369f8e4ddf4Details
💛 - Coveralls |
What changed?
Internal details: CDNC-9597
A user reported some problems during a failover in which a workflow, during a continue-as-new event got dropped during replication silently, without any corresponding DLQ message. We were able to track down the (expected) cause to likely have been a shard movement during that time which triggers several unpleasant edge-conditions with interactions with the following:
How did you test it?
Tested locally and with unit tests. Was able to repro the sequence of events mostly with unit-tests.