-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling service-discovery driven shutdown of matching engine #6198
Enabling service-discovery driven shutdown of matching engine #6198
Conversation
} | ||
|
||
func (e *matchingEngineImpl) Stop() { | ||
close(e.shutdown) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're not waiting for subscribeToMembershipChanges
to complete, possibly goroutine leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true: I was too lazy and didn't think to add a full WG setup to engine. What do you think? It does mean that I didn't toggle it on for the leak-detector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for not leaving goroutines behind. let's wait here until subscribeToMembershipChanges
returns (via waitgroup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I don't love adding a bunch of complexity back to the shutdown process, but fair. Added a waitgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I like this approach a lot! Some minor comments
Co-authored-by: Jakob Haahr Taankvist <[email protected]>
Co-authored-by: Jakob Haahr Taankvist <[email protected]>
Co-authored-by: Jakob Haahr Taankvist <[email protected]>
@@ -84,6 +85,7 @@ type MultiringResolver struct { | |||
status int32 | |||
|
|||
provider PeerProvider | |||
mu sync.Mutex | |||
rings map[string]*ring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is the first instance of multiple accessors to the rings map, it needs guards
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 7 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
} | ||
|
||
func (e *matchingEngineImpl) Stop() { | ||
close(e.shutdown) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for not leaving goroutines behind. let's wait here until subscribeToMembershipChanges
returns (via waitgroup)
service/matching/handler/engine.go
Outdated
// can take a while, so do them in parallel so as to not hold the | ||
// lock too long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which lock is being held here? If this comment belongs to a previous iteration then let's just stop the tasklists synchronously. Better to not return from shutDownNonOwnedTasklists
while some background goroutines are still stopping task lists. shutDownNonOwnedTasklists
may be called again via ticker and might try to do the same/overlapping work and cause unexpected problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does refer to an earlier iteration where it was locking, so let me fix.
My concern with doing it serially is that the net time to shut down will be quite slow, the tlmgr.Close()
method does a lot of IO. What about throwing in a waitgroup and doing it pararallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing parallel should be fine I think. just avoid leaving them behind before returning
service/matching/handler/engine.go
Outdated
if !e.config.EnableTasklistOwnershipGuard() { | ||
return nil | ||
} | ||
noLongerOwned, err := e.getNonOwnedTasklistsLocked() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preemptively stopping tasklists based on service-discovery result might cause "freeze" on a tasklist whose new owner didn't claim it yet. Ideally the old host performs operation on it until the new owner host claims the lease. Is that something we can easily check from DB? (I guess not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the duration of that wait I would expect to be the startup time of a shard + the latency of service-discovery propogation (interally for us, 2 seconds). I do think that it's a slight risk, but thats' not super far off the duration of a LWT, so I'm not sure doing a CAS operation to poll or something would be worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change should perform better than what we have right now but the ideal implementation should also consider that new owner is active. We can try that as a follow up. Do you have plans to make the simulation framework support ring change scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have plans to so personally, I need to move on to a different project, but I do think that's a good idea.
What changed?
Matching hosts presently are at quite a lot of risk of delaying processing or getting contended during shard ownership changes, particularly if the service-discovery changes occur before the container shutdown is kicked off.
This change attempts to be somewhat more proactive in cleaning up and guarding against shard changes so that the Matching host reliquishes ownership and processing of the tasklist manager it may have lost earlier, rather than fighting with the new owner by incrementing the shard counter and delaying processing.
This can happen in a few ways: A host shutdown is taking a while, but the new host is reacting to service-discovery and taking ownership of the original shards, but hitting a taskreader still attempting to take lock, for one example. Another is a scaling-up event where the shards are stolen from an existing host.
Why?
How did you test it?
This has been tested by deploying in a couple of pre-prod envs in a few iterations, but at this point in time needs a bit more manual testing.
Testing currently:
Feature enablement
This feature requires the bool flag
matching.enableTasklistGuardAgainstOwnershipLoss