Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling service-discovery driven shutdown of matching engine #6198

Conversation

davidporter-id-au
Copy link
Member

@davidporter-id-au davidporter-id-au commented Jul 30, 2024

What changed?

Matching hosts presently are at quite a lot of risk of delaying processing or getting contended during shard ownership changes, particularly if the service-discovery changes occur before the container shutdown is kicked off.

This change attempts to be somewhat more proactive in cleaning up and guarding against shard changes so that the Matching host reliquishes ownership and processing of the tasklist manager it may have lost earlier, rather than fighting with the new owner by incrementing the shard counter and delaying processing.

This can happen in a few ways: A host shutdown is taking a while, but the new host is reacting to service-discovery and taking ownership of the original shards, but hitting a taskreader still attempting to take lock, for one example. Another is a scaling-up event where the shards are stolen from an existing host.

Why?

How did you test it?

This has been tested by deploying in a couple of pre-prod envs in a few iterations, but at this point in time needs a bit more manual testing.

Testing currently:

  • Preliminary testing in development environments
  • Unit testing
  • Manual desting in staging

Feature enablement

This feature requires the bool flag matching.enableTasklistGuardAgainstOwnershipLoss

@davidporter-id-au davidporter-id-au changed the title WIP commit Enabling service-discovery driven shutdown of matching engine Jul 30, 2024
}

func (e *matchingEngineImpl) Stop() {
close(e.shutdown)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're not waiting for subscribeToMembershipChanges to complete, possibly goroutine leak?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true: I was too lazy and didn't think to add a full WG setup to engine. What do you think? It does mean that I didn't toggle it on for the leak-detector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not leaving goroutines behind. let's wait here until subscribeToMembershipChanges returns (via waitgroup)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I don't love adding a bunch of complexity back to the shutdown process, but fair. Added a waitgroup.

Copy link
Member

@jakobht jakobht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I like this approach a lot! Some minor comments

service/matching/handler/engine.go Outdated Show resolved Hide resolved
service/matching/handler/engine.go Outdated Show resolved Hide resolved
service/matching/handler/engine.go Show resolved Hide resolved
common/errors/taskListNotOwnedByHostError.go Show resolved Hide resolved
service/matching/handler/membership.go Outdated Show resolved Hide resolved
service/matching/handler/engine.go Outdated Show resolved Hide resolved
@@ -84,6 +85,7 @@ type MultiringResolver struct {
status int32

provider PeerProvider
mu sync.Mutex
rings map[string]*ring
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is the first instance of multiple accessors to the rings map, it needs guards

Copy link

codecov bot commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 93.10345% with 8 lines in your changes missing coverage. Please review.

Project coverage is 72.92%. Comparing base (de281a6) to head (25e649b).
Report is 6 commits behind head on master.

Files Patch % Lines
service/matching/handler/membership.go 87.09% 4 Missing and 4 partials ⚠️
Additional details and impacted files
Files Coverage Δ
common/membership/resolver.go 80.24% <100.00%> (+1.86%) ⬆️
service/matching/config/config.go 100.00% <100.00%> (ø)
service/matching/handler/engine.go 79.94% <100.00%> (+0.93%) ⬆️
service/matching/handler/membership.go 87.09% <87.09%> (ø)

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de281a6...25e649b. Read the comment docs.

service/matching/handler/engine.go Outdated Show resolved Hide resolved
}

func (e *matchingEngineImpl) Stop() {
close(e.shutdown)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not leaving goroutines behind. let's wait here until subscribeToMembershipChanges returns (via waitgroup)

service/matching/handler/membership.go Show resolved Hide resolved
Comment on lines 1211 to 1212
// can take a while, so do them in parallel so as to not hold the
// lock too long.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which lock is being held here? If this comment belongs to a previous iteration then let's just stop the tasklists synchronously. Better to not return from shutDownNonOwnedTasklists while some background goroutines are still stopping task lists. shutDownNonOwnedTasklists may be called again via ticker and might try to do the same/overlapping work and cause unexpected problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does refer to an earlier iteration where it was locking, so let me fix.

My concern with doing it serially is that the net time to shut down will be quite slow, the tlmgr.Close() method does a lot of IO. What about throwing in a waitgroup and doing it pararallel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing parallel should be fine I think. just avoid leaving them behind before returning

service/matching/handler/engine.go Outdated Show resolved Hide resolved
service/matching/handler/engine.go Outdated Show resolved Hide resolved
if !e.config.EnableTasklistOwnershipGuard() {
return nil
}
noLongerOwned, err := e.getNonOwnedTasklistsLocked()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemptively stopping tasklists based on service-discovery result might cause "freeze" on a tasklist whose new owner didn't claim it yet. Ideally the old host performs operation on it until the new owner host claims the lease. Is that something we can easily check from DB? (I guess not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the duration of that wait I would expect to be the startup time of a shard + the latency of service-discovery propogation (interally for us, 2 seconds). I do think that it's a slight risk, but thats' not super far off the duration of a LWT, so I'm not sure doing a CAS operation to poll or something would be worthwhile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change should perform better than what we have right now but the ideal implementation should also consider that new owner is active. We can try that as a follow up. Do you have plans to make the simulation framework support ring change scenarios?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have plans to so personally, I need to move on to a different project, but I do think that's a good idea.

@davidporter-id-au davidporter-id-au merged commit b1c923e into cadence-workflow:master Aug 15, 2024
19 checks passed
@davidporter-id-au davidporter-id-au deleted the bugfix/enabling-shutdown-through-service-discovery branch August 15, 2024 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants