Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix for async matching for isolation-group redirection #5423

Conversation

davidporter-id-au
Copy link
Member

@davidporter-id-au davidporter-id-au commented Oct 16, 2023

What changed?

We observed some problems in production where a heavily backlogged backfill event got stuck and tasks were not dispatched. The issue was mitigated by rebooting matching, but more so by turning off zonal isolation for this one service.

Context

When a task is attempting to be dispatched through async dispatch, but it times out and there doesn't appear to be any pollers in the new zone, the feature will attempt to redirect the traffic to an isolation group where there are pollers - aka 'task redispatch'. This is intended to be a feature to avoid traffic from getting blackholed, but also to prevent it from blocking progression for other isolation groups in the tasklist, since it's a FIFO queue.

Suspected problem:

The suspicion is that the task processing is failing due to getting to a deadlock around task redispatch.

This pull-request attempts to fix the suspected problem of isolation-group task redirection causing a deadlock between the task pump and task dispatch. The basic idea being there's a few scenarios where it's conceivable that redispatching might block on putting tasks into that channel, and so it needs to be made nonblocking (all credit to Zijian/@Shaddoll for thinking of this / Steven for suggesting that also invalid channels should be checked)

Changes

  • Refactors async dispatch code into two sub-functions to make it easier to read
  • Adds some tests around blocking behaviour
  • Adds default on select to prevent this from blocking (the actual bugfix)
  • Adds check on redirection to ensure there's no invalid channel pushing, since that will also block, I'm told (til)

How did you test it?

tag.TaskID(taskInfo.TaskID),
tag.WorkflowDomainID(taskInfo.DomainID),
)
default:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, here, is the single line hotfix basically. The default should prevent any attempts to put stuff on the channel from deadlocking if it's full.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
For this case and line 423 we might want to consider a similar backoff policy in the future.

@@ -150,59 +155,10 @@ dispatchLoop:
if !ok { // Task list getTasks pump is shutdown
break dispatchLoop
}
for {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pulled this into a function because it was annoying to test

@davidporter-id-au davidporter-id-au changed the title Seeing if this makes sense Hotfix for async matching for isolation-group redirection Oct 16, 2023
Comment on lines 394 to 395
breakDispatchLoop = true
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of these 2 lines, return false seems more readable to me. it's already obvious what the returned boolean means since it's named

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good feedback, I was just doing early iteration. I've updated it to subdivide it further for easier testing.

}
if group == isolationGroup {
continue
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can omit the else and reduce the indentation of the following block to improve readability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would normally always go for this, but also am trying to limit the amount of changes I'm making, but maybe in here it's worth the additional improvement. Let me update.

tag.TaskID(taskInfo.TaskID),
tag.WorkflowDomainID(taskInfo.DomainID),
)
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
For this case and line 423 we might want to consider a similar backoff policy in the future.

@davidporter-id-au davidporter-id-au marked this pull request as ready for review October 17, 2023 00:00
tr.logger.Info("Tasklist manager context is cancelled, shutting down")
return true, true
}
if err == context.DeadlineExceeded {
Copy link
Member

@taylanisikdemir taylanisikdemir Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also early terminate here

if err != deadline exceeded {
   // line 491-494 goes here
   return false, false
}

// handle deadline exceeded

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh... not sure I understood sorry?

Copy link
Member

@taylanisikdemir taylanisikdemir Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block has unnecessary level of indentation which can be avoided by handling DeadlineExceeded first. Updated my example above to make the early termination obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is messy. Let me see if I can make it better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised we were handling unknown errors in a way which I think was wrong (they were just assumed to be rate-limit errors, but that seems... naieve. So ended up creating a default catchall for unknown errors as I feel this is more readable

service/matching/taskReader.go Show resolved Hide resolved
common/metrics/defs.go Outdated Show resolved Hide resolved
Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, looks good to me. a minor Q and nits but afaict it's correct and also an improvement. merge whenever.

service/matching/matcher.go Outdated Show resolved Hide resolved
Comment on lines 404 to 408
if err == context.DeadlineExceeded {

if errors.Is(err, context.DeadlineExceeded) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 seems very likely safe and correct. code here doesn't seem to wrap anything ever, but this might allow us to.

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-stamping: looks even gooder to me than before, just the comment to fix.
(thanks for adding the comment! I wasn't sure what that one meant either, now we know)

@davidporter-id-au davidporter-id-au merged commit d15c2ec into cadence-workflow:master Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants