-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hotfix for async matching for isolation-group redirection #5423
Hotfix for async matching for isolation-group redirection #5423
Conversation
service/matching/taskReader.go
Outdated
tag.TaskID(taskInfo.TaskID), | ||
tag.WorkflowDomainID(taskInfo.DomainID), | ||
) | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, here, is the single line hotfix basically. The default should prevent any attempts to put stuff on the channel from deadlocking if it's full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
For this case and line 423 we might want to consider a similar backoff policy in the future.
@@ -150,59 +155,10 @@ dispatchLoop: | |||
if !ok { // Task list getTasks pump is shutdown | |||
break dispatchLoop | |||
} | |||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pulled this into a function because it was annoying to test
service/matching/taskReader.go
Outdated
breakDispatchLoop = true | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of these 2 lines, return false
seems more readable to me. it's already obvious what the returned boolean means since it's named
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, good feedback, I was just doing early iteration. I've updated it to subdivide it further for easier testing.
service/matching/taskReader.go
Outdated
} | ||
if group == isolationGroup { | ||
continue | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can omit the else
and reduce the indentation of the following block to improve readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would normally always go for this, but also am trying to limit the amount of changes I'm making, but maybe in here it's worth the additional improvement. Let me update.
service/matching/taskReader.go
Outdated
tag.TaskID(taskInfo.TaskID), | ||
tag.WorkflowDomainID(taskInfo.DomainID), | ||
) | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
For this case and line 423 we might want to consider a similar backoff policy in the future.
service/matching/taskReader.go
Outdated
tr.logger.Info("Tasklist manager context is cancelled, shutting down") | ||
return true, true | ||
} | ||
if err == context.DeadlineExceeded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also early terminate here
if err != deadline exceeded {
// line 491-494 goes here
return false, false
}
// handle deadline exceeded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh... not sure I understood sorry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block has unnecessary level of indentation which can be avoided by handling DeadlineExceeded first. Updated my example above to make the early termination obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is messy. Let me see if I can make it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised we were handling unknown errors in a way which I think was wrong (they were just assumed to be rate-limit errors, but that seems... naieve. So ended up creating a default catchall for unknown errors as I feel this is more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, looks good to me. a minor Q and nits but afaict it's correct and also an improvement. merge whenever.
…r-id-au/cadence into bugfix/tasklist-locking-issue
service/matching/taskReader.go
Outdated
if err == context.DeadlineExceeded { | ||
|
||
if errors.Is(err, context.DeadlineExceeded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 seems very likely safe and correct. code here doesn't seem to wrap anything ever, but this might allow us to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-stamping: looks even gooder to me than before, just the comment to fix.
(thanks for adding the comment! I wasn't sure what that one meant either, now we know)
What changed?
We observed some problems in production where a heavily backlogged backfill event got stuck and tasks were not dispatched. The issue was mitigated by rebooting matching, but more so by turning off zonal isolation for this one service.
Context
When a task is attempting to be dispatched through async dispatch, but it times out and there doesn't appear to be any pollers in the new zone, the feature will attempt to redirect the traffic to an isolation group where there are pollers - aka 'task redispatch'. This is intended to be a feature to avoid traffic from getting blackholed, but also to prevent it from blocking progression for other isolation groups in the tasklist, since it's a FIFO queue.
Suspected problem:
The suspicion is that the task processing is failing due to getting to a deadlock around task redispatch.
This pull-request attempts to fix the suspected problem of isolation-group task redirection causing a deadlock between the task pump and task dispatch. The basic idea being there's a few scenarios where it's conceivable that redispatching might block on putting tasks into that channel, and so it needs to be made nonblocking (all credit to Zijian/@Shaddoll for thinking of this / Steven for suggesting that also invalid channels should be checked)
Changes
How did you test it?