-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: matching persistence layer update sticky task queue #935
Conversation
TODO verify conditional update with TTL on Cassandra side does not break anything |
Need to special handling Cassandra behavior, since |
* Sticky task queue update should done within transaction lock
@@ -2284,54 +2292,51 @@ func (d *cassandraPersistence) LeaseTaskQueue(request *p.LeaseTaskQueueRequest) | |||
func (d *cassandraPersistence) UpdateTaskQueue(request *p.UpdateTaskQueueRequest) (*p.UpdateTaskQueueResponse, error) { | |||
tli := *request.TaskQueueInfo | |||
tli.LastUpdateTime = timestamp.TimeNowPtrUtc() | |||
if tli.Kind == enumspb.TASK_QUEUE_KIND_STICKY { // if task_queue is sticky, then update with TTL | |||
expiry := types.TimestampNow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before your change, was this even used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var applied bool | ||
previous := make(map[string]interface{}) | ||
if tli.Kind == enumspb.TASK_QUEUE_KIND_STICKY { // if task_queue is sticky, then update with TTL | ||
batch := d.session.NewBatch(gocql.LoggedBatch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any performance implications with this change? Before it was a simple insert, now it is a CAS insert + update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cass does not allow update setting ttl for primary key, hence the insert query.
we need to perform CAS for this query (see the normal task queue). the only additional cost is batch query, which only target at one row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already run a background job to cleanup task queues on mysql backend, I think we should make this consistent across all databases and follow the same pattern for Cassandra. Looks like this would simplify the logic significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already run a background job to cleanup task queues on mysql backend, I think we should make this consistent across all databases and follow the same pattern for Cassandra. Looks like this would simplify the logic significantly.
Short term fix vs long term solution
This PR aims to provide the short term fix, without too much perf impact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let's file a task for future improvement to eliminate TTL usage from cassandra persistence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
blob, err := serialization.TaskQueueInfoToBlob(tqInfo) | ||
if err != nil { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated to your checkin, but why do we not wrap this error, but we wrap the error below it? Is there any specific criteria that comes into play here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serialization.TaskQueueInfoToBlob(tqInfo)
this function is owned by us, if the error to be returned is not correct, than we should update this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to eventually move out serialization from this layer. @wxing1292 can you create a task for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err != nil { | ||
return nil, convertCommonErrors("UpdateTaskQueue", err) | ||
} | ||
datablob, err := serialization.TaskQueueInfoToBlob(&tli) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the MySQL Persistence layer, we actually set the ExpiryTime field on the TLI object before serializing it. What's the reason for the difference between the two layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cassandra has TTL support, SQL not.
There should be a worker / background logic scan SQL to get rid of stale records
Can you briefly explain what the worst-case impact of this bug would have been (e.g. would it have resulted in tasks getting incorrect scheduletostart timeouts, or something more sinister)? |
Overall, not a series bug (although the error logs looks bad). So worst case is worker seeing around 5s more delay and additional work / load on history service |
query := d.session.Query(templateUpdateTaskQueueQueryWithTTL, | ||
var applied bool | ||
previous := make(map[string]interface{}) | ||
if tli.Kind == enumspb.TASK_QUEUE_KIND_STICKY { // if task_queue is sticky, then update with TTL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure we have unit tests for both code paths sticky
and regular
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have, however the unit test was asserting the wrong behavior:
https://github.com/temporalio/temporal/pull/935/files#diff-6349d51f67c5e6f3d3ea5fb9bf919a04a5856a06738d80ef17e7a89240709a95L384
row = sqlplugin.TaskQueuesRow{ | ||
RangeHash: tqHash, | ||
TaskQueueID: tqId, | ||
RangeID: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure the intent here is to set RangeID explicitly to 0? As previously it was implicitly set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we should set all variable explicitly.
plus this does not introduce any behavior difference
if err != nil { | ||
return nil, err | ||
} | ||
if _, err := m.db.ReplaceIntoTaskQueues(&sqlplugin.TaskQueuesRow{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more context why this is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this write is unprotected by range ID
- logic below will write this record again
Overall this is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Let's quickly sync up tomorrow before merging this in.
What changed?
Why?
Task queue update should always be done within lock
How did you test it?
Run tests
Potential risks
N/A