Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: matching persistence layer update sticky task queue #935

Merged
merged 5 commits into from
Nov 2, 2020
Merged

Bugfix: matching persistence layer update sticky task queue #935

merged 5 commits into from
Nov 2, 2020

Conversation

wxing1292
Copy link
Contributor

@wxing1292 wxing1292 commented Oct 31, 2020

What changed?

  • Sticky task queue update should be done within transaction lock or conditional update

Why?
Task queue update should always be done within lock

How did you test it?
Run tests

Potential risks
N/A

@wxing1292 wxing1292 requested a review from mastermanu October 31, 2020 03:30
@wxing1292 wxing1292 changed the title Bugfix: SQL matching layer update task queue Bugfix: matching layer update task queue Oct 31, 2020
@wxing1292 wxing1292 changed the title Bugfix: matching layer update task queue Bugfix: matching layer update sticky task queue Oct 31, 2020
@wxing1292
Copy link
Contributor Author

wxing1292 commented Oct 31, 2020

TODO verify conditional update with TTL on Cassandra side does not break anything

@wxing1292
Copy link
Contributor Author

TODO verify conditional update with TTL on Cassandra side does not break anything

Need to special handling Cassandra behavior, since update will only update non primary column TTL.

@wxing1292 wxing1292 requested a review from samarabbas October 31, 2020 05:54
@wxing1292 wxing1292 changed the title Bugfix: matching layer update sticky task queue Bugfix: matching persistence layer update sticky task queue Oct 31, 2020
* Sticky task queue update should done within transaction lock
@@ -2284,54 +2292,51 @@ func (d *cassandraPersistence) LeaseTaskQueue(request *p.LeaseTaskQueueRequest)
func (d *cassandraPersistence) UpdateTaskQueue(request *p.UpdateTaskQueueRequest) (*p.UpdateTaskQueueResponse, error) {
tli := *request.TaskQueueInfo
tli.LastUpdateTime = timestamp.TimeNowPtrUtc()
if tli.Kind == enumspb.TASK_QUEUE_KIND_STICKY { // if task_queue is sticky, then update with TTL
expiry := types.TimestampNow()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before your change, was this even used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var applied bool
previous := make(map[string]interface{})
if tli.Kind == enumspb.TASK_QUEUE_KIND_STICKY { // if task_queue is sticky, then update with TTL
batch := d.session.NewBatch(gocql.LoggedBatch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any performance implications with this change? Before it was a simple insert, now it is a CAS insert + update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cass does not allow update setting ttl for primary key, hence the insert query.

we need to perform CAS for this query (see the normal task queue). the only additional cost is batch query, which only target at one row.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already run a background job to cleanup task queues on mysql backend, I think we should make this consistent across all databases and follow the same pattern for Cassandra. Looks like this would simplify the logic significantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already run a background job to cleanup task queues on mysql backend, I think we should make this consistent across all databases and follow the same pattern for Cassandra. Looks like this would simplify the logic significantly.

Short term fix vs long term solution

This PR aims to provide the short term fix, without too much perf impact

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's file a task for future improvement to eliminate TTL usage from cassandra persistence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
blob, err := serialization.TaskQueueInfoToBlob(tqInfo)
if err != nil {
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to your checkin, but why do we not wrap this error, but we wrap the error below it? Is there any specific criteria that comes into play here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serialization.TaskQueueInfoToBlob(tqInfo)

this function is owned by us, if the error to be returned is not correct, than we should update this function.

Copy link
Contributor

@samarabbas samarabbas Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to eventually move out serialization from this layer. @wxing1292 can you create a task for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err != nil {
return nil, convertCommonErrors("UpdateTaskQueue", err)
}
datablob, err := serialization.TaskQueueInfoToBlob(&tli)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the MySQL Persistence layer, we actually set the ExpiryTime field on the TLI object before serializing it. What's the reason for the difference between the two layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cassandra has TTL support, SQL not.

There should be a worker / background logic scan SQL to get rid of stale records

@mastermanu
Copy link
Member

Can you briefly explain what the worst-case impact of this bug would have been (e.g. would it have resulted in tasks getting incorrect scheduletostart timeouts, or something more sinister)?

@wxing1292
Copy link
Contributor Author

Can you briefly explain what the worst-case impact of this bug would have been (e.g. would it have resulted in tasks getting incorrect scheduletostart timeouts, or something more sinister)?

Overall, not a series bug (although the error logs looks bad).
The task queue bug only affects sticky task queue (only used by command). sticky command task is also protected by a short timeout (5s according to my mem).

So worst case is worker seeing around 5s more delay and additional work / load on history service

query := d.session.Query(templateUpdateTaskQueueQueryWithTTL,
var applied bool
previous := make(map[string]interface{})
if tli.Kind == enumspb.TASK_QUEUE_KIND_STICKY { // if task_queue is sticky, then update with TTL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure we have unit tests for both code paths sticky and regular?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row = sqlplugin.TaskQueuesRow{
RangeHash: tqHash,
TaskQueueID: tqId,
RangeID: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure the intent here is to set RangeID explicitly to 0? As previously it was implicitly set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we should set all variable explicitly.
plus this does not introduce any behavior difference

if err != nil {
return nil, err
}
if _, err := m.db.ReplaceIntoTaskQueues(&sqlplugin.TaskQueuesRow{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more context why this is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this write is unprotected by range ID
  2. logic below will write this record again

Overall this is a bug.

Copy link
Contributor

@samarabbas samarabbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Let's quickly sync up tomorrow before merging this in.

@wxing1292 wxing1292 merged commit 9b6c587 into temporalio:master Nov 2, 2020
@wxing1292 wxing1292 deleted the bugfix-matching branch November 2, 2020 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants