Bugfix: matching persistence layer update sticky task queue #935

wxing1292 · 2020-10-31T03:30:30Z

What changed?

Sticky task queue update should be done within transaction lock or conditional update

Why?
Task queue update should always be done within lock

How did you test it?
Run tests

Potential risks
N/A

wxing1292 · 2020-10-31T04:06:13Z

TODO verify conditional update with TTL on Cassandra side does not break anything

wxing1292 · 2020-10-31T04:45:59Z

TODO verify conditional update with TTL on Cassandra side does not break anything

Need to special handling Cassandra behavior, since update will only update non primary column TTL.

* Sticky task queue update should done within transaction lock

mastermanu · 2020-10-31T20:31:45Z

common/persistence/cassandra/cassandraPersistence.go

@@ -2284,54 +2292,51 @@ func (d *cassandraPersistence) LeaseTaskQueue(request *p.LeaseTaskQueueRequest)
 func (d *cassandraPersistence) UpdateTaskQueue(request *p.UpdateTaskQueueRequest) (*p.UpdateTaskQueueResponse, error) {
 	tli := *request.TaskQueueInfo
 	tli.LastUpdateTime = timestamp.TimeNowPtrUtc()
-	if tli.Kind == enumspb.TASK_QUEUE_KIND_STICKY { // if task_queue is sticky, then update with TTL
-		expiry := types.TimestampNow()


Before your change, was this even used anywhere?

nop, ref:
https://github.com/uber/cadence/blob/0.11.x/common/persistence/cassandra/cassandraPersistence.go#L2384

mastermanu · 2020-10-31T20:35:01Z

common/persistence/cassandra/cassandraPersistence.go

+	var applied bool
+	previous := make(map[string]interface{})
+	if tli.Kind == enumspb.TASK_QUEUE_KIND_STICKY { // if task_queue is sticky, then update with TTL
+		batch := d.session.NewBatch(gocql.LoggedBatch)


Any performance implications with this change? Before it was a simple insert, now it is a CAS insert + update?

cass does not allow update setting ttl for primary key, hence the insert query.

we need to perform CAS for this query (see the normal task queue). the only additional cost is batch query, which only target at one row.

Since we already run a background job to cleanup task queues on mysql backend, I think we should make this consistent across all databases and follow the same pattern for Cassandra. Looks like this would simplify the logic significantly.

Since we already run a background job to cleanup task queues on mysql backend, I think we should make this consistent across all databases and follow the same pattern for Cassandra. Looks like this would simplify the logic significantly.

Short term fix vs long term solution

This PR aims to provide the short term fix, without too much perf impact

Agreed. Let's file a task for future improvement to eliminate TTL usage from cassandra persistence.

mastermanu · 2020-10-31T20:38:53Z

common/persistence/sql/sqlTaskManager.go

+		}
+		blob, err := serialization.TaskQueueInfoToBlob(tqInfo)
+		if err != nil {
+			return nil, err


unrelated to your checkin, but why do we not wrap this error, but we wrap the error below it? Is there any specific criteria that comes into play here?

serialization.TaskQueueInfoToBlob(tqInfo)

this function is owned by us, if the error to be returned is not correct, than we should update this function.

We need to eventually move out serialization from this layer. @wxing1292 can you create a task for this?

common/persistence/sql/sqlTaskManager.go

mastermanu · 2020-10-31T20:42:12Z

common/persistence/cassandra/cassandraPersistence.go

-		if err != nil {
-			return nil, convertCommonErrors("UpdateTaskQueue", err)
-		}
+	datablob, err := serialization.TaskQueueInfoToBlob(&tli)


In the MySQL Persistence layer, we actually set the ExpiryTime field on the TLI object before serializing it. What's the reason for the difference between the two layers?

Cassandra has TTL support, SQL not.

There should be a worker / background logic scan SQL to get rid of stale records

common/persistence/sql/sqlTaskManager.go

mastermanu · 2020-10-31T20:47:50Z

Can you briefly explain what the worst-case impact of this bug would have been (e.g. would it have resulted in tasks getting incorrect scheduletostart timeouts, or something more sinister)?

wxing1292 · 2020-10-31T21:36:47Z

Can you briefly explain what the worst-case impact of this bug would have been (e.g. would it have resulted in tasks getting incorrect scheduletostart timeouts, or something more sinister)?

Overall, not a series bug (although the error logs looks bad).
The task queue bug only affects sticky task queue (only used by command). sticky command task is also protected by a short timeout (5s according to my mem).

So worst case is worker seeing around 5s more delay and additional work / load on history service

samarabbas · 2020-11-02T01:54:45Z

common/persistence/cassandra/cassandraPersistence.go

-		query := d.session.Query(templateUpdateTaskQueueQueryWithTTL,
+	var applied bool
+	previous := make(map[string]interface{})
+	if tli.Kind == enumspb.TASK_QUEUE_KIND_STICKY { // if task_queue is sticky, then update with TTL


Can you make sure we have unit tests for both code paths sticky and regular?

we have, however the unit test was asserting the wrong behavior:
https://github.com/temporalio/temporal/pull/935/files#diff-6349d51f67c5e6f3d3ea5fb9bf919a04a5856a06738d80ef17e7a89240709a95L384

samarabbas · 2020-11-02T01:58:43Z

common/persistence/sql/sqlTaskManager.go

+		row = sqlplugin.TaskQueuesRow{
+			RangeHash:    tqHash,
+			TaskQueueID:  tqId,
+			RangeID:      0,


Just want to make sure the intent here is to set RangeID explicitly to 0? As previously it was implicitly set.

ideally we should set all variable explicitly.
plus this does not introduce any behavior difference

samarabbas · 2020-11-02T02:06:38Z

common/persistence/sql/sqlTaskManager.go

-		if err != nil {
-			return nil, err
-		}
-		if _, err := m.db.ReplaceIntoTaskQueues(&sqlplugin.TaskQueuesRow{


Can you provide more context why this is removed?

this write is unprotected by range ID

logic below will write this record again

Overall this is a bug.

samarabbas

Overall looks good. Let's quickly sync up tomorrow before merging this in.

wxing1292 requested a review from mastermanu October 31, 2020 03:30

wxing1292 changed the title ~~Bugfix: SQL matching layer update task queue~~ Bugfix: matching layer update task queue Oct 31, 2020

wxing1292 changed the title ~~Bugfix: matching layer update task queue~~ Bugfix: matching layer update sticky task queue Oct 31, 2020

wxing1292 requested a review from samarabbas October 31, 2020 05:54

wxing1292 changed the title ~~Bugfix: matching layer update sticky task queue~~ Bugfix: matching persistence layer update sticky task queue Oct 31, 2020

Bugfix: SQL matching layer update task queue

9114aef

* Sticky task queue update should done within transaction lock

mastermanu reviewed Oct 31, 2020

View reviewed changes

common/persistence/sql/sqlTaskManager.go Outdated Show resolved Hide resolved

mastermanu reviewed Oct 31, 2020

View reviewed changes

common/persistence/sql/sqlTaskManager.go Show resolved Hide resolved

mastermanu approved these changes Oct 31, 2020

View reviewed changes

Address comments

3d08618

wxing1292 added 2 commits October 31, 2020 14:37

Merge branch 'master' into bugfix-matching

b19b498

Merge branch 'master' into bugfix-matching

652c5fd

samarabbas reviewed Nov 2, 2020

View reviewed changes

samarabbas approved these changes Nov 2, 2020

View reviewed changes

Merge branch 'master' into bugfix-matching

ff0dfac

wxing1292 merged commit 9b6c587 into temporalio:master Nov 2, 2020

wxing1292 deleted the bugfix-matching branch November 2, 2020 19:54

yycptt mentioned this pull request Nov 16, 2020

Fix update sticky task list in task store cadence-workflow/cadence#3761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: matching persistence layer update sticky task queue #935

Bugfix: matching persistence layer update sticky task queue #935

wxing1292 commented Oct 31, 2020 •

edited

Loading

wxing1292 commented Oct 31, 2020 •

edited

Loading

wxing1292 commented Oct 31, 2020

mastermanu Oct 31, 2020

wxing1292 Oct 31, 2020

mastermanu Oct 31, 2020

wxing1292 Oct 31, 2020

samarabbas Nov 2, 2020

wxing1292 Nov 2, 2020

samarabbas Nov 2, 2020

wxing1292 Nov 2, 2020

mastermanu Oct 31, 2020

wxing1292 Oct 31, 2020

samarabbas Nov 2, 2020 •

edited

Loading

wxing1292 Nov 2, 2020

mastermanu Oct 31, 2020

wxing1292 Oct 31, 2020

mastermanu commented Oct 31, 2020

wxing1292 commented Oct 31, 2020

samarabbas Nov 2, 2020

wxing1292 Nov 2, 2020

samarabbas Nov 2, 2020

wxing1292 Nov 2, 2020

samarabbas Nov 2, 2020

wxing1292 Nov 2, 2020

samarabbas left a comment

Bugfix: matching persistence layer update sticky task queue #935

Bugfix: matching persistence layer update sticky task queue #935

Conversation

wxing1292 commented Oct 31, 2020 • edited Loading

wxing1292 commented Oct 31, 2020 • edited Loading

wxing1292 commented Oct 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samarabbas Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mastermanu commented Oct 31, 2020

wxing1292 commented Oct 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samarabbas left a comment

Choose a reason for hiding this comment

wxing1292 commented Oct 31, 2020 •

edited

Loading

wxing1292 commented Oct 31, 2020 •

edited

Loading

samarabbas Nov 2, 2020 •

edited

Loading