Benchmark concurrent Cassandra LWTs #6186

taylanisikdemir · 2024-07-23T22:32:47Z

Cassandra LWT overview

Cadence uses Cassandra's LWT (aka Compare and Swap) queries to ensure consistent atomic updates for various scenarios. LWTs are implemented using Paxos based coordination between nodes. LWTs are scoped to the partition key. Running multiple concurrent LWTs for the same partition can impact each other due to the coordination/conflict resolution mechanisms inherent in the Paxos algorithm and its adaptations for LWTs.

Contention and Backoff:
When multiple coordinators attempt to perform writes to the same partition simultaneously, they may clash, leading to contention.
Cassandra employs randomized exponential backoff to mitigate this, meaning that conflicting LWTs will experience delays as they retry their operations with new ballot numbers.
Quorum Requirements:
Each LWT must reach a quorum of nodes to proceed with its stages (prepare, propose, commit). If multiple LWTs are running concurrently, they may compete for the same set of nodes to form a quorum, potentially causing delays or retries.
Proposals and Commits:
The need to re-propose values in case of in-progress Paxos sessions can delay LWTs. If an LWT detects an uncommitted value, it must complete or supersede it before proceeding, adding latency.
Consistency and Order:
Cassandra ensures that once a value is decided (accepted by a quorum), no earlier proposal can be reproposed. Concurrent LWTs must navigate this rule, meaning that operations started later might see the effects of earlier ones, ensuring linearizability but potentially causing additional rounds of coordination.

This PR benchmarks the impact of concurrency by generating 1k UpdateWorkflow LWT queries on same partition with different concurrency limits. The goal is to measure whether tuning LWT concurrency is worth it and potentially help prevent hot partition problems that we face in some environments under high load.

Benchmark setup

LWT queries are given 1s timeout.
All failures are due to timeout Operation timed out - received only 0 responses.
Failed queries are not retried so Elapsed time should be read as "time to attempt 1k queries with given concurrency".

Benchmark with single node Cassandra

Total Updates	Concurrency	Success	Failed	Avg Duration (ms)	Max Duration (ms)	Min Duration (ms)	Elapsed (s)
1000	1	1000	0	7.41	26.9	4.8	7.42
1000	10	1000	0	54.54	808.2	2.8	5.16
1000	20	999	1	90.61	1075	2.8	4.63
1000	40	993	7	145.95	1071	2.2	3.78
1000	80	976	24	254.02	1093	1.4	3.36

Benchmark with 2 nodes Cassandra and replication_factor: 2

Total Updates	Concurrency	Success	Failed	Avg Duration (ms)	Max Duration (ms)	Min Duration (ms)	Elapsed (s)
1000	1	1000	0	11.75	50.09	6.69	11.76
1000	10	900	100	370.96	1120	7.92	37.31
1000	20	642	358	602.63	1120	5.95	30.40
1000	40	365	635	826.45	1130	7.83	21.12
1000	80	64	936	1000	1120	6.87	12.93

Notes on results

Ratio of queries that timeout is proportional to the concurrency limit
The impact of high concurrency is more drastic with replication factor > 1 as expected
Increasing concurrency gives the impression of higher throughput but average query latency increases drastically.

What is next?

UpdateWorkflow queries for a given partition are originated from single Cadence History service host (corresponding shard owner history engine instance). This means we can easily control the concurrency by introducing a similar implementation. A number between (1, 10) seem to promise best latency and throughput. In order to decide exact concurrency limit we can try two approaches:
- Benchmark this in a prod-like environment and come up with a good enough number that will handle the volume of queries with minimal contention.
- Introduce an adaptive concurrency limiter that dynamically increases/decreases concurrency based on ratio of timeouts.

codecov · 2024-07-23T23:07:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.83%. Comparing base (736d66b) to head (a290151).
Report is 2 commits behind head on master.

Additional details and impacted files

see 11 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 736d66b...a290151. Read the comment docs.

taylanisikdemir added 3 commits July 23, 2024 12:22

Measure LWT concurrency impact on failures/latencies

874688b

fix

368867a

more cases

19cd50b

taylanisikdemir requested review from Shaddoll, neil-xie, davidporter-id-au, Groxx, shijiesheng, agautam478, jakobht, 3vilhamster, sankari165, dkrotx and demirkayaender as code owners July 23, 2024 22:32

exclude test only cdb functions from coverage

a290151

shijiesheng approved these changes Jul 24, 2024

View reviewed changes

Shaddoll approved these changes Jul 24, 2024

View reviewed changes

taylanisikdemir merged commit f9996a1 into cadence-workflow:master Jul 24, 2024
20 checks passed

taylanisikdemir deleted the taylan/lwt_pool_benchmark branch July 24, 2024 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark concurrent Cassandra LWTs #6186

Benchmark concurrent Cassandra LWTs #6186

taylanisikdemir commented Jul 23, 2024 •

edited

Loading

codecov bot commented Jul 23, 2024 •

edited

Loading

Benchmark concurrent Cassandra LWTs #6186

Benchmark concurrent Cassandra LWTs #6186

Conversation

taylanisikdemir commented Jul 23, 2024 • edited Loading

Cassandra LWT overview

Benchmark setup

Benchmark with single node Cassandra

Benchmark with 2 nodes Cassandra and replication_factor: 2

What is next?

codecov bot commented Jul 23, 2024 • edited Loading

Codecov Report

taylanisikdemir commented Jul 23, 2024 •

edited

Loading

codecov bot commented Jul 23, 2024 •

edited

Loading