kvserver: lease maintenance scheduler #98433

erikgrinaker · 2023-03-11T11:07:28Z

We need a scheduler that eagerly maintains range leases. This should:

Acquire leases for ranges that don't have one.
Extend expiration-based leases before they expire.
Switch the lease type when appropriate (see kvserver: explore setting to only use expiration-based leases #93903).

There are two motivations for this

We don't want lease acquisition to be lazily driven by client requests.
1. It can prevent ranges from ever acquiring a lease under certain failure modes, e.g. because the client request times out before the lease acquisition succeeds under disk stalls (see kvserver: disk stall prevents lease transfer #81100 and kvserver: persistent outage when liveness leaseholder deadlocks #80713).
2. It adds unnecessary latency, e.g. in the case where a scan across many ranges has to sequentially acquire leases for each range, especially under failure modes such as network outages or disk stalls where the lease acquisition has to wait for network timeouts.
When only using expiration leases, we want to ensure ranges have a lease even when there is no traffic on the ranges, to avoid lease acquisition latencies on the next request to the range.

A few important aspects to consider:

Some ranges are more important that others. In particular, the meta and liveness range leases must get priority.
Expiration lease extensions are currently expensive (one Raft write per extension per range). We may want to allow very cold ranges to let their leases expire after some period of inactivity (e.g. minutes).
There may not be any point in quiescing ranges where we eagerly extend expiration leases (see kvserver: explore quiescence removal #94592 and kvserver: don't quiesce ranges with expiration-based leases #94454).
We already have other scheduler infrastructure that we could piggyback on: the Raft scheduler, and the queue infrastructure (e.g. a new lease queue).
We should try to honour lease preferences.

We have a few similar mechanisms already, that should be mostly be replaced by this scheduler:

The replicate queue acquires leases for ranges that doesn't have one.

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 989 to 996 in 736a67e

    
           // TODO(kvoli): This check should fail if not the leaseholder. In the case 
        
           // where we want to use the replicate queue to acquire leases, this should 
        
           // occur before planning or as a result. In order to return this in 
        
           // planning, it is necessary to simulate the prior change having succeeded 
        
           // to then plan this lease transfer. 
        
           if _, pErr := repl.redirectOnOrAcquireLease(ctx); pErr != nil { 
        
           	return change, pErr.GoError() 
        
           }

Store.startLeaseRenewer() eagerly renews the expiration leases on the meta and liveness ranges to avoid high tail latencies.

cockroach/pkg/kv/kvserver/store.go

Lines 2089 to 2098 in 8add7c5

    
           // startLeaseRenewer runs an infinite loop in a goroutine which regularly 
        
           // checks whether the store has any expiration-based leases that should be 
        
           // proactively renewed and attempts to continue renewing them. 
        
           // 
        
           // This reduces user-visible latency when range lookups are needed to serve a 
        
           // request and reduces ping-ponging of r1's lease to different replicas as 
        
           // maybeGossipFirstRange is called on each (e.g.  #24753). 
        
           func (s *Store) startLeaseRenewer(ctx context.Context) { 
        
           	// Start a goroutine that watches and proactively renews certain 
        
           	// expiration-based leases.

Replica.maybeExtendLeaseAsyncLocked() will extend an expiration lease when processing a request in the last half of the lease interval.

cockroach/pkg/kv/kvserver/replica_range_lease.go

Lines 1437 to 1454 in 736a67e

    
           func (r *Replica) maybeExtendLeaseAsyncLocked(ctx context.Context, st kvserverpb.LeaseStatus) { 
        
           	// Check shouldExtendLeaseRLocked again, because others may have raced to 
        
           	// extend the lease and beaten us here after we made the determination 
        
           	// (under a shared lock) that the extension was needed. 
        
           	if !r.shouldExtendLeaseRLocked(st) { 
        
           		return 
        
           	} 
        
           	if log.ExpensiveLogEnabled(ctx, 2) { 
        
           		log.Infof(ctx, "extending lease %s at %s", st.Lease, st.Now) 
        
           	} 
        
           	// We explicitly ignore the returned handle as we won't block on it. 
        
           	// 
        
           	// TODO(tbg): this ctx is likely cancelled very soon, which will in turn 
        
           	// cancel the lease acquisition (unless joined by another more long-lived 
        
           	// ctx). So this possibly isn't working as advertised (which only plays a role 
        
           	// for expiration-based leases, at least). 
        
           	_ = r.requestLeaseLocked(ctx, st) 
        
           }

Jira issue: CRDB-25265

Epic CRDB-25207

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2023-04-03T13:47:17Z

Once #100430 lands, we'll be eagerly extending expiration-based leases during Raft ticks. That solves one of the objectives here. However, the replicate queue is still responsible for acquiring leases for ranges that don't have one, and for switching the lease type (e.g. when kv.expiration_leases_only.enabled is changed). This can take 10 minutes or more.

We should move this eager lease acquisition and type switching into the Raft tick loop as well, so they're more responsive. However, this requires the ranges to be unquiesced. It's possible that we'll want to only do this for expiration leases, and not for epoch leases, to maintain the same performance profile.

We also need to make sure we don't slam the cluster with a ton of concurrent lease requests, but instead stagger them. Maybe #100426 is the solution here, or maybe we need some other policy.

We also need to try to honor lease preferences.

100430: kvserver: eagerly extend expiration leases in Raft scheduler r=erikgrinaker a=erikgrinaker This is submitted with a 23.1 backport in mind, and is therefore limited in scope. More work is needed to fully address #98433 in the 23.2 timeframe. --- **kvserver: don't quiesce ranges with expiration-based leases** This patch prevents quiescence with expiration-based leases, since we'll have to propose a lease extension shortly anyway. Out of caution, we only do this when `kv.expiration_leases_only.enabled` is `true`, with an eye towards 23.1 backports. For 23.2, we will always do this. Touches #94592. Epic: none Release note: None **kvserver: eagerly extend expiration leases in Raft scheduler** This patch eagerly extends expiration leases in the Raft scheduler, during Raft ticks, but only if `kv.expiration_leases_only.enabled` is `true`. This is possible because we no longer allow ranges with expiration leases to quiesce in this case. The old store lease renewer is disabled in this case, but still tracks local system ranges with expiration leases in case the setting is later changed. The store lease renewer is still retained for use by default, to allow backporting this change to 23.1 while keeping the existing default behavior. Doing this during Raft scheduling is compelling, since it already does efficient goroutine scheduling (including liveness range prioritization) and holds the replica mutex. Compared to an alternative implementation that relied on a separate, improved store lease renewer which also allowed ranges to quiesce between extensions, this has significantly better performance. The main reason for this is that unquiescing involves a Raft append to wake the leader. On an idle cluster with 20.000 ranges using only expiration leases, CPU usage was 26% vs. 30%, write IOPS was 500 vs. 900, and network traffic was 0.8 MB/s vs. 1.0 MB/s. KV benchmark results: ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3 14.6k ± 3% 15.2k ± 4% +3.72% (p=0.003 n=8+8) kv95/enc=false/nodes=3 30.6k ± 3% 32.2k ± 3% +5.22% (p=0.000 n=8+8) name old p50 new p50 delta kv0/enc=false/nodes=3 11.3 ± 3% 11.0 ± 0% -2.76% (p=0.006 n=8+6) kv95/enc=false/nodes=3 4.59 ± 8% 4.55 ± 3% ~ (p=0.315 n=8+8) name old p95 new p95 delta kv0/enc=false/nodes=3 31.5 ± 0% 30.2 ± 4% -4.25% (p=0.003 n=6+8) kv95/enc=false/nodes=3 19.5 ± 7% 18.1 ± 4% -7.28% (p=0.000 n=8+7) name old p99 new p99 delta kv0/enc=false/nodes=3 48.7 ± 3% 47.4 ± 6% ~ (p=0.059 n=8+8) kv95/enc=false/nodes=3 32.8 ± 9% 30.2 ± 4% -8.04% (p=0.001 n=8+8) ``` Resolves #99812. Touches #98433. Epic: none Release note: None **kvserver: also acquire expiration leases in replicate queue** Previously, the replicate queue would only reacquire epoch leases, and allow expiration leases to expire. It now reacquires all leases, since they are eagerly extended. In the future, when quiescence is removed entirely, this responsibility should be moved to the Raft tick loop. Epic: none Release note: None **kvserver: conditionally extend expiration leases during requests** If `kv.expiration_leases_only.enabled´ is `true`, this patch disables expiration lease extension during request processing, since the Raft scheduler will eagerly extend leases in this case. This will eventually be removed completely, but is kept for now with an eye towards a 23.1 backport. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>

100434: kvserver: grow goroutine stack for `requestLease` r=erikgrinaker a=erikgrinaker We pre-grow the goroutine stack when processing batch requests via RPC, to avoid frequent stack copying. However, this did not apply to lease requests, which are submitted directly to the local replica. When eagerly extending 20.000 expiration leases, `requestLease` stack growth made up 4% of total CPU usage -- this patch reduces that to 1%. Touches #98433. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>

101050: kvserver: move eager expiration lease extension to Raft scheduler r=erikgrinaker a=erikgrinaker **kvserver: never quiesce expiration leases** Previously, this was contingent on `kv.expiration_leases_only.enabled`. This patch therefore, in practice, disables quiescence for the meta/liveness ranges and for any ranges that are transferring an expiration lease before upgrading it to an epoch lease. Quiescing these will tend to result in more work rather than less, because we'll have to wake the ranges up again shortly which incurs an additional Raft append. Epic: none Release note: None **kvserver: move eager expiration lease extension to Raft scheduler** This patch makes the Raft scheduler responsible for eagerly extending all expiration-based leases, replacing the store lease renewer. Previously, this was contingent on `kv.expiration_leases_only.enabled`, and by default only the meta/liveness ranges were eagerly extended by the store lease renewer. Touches #98433. Epic: none Release note: None **kvserver: don't extend leases during request processing** This patch removes expiration lease extension during request processing, since the Raft scheduler now always eagerly extends expiration leases. Previously, this was contingent on `kv.expiration_leases_only.enabled`. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>

100432: kvserver: don't add range info for lease requests r=erikgrinaker a=erikgrinaker Lease requests are submitted directly to the local replica, bypassing the DistSender. They therefore don't set `ClientRangeInfo`, and don't need range info returned. However, `maybeAddRangeInfoToResponse()` would interpret this as stale range info and always populate the response with range info, which can be expensive -- especially with expiration leases that are frequently extended. In an idle cluster with 20.000 eagerly extended expiration leases, this was responsible for 1.4% of total CPU usage. Touches #98433. Epic: none Release note: None 102416: githubpost: add team label to issues r=erikgrinaker a=erikgrinaker **TEAMS: add `T-kv` label for KV** Epic: none Release note: None **githubpost: add team label to issues** This patch automatically adds team labels to created issues. Previously, this was done by Blathers based on the project board, but not all teams use project boards -- and we may as well just set the correct label to begin with. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>

100433: batcheval: don't declare latches for `RequestLease` r=erikgrinaker a=erikgrinaker `RequestLease` does not take out latches. However, it still declared latches to appease the spanset asserter. This has a non-negligible cost with eagerly extended expiration leases, about 1.4% of CPU usage with 20.000 idle ranges. This patch removes latch declaration and disables spanset assertions. Touches #98433. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>

100435: kvserver: avoid unnecessary tracing span in `requestLeaseAsync()` r=erikgrinaker a=erikgrinaker There's no need to create a tracing span when the client isn't running a trace. Touches #98433. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels Mar 11, 2023

This was referenced Mar 12, 2023

kvserver: eagerly renew leases #98455

Closed

kvserver: benchmark eager expiration lease extensions #98487

Closed

exalate-issue-sync bot assigned erikgrinaker Apr 3, 2023

This was referenced Apr 9, 2023

kvserver: support multiple priority ranges in Raft scheduler #101013

Closed

kvserver: move eager expiration lease extension to Raft scheduler #101050

Merged

kvserver: acquire and switch leases during Raft ticks #101098

Merged

blathers-crl bot mentioned this issue Apr 27, 2023

release-23.1: kvserver: don't add range info for lease requests #102424

Merged

blathers-crl bot mentioned this issue Apr 29, 2023

release-23.1: batcheval: don't declare latches for RequestLease #102658

Merged

blathers-crl bot mentioned this issue Apr 30, 2023

release-23.1: kvserver: avoid unnecessary tracing span in requestLeaseAsync() #102677

Merged

craig bot closed this as completed in 117dc40 May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: lease maintenance scheduler #98433

kvserver: lease maintenance scheduler #98433

erikgrinaker commented Mar 11, 2023 •

edited

Loading

erikgrinaker commented Apr 3, 2023 •

edited

Loading

kvserver: lease maintenance scheduler #98433

kvserver: lease maintenance scheduler #98433

Comments

erikgrinaker commented Mar 11, 2023 • edited Loading

erikgrinaker commented Apr 3, 2023 • edited Loading

erikgrinaker commented Mar 11, 2023 •

edited

Loading

erikgrinaker commented Apr 3, 2023 •

edited

Loading