Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: lease maintenance scheduler #98433

Closed
erikgrinaker opened this issue Mar 11, 2023 · 1 comment · Fixed by #101098
Closed

kvserver: lease maintenance scheduler #98433

erikgrinaker opened this issue Mar 11, 2023 · 1 comment · Fixed by #101098
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 11, 2023

We need a scheduler that eagerly maintains range leases. This should:

There are two motivations for this

  1. We don't want lease acquisition to be lazily driven by client requests.

    1. It can prevent ranges from ever acquiring a lease under certain failure modes, e.g. because the client request times out before the lease acquisition succeeds under disk stalls (see kvserver: disk stall prevents lease transfer #81100 and kvserver: persistent outage when liveness leaseholder deadlocks #80713).
    2. It adds unnecessary latency, e.g. in the case where a scan across many ranges has to sequentially acquire leases for each range, especially under failure modes such as network outages or disk stalls where the lease acquisition has to wait for network timeouts.
  2. When only using expiration leases, we want to ensure ranges have a lease even when there is no traffic on the ranges, to avoid lease acquisition latencies on the next request to the range.

A few important aspects to consider:

  • Some ranges are more important that others. In particular, the meta and liveness range leases must get priority.
  • Expiration lease extensions are currently expensive (one Raft write per extension per range). We may want to allow very cold ranges to let their leases expire after some period of inactivity (e.g. minutes).
  • There may not be any point in quiescing ranges where we eagerly extend expiration leases (see kvserver: explore quiescence removal #94592 and kvserver: don't quiesce ranges with expiration-based leases #94454).
  • We already have other scheduler infrastructure that we could piggyback on: the Raft scheduler, and the queue infrastructure (e.g. a new lease queue).
  • We should try to honour lease preferences.

We have a few similar mechanisms already, that should be mostly be replaced by this scheduler:

  • The replicate queue acquires leases for ranges that doesn't have one.

    // TODO(kvoli): This check should fail if not the leaseholder. In the case
    // where we want to use the replicate queue to acquire leases, this should
    // occur before planning or as a result. In order to return this in
    // planning, it is necessary to simulate the prior change having succeeded
    // to then plan this lease transfer.
    if _, pErr := repl.redirectOnOrAcquireLease(ctx); pErr != nil {
    return change, pErr.GoError()
    }

  • Store.startLeaseRenewer() eagerly renews the expiration leases on the meta and liveness ranges to avoid high tail latencies.

    // startLeaseRenewer runs an infinite loop in a goroutine which regularly
    // checks whether the store has any expiration-based leases that should be
    // proactively renewed and attempts to continue renewing them.
    //
    // This reduces user-visible latency when range lookups are needed to serve a
    // request and reduces ping-ponging of r1's lease to different replicas as
    // maybeGossipFirstRange is called on each (e.g. #24753).
    func (s *Store) startLeaseRenewer(ctx context.Context) {
    // Start a goroutine that watches and proactively renews certain
    // expiration-based leases.

  • Replica.maybeExtendLeaseAsyncLocked() will extend an expiration lease when processing a request in the last half of the lease interval.

    func (r *Replica) maybeExtendLeaseAsyncLocked(ctx context.Context, st kvserverpb.LeaseStatus) {
    // Check shouldExtendLeaseRLocked again, because others may have raced to
    // extend the lease and beaten us here after we made the determination
    // (under a shared lock) that the extension was needed.
    if !r.shouldExtendLeaseRLocked(st) {
    return
    }
    if log.ExpensiveLogEnabled(ctx, 2) {
    log.Infof(ctx, "extending lease %s at %s", st.Lease, st.Now)
    }
    // We explicitly ignore the returned handle as we won't block on it.
    //
    // TODO(tbg): this ctx is likely cancelled very soon, which will in turn
    // cancel the lease acquisition (unless joined by another more long-lived
    // ctx). So this possibly isn't working as advertised (which only plays a role
    // for expiration-based leases, at least).
    _ = r.requestLeaseLocked(ctx, st)
    }

Jira issue: CRDB-25265

Epic CRDB-25207

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Apr 3, 2023

Once #100430 lands, we'll be eagerly extending expiration-based leases during Raft ticks. That solves one of the objectives here. However, the replicate queue is still responsible for acquiring leases for ranges that don't have one, and for switching the lease type (e.g. when kv.expiration_leases_only.enabled is changed). This can take 10 minutes or more.

We should move this eager lease acquisition and type switching into the Raft tick loop as well, so they're more responsive. However, this requires the ranges to be unquiesced. It's possible that we'll want to only do this for expiration leases, and not for epoch leases, to maintain the same performance profile.

We also need to make sure we don't slam the cluster with a ton of concurrent lease requests, but instead stagger them. Maybe #100426 is the solution here, or maybe we need some other policy.

We also need to try to honor lease preferences.

craig bot pushed a commit that referenced this issue Apr 4, 2023
100430: kvserver: eagerly extend expiration leases in Raft scheduler r=erikgrinaker a=erikgrinaker

This is submitted with a 23.1 backport in mind, and is therefore limited in scope. More work is needed to fully address #98433 in the 23.2 timeframe.

---

**kvserver: don't quiesce ranges with expiration-based leases**

This patch prevents quiescence with expiration-based leases, since we'll have to propose a lease extension shortly anyway. Out of caution, we only do this when `kv.expiration_leases_only.enabled` is `true`, with an eye towards 23.1 backports. For 23.2, we will always do this.

Touches #94592.

Epic: none
Release note: None
  
**kvserver: eagerly extend expiration leases in Raft scheduler**

This patch eagerly extends expiration leases in the Raft scheduler, during Raft ticks, but only if `kv.expiration_leases_only.enabled` is `true`. This is possible because we no longer allow ranges with expiration leases to quiesce in this case.

The old store lease renewer is disabled in this case, but still tracks local system ranges with expiration leases in case the setting is later changed. The store lease renewer is still retained for use by default, to allow backporting this change to 23.1 while keeping the existing default behavior.

Doing this during Raft scheduling is compelling, since it already does efficient goroutine scheduling (including liveness range prioritization) and holds the replica mutex.

Compared to an alternative implementation that relied on a separate, improved store lease renewer which also allowed ranges to quiesce between extensions, this has significantly better performance. The main reason for this is that unquiescing involves a Raft append to wake the leader. On an idle cluster with 20.000 ranges using only expiration leases, CPU usage was 26% vs.  30%, write IOPS was 500 vs. 900, and network traffic was 0.8 MB/s vs.  1.0 MB/s. KV benchmark results:

```
name                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3    14.6k ± 3%   15.2k ± 4%  +3.72%  (p=0.003 n=8+8)
kv95/enc=false/nodes=3   30.6k ± 3%   32.2k ± 3%  +5.22%  (p=0.000 n=8+8)

name                    old p50      new p50      delta
kv0/enc=false/nodes=3     11.3 ± 3%    11.0 ± 0%  -2.76%  (p=0.006 n=8+6)
kv95/enc=false/nodes=3    4.59 ± 8%    4.55 ± 3%    ~     (p=0.315 n=8+8)

name                    old p95      new p95      delta
kv0/enc=false/nodes=3     31.5 ± 0%    30.2 ± 4%  -4.25%  (p=0.003 n=6+8)
kv95/enc=false/nodes=3    19.5 ± 7%    18.1 ± 4%  -7.28%  (p=0.000 n=8+7)

name                    old p99      new p99      delta
kv0/enc=false/nodes=3     48.7 ± 3%    47.4 ± 6%    ~     (p=0.059 n=8+8)
kv95/enc=false/nodes=3    32.8 ± 9%    30.2 ± 4%  -8.04%  (p=0.001 n=8+8)
```

Resolves #99812.
Touches #98433.

Epic: none
Release note: None
  
**kvserver: also acquire expiration leases in replicate queue**

Previously, the replicate queue would only reacquire epoch leases, and allow expiration leases to expire. It now reacquires all leases, since they are eagerly extended.

In the future, when quiescence is removed entirely, this responsibility should be moved to the Raft tick loop.

Epic: none
Release note: None
  
**kvserver: conditionally extend expiration leases during requests**

If `kv.expiration_leases_only.enabled´ is `true`, this patch disables expiration lease extension during request processing, since the Raft scheduler will eagerly extend leases in this case. This will eventually be removed completely, but is kept for now with an eye towards a 23.1 backport.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Apr 26, 2023
100434: kvserver: grow goroutine stack for `requestLease` r=erikgrinaker a=erikgrinaker

We pre-grow the goroutine stack when processing batch requests via RPC, to avoid frequent stack copying. However, this did not apply to lease requests, which are submitted directly to the local replica. When eagerly extending 20.000 expiration leases, `requestLease` stack growth made up 4% of total CPU usage -- this patch reduces that to 1%.

Touches #98433.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Apr 27, 2023
101050: kvserver: move eager expiration lease extension to Raft scheduler r=erikgrinaker a=erikgrinaker

**kvserver: never quiesce expiration leases**

Previously, this was contingent on `kv.expiration_leases_only.enabled`. This patch therefore, in practice, disables quiescence for the meta/liveness ranges and for any ranges that are transferring an expiration lease before upgrading it to an epoch lease. Quiescing these will tend to result in more work rather than less, because we'll have to wake the ranges up again shortly which incurs an additional Raft append.

Epic: none
Release note: None
  
**kvserver: move eager expiration lease extension to Raft scheduler**

This patch makes the Raft scheduler responsible for eagerly extending all expiration-based leases, replacing the store lease renewer. Previously, this was contingent on `kv.expiration_leases_only.enabled`, and by default only the meta/liveness ranges were eagerly extended by the store lease renewer.

Touches #98433.

Epic: none
Release note: None
  
**kvserver: don't extend leases during request processing**

This patch removes expiration lease extension during request processing, since the Raft scheduler now always eagerly extends expiration leases. Previously, this was contingent on `kv.expiration_leases_only.enabled`.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Apr 27, 2023
100432: kvserver: don't add range info for lease requests r=erikgrinaker a=erikgrinaker

Lease requests are submitted directly to the local replica, bypassing the DistSender. They therefore don't set `ClientRangeInfo`, and don't need range info returned. However, `maybeAddRangeInfoToResponse()` would interpret this as stale range info and always populate the response with range info, which can be expensive -- especially with expiration leases that are frequently extended.

In an idle cluster with 20.000 eagerly extended expiration leases, this was responsible for 1.4% of total CPU usage.

Touches #98433.

Epic: none
Release note: None

102416: githubpost: add team label to issues r=erikgrinaker a=erikgrinaker

**TEAMS: add `T-kv` label for KV**

Epic: none
Release note: None

**githubpost: add team label to issues**

This patch automatically adds team labels to created issues. Previously, this was done by Blathers based on the project board, but not all teams use project boards -- and we may as well just set the correct label to begin with.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Apr 29, 2023
100433: batcheval: don't declare latches for `RequestLease` r=erikgrinaker a=erikgrinaker

`RequestLease` does not take out latches. However, it still declared latches to appease the spanset asserter. This has a non-negligible cost with eagerly extended expiration leases, about 1.4% of CPU usage with 20.000 idle ranges. This patch removes latch declaration and disables spanset assertions.

Touches #98433.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Apr 30, 2023
100435: kvserver: avoid unnecessary tracing span in `requestLeaseAsync()` r=erikgrinaker a=erikgrinaker

There's no need to create a tracing span when the client isn't running a trace.

Touches #98433.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
@craig craig bot closed this as completed in 117dc40 May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant