kv: log slow requests on replica level in addition to range level #117117

shralex · 2023-12-27T17:37:27Z

Previously, slow requests were only logged at the range level, but the logs did not indicate which replica is slow. Moreover, the SlowRPC metric attempted to represent the number of requests currently being retried, however it was done on the range level and therefore missed a second level of replica-level retries being done underneath.

This PR adds logging on the replica level, removes a confusing log line, and changes the metric to count the number of slow requests in a simpler manner.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-33510
Fixes: #114431

cockroach-teamcity · 2023-12-27T17:37:36Z

This change is

andrewbaptist

See the comments inline. I'm concerned if we ONLY count individual requests and not the retries we won't correctly detect problems with requests that are retried for a long time before completing and could actually lose observability. It would also be nice to separate out requests so that we can use a shorter timeout like you do here (of 3s) without triggering it too often.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @shralex)

pkg/kv/kvclient/kvcoord/dist_sender.go line 312 at r1 (raw file):

	InLeaseTransferBackoffs            *metric.Counter
	RangeLookups                       *metric.Counter
	SlowRPCs                           *metric.Counter

I don't think it is generally safe to convert from Gauge to Counter. At a minimum spin up a mixed mode cluster and verify that both the AdminUI and Grafana can handle when it is running in this mode.

As this is fundamentally changing the meaning, from current number which is a gauge, to total ever which is a counter, it makes sense to rename this. and deprecate the old gauge.

Looking at the other changes, I think it is best to create distsender.slow.replica and distsender.slow.batch.

pkg/kv/kvclient/kvcoord/dist_sender.go line 1921 at r1 (raw file):

		prevTok = routingTok
		reply, err = ds.sendToReplicas(ctx, ba, routingTok, withCommit)
		if dur := timeutil.Since(tBegin); dur > slowDistSenderRangeThreshold && !tBegin.IsZero() {

I'm confused why you aren't incrementing the metric here anymore. If you create the second metric then it should just be incremented when this longer timeout is hit. Maybe this is intentional since you don't care how long it takes for the entire request, but then you will miss counting requests that retry for a long period of time.

pkg/kv/kvclient/kvcoord/dist_sender.go line 2189 at r1 (raw file):

// slowDistSenderRangeThreshold is a latency threshold for logging slow requests to a range,
// potentially involving RPCs to multiple replicas of the range.
const slowDistSenderRangeThreshold = time.Minute

nit: I would rename these to slowDistSenderBatchTimeout and slowDistSenderReplicaTimeout

When I first read this I thought this was referring to Scan vs Point requests.

pkg/kv/kvclient/kvcoord/dist_sender.go line 2419 at r1 (raw file):

			var s redact.StringBuilder
			// Note that these RPC may or may not have succeeded. Errors are counted separately below.
			ds.metrics.SlowRPCs.Inc(1)

nit: some requests (for instance ExportRequest) often take a long time to return. I think we will trigger the SlowRPC here a lot if we include all requests. There are alternatives to consider:

Create a static map of RequestType to SlowThreshold.
Exclude requests that have an AdmissionHeader lower than NormalPriority
Create separate stats and timeouts based on the admission header priority.
(most complex) - track "expected time" for each type of request and for Batches with multiple requests, track the time of the "slowest" request in the batch. When it logs/warns, it could print both the expected and actual time. This could be hard to get right and I'm not sure if its worth the effort.

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

pkg/kv/kvclient/kvcoord/dist_sender.go line 312 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I don't think it is generally safe to convert from Gauge to Counter. At a minimum spin up a mixed mode cluster and verify that both the AdminUI and Grafana can handle when it is running in this mode.

As this is fundamentally changing the meaning, from current number which is a gauge, to total ever which is a counter, it makes sense to rename this. and deprecate the old gauge.

Looking at the other changes, I think it is best to create distsender.slow.replica and distsender.slow.batch.

Ack. I added a new metric, and kept the previous one unchanged. Regarding replica vs batch, this might not be the right distinction since my understanding is that we send to individual replicas is also request batches.

pkg/kv/kvclient/kvcoord/dist_sender.go line 1921 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I'm confused why you aren't incrementing the metric here anymore. If you create the second metric then it should just be incremented when this longer timeout is hit. Maybe this is intentional since you don't care how long it takes for the entire request, but then you will miss counting requests that retry for a long period of time.

I reverted my changes here.

pkg/kv/kvclient/kvcoord/dist_sender.go line 2189 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: I would rename these to slowDistSenderBatchTimeout and slowDistSenderReplicaTimeout

When I first read this I thought this was referring to Scan vs Point requests.

As mentioned above, batch versus replica might not be the right distinction. We can discuss offline.

pkg/kv/kvclient/kvcoord/dist_sender.go line 2419 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: some requests (for instance ExportRequest) often take a long time to return. I think we will trigger the SlowRPC here a lot if we include all requests. There are alternatives to consider:

Create a static map of RequestType to SlowThreshold.

Exclude requests that have an AdmissionHeader lower than NormalPriority

Create separate stats and timeouts based on the admission header priority.

(most complex) - track "expected time" for each type of request and for Batches with multiple requests, track the time of the "slowest" request in the batch. When it logs/warns, it could print both the expected and actual time. This could be hard to get right and I'm not sure if its worth the effort.

Thanks for these suggestions! I did (2).

andrewbaptist

Thanks for the changes!

Reviewable status: complete! 1 of 0 LGTMs obtained

nvanbenschoten

Reviewed 1 of 2 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @shralex)

pkg/kv/kvclient/kvcoord/dist_sender.go line 163 at r3 (raw file):

	metaDistSenderSlowRPCs = metric.Metadata{
		Name: "requests.slow.distsender",
		Help: `Number of replica-bound RPCs currently stuck or retrying for a long time.

s/replica/range/

pkg/kv/kvclient/kvcoord/dist_sender.go line 2221 at r3 (raw file):

const slowDistSenderRangeThreshold = time.Minute

// slowDistSenderReplicaThreshold is a latency threshold for logging a slow RPC to a single replica.

nit: wrap at 80 chars?

pkg/kv/kvclient/kvcoord/dist_sender.go line 2446 at r3 (raw file):

		tBegin := timeutil.Now() // for slow log message
		br, err = transport.SendNext(ctx, ba)
		if admissionpb.WorkPriority(ba.AdmissionHeader.Priority) >= admissionpb.NormalPri {

Instead of gating this entire check on the admission control priority, should we instead just adjust where we log and whether or not we increment the metric? For example, we could skip the metric and log.Eventf for admissionpb.WorkPriority(ba.AdmissionHeader.Priority) < admissionpb.NormalPri so that the message will at least make it into any traces.

pkg/kv/kvclient/kvcoord/dist_sender.go line 2447 at r3 (raw file):

		br, err = transport.SendNext(ctx, ba)
		if admissionpb.WorkPriority(ba.AdmissionHeader.Priority) >= admissionpb.NormalPri {
			if dur := timeutil.Since(tBegin); dur > slowDistSenderReplicaThreshold && !tBegin.IsZero() {

When will tBegin be zero?

pkg/kv/kvclient/kvcoord/dist_sender_test.go line 4453 at r3 (raw file):

		exp := `slow RPC finished after 8.16s (120 attempts)`
		var s redact.StringBuilder
		slowRangeRPCReturnWarningStr(&s, dur, attempts)

Are we losing test coverage for slowRangeRPCReturnWarningStr?

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender.go line 163 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/replica/range/

Done

pkg/kv/kvclient/kvcoord/dist_sender.go line 2221 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: wrap at 80 chars?

Done

pkg/kv/kvclient/kvcoord/dist_sender.go line 2446 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Instead of gating this entire check on the admission control priority, should we instead just adjust where we log and whether or not we increment the metric? For example, we could skip the metric and log.Eventf for admissionpb.WorkPriority(ba.AdmissionHeader.Priority) < admissionpb.NormalPri so that the message will at least make it into any traces.

Done

pkg/kv/kvclient/kvcoord/dist_sender.go line 2447 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

When will tBegin be zero?

Removed

pkg/kv/kvclient/kvcoord/dist_sender_test.go line 4453 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Are we losing test coverage for slowRangeRPCReturnWarningStr?

Added it back

nvanbenschoten

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @shralex)

pkg/kv/kvclient/kvcoord/dist_sender.go line 2224 at r4 (raw file):

// slowDistSenderReplicaThreshold is a latency threshold for logging a slow RPC
// to a single replica.
const slowDistSenderReplicaThreshold = 3 * time.Second

3 seconds feels a little low to me. If we're going to make this a constant, then should we bump this to something a little less prone to triggering under normal conditions for expensive requests? Maybe 10 seconds?

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andrewbaptist and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender.go line 2224 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

3 seconds feels a little low to me. If we're going to make this a constant, then should we bump this to something a little less prone to triggering under normal conditions for expensive requests? Maybe 10 seconds?

Done

Previously, slow requests were only logged at the range level, but the logs did not indicate which replica is slow. Moreover, the SlowRPC metric attempted to represent the number of requests currently being retried, however it was done on the range level and therefore missed a second level of replica-level retries being done underneath. This PR adds logging on the replica level, removes a confusing log line, and changes the metric to count the number of slow requests in a simpler manner. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-33510 Fixes: cockroachdb#114431

shralex · 2024-02-05T23:33:00Z

bors r+

craig · 2024-02-06T01:48:09Z

Build succeeded:

Bazel Essential CI (Cockroach)

shralex requested a review from a team as a code owner December 27, 2023 17:37

shralex force-pushed the log_slow_requests branch from b3dcc05 to ecc1a99 Compare December 27, 2023 18:21

nvanbenschoten requested a review from andrewbaptist January 3, 2024 18:05

andrewbaptist requested changes Jan 3, 2024

View reviewed changes

shralex force-pushed the log_slow_requests branch from ecc1a99 to 66af98c Compare January 4, 2024 20:11

shralex commented Jan 4, 2024

View reviewed changes

shralex force-pushed the log_slow_requests branch 2 times, most recently from 66d10e8 to 88d1063 Compare January 18, 2024 22:57

andrewbaptist self-requested a review January 19, 2024 18:05

andrewbaptist approved these changes Jan 27, 2024

View reviewed changes

nvanbenschoten requested a review from andrewbaptist February 1, 2024 16:26

nvanbenschoten reviewed Feb 1, 2024

View reviewed changes

shralex force-pushed the log_slow_requests branch from 88d1063 to c515cb7 Compare February 4, 2024 05:13

shralex commented Feb 4, 2024

View reviewed changes

nvanbenschoten approved these changes Feb 5, 2024

View reviewed changes

shralex force-pushed the log_slow_requests branch from c515cb7 to 05f2ab2 Compare February 5, 2024 07:22

shralex commented Feb 5, 2024

View reviewed changes

shralex force-pushed the log_slow_requests branch from 05f2ab2 to a516b8a Compare February 5, 2024 19:39

shralex force-pushed the log_slow_requests branch from a516b8a to b034714 Compare February 5, 2024 21:39

craig bot merged commit 804d37e into cockroachdb:master Feb 6, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: log slow requests on replica level in addition to range level #117117

kv: log slow requests on replica level in addition to range level #117117

shralex commented Dec 27, 2023

cockroach-teamcity commented Dec 27, 2023

andrewbaptist left a comment

shralex left a comment

andrewbaptist left a comment

nvanbenschoten left a comment

shralex left a comment

nvanbenschoten left a comment

shralex left a comment

shralex commented Feb 5, 2024

craig bot commented Feb 6, 2024

kv: log slow requests on replica level in addition to range level #117117

kv: log slow requests on replica level in addition to range level #117117

Conversation

shralex commented Dec 27, 2023

cockroach-teamcity commented Dec 27, 2023

andrewbaptist left a comment

Choose a reason for hiding this comment

shralex left a comment

Choose a reason for hiding this comment

andrewbaptist left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

shralex left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

shralex left a comment

Choose a reason for hiding this comment

shralex commented Feb 5, 2024

craig bot commented Feb 6, 2024