Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: log slow requests on replica level in addition to range level #117117

Merged
merged 1 commit into from
Feb 6, 2024

Conversation

shralex
Copy link
Contributor

@shralex shralex commented Dec 27, 2023

Previously, slow requests were only logged at the range level, but the logs did not indicate which replica is slow. Moreover, the SlowRPC metric attempted to represent the number of requests currently being retried, however it was done on the range level and therefore missed a second level of replica-level retries being done underneath.

This PR adds logging on the replica level, removes a confusing log line, and changes the metric to count the number of slow requests in a simpler manner.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-33510
Fixes: #114431

@shralex shralex requested a review from a team as a code owner December 27, 2023 17:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comments inline. I'm concerned if we ONLY count individual requests and not the retries we won't correctly detect problems with requests that are retried for a long time before completing and could actually lose observability. It would also be nice to separate out requests so that we can use a shorter timeout like you do here (of 3s) without triggering it too often.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @shralex)


pkg/kv/kvclient/kvcoord/dist_sender.go line 312 at r1 (raw file):

	InLeaseTransferBackoffs            *metric.Counter
	RangeLookups                       *metric.Counter
	SlowRPCs                           *metric.Counter

I don't think it is generally safe to convert from Gauge to Counter. At a minimum spin up a mixed mode cluster and verify that both the AdminUI and Grafana can handle when it is running in this mode.

As this is fundamentally changing the meaning, from current number which is a gauge, to total ever which is a counter, it makes sense to rename this. and deprecate the old gauge.

Looking at the other changes, I think it is best to create distsender.slow.replica and distsender.slow.batch.


pkg/kv/kvclient/kvcoord/dist_sender.go line 1921 at r1 (raw file):

		prevTok = routingTok
		reply, err = ds.sendToReplicas(ctx, ba, routingTok, withCommit)
		if dur := timeutil.Since(tBegin); dur > slowDistSenderRangeThreshold && !tBegin.IsZero() {

I'm confused why you aren't incrementing the metric here anymore. If you create the second metric then it should just be incremented when this longer timeout is hit. Maybe this is intentional since you don't care how long it takes for the entire request, but then you will miss counting requests that retry for a long period of time.


pkg/kv/kvclient/kvcoord/dist_sender.go line 2189 at r1 (raw file):

// slowDistSenderRangeThreshold is a latency threshold for logging slow requests to a range,
// potentially involving RPCs to multiple replicas of the range.
const slowDistSenderRangeThreshold = time.Minute

nit: I would rename these to slowDistSenderBatchTimeout and slowDistSenderReplicaTimeout

When I first read this I thought this was referring to Scan vs Point requests.


pkg/kv/kvclient/kvcoord/dist_sender.go line 2419 at r1 (raw file):

			var s redact.StringBuilder
			// Note that these RPC may or may not have succeeded. Errors are counted separately below.
			ds.metrics.SlowRPCs.Inc(1)

nit: some requests (for instance ExportRequest) often take a long time to return. I think we will trigger the SlowRPC here a lot if we include all requests. There are alternatives to consider:

  1. Create a static map of RequestType to SlowThreshold.
  2. Exclude requests that have an AdmissionHeader lower than NormalPriority
  3. Create separate stats and timeouts based on the admission header priority.
  4. (most complex) - track "expected time" for each type of request and for Batches with multiple requests, track the time of the "slowest" request in the batch. When it logs/warns, it could print both the expected and actual time. This could be hard to get right and I'm not sure if its worth the effort.

@shralex shralex force-pushed the log_slow_requests branch from ecc1a99 to 66af98c Compare January 4, 2024 20:11
Copy link
Contributor Author

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/kv/kvclient/kvcoord/dist_sender.go line 312 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I don't think it is generally safe to convert from Gauge to Counter. At a minimum spin up a mixed mode cluster and verify that both the AdminUI and Grafana can handle when it is running in this mode.

As this is fundamentally changing the meaning, from current number which is a gauge, to total ever which is a counter, it makes sense to rename this. and deprecate the old gauge.

Looking at the other changes, I think it is best to create distsender.slow.replica and distsender.slow.batch.

Ack. I added a new metric, and kept the previous one unchanged. Regarding replica vs batch, this might not be the right distinction since my understanding is that we send to individual replicas is also request batches.


pkg/kv/kvclient/kvcoord/dist_sender.go line 1921 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I'm confused why you aren't incrementing the metric here anymore. If you create the second metric then it should just be incremented when this longer timeout is hit. Maybe this is intentional since you don't care how long it takes for the entire request, but then you will miss counting requests that retry for a long period of time.

I reverted my changes here.


pkg/kv/kvclient/kvcoord/dist_sender.go line 2189 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: I would rename these to slowDistSenderBatchTimeout and slowDistSenderReplicaTimeout

When I first read this I thought this was referring to Scan vs Point requests.

As mentioned above, batch versus replica might not be the right distinction. We can discuss offline.


pkg/kv/kvclient/kvcoord/dist_sender.go line 2419 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: some requests (for instance ExportRequest) often take a long time to return. I think we will trigger the SlowRPC here a lot if we include all requests. There are alternatives to consider:

  1. Create a static map of RequestType to SlowThreshold.
  2. Exclude requests that have an AdmissionHeader lower than NormalPriority
  3. Create separate stats and timeouts based on the admission header priority.
  4. (most complex) - track "expected time" for each type of request and for Batches with multiple requests, track the time of the "slowest" request in the batch. When it logs/warns, it could print both the expected and actual time. This could be hard to get right and I'm not sure if its worth the effort.

Thanks for these suggestions! I did (2).

@shralex shralex force-pushed the log_slow_requests branch 2 times, most recently from 66d10e8 to 88d1063 Compare January 18, 2024 22:57
@andrewbaptist andrewbaptist self-requested a review January 19, 2024 18:05
Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for the changes!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @shralex)


pkg/kv/kvclient/kvcoord/dist_sender.go line 163 at r3 (raw file):

	metaDistSenderSlowRPCs = metric.Metadata{
		Name: "requests.slow.distsender",
		Help: `Number of replica-bound RPCs currently stuck or retrying for a long time.

s/replica/range/


pkg/kv/kvclient/kvcoord/dist_sender.go line 2221 at r3 (raw file):

const slowDistSenderRangeThreshold = time.Minute

// slowDistSenderReplicaThreshold is a latency threshold for logging a slow RPC to a single replica.

nit: wrap at 80 chars?


pkg/kv/kvclient/kvcoord/dist_sender.go line 2446 at r3 (raw file):

		tBegin := timeutil.Now() // for slow log message
		br, err = transport.SendNext(ctx, ba)
		if admissionpb.WorkPriority(ba.AdmissionHeader.Priority) >= admissionpb.NormalPri {

Instead of gating this entire check on the admission control priority, should we instead just adjust where we log and whether or not we increment the metric? For example, we could skip the metric and log.Eventf for admissionpb.WorkPriority(ba.AdmissionHeader.Priority) < admissionpb.NormalPri so that the message will at least make it into any traces.


pkg/kv/kvclient/kvcoord/dist_sender.go line 2447 at r3 (raw file):

		br, err = transport.SendNext(ctx, ba)
		if admissionpb.WorkPriority(ba.AdmissionHeader.Priority) >= admissionpb.NormalPri {
			if dur := timeutil.Since(tBegin); dur > slowDistSenderReplicaThreshold && !tBegin.IsZero() {

When will tBegin be zero?


pkg/kv/kvclient/kvcoord/dist_sender_test.go line 4453 at r3 (raw file):

		exp := `slow RPC finished after 8.16s (120 attempts)`
		var s redact.StringBuilder
		slowRangeRPCReturnWarningStr(&s, dur, attempts)

Are we losing test coverage for slowRangeRPCReturnWarningStr?

Copy link
Contributor Author

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/dist_sender.go line 163 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/replica/range/

Done


pkg/kv/kvclient/kvcoord/dist_sender.go line 2221 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: wrap at 80 chars?

Done


pkg/kv/kvclient/kvcoord/dist_sender.go line 2446 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Instead of gating this entire check on the admission control priority, should we instead just adjust where we log and whether or not we increment the metric? For example, we could skip the metric and log.Eventf for admissionpb.WorkPriority(ba.AdmissionHeader.Priority) < admissionpb.NormalPri so that the message will at least make it into any traces.

Done


pkg/kv/kvclient/kvcoord/dist_sender.go line 2447 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

When will tBegin be zero?

Removed


pkg/kv/kvclient/kvcoord/dist_sender_test.go line 4453 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Are we losing test coverage for slowRangeRPCReturnWarningStr?

Added it back

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @shralex)


pkg/kv/kvclient/kvcoord/dist_sender.go line 2224 at r4 (raw file):

// slowDistSenderReplicaThreshold is a latency threshold for logging a slow RPC
// to a single replica.
const slowDistSenderReplicaThreshold = 3 * time.Second

3 seconds feels a little low to me. If we're going to make this a constant, then should we bump this to something a little less prone to triggering under normal conditions for expensive requests? Maybe 10 seconds?

Copy link
Contributor Author

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/dist_sender.go line 2224 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

3 seconds feels a little low to me. If we're going to make this a constant, then should we bump this to something a little less prone to triggering under normal conditions for expensive requests? Maybe 10 seconds?

Done

Previously, slow requests were only logged at the range level, but
the logs did not indicate which replica is slow. Moreover, the
SlowRPC metric attempted to represent the number of requests
currently being retried, however it was done on the range level
and therefore missed a second level of replica-level retries being
done underneath.

This PR adds logging on the replica level, removes a confusing log
line, and changes the metric to count the number of slow requests
in a simpler manner.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-33510
Fixes: cockroachdb#114431
@shralex
Copy link
Contributor Author

shralex commented Feb 5, 2024

bors r+

@craig craig bot merged commit 804d37e into cockroachdb:master Feb 6, 2024
9 checks passed
@craig
Copy link
Contributor

craig bot commented Feb 6, 2024

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: log slow client requests with replica information
4 participants