-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: log slow requests on replica level in addition to range level #117117
Conversation
b3dcc05
to
ecc1a99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comments inline. I'm concerned if we ONLY count individual requests and not the retries we won't correctly detect problems with requests that are retried for a long time before completing and could actually lose observability. It would also be nice to separate out requests so that we can use a shorter timeout like you do here (of 3s) without triggering it too often.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @shralex)
pkg/kv/kvclient/kvcoord/dist_sender.go
line 312 at r1 (raw file):
InLeaseTransferBackoffs *metric.Counter RangeLookups *metric.Counter SlowRPCs *metric.Counter
I don't think it is generally safe to convert from Gauge
to Counter
. At a minimum spin up a mixed mode cluster and verify that both the AdminUI and Grafana can handle when it is running in this mode.
As this is fundamentally changing the meaning, from current number which is a gauge, to total ever which is a counter, it makes sense to rename this. and deprecate the old gauge.
Looking at the other changes, I think it is best to create distsender.slow.replica
and distsender.slow.batch
.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 1921 at r1 (raw file):
prevTok = routingTok reply, err = ds.sendToReplicas(ctx, ba, routingTok, withCommit) if dur := timeutil.Since(tBegin); dur > slowDistSenderRangeThreshold && !tBegin.IsZero() {
I'm confused why you aren't incrementing the metric here anymore. If you create the second metric then it should just be incremented when this longer timeout is hit. Maybe this is intentional since you don't care how long it takes for the entire request, but then you will miss counting requests that retry for a long period of time.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2189 at r1 (raw file):
// slowDistSenderRangeThreshold is a latency threshold for logging slow requests to a range, // potentially involving RPCs to multiple replicas of the range. const slowDistSenderRangeThreshold = time.Minute
nit: I would rename these to slowDistSenderBatchTimeout
and slowDistSenderReplicaTimeout
When I first read this I thought this was referring to Scan
vs Point
requests.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2419 at r1 (raw file):
var s redact.StringBuilder // Note that these RPC may or may not have succeeded. Errors are counted separately below. ds.metrics.SlowRPCs.Inc(1)
nit: some requests (for instance ExportRequest
) often take a long time to return. I think we will trigger the SlowRPC here a lot if we include all requests. There are alternatives to consider:
- Create a static map of
RequestType
toSlowThreshold
. - Exclude requests that have an
AdmissionHeader
lower thanNormalPriority
- Create separate stats and timeouts based on the admission header priority.
- (most complex) - track "expected time" for each type of request and for Batches with multiple requests, track the time of the "slowest" request in the batch. When it logs/warns, it could print both the expected and actual time. This could be hard to get right and I'm not sure if its worth the effort.
ecc1a99
to
66af98c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)
pkg/kv/kvclient/kvcoord/dist_sender.go
line 312 at r1 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
I don't think it is generally safe to convert from
Gauge
toCounter
. At a minimum spin up a mixed mode cluster and verify that both the AdminUI and Grafana can handle when it is running in this mode.As this is fundamentally changing the meaning, from current number which is a gauge, to total ever which is a counter, it makes sense to rename this. and deprecate the old gauge.
Looking at the other changes, I think it is best to create
distsender.slow.replica
anddistsender.slow.batch
.
Ack. I added a new metric, and kept the previous one unchanged. Regarding replica vs batch, this might not be the right distinction since my understanding is that we send to individual replicas is also request batches.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 1921 at r1 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
I'm confused why you aren't incrementing the metric here anymore. If you create the second metric then it should just be incremented when this longer timeout is hit. Maybe this is intentional since you don't care how long it takes for the entire request, but then you will miss counting requests that retry for a long period of time.
I reverted my changes here.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2189 at r1 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
nit: I would rename these to
slowDistSenderBatchTimeout
andslowDistSenderReplicaTimeout
When I first read this I thought this was referring to
Scan
vsPoint
requests.
As mentioned above, batch versus replica might not be the right distinction. We can discuss offline.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2419 at r1 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
nit: some requests (for instance
ExportRequest
) often take a long time to return. I think we will trigger the SlowRPC here a lot if we include all requests. There are alternatives to consider:
- Create a static map of
RequestType
toSlowThreshold
.- Exclude requests that have an
AdmissionHeader
lower thanNormalPriority
- Create separate stats and timeouts based on the admission header priority.
- (most complex) - track "expected time" for each type of request and for Batches with multiple requests, track the time of the "slowest" request in the batch. When it logs/warns, it could print both the expected and actual time. This could be hard to get right and I'm not sure if its worth the effort.
Thanks for these suggestions! I did (2).
66d10e8
to
88d1063
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @shralex)
pkg/kv/kvclient/kvcoord/dist_sender.go
line 163 at r3 (raw file):
metaDistSenderSlowRPCs = metric.Metadata{ Name: "requests.slow.distsender", Help: `Number of replica-bound RPCs currently stuck or retrying for a long time.
s/replica/range/
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2221 at r3 (raw file):
const slowDistSenderRangeThreshold = time.Minute // slowDistSenderReplicaThreshold is a latency threshold for logging a slow RPC to a single replica.
nit: wrap at 80 chars?
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2446 at r3 (raw file):
tBegin := timeutil.Now() // for slow log message br, err = transport.SendNext(ctx, ba) if admissionpb.WorkPriority(ba.AdmissionHeader.Priority) >= admissionpb.NormalPri {
Instead of gating this entire check on the admission control priority, should we instead just adjust where we log and whether or not we increment the metric? For example, we could skip the metric and log.Eventf
for admissionpb.WorkPriority(ba.AdmissionHeader.Priority) < admissionpb.NormalPri
so that the message will at least make it into any traces.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2447 at r3 (raw file):
br, err = transport.SendNext(ctx, ba) if admissionpb.WorkPriority(ba.AdmissionHeader.Priority) >= admissionpb.NormalPri { if dur := timeutil.Since(tBegin); dur > slowDistSenderReplicaThreshold && !tBegin.IsZero() {
When will tBegin
be zero?
pkg/kv/kvclient/kvcoord/dist_sender_test.go
line 4453 at r3 (raw file):
exp := `slow RPC finished after 8.16s (120 attempts)` var s redact.StringBuilder slowRangeRPCReturnWarningStr(&s, dur, attempts)
Are we losing test coverage for slowRangeRPCReturnWarningStr
?
88d1063
to
c515cb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/dist_sender.go
line 163 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/replica/range/
Done
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2221 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: wrap at 80 chars?
Done
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2446 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Instead of gating this entire check on the admission control priority, should we instead just adjust where we log and whether or not we increment the metric? For example, we could skip the metric and
log.Eventf
foradmissionpb.WorkPriority(ba.AdmissionHeader.Priority) < admissionpb.NormalPri
so that the message will at least make it into any traces.
Done
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2447 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
When will
tBegin
be zero?
Removed
pkg/kv/kvclient/kvcoord/dist_sender_test.go
line 4453 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Are we losing test coverage for
slowRangeRPCReturnWarningStr
?
Added it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @shralex)
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2224 at r4 (raw file):
// slowDistSenderReplicaThreshold is a latency threshold for logging a slow RPC // to a single replica. const slowDistSenderReplicaThreshold = 3 * time.Second
3 seconds feels a little low to me. If we're going to make this a constant, then should we bump this to something a little less prone to triggering under normal conditions for expensive requests? Maybe 10 seconds?
c515cb7
to
05f2ab2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/dist_sender.go
line 2224 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
3 seconds feels a little low to me. If we're going to make this a constant, then should we bump this to something a little less prone to triggering under normal conditions for expensive requests? Maybe 10 seconds?
Done
05f2ab2
to
a516b8a
Compare
Previously, slow requests were only logged at the range level, but the logs did not indicate which replica is slow. Moreover, the SlowRPC metric attempted to represent the number of requests currently being retried, however it was done on the range level and therefore missed a second level of replica-level retries being done underneath. This PR adds logging on the replica level, removes a confusing log line, and changes the metric to count the number of slow requests in a simpler manner. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-33510 Fixes: cockroachdb#114431
a516b8a
to
b034714
Compare
bors r+ |
Build succeeded: |
Previously, slow requests were only logged at the range level, but the logs did not indicate which replica is slow. Moreover, the SlowRPC metric attempted to represent the number of requests currently being retried, however it was done on the range level and therefore missed a second level of replica-level retries being done underneath.
This PR adds logging on the replica level, removes a confusing log line, and changes the metric to count the number of slow requests in a simpler manner.
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-33510
Fixes: #114431