Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: backup-restore/mixed-version failed #112373

Closed
cockroach-teamcity opened this issue Oct 14, 2023 · 23 comments
Closed

roachtest: backup-restore/mixed-version failed #112373

cockroach-teamcity opened this issue Oct 14, 2023 · 23 comments
Labels
A-admission-control A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-noreuse Prevent automatic commenting from CI test failures
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 14, 2023

roachtest.backup-restore/mixed-version failed with artifacts on release-23.2 @ abd56c6d6e9dbf131fc1fa0d3f8d1fbcfb7e4901:

(mixedversion.go:545).Run: mixed-version test failure while running step 29 (run "plan and run backups"): error waiting for job to finish: job 908356280488263681 failed with error: failed to run backup: exporting 230 ranges: export request timeout: operation "intent_resolver_ir_batcher.sendBatch" timed out after 1m0.041s (given timeout 1m0s): work kv deadline expired while waiting: deadline: 2023-10-14 10:20:57.098424558 +0000 UTC m=+377.360809249, start: 2023-10-14 10:19:57.098548 +0000 UTC m=+317.360932704, dur: 1m0.039413971s
test artifacts and logs in: /artifacts/backup-restore/mixed-version/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-32378

@cockroach-teamcity cockroach-teamcity added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Oct 14, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Oct 14, 2023
@msbutler

This comment was marked as outdated.

@dt
Copy link
Member

dt commented Oct 16, 2023

A whole backup should not fail on this error

If it is taking a whole minute for one intent resolution batch request to respond, arguably failing the backup is correct, since something is busted and the backup just hanging on retries is only going to hide that something.

@sumeerbhola
Copy link
Collaborator

A backup export request failed due to an admission control timeout recently introduced in #109932 by @sumeerbhola

I don't remember adding a timeout. How do we know this timeout was due to AC queueing? Am I missing something?
This seems like another place where we need the observability mentioned in #112126 to understand the causes of the latency incurred by the request.
@adityamaru

@msbutler
Copy link
Collaborator

msbutler commented Oct 17, 2023

Apologies @sumeerbhola. Git blame fooled me. I no longer think the failure relates to #109932 as the limiter you added here wasn't even enabled. The timeout instead relates to #91815. I'll put this on kv's board to asses why intent resolution timed out.

@msbutler
Copy link
Collaborator

Sending this failure to kv to dig into why an export request failed due to:

operation "intent_resolver_ir_batcher.sendBatch" timed out after 1m0.041s (given timeout 1m0s): work kv deadline expired while waiting: deadline: 2023-10-14 10:20:57.098424558 +0000 UTC m=+377.360809249, start: 2023-10-14 10:19:57.098548 +0000 UTC m=+317.360932704, dur: 1m0.039413971s

Some context on this test: this backup timed out after several successful backups of the foreground workload. Note that this timeout occurred while all nodes in the cluster were running on binary version release-23.2 , but had yet to finalize to 23.2.

@msbutler msbutler added T-kv KV Team and removed T-admission-control Admission Control labels Oct 17, 2023
@sumeerbhola
Copy link
Collaborator

I no longer think the failure relates to #109932 as the limiter you added here wasn't even enabled.

kv.intent_resolver.batch.bypass_admission_control.enabled defaults to false, so AC is enabled for most intent resolution.

@andrewbaptist andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 19, 2023
@aadityasondhi
Copy link
Collaborator

Seeing a bunch of these right before the failure:

I231014 10:21:05.508193 671447 server/node.go:1368 ⋮ [T1,Vsystem,n3] 1491 +     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch
W231014 10:21:05.711642 600806 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1492  pusher ‹3b1b0f28›: have been waiting 60.01s for pushee ‹fdfe2850›
W231014 10:21:05.920440 601174 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1493  pusher ‹ff90cb52›: have been waiting 60.00s for pushee ‹fdfe2850›
W231014 10:21:06.045302 601168 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1494  pusher ‹6bc668a4›: have been waiting 60.00s for pushee ‹639140cd›
W231014 10:21:06.262348 601265 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1495  pusher ‹c8addbf1›: have been waiting 60.14s for pushee ‹7499a0b9›
W231014 10:21:06.262425 601318 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1496  pusher ‹2c283425›: have been waiting 60.12s for pushee ‹3555f153›
W231014 10:21:06.287687 601531 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1497  pusher ‹bddece9c›: have been waiting 60.00s for pushee ‹4f16652a›
W231014 10:21:06.478808 601750 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1498  pusher ‹d3d77518›: have been waiting 60.00s for pushee ‹4f16652a›

It's not a clear signal to me yet for anything, but I am posting it here since it seems like it could be related.

@blathers-crl blathers-crl bot added the T-storage Storage Team label Oct 25, 2023
@nvanbenschoten
Copy link
Member

Moving to Storage, now that AC has moved over. Please move back over if this appears to be KV related.

@miraradeva
Copy link
Contributor

@sumeerbhola thanks for taking a look!

If we have a tsdump we could probably make an educated guess.

There is a tsdump in the artifacts that Michael uploaded (last line of his latest comment).

@blathers-crl blathers-crl bot added the T-storage Storage Team label Nov 8, 2023
@sumeerbhola
Copy link
Collaborator

@nvanbenschoten how can I tell which (node,store) pair was slow to respond from these log statements?

W231014 10:20:57.138825 310484 kv/kvclient/kvcoord/dist_sender.go:1887 ⋮ [T1,Vsystem,n4,s4] 321  slow range RPC: have been waiting 60.04s (1 attempts) for RPC ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹2›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹3›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹4›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹5›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹6›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹7›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹8›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹9›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹10›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹11›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹12›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹13›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹14›/‹0›,/Min) to r171:‹/Table/11{4/1/79/5/-2431/8-7}› [(n1,s1):1, (n4,s4):5, (n3,s3):3, next=6, gen=21, sticky=1697278789.191696346,0]; resp: ‹(err: work kv deadline expired while waiting: deadline: 2023-10-14 10:20:57.098424558 +0000 UTC m=+377.360809249, start: 2023-10-14 10:19:57.098548 +0000 UTC m=+317.360932704, dur: 1m0.039413971s)›
W231014 10:20:57.138898 310484 kv/kvclient/kvcoord/dist_sender.go:1897 ⋮ [T1,Vsystem,n4,s4] 322  slow RPC response: slow RPC finished after 60.04s (1 attempts)
W231014 10:20:57.143646 248440 kv/kvclient/kvcoord/dist_sender.go:1887 ⋮ [T1,Vsystem,n4,f‹9c324142›,job=908356280488263681,distsql.gateway=3,distsql.appname=‹$ internal-resume-job-908356280488263681›] 323  slow range RPC: have been waiting 61.87s (1 attempts) for RPC Export [/Table/114/1/‹79›/‹5›/‹-2431›/‹8›,/Table/114/2), [wait-policy: Error] to r171:‹/Table/11{4/1/79/5/-2431/8-7}› [(n1,s1):1, (n4,s4):5, (n3,s3):3, next=6, gen=21, sticky=1697278789.191696346,0]; resp: ‹(err: operation "intent_resolver_ir_batcher.sendBatch" timed out after 1m0.041s (given timeout 1m0s): work kv deadline expired while waiting: deadline: 2023-10-14 10:20:57.098424558 +0000 UTC m=+377.360809249, start: 2023-10-14 10:19:57.098548 +0000 UTC m=+317.360932704, dur: 1m0.039413971s)›

@nvanbenschoten
Copy link
Member

Those logs statements don't include that information, as they're coming from a level where we know which range to send to, but not which replica in that range to send to. We may end up trying multiple. We see that range 171 contains: (n1,s1):1, (n4,s4):5, (n3,s3):3.

@sumeerbhola
Copy link
Collaborator

Those logs statements don't include that information, as they're coming from a level where we know which range to send to, but not which replica in that range to send to. We may end up trying multiple.

What would it take to improve the logging here -- knowing which node/store one was waiting on when the deadline expired would be helpful? Is there an existing issue?

@sumeerbhola
Copy link
Collaborator

Summary: this is mostly working as intended, since intent resolution is subject to AC.

The two followups for AC team are:

Details:
As discussed in #112373 (comment) we can't be sure which node (among (n1,s1):1, (n4,s4):5, (n3,s3):3) was responsible for the AC queueing.
CPU utilization was 75% during the test and p99 goroutine scheduling latency was > 1ms, but runnable goroutine counts were low, so we expect regular CPU (which is what this intent resolution should run as) to not be a bottleneck and elastic CPU to be a bottleneck.
Screenshot 2023-11-13 at 10 18 57 AM
Which is corroborated by slots exhausted (regular requests) being low
Screenshot 2023-11-13 at 10 24 42 AM
And elastic CPU utilization running at its limit of 5% (which is the floor of elastic CPU utilization so can't be reduced further even if goroutine scheduling latency p99 > 1ms)
Screenshot 2023-11-13 at 10 25 55 AM

n1's IO tokens are completely exhausted
Screenshot 2023-11-13 at 10 28 08 AM
And it is not being able to keep its sublevel count at 10
Screenshot 2023-11-13 at 10 28 37 AM
There are various reasons it could be having difficulty controlling the sub-level count, including lack of replication admission control for regular writes it receives as a follower. There are no logs for n1, so we can't investigate further. It doesn't matter -- even if it was able to exercise more control, there would be queueing and the intent resolution would possibly wait (even more so if intent resolution was running as an elastic write).

@sumeerbhola
Copy link
Collaborator

I'm removing the release and GA blocker labels.
Assigning this to @nvanbenschoten only for creating an issue for #112373 (comment). Once that is done, I think we can close this.

@sumeerbhola sumeerbhola removed GA-blocker branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Nov 13, 2023
@sumeerbhola
Copy link
Collaborator

investigate why intent resolution on behalf of ExportRequest seems to be running as regular instead of elastic write

I had forgotten that we bump up the priority when doing intent resolution. This is documented in

// - Txns that already hold locks/intents use an AC priority that is logically
// +1 of the original txn priority. Since there are gaps between the
// different user priorities (UserLowPri=-50, NormalPri=0, UserHighPri=50),
// work in one part of the key space that executes with NormalPri+1 will not
// get prioritized over another part of the key space executing at
// UserHighPri. And when different tenantIDs are being used for performance
// isolation, the priority is irrelevant. The actual implementation uses a
// step that is bigger than +1. See
// admissionpb.AdjustedPriorityWhenHoldingLocks.

So ExportRequest triggered intent resolution running as regular writes (and not elastic writes) is expected.

@nvanbenschoten
Copy link
Member

Assigning this to @nvanbenschoten only for creating an issue for #112373 (comment). Once that is done, I think we can close this.

This is now being tracked #114431, which will help us connect hanging requests to the specific replica that the request is hanging on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-noreuse Prevent automatic commenting from CI test failures
Projects
No open projects
Archived in project
Archived in project
Archived in project
Development

No branches or pull requests

10 participants