solver: fix printing progress messages after merged edges #4347

tonistiigi · 2023-10-18T05:27:31Z

~~depends on #4285~~

When different LLB vertexes (eg. parallel requests referencing
local sources from different sessions) generate the same cache keys
during solve, they are merged together into a single operation.

Currently, when this happened, the progress for the vertex that
was dropped got lost. This fixes this case by adding the
progressWriter of the redirected vertex as a target to the
source one.

This should also work with multiple levels of merged edges,
just multiple nested multiwriters as well.

jsternberg

Implementation seems correct to me. Had a question about a potential alternative way of doing the same thing.

jsternberg · 2023-10-18T16:58:07Z

solver/jobs.go

+	targetSt, ok := jl.actives[targetEdge.edge.Vertex.Digest()]
+	if !ok {
+		return
+	}


Can we retrieve the state directly from the target edge? The edge itself contains a reference to activeOp and this is always an instance of sharedOp. sharedOp contains a reference to the state that I believe should be identical to this.

It might be better if we add a method to the activeOp interface like this:

func State() *state

Then have sharedOp implement that to return the state. These are all private interfaces so it shouldn't be a problem.

I would also say I'm a bit uncomfortable with the idea of returning when this fails. It feels to me like we should just pass nil to setEdge and have setEdge check if the state is nil. While I don't think anything could ever cause line 292 to trigger, if it does, I'd like it to just continue on through the old control-flow for safety reasons.

It is somewhat intentional that this can't be directly accessed. We want to avoid that the state machine in scheduler has capability of directly modifying the builder state or the shared graph. This is why there is the edgeFactory and activeOp interface and the scheduler always needs to go through these methods.

I would also say I'm a bit uncomfortable with the idea of returning when this fails.

This indeed isn't ideal, but it is not new to this PR. It could error, but that is bit tricky from the event loop and should be separate PR. But the main issue is that if this returns early it is basically some reference counting bug and graph is in an inconsistent state. So even if we error, it would be hard to fix up this state and probably something else would fail. We can't just skip the merge because this function is only called after the merge has already completed as a final pointer switch.

I think I misunderstood your second comment and you just suggested skipping setting progress writer if this new state lookup fails for any reason. Yes, I can change that.

jsternberg

I'll look at the other PR today and maybe we can get both merged.

When different LLB vertexes (eg. parallel requests referencing local sources from different sessions) generate same cache keys during solve they are merged together into a single operation. Currently, when this happened the progress for the vertex that was dropped got lost. This fixes this case by adding the progressWriter of the redirected vertex as a target to the source one. This should also work with multiple levels of merged edges, just multiple nested multiwriters as well. Signed-off-by: Tonis Tiigi <[email protected]>

jedevc · 2023-12-29T16:54:45Z

🤔 Ok, on attempting to upgrade the buildkit version in dagger (cc @sipsma), we hit a deadlock that seems to have been introduced by this patch. I think the reason we see it there (and not in buildkit tests) is that we run all our tests against one singular buildkit solver, which results in a lot more edge merging than usual. (of course it's possible there's something downstream in dagger that would effect this, but given the way the solver package is structured, I can't think of anything in here that that would be even possible to override to change the behavior)

I'm still trying to understand how, but this seems like the relevant part of the stack trace obtained from sending SIGQUIT:

goroutine 441 [sync.Mutex.Lock, 4 minutes]:
runtime.gopark(0xc01d5ebbf0?, 0xc00aef48e1?, 0xc0?, 0xf?, 0xc017641870?)
	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc017641790 sp=0xc017641770 pc=0x43e42e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:404
runtime.semacquire1(0xc01d77eb74, 0x0?, 0x3, 0x1, 0x21?)
	/usr/local/go/src/runtime/sema.go:160 +0x218 fp=0xc0176417f8 sp=0xc017641790 pc=0x44f918
sync.runtime_SemacquireMutex(0xffffffffffffffff?, 0x0?, 0xc00ed7c4b0?)
	/usr/local/go/src/runtime/sema.go:77 +0x25 fp=0xc017641830 sp=0xc0176417f8 pc=0x46d545
sync.(*Mutex).lockSlow(0xc01d77eb70)
	/usr/local/go/src/sync/mutex.go:171 +0x15d fp=0xc017641880 sp=0xc017641830 pc=0x488f1d
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:90
github.com/moby/buildkit/util/progress.(*MultiWriter).writeRawProgress(0xc01d77eb70, 0xc004bd6e00)
	/go/pkg/mod/github.com/moby/[email protected]/util/progress/multiwriter.go:91 +0x5c fp=0xc017641960 sp=0xc017641880 pc=0xcb537c
github.com/moby/buildkit/util/progress.(*MultiWriter).WriteRawProgress(0xc01d77eb70?, 0xc004bd6e00?)
	/go/pkg/mod/github.com/moby/[email protected]/util/progress/multiwriter.go:87 +0xac fp=0xc017641a28 sp=0xc017641960 pc=0xcb516c
github.com/moby/buildkit/util/progress.(*MultiWriter).writeRawProgress(0xc01dfe5da0, 0xc004bd6e00)
	/go/pkg/mod/github.com/moby/[email protected]/util/progress/multiwriter.go:95 +0x17e fp=0xc017641b08 sp=0xc017641a28 pc=0xcb549e
github.com/moby/buildkit/util/progress.(*MultiWriter).WriteRawProgress(0xc01dfe5da0?, 0xc004bd6e00?)
	/go/pkg/mod/github.com/moby/[email protected]/util/progress/multiwriter.go:87 +0xac fp=0xc017641bd0 sp=0xc017641b08 pc=0xcb516c
github.com/moby/buildkit/util/progress.(*MultiWriter).Add(0xc01d77eb70, {0x1f7e8b0?, 0xc01dfe5da0?})
	/go/pkg/mod/github.com/moby/[email protected]/util/progress/multiwriter.go:46 +0x190 fp=0xc017641c60 sp=0xc017641bd0 pc=0xcb4d70
github.com/moby/buildkit/solver.(*state).setEdge(0xc01dfe3440, 0x2, 0xc003975a40, 0xc0165fcb40)
	/go/pkg/mod/github.com/moby/[email protected]/solver/jobs.go:178 +0x370 fp=0xc017641ce8 sp=0xc017641c60 pc=0xdef3f0
github.com/moby/buildkit/solver.(*Solver).setEdge(0xc0007071a0, {0xc003975a40?, {0x1f92360?, 0xc004bd6480?}}, 0xc003975a40)
	/go/pkg/mod/github.com/moby/[email protected]/solver/jobs.go:295 +0x125 fp=0xc017641d58 sp=0xc017641ce8 pc=0xdf0265
github.com/moby/buildkit/solver.(*scheduler).dispatch(0xc0002fc770, 0xc01356a140)
	/go/pkg/mod/github.com/moby/[email protected]/solver/scheduler.go:177 +0xbdb fp=0xc017641f68 sp=0xc017641d58 pc=0xdfe87b
github.com/moby/buildkit/solver.(*scheduler).loop(0xc0002fc770)
	/go/pkg/mod/github.com/moby/[email protected]/solver/scheduler.go:104 +0x1d4 fp=0xc017641fc8 sp=0xc017641f68 pc=0xdfdb34
github.com/moby/buildkit/solver.newScheduler.func1()
	/go/pkg/mod/github.com/moby/[email protected]/solver/scheduler.go:35 +0x25 fp=0xc017641fe0 sp=0xc017641fc8 pc=0xdfd865
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc017641fe8 sp=0xc017641fe0 pc=0x4715e1
created by github.com/moby/buildkit/solver.newScheduler in goroutine 1
	/go/pkg/mod/github.com/moby/[email protected]/solver/scheduler.go:35 +0x22d

Somehow, we can get into a state where MultiWriter.Add eventually calls writeRawProgress on itself, which deadlocks, since the lock is already held by MultiWriter.Add. This seems like we're trying to add a MultiWriter to one of it's own children (what)?

Will keep investigating as I have time, but any insights from anyone else would be interesting - still not 100% sure how to trigger a repro either.

jedevc · 2024-01-12T14:27:23Z

Somehow, we can get into a state where MultiWriter.Add eventually calls writeRawProgress on itself, which deadlocks, since the lock is already held by MultiWriter.Add. This seems like we're trying to add a MultiWriter to one of it's own children (what)?

Ok, did some more investigation (master...jedevc:buildkit:wip-repro-the-deadlock) to add some more relevant logs, and to detect this case and panic (instead of deadlocking, that was a lot harder to detect).

Looking through the logs, I notice something suspicious:

...

time="2024-01-12T14:09:03Z" level=debug msg="merging edge exec dagger --debug mod init --name=minimal --sdk=go to exec dagger --debug mod init --name=minimal --sdk=go\n"
time="2024-01-12T14:09:03Z" level=debug msg="             sha256:1eb481d5ab0b4f45690cdeba169e0f45cea123705402b85ad4a66b602ba68fb5 to sha256:568e76d43d12b175ee4f79e831ffdd76b293c9ee1e38783f03987dc36acb17af\n"
time="2024-01-12T14:09:03Z" level=debug msg="ownership sha256:1eb481d5ab0b4f45690cdeba169e0f45cea123705402b85ad4a66b602ba68fb5 (0xc001a89e00) -> sha256:568e76d43d12b175ee4f79e831ffdd76b293c9ee1e38783f03987dc36acb17af (0xc000b317c0)"
time="2024-01-12T14:09:03Z" level=debug msg="releasing sha256:1eb481d5ab0b4f45690cdeba169e0f45cea123705402b85ad4a66b602ba68fb5 (0xc001a89e00)"
time="2024-01-12T14:09:03Z" level=debug msg="adding 0xc0042387b0 to 0xc0032ef830 (setEdge)\n"

...

time="2024-01-12T14:09:15Z" level=debug msg="merging edge exec dagger --debug mod init --name=minimal --sdk=go to exec dagger --debug mod init --name=minimal --sdk=go\n"
time="2024-01-12T14:09:15Z" level=debug msg="             sha256:568e76d43d12b175ee4f79e831ffdd76b293c9ee1e38783f03987dc36acb17af to sha256:1eb481d5ab0b4f45690cdeba169e0f45cea123705402b85ad4a66b602ba68fb5\n"
time="2024-01-12T14:09:15Z" level=debug msg="ownership sha256:568e76d43d12b175ee4f79e831ffdd76b293c9ee1e38783f03987dc36acb17af (0xc0001e6f00) -> sha256:1eb481d5ab0b4f45690cdeba169e0f45cea123705402b85ad4a66b602ba68fb5 (0xc001a89040)"
time="2024-01-12T14:09:15Z" level=debug msg="releasing sha256:568e76d43d12b175ee4f79e831ffdd76b293c9ee1e38783f03987dc36acb17af (0xc0001e6f00)"
time="2024-01-12T14:09:15Z" level=debug msg="adding 0xc0032ef830 to 0xc0042387b0 (setEdge)\n"
panic: wait we can't nest like this

goroutine 45 [running]:
github.com/moby/buildkit/util/progress.(*MultiWriter).Add(0xc0042387b0, {0x1f91170?, 0xc0032ef830?})
	/app/buildkit/util/progress/multiwriter.go:44 +0x24a
github.com/moby/buildkit/solver.(*state).setEdge(0xc0018bf7a0, 0x2, 0xc001a89040, 0xc005440d80)
	/app/buildkit/solver/jobs.go:180 +0x3d6
github.com/moby/buildkit/solver.(*Solver).setEdge(0xc00062c300, {0xc001a89040?, {0x1fa4a40?, 0xc004319080?}}, 0xc001a89040)
	/app/buildkit/solver/jobs.go:297 +0x125
github.com/moby/buildkit/solver.(*scheduler).dispatch(0xc0003a03f0, 0xc0001e6f00)
	/app/buildkit/solver/scheduler.go:183 +0xed8
github.com/moby/buildkit/solver.(*scheduler).loop(0xc0003a03f0)
	/app/buildkit/solver/scheduler.go:104 +0x1d4
created by github.com/moby/buildkit/solver.newScheduler in goroutine 1
	/app/buildkit/solver/scheduler.go:35 +0x22d

(note, the "merging edge" messages, which I've updated to include the vertex digests - we attempt to merge sha256:1eb4... into sha256:568e..., but later in the logs, we flip this, and the merge goes the other direction)

Essentially, what appears to happen is that we have two vertexes, A and B. Both of these vertexes seem to have at least two edges, A1/A2 and B1/B2. Somehow, we perform a merge that merges A1 into B1, but then merges B2 into A2. (note, the edges are different in the ownership log I added - we can see that the pointers printed out are different, so this isn't just some zombie edges appearing again after getting released)

From my understanding, this has always been a possibility? It's just with this PR, this caused us to deadlock, since the progress writer (which are on the states) then becomes nested.

Thinking about this, it feels completely possible for edge merges to form these cycles (at least, I don't see any protection against this case), and we could just patch to skip the underlying progress cycles, but it does feel like a bit of an issue that we could get this cycle in the first place (if that's really what's happening).

This would mean that an operation that could be cached isn't, and we'd be doing an operation twice, since I think each vertex would still be Execed? Ideally, we should avoid this cycles, and only ever perform edge merging in one direction (i.e. only ever from A to B, or B to A, and not both directions). In an experiment, I hacked together a patch to do this: b735af8. This "resolves" the issue, and we no longer see the deadlock/panic 🎉

@tonistiigi @sipsma any ideas as to next steps here - am I missing something that should stop this case from appearing in the first place (so the root cause is something else more than this), or would this be it? If this is the root cause, then I can submit a patch that avoids these kind of cycles (though my simple hack to just reverse merging direction doesn't quite work, since we could potentially have cycles of N vertices instead of just 2, so we'd need to also avoid merging edges such that we'd get cycles of A->B->C->A).

jsternberg requested changes Oct 18, 2023

View reviewed changes

tonistiigi force-pushed the merged-edges-progress-fix branch from 00c326e to 759b90c Compare October 18, 2023 22:56

jsternberg approved these changes Oct 19, 2023

View reviewed changes

tonistiigi force-pushed the merged-edges-progress-fix branch from 759b90c to e1da8b7 Compare October 23, 2023 22:03

tonistiigi marked this pull request as ready for review October 23, 2023 22:03

jsternberg approved these changes Oct 24, 2023

View reviewed changes

tonistiigi requested review from jedevc and sipsma October 31, 2023 23:26

jedevc approved these changes Nov 7, 2023

View reviewed changes

tonistiigi merged commit ca71a44 into moby:master Nov 7, 2023
56 of 57 checks passed

jedevc mentioned this pull request Nov 29, 2023

vendor: update buildkit to master@c4f191410a41 dagger/dagger#6169

Merged

jedevc mentioned this pull request Jan 8, 2024

Update buildkit past ca71a44 dagger/dagger#6374

Closed

This was referenced Jan 12, 2024

solver: fix issue with double merged edges #4285

Merged

scheduler: always edge merge in one direction #4559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

solver: fix printing progress messages after merged edges #4347

solver: fix printing progress messages after merged edges #4347

tonistiigi commented Oct 18, 2023 •

edited

Loading

jsternberg left a comment

jsternberg Oct 18, 2023

tonistiigi Oct 18, 2023

tonistiigi Oct 18, 2023

jsternberg left a comment

jedevc commented Dec 29, 2023

jedevc commented Jan 12, 2024

solver: fix printing progress messages after merged edges #4347

solver: fix printing progress messages after merged edges #4347

Conversation

tonistiigi commented Oct 18, 2023 • edited Loading

jsternberg left a comment

Choose a reason for hiding this comment

jsternberg Oct 18, 2023

Choose a reason for hiding this comment

tonistiigi Oct 18, 2023

Choose a reason for hiding this comment

tonistiigi Oct 18, 2023

Choose a reason for hiding this comment

jsternberg left a comment

Choose a reason for hiding this comment

jedevc commented Dec 29, 2023

jedevc commented Jan 12, 2024

tonistiigi commented Oct 18, 2023 •

edited

Loading