-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
solver: fix printing progress messages after merged edges #4347
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation seems correct to me. Had a question about a potential alternative way of doing the same thing.
solver/jobs.go
Outdated
targetSt, ok := jl.actives[targetEdge.edge.Vertex.Digest()] | ||
if !ok { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we retrieve the state directly from the target edge? The edge itself contains a reference to activeOp
and this is always an instance of sharedOp
. sharedOp
contains a reference to the state that I believe should be identical to this.
It might be better if we add a method to the activeOp
interface like this:
func State() *state
Then have sharedOp
implement that to return the state. These are all private interfaces so it shouldn't be a problem.
I would also say I'm a bit uncomfortable with the idea of returning when this fails. It feels to me like we should just pass nil
to setEdge
and have setEdge
check if the state is nil. While I don't think anything could ever cause line 292 to trigger, if it does, I'd like it to just continue on through the old control-flow for safety reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is somewhat intentional that this can't be directly accessed. We want to avoid that the state machine in scheduler has capability of directly modifying the builder state or the shared graph. This is why there is the edgeFactory
and activeOp
interface and the scheduler always needs to go through these methods.
I would also say I'm a bit uncomfortable with the idea of returning when this fails.
This indeed isn't ideal, but it is not new to this PR. It could error, but that is bit tricky from the event loop and should be separate PR. But the main issue is that if this returns early it is basically some reference counting bug and graph is in an inconsistent state. So even if we error, it would be hard to fix up this state and probably something else would fail. We can't just skip the merge because this function is only called after the merge has already completed as a final pointer switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood your second comment and you just suggested skipping setting progress writer if this new state lookup fails for any reason. Yes, I can change that.
00c326e
to
759b90c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look at the other PR today and maybe we can get both merged.
When different LLB vertexes (eg. parallel requests referencing local sources from different sessions) generate same cache keys during solve they are merged together into a single operation. Currently, when this happened the progress for the vertex that was dropped got lost. This fixes this case by adding the progressWriter of the redirected vertex as a target to the source one. This should also work with multiple levels of merged edges, just multiple nested multiwriters as well. Signed-off-by: Tonis Tiigi <[email protected]>
759b90c
to
e1da8b7
Compare
🤔 Ok, on attempting to upgrade the buildkit version in dagger (cc @sipsma), we hit a deadlock that seems to have been introduced by this patch. I think the reason we see it there (and not in buildkit tests) is that we run all our tests against one singular buildkit solver, which results in a lot more edge merging than usual. (of course it's possible there's something downstream in dagger that would effect this, but given the way the solver package is structured, I can't think of anything in here that that would be even possible to override to change the behavior) I'm still trying to understand how, but this seems like the relevant part of the stack trace obtained from sending
Somehow, we can get into a state where Will keep investigating as I have time, but any insights from anyone else would be interesting - still not 100% sure how to trigger a repro either. |
Ok, did some more investigation (master...jedevc:buildkit:wip-repro-the-deadlock) to add some more relevant logs, and to detect this case and panic (instead of deadlocking, that was a lot harder to detect). Looking through the logs, I notice something suspicious:
(note, the "merging edge" messages, which I've updated to include the vertex digests - we attempt to merge Essentially, what appears to happen is that we have two vertexes, A and B. Both of these vertexes seem to have at least two edges, A1/A2 and B1/B2. Somehow, we perform a merge that merges A1 into B1, but then merges B2 into A2. (note, the edges are different in the From my understanding, this has always been a possibility? It's just with this PR, this caused us to deadlock, since the progress writer (which are on the states) then becomes nested. Thinking about this, it feels completely possible for edge merges to form these cycles (at least, I don't see any protection against this case), and we could just patch to skip the underlying progress cycles, but it does feel like a bit of an issue that we could get this cycle in the first place (if that's really what's happening). This would mean that an operation that could be cached isn't, and we'd be doing an operation twice, since I think each vertex would still be @tonistiigi @sipsma any ideas as to next steps here - am I missing something that should stop this case from appearing in the first place (so the root cause is something else more than this), or would this be it? If this is the root cause, then I can submit a patch that avoids these kind of cycles (though my simple hack to just reverse merging direction doesn't quite work, since we could potentially have cycles of N vertices instead of just 2, so we'd need to also avoid merging edges such that we'd get cycles of A->B->C->A). |
fixes #4330
depends on #4285When different LLB vertexes (eg. parallel requests referencing
local sources from different sessions) generate the same cache keys
during solve, they are merged together into a single operation.
Currently, when this happened, the progress for the vertex that
was dropped got lost. This fixes this case by adding the
progressWriter of the redirected vertex as a target to the
source one.
This should also work with multiple levels of merged edges,
just multiple nested multiwriters as well.