clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time #6563

easwars · 2023-08-17T23:59:14Z

clusterresolver LB policy currently deadlocks if the dns resolver reports an update or error inline at build time. This is because the dns resolver is built while holding a lock, and the same lock needs to be grabbed to handle an error or update from the resolver.

This was not caught in our current tests because the dns resolver was being overridden with a fake one. I switched as many tests as possible to use the real dns resolver. I also ensure that the dns resolver pushes update inline at build time because I pass it the actual host:port and not a name to be resolved.

I ran into this issue when I fixing some tests in the cds LB policy.

Fixes #6562

RELEASE NOTES:

clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time

zasweq

Some comments.

xds/internal/balancer/clusterresolver/resource_resolver_dns.go

zasweq · 2023-08-21T21:20:04Z

xds/internal/balancer/clusterresolver/resource_resolver_dns.go

+	ret.serializer.Schedule(func(context.Context) {
+		r, err := newDNS(resolver.Target{URL: *u}, ret, resolver.BuildOptions{})
+		if err == nil {
+			ret.dnsR = r
+			return
+		}
+
 		if ret.logger.V(2) {
 			ret.logger.Infof("Failed to build DNS resolver for target %q: %v", target, err)
 		}
+		ret.mu.Lock()
 		ret.updateReceived = true
+		ret.mu.Unlock()
 		ret.topLevelResolver.onUpdate()


This seems correct thinking about it. My musing is super minor here. Previously, when you called this function and an onUpdate() gets called, it's considered having "received" configuration for this particular discovery mechanism and will trigger fallback if DNS Resolver hasn't sent anything yet. Now, it happens async (and we wait for the dns resolver to build before to set the bool). Oh, I guess we can't consider this discovery mechanism having had a chance to received an update before the dns resolver has a chance to return results inline. I was mainly concerned now there's a new time frame between when we build this resource resolver and when this callback actually executes where previously before it would build the config and trigger fallback (in the case of no update sent so no addrs), and now it just waits until onUpdate() is called here. But this wait seems minor and ok. So seems correct.

Thanks for this comment. I realized that I was not handling all cases correctly. Specifically, if a url parse failure happened, with my previous commit, things would still deadlock.

So, I thought about this for a while, and came to the conclusion that the correct place to handle this is in the resourceResolver component. So, my current commit handles this:

it makes the onUpdate non blocking, but pushing the call to generateLocked via a callback serializer

make it possible for the cluster_resolver LB policy to notify the resourceResolver whether it is being actually closed or not (in stop())

The use of the callback serializer is not absolutely required in the resourceResolver. I could as well have pushed a signal on an unbounded buffer, and have a goroutine read from it. But using a callback serializer just made is easier. Let me know if you have any questions/concerns on this approach.

Oh yeah, I see now about url parse error triggering deadlock with onUpdate(). There was a discussion about adding an nack validation in the xDS Client in the xDS Chat, but the discussion got tabled. Then there would no longer need to be this error handling. Let me think some about what layer you put the buffer of callbacks at.

xds/internal/balancer/clusterresolver/resource_resolver_dns.go

zasweq · 2023-08-21T21:25:29Z

xds/internal/balancer/clusterresolver/resource_resolver_dns.go

-	if dr.dnsR != nil {
-		dr.dnsR.Close()
-	}
+	dr.serializerCancel()


The documentation for the callback serializer makes it seem like once this context is cancelled - "It is guaranteed that no callbacks will be added once this context is canceled" it can't add callbacks (and reading the run() goroutine in that component backs it up: https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L84. Should this come after the scheduling on line 150?

This change has been deleted.

The callback serializer's Schedule method returns a value which indicates whether or not the callback was scheduled, and it guarantees that all scheduled callbacks are executed even if the provided context is cancelled. It also provides a Done() method which the caller can block on after cancelling the context to be 100% sure that the callback serializer is done with everything that it had to execute and that it has freed up all its resources.

Right. But wouldn't this recv happen: https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L84, which would trigger the possibility of grabbing this close mu and subsequent code block (before Schedule() runs): https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L98 (setting closed to true), and then Schedule() fails here: https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L71? Anyways, this is now no longer relevant, but I do think there was a misordering of operations here.

zasweq · 2023-08-23T20:33:14Z

xds/internal/balancer/clusterresolver/resource_resolver.go

+	if closing {
+		rr.serializerCancel()
+		<-rr.serializer.Done()
+	}


Is there a way to close the serializer without causing any queued unran callbacks to execute? I don't think there is, but should we add that method to that type? It feels like a waste to execute all the potentially operations (i.e. generateLocked()) when we know the whole cluster resolver component is closing anyway.

Yes, there is currently no way to cause the serializer to shut down without running queued callbacks. In fact, the serializer in its first version used to do that, but when I was making changes for channel idleness, I quickly realized that the more common case was the one where it is guaranteed that all scheduled callbacks are run.

It should be trivial for the user of the serializer to have some logic to turn the callback into a no-op when it knows that the context passed to the serializer has been cancelled.

Ok. So you think that say if you have 5 generateLocked() calls queued, it's fine to just run them before closing, which would search through all discovery mechanisms. Eh, whatever, minor, in practice won't scale out of proportion.

zasweq · 2023-08-23T20:44:27Z

xds/internal/balancer/clusterresolver/clusterresolver.go

@@ -280,7 +280,7 @@ func (b *clusterResolverBalancer) handleErrorFromUpdate(err error, fromParent bo
 	// EDS resource was removed. No action needs to be taken for this, and we
 	// should continue watching the same EDS resource.
 	if fromParent && xdsresource.ErrType(err) == xdsresource.ErrorTypeResourceNotFound {
-		b.resourceWatcher.stop()
+		b.resourceWatcher.stop(false)


I'm having a bit of trouble following this logic (false plumbed down). Is it to support the use case described in this comment:

grpc-go/xds/internal/balancer/clusterresolver/resource_resolver.go

Line 224 in 79dd4a6

// Save the previous childrenMap to stop the children outside the mutex,

(i.e. reusing the lb policy in the future, so the component callback serializer the resource_resolver holds onto is still active?)

It is basically to handle the comment here:

grpc-go/xds/internal/balancer/clusterresolver/resource_resolver.go

Line 246 in 79dd4a6

// stop() is called when the LB policy is closed or when the underlying

stop() is called when:

the CDS resource is deleted, or

the LB policy is being stopped

And in the first case, when the CDS resource is added back, the cluster_resolver will get a config update with new mechanisms and will need to process it.

I initially thought about cancelling the serializer in stop() unconditionally and recreate it in updateMechanisms(), instead of in newResourceResolver. But this was hard to do because then you need to protect access to the serializer with a mutex, and if we have to do that, then there is no way we can guarantee that onUpdate won't block on the mutex.

Ok sounds good, thanks for the explanation. It's keeping it's lifecycle coupled with this resource resolver type, which sounds fine since it's a buffer the resource resolver type uses.

xds/internal/balancer/clusterresolver/resource_resolver.go

xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go

zasweq

LGTM.

easwars requested a review from zasweq August 17, 2023 23:59

easwars assigned zasweq Aug 17, 2023

easwars added the Type: Bug label Aug 17, 2023

easwars added this to the 1.58 Release milestone Aug 17, 2023

easwars added 2 commits August 18, 2023 01:27

clusterresolver: fix deadlock in dns discovery mechanism

96022b9

make vet happy

f78bd31

easwars force-pushed the clusterresolver_deadlock branch from bedf587 to f78bd31 Compare August 18, 2023 01:28

easwars mentioned this pull request Aug 18, 2023

cdsbalancer: test cleanup part 3/N #6564

Merged

zasweq requested changes Aug 21, 2023

View reviewed changes

zasweq assigned easwars and unassigned zasweq Aug 21, 2023

review comments

de1fa7b

easwars assigned zasweq and unassigned easwars Aug 22, 2023

delete an extraneous newline

79dd4a6

zasweq requested changes Aug 23, 2023

View reviewed changes

zasweq assigned easwars and unassigned zasweq Aug 23, 2023

easwars assigned zasweq and unassigned easwars Aug 23, 2023

zasweq approved these changes Aug 23, 2023

View reviewed changes

zasweq assigned easwars and unassigned zasweq Aug 23, 2023

easwars merged commit 4c9777c into grpc:master Aug 23, 2023
10 checks passed

github-actions bot locked as resolved and limited conversation to collaborators Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time #6563

clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time #6563

easwars commented Aug 17, 2023

zasweq left a comment

zasweq Aug 21, 2023

easwars Aug 22, 2023

zasweq Aug 23, 2023

zasweq Aug 21, 2023

easwars Aug 22, 2023

zasweq Aug 23, 2023 •

edited

Loading

zasweq Aug 23, 2023

easwars Aug 23, 2023

zasweq Aug 23, 2023

zasweq Aug 23, 2023

easwars Aug 23, 2023

zasweq Aug 23, 2023

zasweq left a comment

clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time #6563

clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time #6563

Conversation

easwars commented Aug 17, 2023

zasweq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zasweq Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zasweq left a comment

Choose a reason for hiding this comment

zasweq Aug 23, 2023 •

edited

Loading