Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ringhash: handle config updates properly #5557

Merged
merged 5 commits into from
Aug 4, 2022

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Aug 2, 2022

Fixes #5460

Fix a few things in the config handling routine of the ringhash LB policy:

  • config changes in the ring size should result in a new picker with a new ring
  • config changes should result in the internal copy of the config being updated

RELEASE NOTES:

  • ringhash: fix config update processing to recreate ring and picker when min/max ring size changes

@easwars easwars added this to the 1.49 Release milestone Aug 2, 2022
@easwars easwars requested a review from zasweq August 2, 2022 19:31
// picker is pushed on the channel.
func (s) TestUpdateClientConnState_NewRingSize(t *testing.T) {
addrs := []resolver.Address{{Addr: testBackendAddrStrs[0]}}
cc, b, p0 := setupTest(t, addrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I know it's p0 elsewhere, but for some reason p1 and p2 make more sense in my mind than p0 and p1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// If the ring configuration has changed, we need to regenerate the ring and
// send a new picker.
var regenerateRing bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In correspondence to your comment on my Outlier Detection PR:
regenerateRing := b.updateAddresses(s.ResolverState.Addresses)
if b.config == nil || b.config.MinRingSize != newConfig.MinRingSize || b.config.MaxRingSize != newConfig.MaxRingSize {
regenerateRing = true
}
b.config = newConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

}

if regenerateRing {
ring, err := newRing(b.subConns, b.config.MinRingSize, b.config.MaxRingSize)
Copy link
Contributor

@zasweq zasweq Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes logic, and I don't know if this is correct. I still think it needs to write to b.ring. Otherwise it would persist a built ring map from the previous unrelated good update? Then subsequent ResolverError calls (line 318) and UpdateSubConnState() calls (line 369) call into regeneratePicker which reads b.ring, which would not be in the correct state and "synchronized" with the UpdateClientConnState that caused the SubConn error from the address list. Please triage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, when we were assigning to b.ring here and if err != nil, we were still calling ResolverError, but not sending a new picker. So, we would have still ended up using the old ring (from the old picker) to service RPCs. So, in that sense, i feel that the behavior has not changed. But either ways, I'm not sure if our existing behavior was right. So, I have posted on our chat room, let's see what comes out of it. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, based on the discussion in the chat room, I've change our ring creation routine newRing() to not return an error.

return balancer.ErrBadResolverState
}
// Successful resolution; clear resolver error and return nil.
b.resolverErr = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't know if this breaks it, but this doesn't clear it if hits line 295, whereas in master it did. Although it feels appropriate here, but please triage if this breaks anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks totally fine to me.

// TestUpdateClientConnState_NewRingSize tests the scenario where the ringhash
// LB policy receives new configuration which specifies new values for the ring
// min and max sizes. The test verifies that a new ring is created and a new
// picker is pushed on the channel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we're not testing whether the picker is pushed on "the channel". We're testing whether the new picker gets forwarded to this balancer's client conn, which in this case sends on the channel for verification purposes. Would prefer rewording "is pushed on the channel" to "sent to the ClientConn".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@zasweq zasweq assigned easwars and unassigned zasweq Aug 2, 2022
@easwars easwars assigned zasweq and unassigned easwars Aug 2, 2022
@zasweq zasweq assigned easwars and unassigned zasweq Aug 3, 2022
zasweq
zasweq previously approved these changes Aug 3, 2022
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

// getWeightAttribute() returns 1 if the weight attribute is not found
// on the address. And since this function is guaranteed to be called
// with a non-empty subConns map, weightSum is guaranteed to be
// non-zero. So, we need not worry about divide by zero error here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, super super minor nit: "need not worry about a divide"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

//
// Must be called with a non-empty subConns map.
func normalizeWeights(subConns *resolver.AddressMap) ([]subConnWithWeight, float64) {
var weightSum float64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of keeping this uint32, and instead of having another var to convert it like in master, you convert in the denominator float64(weightsum) to avoid cast on line 119. I don't care/don't feel strongly about this but was just proposing a suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// normalizeWeights divides all the weights by the sum, so that the total weight
// is 1.
func normalizeWeights(subConns *resolver.AddressMap) ([]subConnWithWeight, float64, error) {
//
// Must be called with a non-empty subConns map.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing "non-empty" encapsualtes nil and len(0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

@@ -291,6 +272,29 @@ func (b *ringhashBalancer) UpdateClientConnState(s balancer.ClientConnState) err
b.ResolverError(errors.New("produced zero addresses"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. This happens before b.subConns gets written to. Thus, line 303 won't hit guaranteed. Say you get a first update with 3 addresses, you make 3 subconns. Then, the next UpdateClientConnState comes in with 0 addresses. Then, this conditional hits, the SubConn len is still 3, and doens't set to connectivity.TransientFailure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I moved it to be right before we go about constructing the ring. So, if we end up with zero addresses, we save the effort of making a new ring.

@zasweq zasweq dismissed their stale review August 3, 2022 18:45

Bug, not approved.

@easwars easwars assigned zasweq and unassigned easwars Aug 3, 2022
@easwars
Copy link
Contributor Author

easwars commented Aug 3, 2022

Thanks for the review.

Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I found my first correctness issue ever I'm so proud of myself :D!

@easwars
Copy link
Contributor Author

easwars commented Aug 4, 2022

LGTM. I found my first correctness issue ever I'm so proud of myself :D!

Yay Yay Yay !!!

@easwars easwars merged commit f9409d3 into grpc:master Aug 4, 2022
@easwars easwars deleted the ringhash_config_update_fix branch August 4, 2022 19:11
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ringhash: correctly apply ring size changes received as part of an update
2 participants