-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-1880: graduating to beta #4645
Conversation
for sig net |
4428200
to
49c42aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left several comments
2. IP repair loop pkg/registry/core/service/ipallocator/controller/metrics.go | ||
|
||
Users can also obtain the `ServiceCIDR` and `IPAddress` objects that are only available | ||
if the feature is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other data observable by the cluster admin, so that one can clearly say this functionality is or is not working in the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably suggest writing a statement building around the metrics you provide below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expanded now with more detailed instructions, PTAL
are not available IPs checking the metrics with the number of available IPs | ||
- Diagnostics: The ServiceCIDR object has conditions in state that can help to understand if there is some problems with | ||
the available ranges. The kube-apiserve metrics shows the repair IP and ip allocation errors. | ||
- Testing: See integration tests added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either in this section or in the upgrade section, can you describe a potential problem during upgrade, when you have two apiservers (one with this new feature enabled and one without) working in parallel. The new one will work against the new API objects as the source of truth, but the old one will still work of the in-memory one, which can potentially differ as you pointed out earlier in this document. I believe the answer is that the repair loop will eventually settle this situation, but I'd appreciate calling it out here explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expanded the Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
section, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's still a potential issue here. #4645 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved the issue and added a test validating it kubernetes/kubernetes@b29cbfa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
for the PRR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone delete the old bitmap object? Or do we just orphan it? We need to be able to rollback or to disable the feature, but we don't want to leave a time-bomb. The repair loop SHOULD fix it up no matter what, right? So no bomb? Maybe we schedule that cleanup for when this goes GA, should we just leave a TODO to delete that old object when it goes GA (and then another to delete the delete code :)
Other:
Line 335: "any error creating the IPAddress object will cause an error on the Service creation"
So does that mean that 2 apiservers might allocate the same IP and collide on creating the IPAddress, and one of them comes back to the user as an error (that they did not cause)? Or do we retry in that case, with a different IP? Yu mention retry later, but this says "any"?
// IPv6 is an IPv6 block in CIDR notation "fd12:3456:789a:1::/64" | ||
IPv6 string `json:"ipv6"` | ||
// CIDRs defines the IP blocks in CIDR notation (e.g. "192.168.0.0/24" or "2001:db8::/64") | ||
// from which to assign service cluster IPs. Max of two CIDRs is allowed, one of each IP family. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should spec that order of IP families does not matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also spec that, for dual-stack services, both IPs will be allocated from ranges in the same ServiceCIDR if possible, but might be split if necessary? Do we test that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, good point, there is always a concept of primary IP family that is the one provided by the flags https://github.com/kubernetes/kubernetes/blob/164276175fb2fa36d223e3fdeae799d8cf86b0cd/pkg/registry/core/rest/storage_core.go#L354-L358
We need this also because the kubernetes.default is single stack, despite the users can modify the service cidrs.
So, right now, this is as this:
- The API server will have a Primary IP family on dual stack cluster, that is provided via the order of the IP families in the Service CIDR flags
- This pimary IP family will be used as default for the creation of Services, special case is the kubernetes.default service
- The order of the CIDRs in the ServiceCIDRs object does not matter, although, the order of the CIDRs in the default ServiceCIDR represent the order of the flags in the apiserver, hence, the first family will be the default one
test/integration/servicecidr/allocator_test.go TestSkewedAllocators | ||
1. start one apiserver with the feature gate enabled and other with the feature gate disabled | ||
2. create 5 Services against each apiserver | ||
3. the Services created against the apiserver with the bitmap will not allocate the IPAddresses automatically, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: These will be allocated in the bitmap but not as IPAddress objects, but the OTHER apiserver will allocate them as IPAddress objects in the repair loop, right?
Is it possible to have both apiservers allocate the same IP ? I think it is, right?
- the old API allocates an IP and writes the bitmap
- the new API allocates the same IP and writes the IPAddress
- the repair loop catches up and finds a service (from old API) with an IP that doesn't point back to it
Am I forgetting some place where we prevent that? What can we do about it?
What if the new allocator also wrote the bitmap (when it is allocating from the "kubernetes" range(s))?
Do we need to do a release that always writes IPAddress objects but does't allow multiple CIDRs? That way the NEXT release will always collide on IPAddress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar, but with a single APIserver:
- Boot with feature disabled
- Create service with IP X
- Enable feature
- Auto-create the flags-based range
- Create another service with same IP
- Repair loop catches up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing down the conversations we had yesterday and some thoughts.
Some details of existing implementation
- The apiserver will not be ready until one execution of the repair loop is succesful https://github.com/kubernetes/kubernetes/blob/c3689b9f8b8c1a1ac2bbc4e3b3cda28d83849ec2/pkg/registry/core/rest/storage_core.go#L432-L470
- The single API server scenarios (last comment) is tested with an integration test, see below
TestMigrateService
the Services created against the apiserver with the bitmap will not allocate the IPAddresses automatically,
The repair loop in the new allocator watches Services events, at the moment that a new Service is created in the old apiserver it will reconcile and create the corresponding IPAddress object, this reduces the time window where the race is possible.
Problem statement in a scenario with 2 apiservers;
- apiserver A has feature enabled and apiserverB disabled and both are ready
- t1 is the time necessary by A to get a Service creation event and reconcile the corresponding IPAddress
- t2 is the time necessary by B to reconcile the Service IPs created by A, this right now is triggered periodically but can be implemented as triggered by events too
The problem is when there are two Services created at the same time requesting the same IP.
If the IP is randomly allocated we are in a birthday paradox like problem, what are the chances two different allocators choose the same value? it will depend on the serviceCIDR size ...
If the IP is statically allocated, this is the worst scenario, though not seem very realistic, but the consequences are bad, if two different Services ask for the same IP:
- in current version, one of these requests will fail with skewed apiservers
- with current scenario, if one of each requests go to different apiservers, it will depend on the window of time defined by the max(t1,t2) and may create two Services with the same IP ... the repair IP will start generating events but this will depend on the admin to realize of the problem
The simplest solution is to move the old repair loop to receive the Service create objects event to trigger a repair loop, this will reduce significantly the time window and we may rely on statistics to justify this decision (there are a lot of ifs that has to happen in a small window of time and use documentation to mitigate the risks.
I was thinking in providing a tool to run against the cluster to detect this and also to offer the option of delete the old bitmaps stored ... we have done something similar in the past with the storage version migration https://github.com/kubernetes-sigs/kube-storage-version-migrator, (interestingly this is moving in tree in 1.30 https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/ )
If we want to get a zero chance of this to happen, we need to sync both apiservers, this means we need some sort of synchronization between them, and it seems the opaque bitmap for the old allocator is the best alternative.
@thockin suggested an additional feature gate that will allow us to do a 2 phases rollout and it can work to solve the upgrade/skew problem, however, it does not solve the long term problem, because one of the main uses cases of this new feature is to be able to shrink or grow the default service cidr without having to restart the apiservers, and this will imply to resize the bitmap allocator, that will break the skewed servers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capturing discussion, the idea:
- Add a new gate, e.g. "ServiceCIDRLegacyBitmap", stage Deprecated, starts as beta default=on (captures the existing behavior)
- Add logic to new-path allocator to write BOTH the
IPAddress
resources and the bitmap when allocating from the flag-derived range (bitmap first, I think) - Roll this all out as beta in 1.31
- In case of version skew described above, both old and new will use the bitmap so no collision
- In 1.32 change gate ServiceCIDRLegacyBitmap to default=off
- When we move the real gate to GA, add code to delete the the bitmap entirely
- Next release we remove that cleanup code
@aojea - agree or did I get some detail wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In 1.32 change gate ServiceCIDRLegacyBitmap to default=off
Do you envision default=off and lock=true, or users can still turn this back on. In which case I believe you fallback to the skew scenario, and you write both, did I get that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a thread in slack about it, I was going initially with 2 beta cycles, but that thread seemed to encourage going GA https://kubernetes.slack.com/archives/C0EG7JC6T/p1718289997541569?thread_ts=1718249135.862569&cid=C0EG7JC6T
To be honest, at this point I don't think we expect much changes in the API, so maybe is better to go GA without the lock to avoid the beta APIs to be lingering on the clusters, the effect will be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to pin it down here/yet. My fear is that the move from "off" to "on" is the riskiest moment, and if "on" means "GA", we have no recourse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, if the main concern are the API objects, can we move the API to v1 and leave the feature gate in beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that helps, and is probably more confusing overall.
I think we need a cycle of beta+on -- that could be this cycle, or it could be 32, but I think it has to happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me push the latest update, 1 beta off now, 1 beta on next
5. create new services and assert previous and new ones are correct | ||
6. shutdown apiserver | ||
7. start new apiserver with feature disabled | ||
8. create new services and assert previous and new ones are correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this include checking the bitmap?
| Version | MultiCIDRServiceAllocator | DisableAllocatorDualWrite | | ||
|----------|----------|----------| | ||
| 1.31 | Beta off | Alpha off | | ||
| 1.32 | GA on (locked) | Beta off | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we want one cycle of beta+on
beta: "v1.30" | ||
stable: "v1.32" | ||
beta: "v1.31" | ||
stable: | ||
|
||
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled | ||
feature-gates: | ||
- name: MultiCIDRServiceAllocator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to add the new gate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, soltysh, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
One-line PR description: graduate to beta
Issue link: Multiple Service CIDRs #1880
Other comments: Since the feature depends on a beta API, it will be disabled by default