KEP-1880: graduating to beta #4645

aojea · 2024-05-20T13:31:55Z

One-line PR description: graduate to beta
Issue link: Multiple Service CIDRs #1880
Other comments: Since the feature depends on a beta API, it will be disabled by default

aojea · 2024-05-20T13:32:52Z

for sig net
/assign @thockin @danwinship
for prr
/assign @wojtek-t @johnbelamaric
John did the last one IIRC

keps/sig-network/1880-multiple-service-cidrs/README.md

keps/prod-readiness/sig-network/1880.yaml

keps/sig-network/1880-multiple-service-cidrs/README.md

soltysh

Left several comments

keps/sig-network/1880-multiple-service-cidrs/README.md

soltysh · 2024-06-11T12:03:36Z

keps/sig-network/1880-multiple-service-cidrs/README.md

+2. IP repair loop pkg/registry/core/service/ipallocator/controller/metrics.go
+
+Users can also obtain the `ServiceCIDR` and `IPAddress` objects that are only available
+if the feature is enabled.


Are there any other data observable by the cluster admin, so that one can clearly say this functionality is or is not working in the cluster?

I'd probably suggest writing a statement building around the metrics you provide below.

expanded now with more detailed instructions, PTAL

keps/sig-network/1880-multiple-service-cidrs/README.md

soltysh · 2024-06-11T12:08:24Z

keps/sig-network/1880-multiple-service-cidrs/README.md

+are not available IPs checking the metrics with the number of available IPs
+  - Diagnostics: The ServiceCIDR object has conditions in state that can help to understand if there is some problems with
+the available ranges. The kube-apiserve metrics shows the repair IP and ip allocation errors. 
+  - Testing: See integration tests added


Either in this section or in the upgrade section, can you describe a potential problem during upgrade, when you have two apiservers (one with this new feature enabled and one without) working in parallel. The new one will work against the new API objects as the source of truth, but the old one will still work of the in-memory one, which can potentially differ as you pointed out earlier in this document. I believe the answer is that the repair loop will eventually settle this situation, but I'd appreciate calling it out here explicitly.

expanded the Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? section, PTAL

I think there's still a potential issue here. #4645 (comment)

solved the issue and added a test validating it kubernetes/kubernetes@b29cbfa

keps/sig-network/1880-multiple-service-cidrs/kep.yaml

soltysh

/approve
for the PRR

thockin

Does anyone delete the old bitmap object? Or do we just orphan it? We need to be able to rollback or to disable the feature, but we don't want to leave a time-bomb. The repair loop SHOULD fix it up no matter what, right? So no bomb? Maybe we schedule that cleanup for when this goes GA, should we just leave a TODO to delete that old object when it goes GA (and then another to delete the delete code :)

Other:

Line 335: "any error creating the IPAddress object will cause an error on the Service creation"

So does that mean that 2 apiservers might allocate the same IP and collide on creating the IPAddress, and one of them comes back to the user as an error (that they did not cause)? Or do we retry in that case, with a different IP? Yu mention retry later, but this says "any"?

thockin · 2024-06-12T21:18:39Z

keps/sig-network/1880-multiple-service-cidrs/README.md

- // IPv6 is an IPv6 block in CIDR notation "fd12:3456:789a:1::/64" 
- IPv6 string `json:"ipv6"`
+	// CIDRs defines the IP blocks in CIDR notation (e.g. "192.168.0.0/24" or "2001:db8::/64")
+	// from which to assign service cluster IPs. Max of two CIDRs is allowed, one of each IP family.


We should spec that order of IP families does not matter

Should we also spec that, for dual-stack services, both IPs will be allocated from ranges in the same ServiceCIDR if possible, but might be split if necessary? Do we test that?

hmm, good point, there is always a concept of primary IP family that is the one provided by the flags https://github.com/kubernetes/kubernetes/blob/164276175fb2fa36d223e3fdeae799d8cf86b0cd/pkg/registry/core/rest/storage_core.go#L354-L358

We need this also because the kubernetes.default is single stack, despite the users can modify the service cidrs.

So, right now, this is as this:

The API server will have a Primary IP family on dual stack cluster, that is provided via the order of the IP families in the Service CIDR flags

This pimary IP family will be used as default for the creation of Services, special case is the kubernetes.default service

The order of the CIDRs in the ServiceCIDRs object does not matter, although, the order of the CIDRs in the default ServiceCIDR represent the order of the flags in the apiserver, hence, the first family will be the default one

thockin · 2024-06-12T23:14:14Z

keps/sig-network/1880-multiple-service-cidrs/README.md

+test/integration/servicecidr/allocator_test.go TestSkewedAllocators
+1. start one apiserver with the feature gate enabled and other with the feature gate disabled
+2. create 5 Services against each apiserver
+3. the Services created against the apiserver with the bitmap will not allocate the IPAddresses automatically,


To clarify: These will be allocated in the bitmap but not as IPAddress objects, but the OTHER apiserver will allocate them as IPAddress objects in the repair loop, right?

Is it possible to have both apiservers allocate the same IP ? I think it is, right?

the old API allocates an IP and writes the bitmap

the new API allocates the same IP and writes the IPAddress

the repair loop catches up and finds a service (from old API) with an IP that doesn't point back to it

Am I forgetting some place where we prevent that? What can we do about it?

What if the new allocator also wrote the bitmap (when it is allocating from the "kubernetes" range(s))?

Do we need to do a release that always writes IPAddress objects but does't allow multiple CIDRs? That way the NEXT release will always collide on IPAddress?

Similar, but with a single APIserver:

Boot with feature disabled

Create service with IP X

Enable feature

Auto-create the flags-based range

Create another service with same IP

Repair loop catches up

Writing down the conversations we had yesterday and some thoughts.
Some details of existing implementation

The apiserver will not be ready until one execution of the repair loop is succesful https://github.com/kubernetes/kubernetes/blob/c3689b9f8b8c1a1ac2bbc4e3b3cda28d83849ec2/pkg/registry/core/rest/storage_core.go#L432-L470

The single API server scenarios (last comment) is tested with an integration test, see below TestMigrateService

the Services created against the apiserver with the bitmap will not allocate the IPAddresses automatically,

The repair loop in the new allocator watches Services events, at the moment that a new Service is created in the old apiserver it will reconcile and create the corresponding IPAddress object, this reduces the time window where the race is possible.

Problem statement in a scenario with 2 apiservers;

apiserver A has feature enabled and apiserverB disabled and both are ready

t1 is the time necessary by A to get a Service creation event and reconcile the corresponding IPAddress

t2 is the time necessary by B to reconcile the Service IPs created by A, this right now is triggered periodically but can be implemented as triggered by events too

The problem is when there are two Services created at the same time requesting the same IP.
If the IP is randomly allocated we are in a birthday paradox like problem, what are the chances two different allocators choose the same value? it will depend on the serviceCIDR size ...

If the IP is statically allocated, this is the worst scenario, though not seem very realistic, but the consequences are bad, if two different Services ask for the same IP:

in current version, one of these requests will fail with skewed apiservers

with current scenario, if one of each requests go to different apiservers, it will depend on the window of time defined by the max(t1,t2) and may create two Services with the same IP ... the repair IP will start generating events but this will depend on the admin to realize of the problem

The simplest solution is to move the old repair loop to receive the Service create objects event to trigger a repair loop, this will reduce significantly the time window and we may rely on statistics to justify this decision (there are a lot of ifs that has to happen in a small window of time and use documentation to mitigate the risks.

I was thinking in providing a tool to run against the cluster to detect this and also to offer the option of delete the old bitmaps stored ... we have done something similar in the past with the storage version migration https://github.com/kubernetes-sigs/kube-storage-version-migrator, (interestingly this is moving in tree in 1.30 https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/ )

If we want to get a zero chance of this to happen, we need to sync both apiservers, this means we need some sort of synchronization between them, and it seems the opaque bitmap for the old allocator is the best alternative.
@thockin suggested an additional feature gate that will allow us to do a 2 phases rollout and it can work to solve the upgrade/skew problem, however, it does not solve the long term problem, because one of the main uses cases of this new feature is to be able to shrink or grow the default service cidr without having to restart the apiservers, and this will imply to resize the bitmap allocator, that will break the skewed servers

Capturing discussion, the idea:

Add a new gate, e.g. "ServiceCIDRLegacyBitmap", stage Deprecated, starts as beta default=on (captures the existing behavior)

Add logic to new-path allocator to write BOTH the IPAddress resources and the bitmap when allocating from the flag-derived range (bitmap first, I think)

Roll this all out as beta in 1.31

In case of version skew described above, both old and new will use the bitmap so no collision

In 1.32 change gate ServiceCIDRLegacyBitmap to default=off

When we move the real gate to GA, add code to delete the the bitmap entirely

Next release we remove that cleanup code

@aojea - agree or did I get some detail wrong?

In 1.32 change gate ServiceCIDRLegacyBitmap to default=off

Do you envision default=off and lock=true, or users can still turn this back on. In which case I believe you fallback to the skew scenario, and you write both, did I get that correct?

there was a thread in slack about it, I was going initially with 2 beta cycles, but that thread seemed to encourage going GA https://kubernetes.slack.com/archives/C0EG7JC6T/p1718289997541569?thread_ts=1718249135.862569&cid=C0EG7JC6T

To be honest, at this point I don't think we expect much changes in the API, so maybe is better to go GA without the lock to avoid the beta APIs to be lingering on the clusters, the effect will be the same

We don't have to pin it down here/yet. My fear is that the move from "off" to "on" is the riskiest moment, and if "on" means "GA", we have no recourse.

agree, if the main concern are the API objects, can we move the API to v1 and leave the feature gate in beta?

I don't think that helps, and is probably more confusing overall.

I think we need a cycle of beta+on -- that could be this cycle, or it could be 32, but I think it has to happen

ok, let me push the latest update, 1 beta off now, 1 beta on next

thockin · 2024-06-12T23:15:47Z

keps/sig-network/1880-multiple-service-cidrs/README.md

+5. create new services and assert previous and new ones are correct
+6. shutdown apiserver
+7. start new apiserver with feature disabled
+8. create new services and assert previous and new ones are correct


Does this include checking the bitmap?

thockin · 2024-06-18T21:53:39Z

keps/sig-network/1880-multiple-service-cidrs/README.md

+| Version  | MultiCIDRServiceAllocator | DisableAllocatorDualWrite |
+|----------|----------|----------|
+| 1.31     | Beta off   | Alpha off   |
+| 1.32     | GA on  (locked) | Beta off   |


I'm pretty sure we want one cycle of beta+on

thockin · 2024-06-18T21:54:37Z

keps/sig-network/1880-multiple-service-cidrs/kep.yaml

-  beta: "v1.30"
-  stable: "v1.32"
+  beta: "v1.31"
+  stable:

 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled
 feature-gates:
  - name: MultiCIDRServiceAllocator


Want to add the new gate here?

thockin

Thanks!

/lgtm
/approve

k8s-ci-robot · 2024-06-18T21:54:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, soltysh, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [soltysh]
~~keps/sig-network/OWNERS~~ [aojea,thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 20, 2024

k8s-ci-robot requested review from MikeZappa87 and thockin May 20, 2024 13:32

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 20, 2024

k8s-ci-robot assigned danwinship, johnbelamaric, thockin and wojtek-t May 20, 2024

aojea commented May 20, 2024

View reviewed changes

keps/sig-network/1880-multiple-service-cidrs/README.md Outdated Show resolved Hide resolved

aojea commented May 20, 2024

View reviewed changes

keps/sig-network/1880-multiple-service-cidrs/README.md Outdated Show resolved Hide resolved

aojea commented May 20, 2024

View reviewed changes

keps/sig-network/1880-multiple-service-cidrs/README.md Outdated Show resolved Hide resolved

aojea mentioned this pull request May 20, 2024

KEP-1880 Multiple Service CIDRs: Graduate to Beta (1/2) kubernetes/kubernetes#122047

Merged

5 tasks

aojea changed the title ~~KEP-1880: graduating to beta~~ [WIP] KEP-1880: graduating to beta May 21, 2024

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2024

aojea changed the title ~~[WIP] KEP-1880: graduating to beta~~ KEP-1880: graduating to beta May 21, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2024

aojea commented May 21, 2024

View reviewed changes

keps/sig-network/1880-multiple-service-cidrs/README.md Outdated Show resolved Hide resolved

aojea commented May 21, 2024

View reviewed changes

keps/sig-network/1880-multiple-service-cidrs/README.md Outdated Show resolved Hide resolved

aojea mentioned this pull request May 24, 2024

Multiple Service CIDRs #1880

Open

12 tasks

aojea force-pushed the servicecidrbeta branch from a8a75cc to e0c6056 Compare May 28, 2024 18:59

k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2024

aojea force-pushed the servicecidrbeta branch 2 times, most recently from 4428200 to 49c42aa Compare June 5, 2024 09:49

promote KEP-1880 to beta

416c70c

aojea force-pushed the servicecidrbeta branch from 49c42aa to 416c70c Compare June 5, 2024 13:55

wojtek-t removed their assignment Jun 7, 2024

BenTheElder self-assigned this Jun 7, 2024

soltysh reviewed Jun 10, 2024

View reviewed changes

keps/prod-readiness/sig-network/1880.yaml Show resolved Hide resolved

change PRR approver

2e729cc

BenTheElder reviewed Jun 10, 2024

View reviewed changes

keps/sig-network/1880-multiple-service-cidrs/README.md Outdated Show resolved Hide resolved

keps/sig-network/1880-multiple-service-cidrs/README.md Show resolved Hide resolved

keps/sig-network/1880-multiple-service-cidrs/README.md Outdated Show resolved Hide resolved

soltysh reviewed Jun 11, 2024

View reviewed changes

aojea added 4 commits June 11, 2024 13:52

update unit test coverage

36ed6ca

remove stable target

59691dc

address PRR reviews

bf9dd1f

add link to e2e tests

dc8adc5

soltysh approved these changes Jun 12, 2024

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2024

thockin reviewed Jun 12, 2024

View reviewed changes

fix race with skewed apiservers during upgrade

d80d6fa

aojea force-pushed the servicecidrbeta branch from b343a2f to d80d6fa Compare June 18, 2024 20:59

thockin reviewed Jun 18, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2024

k8s-ci-robot merged commit ba637b8 into kubernetes:master Jun 18, 2024
4 checks passed

k8s-ci-robot added this to the v1.31 milestone Jun 18, 2024

LionelJouin mentioned this pull request Sep 30, 2024

Fix ServiceCIDR integration test enable/disable kubernetes/kubernetes#127751

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-1880: graduating to beta #4645

KEP-1880: graduating to beta #4645

aojea commented May 20, 2024

aojea commented May 20, 2024

soltysh left a comment

soltysh Jun 11, 2024

soltysh Jun 11, 2024

aojea Jun 11, 2024

soltysh Jun 11, 2024

aojea Jun 11, 2024

thockin Jun 12, 2024

aojea Jun 18, 2024

soltysh left a comment

thockin left a comment •

edited

Loading

thockin Jun 12, 2024

thockin Jun 12, 2024

aojea Jun 15, 2024

thockin Jun 12, 2024

thockin Jun 12, 2024

aojea Jun 15, 2024

thockin Jun 17, 2024

soltysh Jun 18, 2024

aojea Jun 19, 2024

thockin Jun 19, 2024

aojea Jun 19, 2024

thockin Jun 19, 2024

aojea Jun 19, 2024

thockin Jun 12, 2024

thockin Jun 18, 2024

thockin Jun 18, 2024

thockin left a comment

k8s-ci-robot commented Jun 18, 2024

KEP-1880: graduating to beta #4645

KEP-1880: graduating to beta #4645

Conversation

aojea commented May 20, 2024

aojea commented May 20, 2024

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh left a comment

Choose a reason for hiding this comment

thockin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 18, 2024

thockin left a comment •

edited

Loading