Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1880: graduating to beta #4645

Merged
merged 7 commits into from
Jun 18, 2024
Merged

Conversation

aojea
Copy link
Member

@aojea aojea commented May 20, 2024

  • One-line PR description: graduate to beta

  • Issue link: Multiple Service CIDRs #1880

  • Other comments: Since the feature depends on a beta API, it will be disabled by default

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 20, 2024
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 20, 2024
@aojea
Copy link
Member Author

aojea commented May 20, 2024

for sig net
/assign @thockin @danwinship
for prr
/assign @wojtek-t @johnbelamaric
John did the last one IIRC

@aojea aojea changed the title KEP-1880: graduating to beta [WIP] KEP-1880: graduating to beta May 21, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2024
@aojea aojea changed the title [WIP] KEP-1880: graduating to beta KEP-1880: graduating to beta May 21, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2024
@aojea aojea mentioned this pull request May 24, 2024
12 tasks
@aojea aojea force-pushed the servicecidrbeta branch from a8a75cc to e0c6056 Compare May 28, 2024 18:59
@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2024
@aojea aojea force-pushed the servicecidrbeta branch 2 times, most recently from 4428200 to 49c42aa Compare June 5, 2024 09:49
@aojea aojea force-pushed the servicecidrbeta branch from 49c42aa to 416c70c Compare June 5, 2024 13:55
@wojtek-t wojtek-t removed their assignment Jun 7, 2024
@BenTheElder BenTheElder self-assigned this Jun 7, 2024
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left several comments

2. IP repair loop pkg/registry/core/service/ipallocator/controller/metrics.go

Users can also obtain the `ServiceCIDR` and `IPAddress` objects that are only available
if the feature is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other data observable by the cluster admin, so that one can clearly say this functionality is or is not working in the cluster?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably suggest writing a statement building around the metrics you provide below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expanded now with more detailed instructions, PTAL

are not available IPs checking the metrics with the number of available IPs
- Diagnostics: The ServiceCIDR object has conditions in state that can help to understand if there is some problems with
the available ranges. The kube-apiserve metrics shows the repair IP and ip allocation errors.
- Testing: See integration tests added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either in this section or in the upgrade section, can you describe a potential problem during upgrade, when you have two apiservers (one with this new feature enabled and one without) working in parallel. The new one will work against the new API objects as the source of truth, but the old one will still work of the in-memory one, which can potentially differ as you pointed out earlier in this document. I believe the answer is that the repair loop will eventually settle this situation, but I'd appreciate calling it out here explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expanded the Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? section, PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's still a potential issue here. #4645 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solved the issue and added a test validating it kubernetes/kubernetes@b29cbfa

keps/sig-network/1880-multiple-service-cidrs/kep.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
for the PRR

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2024
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone delete the old bitmap object? Or do we just orphan it? We need to be able to rollback or to disable the feature, but we don't want to leave a time-bomb. The repair loop SHOULD fix it up no matter what, right? So no bomb? Maybe we schedule that cleanup for when this goes GA, should we just leave a TODO to delete that old object when it goes GA (and then another to delete the delete code :)

Other:

Line 335: "any error creating the IPAddress object will cause an error on the Service creation"

So does that mean that 2 apiservers might allocate the same IP and collide on creating the IPAddress, and one of them comes back to the user as an error (that they did not cause)? Or do we retry in that case, with a different IP? Yu mention retry later, but this says "any"?

// IPv6 is an IPv6 block in CIDR notation "fd12:3456:789a:1::/64"
IPv6 string `json:"ipv6"`
// CIDRs defines the IP blocks in CIDR notation (e.g. "192.168.0.0/24" or "2001:db8::/64")
// from which to assign service cluster IPs. Max of two CIDRs is allowed, one of each IP family.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should spec that order of IP families does not matter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also spec that, for dual-stack services, both IPs will be allocated from ranges in the same ServiceCIDR if possible, but might be split if necessary? Do we test that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, good point, there is always a concept of primary IP family that is the one provided by the flags https://github.com/kubernetes/kubernetes/blob/164276175fb2fa36d223e3fdeae799d8cf86b0cd/pkg/registry/core/rest/storage_core.go#L354-L358

We need this also because the kubernetes.default is single stack, despite the users can modify the service cidrs.

So, right now, this is as this:

  • The API server will have a Primary IP family on dual stack cluster, that is provided via the order of the IP families in the Service CIDR flags
  • This pimary IP family will be used as default for the creation of Services, special case is the kubernetes.default service
  • The order of the CIDRs in the ServiceCIDRs object does not matter, although, the order of the CIDRs in the default ServiceCIDR represent the order of the flags in the apiserver, hence, the first family will be the default one

test/integration/servicecidr/allocator_test.go TestSkewedAllocators
1. start one apiserver with the feature gate enabled and other with the feature gate disabled
2. create 5 Services against each apiserver
3. the Services created against the apiserver with the bitmap will not allocate the IPAddresses automatically,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: These will be allocated in the bitmap but not as IPAddress objects, but the OTHER apiserver will allocate them as IPAddress objects in the repair loop, right?

Is it possible to have both apiservers allocate the same IP ? I think it is, right?

  • the old API allocates an IP and writes the bitmap
  • the new API allocates the same IP and writes the IPAddress
  • the repair loop catches up and finds a service (from old API) with an IP that doesn't point back to it

Am I forgetting some place where we prevent that? What can we do about it?

What if the new allocator also wrote the bitmap (when it is allocating from the "kubernetes" range(s))?

Do we need to do a release that always writes IPAddress objects but does't allow multiple CIDRs? That way the NEXT release will always collide on IPAddress?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, but with a single APIserver:

  • Boot with feature disabled
  • Create service with IP X
  • Enable feature
  • Auto-create the flags-based range
  • Create another service with same IP
  • Repair loop catches up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing down the conversations we had yesterday and some thoughts.
Some details of existing implementation

the Services created against the apiserver with the bitmap will not allocate the IPAddresses automatically,

The repair loop in the new allocator watches Services events, at the moment that a new Service is created in the old apiserver it will reconcile and create the corresponding IPAddress object, this reduces the time window where the race is possible.

Problem statement in a scenario with 2 apiservers;

  • apiserver A has feature enabled and apiserverB disabled and both are ready
  • t1 is the time necessary by A to get a Service creation event and reconcile the corresponding IPAddress
  • t2 is the time necessary by B to reconcile the Service IPs created by A, this right now is triggered periodically but can be implemented as triggered by events too

The problem is when there are two Services created at the same time requesting the same IP.
If the IP is randomly allocated we are in a birthday paradox like problem, what are the chances two different allocators choose the same value? it will depend on the serviceCIDR size ...

If the IP is statically allocated, this is the worst scenario, though not seem very realistic, but the consequences are bad, if two different Services ask for the same IP:

  • in current version, one of these requests will fail with skewed apiservers
  • with current scenario, if one of each requests go to different apiservers, it will depend on the window of time defined by the max(t1,t2) and may create two Services with the same IP ... the repair IP will start generating events but this will depend on the admin to realize of the problem

The simplest solution is to move the old repair loop to receive the Service create objects event to trigger a repair loop, this will reduce significantly the time window and we may rely on statistics to justify this decision (there are a lot of ifs that has to happen in a small window of time and use documentation to mitigate the risks.

I was thinking in providing a tool to run against the cluster to detect this and also to offer the option of delete the old bitmaps stored ... we have done something similar in the past with the storage version migration https://github.com/kubernetes-sigs/kube-storage-version-migrator, (interestingly this is moving in tree in 1.30 https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/ )

If we want to get a zero chance of this to happen, we need to sync both apiservers, this means we need some sort of synchronization between them, and it seems the opaque bitmap for the old allocator is the best alternative.
@thockin suggested an additional feature gate that will allow us to do a 2 phases rollout and it can work to solve the upgrade/skew problem, however, it does not solve the long term problem, because one of the main uses cases of this new feature is to be able to shrink or grow the default service cidr without having to restart the apiservers, and this will imply to resize the bitmap allocator, that will break the skewed servers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing discussion, the idea:

  • Add a new gate, e.g. "ServiceCIDRLegacyBitmap", stage Deprecated, starts as beta default=on (captures the existing behavior)
  • Add logic to new-path allocator to write BOTH the IPAddress resources and the bitmap when allocating from the flag-derived range (bitmap first, I think)
  • Roll this all out as beta in 1.31
  • In case of version skew described above, both old and new will use the bitmap so no collision
  • In 1.32 change gate ServiceCIDRLegacyBitmap to default=off
  • When we move the real gate to GA, add code to delete the the bitmap entirely
  • Next release we remove that cleanup code

@aojea - agree or did I get some detail wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • In 1.32 change gate ServiceCIDRLegacyBitmap to default=off

Do you envision default=off and lock=true, or users can still turn this back on. In which case I believe you fallback to the skew scenario, and you write both, did I get that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a thread in slack about it, I was going initially with 2 beta cycles, but that thread seemed to encourage going GA https://kubernetes.slack.com/archives/C0EG7JC6T/p1718289997541569?thread_ts=1718249135.862569&cid=C0EG7JC6T

To be honest, at this point I don't think we expect much changes in the API, so maybe is better to go GA without the lock to avoid the beta APIs to be lingering on the clusters, the effect will be the same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to pin it down here/yet. My fear is that the move from "off" to "on" is the riskiest moment, and if "on" means "GA", we have no recourse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, if the main concern are the API objects, can we move the API to v1 and leave the feature gate in beta?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that helps, and is probably more confusing overall.

I think we need a cycle of beta+on -- that could be this cycle, or it could be 32, but I think it has to happen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me push the latest update, 1 beta off now, 1 beta on next

5. create new services and assert previous and new ones are correct
6. shutdown apiserver
7. start new apiserver with feature disabled
8. create new services and assert previous and new ones are correct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include checking the bitmap?

@aojea aojea force-pushed the servicecidrbeta branch from b343a2f to d80d6fa Compare June 18, 2024 20:59
| Version | MultiCIDRServiceAllocator | DisableAllocatorDualWrite |
|----------|----------|----------|
| 1.31 | Beta off | Alpha off |
| 1.32 | GA on (locked) | Beta off |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we want one cycle of beta+on

beta: "v1.30"
stable: "v1.32"
beta: "v1.31"
stable:

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: MultiCIDRServiceAllocator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to add the new gate here?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, soltysh, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit ba637b8 into kubernetes:master Jun 18, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants