📖 Add designs/multi-cluster.md #2746

sttts · 2024-03-31T13:38:14Z

Controller-runtime today allows to write controllers against one cluster only.
Multi-cluster use-cases require the creation of multiple managers and/or cluster
objects. This proposal is about adding native support for multi-cluster use-cases
to controller-runtime.

The proposed changes are prototyped in #2726.

Signed-off-by: Dr. Stefan Schimanski <[email protected]>

k8s-ci-robot · 2024-03-31T13:38:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sttts
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alvaroaleman · 2024-03-31T14:22:31Z

designs/multi-cluster.md

+}
+
+// pkg/handler
+type DeepCopyableEventHandler interface {


The eventhandlers are stateless, why do we need the deepcopy for them?

Looking at the propotype. I think this is because EventHandler then would store the Cluster (it is using that info to set the ClusterName field in the request)

This is gone now in #2726.

Will update the design here.

With the BYO request/eventhandler changes in #3019, I brought this back after remembering it was mentioned in the proposal. The previous version of my prototype had a weird second layer of event handler that was wrapping the actual event wrapper and was using the event object to communicate the cluster name in. That felt all kinds of weird.

Because we now have BYO EventHandlers, it's possible that they are not entirely stateless (as @sbueringer pointed out, some event handlers might have to store the cluster name in absence of any information on the event object itself). So I think this approach is the most clean, to be honest. It's entirely optional in #3019, existing EventHandlers don't need to be changed.

alvaroaleman · 2024-03-31T14:31:07Z

designs/multi-cluster.md

+	Disengage(context.Context, Cluster) error
+}
+```
+In particular, controllers implement the `AwareRunnable` interface. They react


Rather than changing the controller type directly and requiring all its dependencies to known how to deepcopy themselves, how about having something like a controllerconstructor (name tbd) in between that is filled with a []watchConstructor{source func(Cluster) source.Source, handler func(Cluster) handler.Handler, predicate func(cluster) []predicate.Predicate}?

I think this would require more invasive changes to our public API (the Controller interface)

No, you can call Watch on an existing controller. The idea is to not let the Controller or its dependencies have any knowledge about this but instead have a thing on top of the Controller that is configured with constructors that take a cluster.Cluster and return a source/predicate/handler and then uses those to call Watch when a new cluster appears.

When one disappears, it would cancel the context on the Source.

The idea really is the opposite, I do not want the Controller to know how to extend itself like this, IMHO this is a higher-level abstraction.

Compare #2726 after latest push. I have implemented @alvaroaleman's idea via a MultiClusterController wrapper implementing cluster.AwareRunnable and just calling Watch on the actual controller. All the deepcopy'ing is gone 🎉 Much nicer IMO. @alvaroaleman great intuition!

alvaroaleman · 2024-03-31T14:33:58Z

designs/multi-cluster.md

+// pkg/cluster
+type Provider interface {
+   Get(ctx context.Context, clusterName string, opts ...Option) (Cluster, error)
+   List(ctx context.Context) ([]string, error)


Why return []string here rather than []Cluster?

+1 Would be good for consistency with the Get func

There is a misunderstanding of the interface. The getter is actually the constructor. The life-cycle of the returned clusters is owned by the manager (they are added as runnables). Hence, the List returns names, not clusters. We should rather rename Get to Create or Connect.

alvaroaleman · 2024-03-31T14:37:16Z

designs/multi-cluster.md

+}
+```
+
+The `ctrl.Manager` will use the provider to watch clusters coming and going, and


I'll have to think about if and how this is doable, but ideally the "thing that comes and goes" wouldn't be typed to cluster.Cluster but can be anything, so this mechanism can also be used if folks have sources that are not kube watches

Would this be mostly about a more generic name? (can't think of much that would work, maybe something like scope)

designs/multi-cluster.md

elmiko

i think this is an interesting idea and i could see using it, i just have a question about some of the mechanics.

for context, i am investigating a cluster-api provider for karpenter and it would be nice to have the controllers discriminate between objects in the management cluster and objects in the workload clusters.

elmiko · 2024-04-08T17:29:42Z

designs/multi-cluster.md

+### Examples
+
+- Run a controller-runtime controller against a kubeconfig with arbitrary many contexts, all being reconciled.
+- Run a controller-runtime controller against cluster-managers like kind, Cluster-API, Open-Cluster-Manager or Hypershift.


given the cluster-api example here, is the intention that controllers will be able to reconcile CRDs in clusters that they know about that may only exist in a subset of clusters (e.g. Machine objects in the management cluster but not in the workload cluster) ?

Good point. I think that has to be possible. Otherwise we need all resources that we watch in all clusters

(especially good point because today a controller crashes if a resource doesn't exist)

EDIT: Further down:

For example, it can well be that every cluster has different REST mapping because installed CRDs are different. Without a context, we cannot return the right REST mapper.

Good point. Question is whether one would rather group them in managers such that every manager has a uniform set of clusters.

See my updated PR #2726. You can now opt into provider and/or the default cluster per controller via options:

// EngageWithDefaultCluster indicates whether the controller should engage // with the default cluster of a manager. This defaults to false through the // global controller options of the manager if a cluster provider is set, // and to true otherwise. Here it can be overridden. EngageWithDefaultCluster *bool // EngageWithProvidedClusters indicates whether the controller should engage // with the provided clusters of a manager. This defaults to true through the // global controller options of the manager if a cluster provider is set, // and to false otherwise. Here it can be overridden. EngageWithProviderClusters *bool

There is no logic yet for a controller to decide whether to engage with a provider cluster or not. Now it's with all of them. If the setup is more diverse, we might want such a functionality, e.g. some kind of pre-check: ctrl.WantsToEngage(ctx, cluster) bool`.

i'm still understanding the changes in #2726, but i think what you are saying here makes sense to me and would solve the issue.

some kind of pre-check: ctrl.WantsToEngage(ctx, cluster) bool`.

+1, i think we definitely need some way for the client user to specify when it should check a specific cluster for a resource.

I somehow think it should be the author's and managers responsibility (for now) to group them into groups which are working with the pattern. At this point, we don't know what we don't know. Once this is released, we can gather some feedback on edge cases and take it from there. I suspect the majority of use cases will be still single cluster reconcile loops.

Maybe document this edge case and mark this feature overall as experimental? This way we not committing to full production level stability, and allow to gather more feedback?

designs/multi-cluster.md

sttts · 2024-05-28T13:05:45Z

For those reading, this is currently a little outdated. #2726 has a changed design proposed by @alvaroaleman. Will come back soon to both PRs.

k8s-triage-robot · 2024-08-26T13:36:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-09-25T14:27:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-10-25T15:12:48Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-10-25T15:12:54Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

embik · 2024-10-28T11:20:28Z

/reopen

We'd like to continue working on this, time is simply a bit scarce at the moment.

k8s-ci-robot · 2024-10-28T11:20:34Z

@embik: Reopened this PR.

In response to this:

/reopen

We'd like to continue working on this, time is simply a bit scarce at the moment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Gomaya · 2024-11-13T06:13:58Z

Hi, could you please share the future plans for this feature? Thank you!
@embik

embik · 2024-11-13T08:16:28Z

@Gomaya I'm working on a prototype that attempts to address the review comments in #2726. Once everyone is back from KubeCon, I plan to run this by everyone involved and try to move the feature forward.

embik · 2024-12-03T11:42:59Z

/remove-lifecycle rotten

Add designs/multi-cluster.md

eb207cb

Signed-off-by: Dr. Stefan Schimanski <[email protected]>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 31, 2024

k8s-ci-robot requested review from varshaprasad96 and vincepri March 31, 2024 13:38

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 31, 2024

alvaroaleman reviewed Mar 31, 2024

View reviewed changes

embik reviewed Apr 4, 2024

View reviewed changes

designs/multi-cluster.md Show resolved Hide resolved

sbueringer mentioned this pull request Apr 4, 2024

Multi Cluster Example / Pattern #2755

Closed

elmiko reviewed Apr 8, 2024

View reviewed changes

sbueringer reviewed Apr 11, 2024

View reviewed changes

designs/multi-cluster.md Show resolved Hide resolved

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 25, 2024

k8s-ci-robot closed this Oct 25, 2024

k8s-ci-robot reopened this Oct 28, 2024

embik mentioned this pull request Nov 22, 2024

✨ WIP: Cluster provider and cluster-aware controllers #3019

Open

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📖 Add designs/multi-cluster.md #2746

📖 Add designs/multi-cluster.md #2746

sttts commented Mar 31, 2024 •

edited

Loading

k8s-ci-robot commented Mar 31, 2024

alvaroaleman Mar 31, 2024

sbueringer Apr 11, 2024 •

edited

Loading

sttts Apr 23, 2024

embik Dec 4, 2024 •

edited

Loading

alvaroaleman Mar 31, 2024

sbueringer Apr 11, 2024

alvaroaleman Apr 11, 2024

sttts Apr 23, 2024

alvaroaleman Mar 31, 2024

sbueringer Apr 10, 2024

sttts Apr 22, 2024

alvaroaleman Mar 31, 2024

sbueringer Apr 11, 2024

elmiko left a comment

elmiko Apr 8, 2024

sbueringer Apr 10, 2024 •

edited

Loading

sttts Apr 22, 2024

sttts Apr 23, 2024

elmiko Apr 23, 2024

mjudeikis Apr 28, 2024

sttts commented May 28, 2024

k8s-triage-robot commented Aug 26, 2024

k8s-triage-robot commented Sep 25, 2024

k8s-triage-robot commented Oct 25, 2024

k8s-ci-robot commented Oct 25, 2024

embik commented Oct 28, 2024

k8s-ci-robot commented Oct 28, 2024

Gomaya commented Nov 13, 2024

embik commented Nov 13, 2024

embik commented Dec 3, 2024

📖 Add designs/multi-cluster.md #2746

Are you sure you want to change the base?

📖 Add designs/multi-cluster.md #2746

Conversation

sttts commented Mar 31, 2024 • edited Loading

k8s-ci-robot commented Mar 31, 2024

Choose a reason for hiding this comment

sbueringer Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

embik Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elmiko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts commented May 28, 2024

k8s-triage-robot commented Aug 26, 2024

k8s-triage-robot commented Sep 25, 2024

k8s-triage-robot commented Oct 25, 2024

k8s-ci-robot commented Oct 25, 2024

embik commented Oct 28, 2024

k8s-ci-robot commented Oct 28, 2024

Gomaya commented Nov 13, 2024

embik commented Nov 13, 2024

embik commented Dec 3, 2024

sttts commented Mar 31, 2024 •

edited

Loading

sbueringer Apr 11, 2024 •

edited

Loading

embik Dec 4, 2024 •

edited

Loading

sbueringer Apr 10, 2024 •

edited

Loading