Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate nrt to ctrl runtime client #655

Merged

Conversation

zwpaper
Copy link
Member

@zwpaper zwpaper commented Nov 1, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

part of #485

Special notes for your reviewer:

after deleting the generated vendor, CI found noderesourcetopology crd file changed, v1alpha1 version was deleted, I checked and found nowhere that used the v1alpha1, so I deleted the version.

now we only have noderesourcetopology v1alpha2

Does this PR introduce a user-facing change?

migrate noderesourcetopology to ctrl runtime client

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 1, 2023
Copy link

netlify bot commented Nov 1, 2023

Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name Link
🔨 Latest commit 0b5755d
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/6550a66ff119c20008ad111c

@zwpaper zwpaper marked this pull request as draft November 1, 2023 15:25
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 1, 2023
@zwpaper zwpaper changed the title WIP migrate nrt to ctrl runtime WIP migrate nrt to ctrl runtime client Nov 1, 2023
@Huang-Wei
Copy link
Contributor

/assign

@zwpaper zwpaper force-pushed the refactor-ctrl-runtime-capacity branch from c3ee7ed to 723706e Compare November 2, 2023 16:45
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 2, 2023
@zwpaper zwpaper changed the title WIP migrate nrt to ctrl runtime client migrate nrt to ctrl runtime client Nov 3, 2023
@zwpaper zwpaper marked this pull request as ready for review November 3, 2023 00:08
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 3, 2023
@zwpaper zwpaper force-pushed the refactor-ctrl-runtime-capacity branch 3 times, most recently from 0524bcf to d447dfc Compare November 3, 2023 04:35
@zwpaper
Copy link
Member Author

zwpaper commented Nov 3, 2023

@Huang-Wei CI passed, PTAL

Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits. LGTM overall.

/approve
/lgtm

pkg/noderesourcetopology/cache/overreserve.go Show resolved Hide resolved
podLister: podLister,
resyncMethod: resyncMethod,
}
return obj, nil
}

func (ov *OverReserve) GetCachedNRTCopy(nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) {
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems ctx is not used in this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, either pass it through ov.Get (which will become ov.client.Get?) or discard it please (applies to similar places elsewhere in the code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, my bad. It would be nice to change it as
func (ov *OverReserve) GetCachedNRTCopy(_ context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) { }
because ctx is indeed not needed in the body, but this is borderline a nit and totally not blocking

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, zwpaper

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2023
@Huang-Wei
Copy link
Contributor

Huang-Wei commented Nov 4, 2023

/hold

cc @ffromani @swatisehgal @Tal-or for awareness.

The background is that scheduler-plugins start to use more CRDs as time goes, and the current way is to generate typed client for each CRD and leverage in the plugin codebase. The codegen part is annoying and raise the bar of using/understanding scheduler plugin. So the overall goal to turn to dynamic client by leverage controller runtime's utilities.

cc-ing you is b/c it seems you're working on pushing NRT to upstream. Hope this work doesn't conflict with your work.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2023
Copy link
Contributor

@Tal-or Tal-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR.
I left some notes from my side, and I would wait for @ffromani and @swatisehgal to review this PR as well.

pkg/noderesourcetopology/cache/overreserve.go Outdated Show resolved Hide resolved
@@ -43,32 +45,26 @@ const (
)

func initNodeTopologyInformer(tcfg *apiconfig.NodeResourceTopologyMatchArgs, handle framework.Handle) (nrtcache.Interface, error) {
topoClient, err := topoclientset.NewForConfig(handle.KubeConfig())
scheme := runtime.NewScheme()
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would handle this error instead of panic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error in practice won't happen IMO. If we treat it as an error, @Tal-or do you mean we can find a way to proceed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this commit (which seems to want to focus on test code) needs to add manage schemas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei Not sure if it's possible to proceed, but at least throwing a detailed error which bubble up and explained what happen, would be more user friendly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the controller runtime client can know what to do with the types, this is a common usage from kube-builder, which is one of the most famous libs using controller runtime.

you can check: https://book.kubebuilder.io/cronjob-tutorial/empty-main

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @Huang-Wei . The error is not recoverable, so it's fine to use utilruntime.Must and fail badly in the (I reckon) exceptionally rare case on which AddToScheme fails.

The only thing which doesn't fully click to me is why we need to fix the schema in this precise code location. Disclosure: I didn't review the latest changes, so this is possibly already improved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ffromani, I did not update the codes here.

we must handle the scheme somewhere, as the main scheduler, we added in main.go, and for this nrt plugin, I add it here when initing it, so that we are not adding this scheme if nrt plugin is not used.

as for utilruntime.Must, WDYT @Tal-or ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffromani raised a very good point. The AddToScheme logic should only happen once per its usage:

  • for running scheduler, it's should in main's init() or equivalent place. @zwpaper I know this is generic code problem, but if you can add this logic in this PR, it'd be much appreciated 🙇‍♂️
  • for UTs, it's fine to init in each UT as we treat each UT as a client

Copy link
Contributor

@Tal-or Tal-or Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for utilruntime.Must, WDYT @Tal-or ?

@zwpaper since the other reviewers are in favor of utilruntime.Must that's ok

Copy link
Member Author

@zwpaper zwpaper Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei moved all scheme initial into init func, PTAL

# rg -B 1 -g '!vendor/' -g '!*_test.go' -- utilruntime.Must
pkg/noderesourcetopology/plugin.go
45-func init() {
46:	utilruntime.Must(clientgoscheme.AddToScheme(scheme))
47:	utilruntime.Must(topologyv1alpha2.AddToScheme(scheme))

pkg/networkaware/topologicalsort/topologicalsort.go
45-func init() {
46:	utilruntime.Must(clientgoscheme.AddToScheme(scheme))
47:	utilruntime.Must(agv1alpha.AddToScheme(scheme))

pkg/networkaware/networkoverhead/networkoverhead.go
66-func init() {
67:	utilruntime.Must(clientgoscheme.AddToScheme(scheme))
68-
69:	utilruntime.Must(agv1alpha1.AddToScheme(scheme))
70:	utilruntime.Must(ntv1alpha1.AddToScheme(scheme))

pkg/generated/clientset/versioned/scheme/register.go
54-	v1.AddToGroupVersion(Scheme, schema.GroupVersion{Version: "v1"})
55:	utilruntime.Must(AddToScheme(Scheme))

pkg/generated/clientset/versioned/fake/register.go
54-	v1.AddToGroupVersion(scheme, schema.GroupVersion{Version: "v1"})
55:	utilruntime.Must(AddToScheme(scheme))

cmd/controller/app/server.go
37-func init() {
38:	utilruntime.Must(clientgoscheme.AddToScheme(scheme))
39-
40:	utilruntime.Must(schedulingv1a1.AddToScheme(scheme))

apis/config/scheme/scheme.go
44-func AddToScheme(scheme *runtime.Scheme) {
45:	utilruntime.Must(config.AddToScheme(scheme))
46:	utilruntime.Must(v1.AddToScheme(scheme))
47:	utilruntime.Must(v1beta2.AddToScheme(scheme))
48:	utilruntime.Must(v1beta3.AddToScheme(scheme))

apis/scheduling/scheme/scheme.go
31-func AddToScheme(scheme *runtime.Scheme) {
32:	utilruntime.Must(schedv1alpha1.AddToScheme(scheme

pkg/noderesourcetopology/pluginhelpers.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/score_test.go Show resolved Hide resolved
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there could be some impl details to iterate over, but design wise moving to controller-runtime seems OK. Doesn't help us in general in our journey towards k8s core merge (or its proposal, at least), but its' manageable.

What puzzles me more is if this approach is still compatible with #599 .
Note that #599 is not necessarily something which I want to merge, but still the problem it attacks is real, and we didn't reach a conclusion which side should carry a fix. In addition, it would be nice to have the option to workaround the issue also on scheduler side if needed.

pkg/noderesourcetopology/cache/overreserve.go Outdated Show resolved Hide resolved
podLister: podLister,
resyncMethod: resyncMethod,
}
return obj, nil
}

func (ov *OverReserve) GetCachedNRTCopy(nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) {
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, either pass it through ov.Get (which will become ov.client.Get?) or discard it please (applies to similar places elsewhere in the code)

)

type Passthrough struct {
lister listerv1alpha2.NodeResourceTopologyLister
client ctrlclient.Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, we're not using embedding here so better be consistent in OverReserve as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the OverReserve to private client to keep it consistent.

@@ -43,32 +45,26 @@ const (
)

func initNodeTopologyInformer(tcfg *apiconfig.NodeResourceTopologyMatchArgs, handle framework.Handle) (nrtcache.Interface, error) {
topoClient, err := topoclientset.NewForConfig(handle.KubeConfig())
scheme := runtime.NewScheme()
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this commit (which seems to want to focus on test code) needs to add manage schemas

Comment on lines +484 to +486
scheme := runtime.NewScheme()
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
utilruntime.Must(topologyv1alpha2.AddToScheme(scheme))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this is probably OK, it seems wasteful to do each time. Can we do it once when test runs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests run independently, how can we init once and run cross all the tests? maybe we have to refactor to use a test.M?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of agree with @zwpaper, in the perspective to making each UT hermetic.

Comment on lines +670 to +672
scheme := runtime.NewScheme()
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
utilruntime.Must(topologyv1alpha2.AddToScheme(scheme))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +135 to +138
scheme := runtime.NewScheme()
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
utilruntime.Must(topologyv1alpha2.AddToScheme(scheme))
extClient, err := ctrlclient.New(globalKubeConfig, ctrlclient.Options{Scheme: scheme})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@zwpaper zwpaper force-pushed the refactor-ctrl-runtime-capacity branch from c14580c to f4b63d2 Compare November 9, 2023 15:33
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2023
@Huang-Wei
Copy link
Contributor

What puzzles me more is if this approach is still compatible with #599 .

@ffromani this PR won't block #599 as #599 touched scheduler config API, and the introduced informer are imposed on core api objects (pods).

@zwpaper zwpaper force-pushed the refactor-ctrl-runtime-capacity branch from f4b63d2 to 9471d79 Compare November 12, 2023 09:45
@zwpaper zwpaper force-pushed the refactor-ctrl-runtime-capacity branch from 9471d79 to 0b5755d Compare November 12, 2023 10:18
Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.

A potential future improvement is to instantiate the scheme once (in main.go) for non-test code, and then pass to all plugins that need it using a WithScheme option. But this doesn't block on this PR's merge.

@@ -28,7 +28,7 @@ import (
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/util/rand"
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
"k8s.io/client-go/kubernetes/scheme"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: revert this?

Copy link
Member Author

@zwpaper zwpaper Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scheme name is in conflict with the var scheme for init schemes, changed to keep consistency

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
(non-binding to let other tagged reviewers review)

podLister: podLister,
resyncMethod: resyncMethod,
}
return obj, nil
}

func (ov *OverReserve) GetCachedNRTCopy(nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) {
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, my bad. It would be nice to change it as
func (ov *OverReserve) GetCachedNRTCopy(_ context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) { }
because ctx is indeed not needed in the body, but this is borderline a nit and totally not blocking

nrtCandidate, err := ov.nrtLister.Get(nodeName)
if err != nil {
nrtCandidate := &topologyv1alpha2.NodeResourceTopology{}
if err := ov.client.Get(context.Background(), types.NamespacedName{Name: nodeName}, nrtCandidate); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should use the same context for all the Gets - and I think we should (and we should also make it cancellable as further extension). But this change is better served in a followup PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, other PRs also have some leftovers. The general idea would be use context wherever you can:

  1. the chained calls in PreFitler/Filter can use the context that is provided by scheduler framework
  2. for calls outside PreFilter/Filter/..., we should instantiate one at the start of main() (or make some change in scheduler framework to expose it), and pass it to all plugins

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue to track this #664

@Huang-Wei
Copy link
Contributor

/lgtm
/unhold

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 14, 2023
@k8s-ci-robot k8s-ci-robot merged commit 862667a into kubernetes-sigs:master Nov 14, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants