Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] hub-agent making noop /status updates on every reconciliation #940

Open
ahmetb opened this issue Oct 30, 2024 · 1 comment
Open

[BUG] hub-agent making noop /status updates on every reconciliation #940

ahmetb opened this issue Oct 30, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ahmetb
Copy link

ahmetb commented Oct 30, 2024

Describe the bug

I'm observing that every time hub-agent restarts, it makes a ton of updates to clusterrresourceplacement/status for no reason. Given the default worker count is configured to be 1, this starves the new items on the work queue on large clusters with lots of work to do for 10-20 minutes.

Similarly the resync period on the hub-agent is configured to be 5 minutes by default (way too aggressive IMO) and it exacerbates the frequency of the problem:

// ResyncPeriod is the base frequency the informers are resynced. Defaults is 5 minutes.
ResyncPeriod metav1.Duration

Environment

Please provide the following:

  • Hub cluster details: hub-agent v0.8.5
  • Member cluster details: member-agent v0.8.5

To Reproduce

Steps to reproduce the behavior:

  • Restart controller (trigger a rolling update)

  • Tail logs from the leader

  • Observe it prints log statements to re-reconcile everything

    1030 17:55:03.104691       1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="lep-...
    1030 17:55:03.522789       1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="depo...
    1030 17:55:03.930300       1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="tran...
    1030 17:55:04.400769       1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="job-...
    1030 17:55:04.812419       1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="deci...
    1030 17:55:05.282259       1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="auto...
    1030 17:55:05.690881       1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="samp...
    1030 17:55:06.145613       1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="anti...
    
  • Look at API Server audit logs, observe that it's making updates to clusterrresourceplacement/status for every clusterSchedulingPolicySnapshot object despite there should be no changes to the object (timestamps are matching):

    Image

  • I'm not seeing any updated timestamps etc in clusterresourceplacement/status to warrant this /status update:

    status:
      conditions:
      - lastTransitionTime: "2024-09-19T00:06:37Z" # this is not today
        message: found all the clusters needed as specified by the scheduling policy
        observedGeneration: 1
        reason: SchedulingPolicyFulfilled
        status: "True"
        type: Scheduled
      observedCRPGeneration: 1
      targetClusters:
      - clusterName: redacted
        reason: picked by scheduling policy
        selected: true
    

In this part of the code we can clearly see the status update call is made unconditionally:

isClusterScheduled, err := r.setPlacementStatus(ctx, crp, selectedResourceIDs, latestSchedulingPolicySnapshot, latestResourceSnapshot)
if err != nil {
return ctrl.Result{}, err
}
if err := r.Client.Status().Update(ctx, crp); err != nil {
klog.ErrorS(err, "Failed to update the status", "clusterResourcePlacement", crpKObj)
return ctrl.Result{}, err
}

Expected behavior

The controller should do apiequality.Semantic.DeepEqual(old.status,new.status), and skip updating the status on the API when there's no reason to make this API call.

This would ensure the full resyncs and controller startup can happen very fast, and reduce the load on the API Server.

Screenshots

Attached above

Additional context

N/A

@ahmetb ahmetb added the bug Something isn't working label Oct 30, 2024
@ryanzhang-oss
Copy link
Contributor

Thank you very much, we will triage this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants