You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm observing that every time hub-agent restarts, it makes a ton of updates to clusterrresourceplacement/status for no reason. Given the default worker count is configured to be 1, this starves the new items on the work queue on large clusters with lots of work to do for 10-20 minutes.
Similarly the resync period on the hub-agent is configured to be 5 minutes by default (way too aggressive IMO) and it exacerbates the frequency of the problem:
// ResyncPeriod is the base frequency the informers are resynced. Defaults is 5 minutes.
ResyncPeriod metav1.Duration
Environment
Please provide the following:
Hub cluster details: hub-agent v0.8.5
Member cluster details: member-agent v0.8.5
To Reproduce
Steps to reproduce the behavior:
Restart controller (trigger a rolling update)
Tail logs from the leader
Observe it prints log statements to re-reconcile everything
1030 17:55:03.104691 1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="lep-...
1030 17:55:03.522789 1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="depo...
1030 17:55:03.930300 1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="tran...
1030 17:55:04.400769 1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="job-...
1030 17:55:04.812419 1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="deci...
1030 17:55:05.282259 1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="auto...
1030 17:55:05.690881 1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="samp...
1030 17:55:06.145613 1 framework/framework.go:1310] "No change in scheduling decisions and condition, and the observed CRP generation remains the same" clusterSchedulingPolicySnapshot="anti...
Look at API Server audit logs, observe that it's making updates to clusterrresourceplacement/status for every clusterSchedulingPolicySnapshot object despite there should be no changes to the object (timestamps are matching):
I'm not seeing any updated timestamps etc in clusterresourceplacement/status to warrant this /status update:
status:
conditions:
- lastTransitionTime: "2024-09-19T00:06:37Z" # this is not today
message: found all the clusters needed as specified by the scheduling policy
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: Scheduled
observedCRPGeneration: 1
targetClusters:
- clusterName: redacted
reason: picked by scheduling policy
selected: true
In this part of the code we can clearly see the status update call is made unconditionally:
klog.ErrorS(err, "Failed to update the status", "clusterResourcePlacement", crpKObj)
return ctrl.Result{}, err
}
Expected behavior
The controller should do apiequality.Semantic.DeepEqual(old.status,new.status), and skip updating the status on the API when there's no reason to make this API call.
This would ensure the full resyncs and controller startup can happen very fast, and reduce the load on the API Server.
Screenshots
Attached above
Additional context
N/A
The text was updated successfully, but these errors were encountered:
Describe the bug
I'm observing that every time hub-agent restarts, it makes a ton of updates to
clusterrresourceplacement/status
for no reason. Given the default worker count is configured to be1
, this starves the new items on the work queue on large clusters with lots of work to do for 10-20 minutes.Similarly the resync period on the hub-agent is configured to be 5 minutes by default (way too aggressive IMO) and it exacerbates the frequency of the problem:
fleet/cmd/hubagent/options/options.go
Lines 66 to 67 in 6b81bdb
Environment
Please provide the following:
To Reproduce
Steps to reproduce the behavior:
Restart controller (trigger a rolling update)
Tail logs from the leader
Observe it prints log statements to re-reconcile everything
Look at API Server audit logs, observe that it's making updates to
clusterrresourceplacement/status
for everyclusterSchedulingPolicySnapshot
object despite there should be no changes to the object (timestamps are matching):I'm not seeing any updated timestamps etc in
clusterresourceplacement/status
to warrant this/status
update:In this part of the code we can clearly see the status update call is made unconditionally:
fleet/pkg/controllers/clusterresourceplacement/controller.go
Lines 193 to 201 in cb9a7a0
Expected behavior
The controller should do
apiequality.Semantic.DeepEqual(old.status,new.status)
, and skip updating the status on the API when there's no reason to make this API call.This would ensure the full resyncs and controller startup can happen very fast, and reduce the load on the API Server.
Screenshots
Attached above
Additional context
N/A
The text was updated successfully, but these errors were encountered: