-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445
Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445
Conversation
10909b6
to
269d806
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #7445 +/- ##
=======================================
Coverage 61.64% 61.64%
=======================================
Files 262 262
Lines 28477 28477
=======================================
Hits 17556 17556
Misses 9689 9689
Partials 1232 1232 ☔ View full report in Codecov by Sentry. |
…ly incrementing over time Signed-off-by: allenxu404 <[email protected]>
269d806
to
84fb88c
Compare
@@ -468,7 +468,7 @@ func (m *ServerMetrics) InitSchedule(scheduleName string) { | |||
c.WithLabelValues(scheduleName).Add(0) | |||
} | |||
if c, ok := m.metrics[backupLastStatus].(*prometheus.GaugeVec); ok { | |||
c.WithLabelValues(scheduleName).Add(1) | |||
c.WithLabelValues(scheduleName).Set(float64(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this PR when trying to figure out why the metric was increasing; if I understand correctly, this ends up running periodically (every time the schedule is reconciled), right?
If so, wouldn't this incorrectly clear a failing status? Say that the schedule created a backup that failed, the metric would become 0, and then get reset back to 1 after a couple of minutes, which seems problematic since this can hide failures.
If anything, I think lines 470-471 should be omitted entirely. This would prevent from emitting a data point until the point where we actually have one, which should still fix what #6838 was aiming for. If an accurate status must be set, it should be done as part of scheduleReconciler.Reconcile, using the status of the last backup created by the schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Snaipe yes, you're right.
This PR only aimed to address the specific bug caused in #6838. I will create a separate PR to resolve the issue with incorrect failure status clear and keep you posted. Thanks for highlighting this larger problem!
The purpose of setting the default value of backup_last_status to 1 was to prevent mistaken failure reports that could occur if the code exits without updating that metric. I prefer to leave that implementation unchanged. Upon further analysis, I think the true underlying issue seems to be that metrics are re-initialized each time the schedule is reconciled, instead, the metrics should be initialized only if a new backup is due to be created. So I think the InitSchedule
function would be better placed below, where backups are created, to avoid unnecessary reset. what's your thought?
if c.ifDue(schedule, cronSchedule) && !c.checkIfBackupInNewOrProgress(schedule) {
if err := c.submitBackup(ctx, schedule); err != nil {
return ctrl.Result{}, errors.Wrapf(err, "error submit backup for schedule %s", req.String())
}
c.metrics.InitSchedule(schedule.Name)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd help, but I'm not sure it would fix the underlying problem. As far as I can tell, what this would do is that if your schedule is failing, and a new backup is created, then the schedule would be reported as "fixed" metrics-wise for the entire duration of the backup.
It's less bad than the first situation, but it would definitely confuse alerting.
I think not advertising a value for the metric up until the point where there is a data point seems slightly better in the sense that users can choose to ignore them or set it to the default value of their choice (0 or 1) in grafana/alertmanager. A better solution would be to set the initial value of the metric directly to the result of the last backup of the schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better solution would be to set the initial value of the metric directly to the result of the last backup of the schedule.
I agree, will raise another issue to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we've already has the similar issue to track: #6936
Thank you for contributing to Velero!
Please add a summary of your change
Does your change fix a particular issue?
Fixes #7378
Please indicate you've done the following:
/kind changelog-not-required
as a comment on this pull request.site/content/docs/main
.