-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445
Merged
blackpiglet
merged 1 commit into
vmware-tanzu:main
from
allenxu404:backup-last-status-fixed
Feb 21, 2024
+2
−1
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Adjust the logic for the backup_last_status metrics to stop incorrectly incrementing over time |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this PR when trying to figure out why the metric was increasing; if I understand correctly, this ends up running periodically (every time the schedule is reconciled), right?
If so, wouldn't this incorrectly clear a failing status? Say that the schedule created a backup that failed, the metric would become 0, and then get reset back to 1 after a couple of minutes, which seems problematic since this can hide failures.
If anything, I think lines 470-471 should be omitted entirely. This would prevent from emitting a data point until the point where we actually have one, which should still fix what #6838 was aiming for. If an accurate status must be set, it should be done as part of scheduleReconciler.Reconcile, using the status of the last backup created by the schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Snaipe yes, you're right.
This PR only aimed to address the specific bug caused in #6838. I will create a separate PR to resolve the issue with incorrect failure status clear and keep you posted. Thanks for highlighting this larger problem!
The purpose of setting the default value of backup_last_status to 1 was to prevent mistaken failure reports that could occur if the code exits without updating that metric. I prefer to leave that implementation unchanged. Upon further analysis, I think the true underlying issue seems to be that metrics are re-initialized each time the schedule is reconciled, instead, the metrics should be initialized only if a new backup is due to be created. So I think the
InitSchedule
function would be better placed below, where backups are created, to avoid unnecessary reset. what's your thought?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd help, but I'm not sure it would fix the underlying problem. As far as I can tell, what this would do is that if your schedule is failing, and a new backup is created, then the schedule would be reported as "fixed" metrics-wise for the entire duration of the backup.
It's less bad than the first situation, but it would definitely confuse alerting.
I think not advertising a value for the metric up until the point where there is a data point seems slightly better in the sense that users can choose to ignore them or set it to the default value of their choice (0 or 1) in grafana/alertmanager. A better solution would be to set the initial value of the metric directly to the result of the last backup of the schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, will raise another issue to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we've already has the similar issue to track: #6936