Velero fails to expose correct backup metrics after a pod restart #6936

Ahmad-Faizan · 2023-10-10T09:03:17Z

What steps did you take and what happened:
The metric velero_backup_last_status exposes the status of the latest backups.
Once a backup has been taken, the metric gets updated.
However, a pod restart in between any two scheduled backups resets the metric exposed by velero_backup_last_status.
The metric only gets updated for backups after they are created.

What did you expect to happen:
Ideally, the metric should read the list of backups and set the velero_backup_last_status metric.
So if a backup happens at 12:00 and the velero pod is restarted or killed at 12:30, the metric
should not be set to 0 (which indicates no backup has been taken).

The following information will help us better understand what's going on:

Environment:

Velero version (use velero version): v1.11.0
Velero features (use velero client config get features):
Kubernetes version (use kubectl version): v1.26.8
Kubernetes installer & version: kops v1.26.5
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Ubuntu 20.04.5 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

yanggangtony · 2023-10-13T00:42:15Z

you use the version is 1.11?
the default is changed to 1 in https://github.com/vmware-tanzu/velero/pull/6838/files

yanggangtony · 2023-10-13T01:00:29Z

I check the release 1.12 , the pr is not include in that.
Maybe wait release 1.13 will be include?

allenxu404 · 2023-10-13T02:04:29Z

Ideally, the metric should read the list of backups and set the velero_backup_last_status metric.

Do you mean the velero_backup_last_status metric should read the most recently completed backup before the Velero pod restarts to determine the metrics' value for schedule backup?

The metrics is reset to 0 when Velero pod restarts because the default value of the metrics is 0. This has been changed in the PR: #6838 as @yanggangtony mentioned.

jkroepke · 2023-10-13T13:37:20Z

Do you mean the velero_backup_last_status metric should read the most recently completed backup before the Velero pod restarts to determine the metrics' value for schedule backup?

Yes. A default value of one seems pointless to me.

If a backup fails and then velero gets an restart, the metric reflects a wrong state. if have the personal feeling that using a default value as initial value results into unexpected behavior.

yanggangtony · 2023-10-14T02:23:39Z

@jkroepke

A default value of one seems pointless to me.

in this issue issues/6809 , we observed when velero gets an restart, the schedule will continue a new cron runing.

So the default value will be changed when it hits the error,and changed to value 0.

And you suggest maybe want to not init the value of 'velero_backup_last_status' , but realtime calculate the most recently completed backup.

This maybe get a opinion and discuss with maintaners , like @allenxu404 @sseago @ywk253100

jkroepke · 2023-10-14T11:17:23Z

Yes, I'm expecting the same behavior from velero_backup_last_successful_timestamp where the timestamp is the real timestamp of the latest backup and not an default value.

Ahmad-Faizan · 2023-10-16T09:51:46Z

Other metrics from velero expose non-default values after a pod restart.
A similar behaviour is expected from this metric too, as @jkroepke mentioned - velero can calculate the timestamp of the last successful backup and exposes correct metric in the case of velero_backup_last_successful_timestamp even after a pod restart.

weshayutin · 2023-10-20T15:03:23Z

@mpryc FYI

github-actions · 2023-12-20T01:42:25Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

yanggangtony · 2023-12-20T03:33:11Z

not stale

github-actions · 2024-02-20T01:46:19Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

jkroepke · 2024-02-20T08:34:10Z

I feel this is still relevant

mpryc · 2024-02-20T12:18:34Z

The velero_backup_last_status could be re-read on velero restart as well as some of the other metrics within https://github.com/vmware-tanzu/velero/blob/main/pkg/metrics/metrics.go#L31-L86

This may have side effects which needs to be checked on the time of such event. When the velero restarts and the metric will re-read information around backups and it's states the time of such even will be the time of velero restart and not the actual backup. This does not apply to all the metrics, but metrics such as backupLastSuccessfulTimestamp needs to be carefully handled.

Another solution would be to not represent any metrics after restart and only show the ones which happens after restart. This will however require modifications on the query of the prometheus DB to gather information about past events.

github-actions · 2024-04-21T01:48:33Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

jkroepke · 2024-04-21T07:04:30Z

For weekly backups, it also take a while until the status is correctly reported.

vinayan3 · 2024-06-01T00:27:22Z

We are hitting this issue as well where the pod gets restarted because we replace nodes on a regular cadenace. This causes backup metrics to be misreported. We have held off putting up alerts because of this.

jkroepke · 2024-06-01T00:37:44Z

@vinayan3 we are using velero_backup_last_successful_timestamp, since it works as expected.

velero_backup_last_status is just un-useable for now.

github-actions · 2024-07-31T01:45:29Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

vinayan3 · 2024-07-31T03:11:54Z

This isn't stale.

kaovilai · 2024-07-31T17:17:42Z

unstalev2

reasonerjt added the Metrics Related to prometheus metrics label Oct 16, 2023

reasonerjt assigned allenxu404 Oct 16, 2023

reasonerjt added the backlog label Oct 16, 2023

github-actions bot added the staled label Dec 20, 2023

github-actions bot removed the staled label Dec 21, 2023

github-actions bot added the staled label Feb 20, 2024

github-actions bot removed the staled label Feb 21, 2024

allenxu404 mentioned this issue Feb 26, 2024

Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445

Merged

3 tasks

github-actions bot added the staled label Apr 21, 2024

github-actions bot removed the staled label Apr 22, 2024

github-actions bot added the staled label Jul 31, 2024

github-actions bot removed the staled label Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero fails to expose correct backup metrics after a pod restart #6936

Velero fails to expose correct backup metrics after a pod restart #6936

Ahmad-Faizan commented Oct 10, 2023 •

edited

Loading

yanggangtony commented Oct 13, 2023

yanggangtony commented Oct 13, 2023

allenxu404 commented Oct 13, 2023

jkroepke commented Oct 13, 2023

yanggangtony commented Oct 14, 2023 •

edited

Loading

jkroepke commented Oct 14, 2023

Ahmad-Faizan commented Oct 16, 2023

weshayutin commented Oct 20, 2023

github-actions bot commented Dec 20, 2023

yanggangtony commented Dec 20, 2023

github-actions bot commented Feb 20, 2024

jkroepke commented Feb 20, 2024

mpryc commented Feb 20, 2024

github-actions bot commented Apr 21, 2024

jkroepke commented Apr 21, 2024

vinayan3 commented Jun 1, 2024

jkroepke commented Jun 1, 2024 •

edited

Loading

github-actions bot commented Jul 31, 2024

vinayan3 commented Jul 31, 2024

kaovilai commented Jul 31, 2024

Velero fails to expose correct backup metrics after a pod restart #6936

Velero fails to expose correct backup metrics after a pod restart #6936

Comments

Ahmad-Faizan commented Oct 10, 2023 • edited Loading

yanggangtony commented Oct 13, 2023

yanggangtony commented Oct 13, 2023

allenxu404 commented Oct 13, 2023

jkroepke commented Oct 13, 2023

yanggangtony commented Oct 14, 2023 • edited Loading

jkroepke commented Oct 14, 2023

Ahmad-Faizan commented Oct 16, 2023

weshayutin commented Oct 20, 2023

github-actions bot commented Dec 20, 2023

yanggangtony commented Dec 20, 2023

github-actions bot commented Feb 20, 2024

jkroepke commented Feb 20, 2024

mpryc commented Feb 20, 2024

github-actions bot commented Apr 21, 2024

jkroepke commented Apr 21, 2024

vinayan3 commented Jun 1, 2024

jkroepke commented Jun 1, 2024 • edited Loading

github-actions bot commented Jul 31, 2024

vinayan3 commented Jul 31, 2024

kaovilai commented Jul 31, 2024

Ahmad-Faizan commented Oct 10, 2023 •

edited

Loading

yanggangtony commented Oct 14, 2023 •

edited

Loading

jkroepke commented Jun 1, 2024 •

edited

Loading