Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backups failing when schedule for whole cluster and schedule with pvc are running at the same time because velero deletes volumesnapshots #7625

Open
Elyytscha opened this issue Apr 4, 2024 · 14 comments
Assignees
Labels
Area/CSI Related to Container Storage Interface support Needs investigation

Comments

@Elyytscha
Copy link

Elyytscha commented Apr 4, 2024

What steps did you take and what happened:

Its not that easy, this is a really complex problem in my opinion and i'm not sure if i'm completely right or have all steps to successfully reproduce it

lets assume we have two backup schedules, one which backups all k8s resources for the whole cluster without pvc's
and another schedule which backups a specific namespace with its PVC's at the same time.
This possibly fails because the first schedule wants to backup the VolumeSnapshot resources and the second schedule deletes the VolumeSnapshots after the snapshot was done successfully

# general-etcd-backup-without-pvc
apiVersion: velero.io/v1
kind: Schedule
metadata:
  annotations:
  name: general-etcd-backup-schedule
  namespace: velero
spec:
  schedule: 0 1 * * *
  skipImmediately: false
  template:
    snapshotVolumes: false
    ttl: 720h
# vault-backup-with-pvc
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: vault-backup-with-pvc
  namespace: velero
spec:
  schedule: 0 1 * * *
  skipImmediately: false
  template:
    includeClusterResources: true
    includedNamespaces:
      - core-vault-migration
    includedResources:
      - pv
      - pvc
      - secret
      - vault
      - configmap
      - deployment
      - service
      - statefulset
      - pod
      - ingress
      - replicaset
    labelSelector:
      matchLabels:
        vault_cr: vault
    snapshotVolumes: true
    storageLocation: default
    volumeSnapshotLocations:
      - default

A setup where CSI is used, with a VolumeSnapshotclass, the general backup schedule above CAN fail, because the backup schedule running at the same time with the pvc backup enabled does (imo) really weird things:

The VolumeSnapshot objects will be removed from the cluster after the backup is uploaded to the object storage, so that the namespace that is backed up can be deleted without removing the snapshot in the storage provider if the DeletionPolicy is Delete.
(https://velero.io/docs/main/csi/)

Velero deletes the VolumeSnapshot and just keeps the VolumeSnapshotContent, the argument is, so a backup cant deleted when the namespace is deleted

  1. I want to have those objects remaining, thats the objects someone will use to create a new pvc from with the backuped data
    image so its disabling core functionality and a lot of purpose for volumesnapshots

  2. If the backup should be kept or not when I delete the namespace should be determined by the deletion policy

If the deletionPolicy is Delete, then the underlying storage snapshot will be deleted along with the VolumeSnapshotContent object. If the deletionPolicy is Retain, then both the underlying snapshot and VolumeSnapshotContent remain.
(https://kubernetes.io/docs/concepts/storage/volume-snapshot-classes/#deletionpolicy)

  1. it creates ERRORS, which is why i open this ticket, because when those 2 schedules are running at the same time this CAN happen (not sure if it happens everytime, till yet, possibly some sort of race condition)
    it CAN fail because the first schedule wants to backup the volumesnapshot resources and the second schedule deletes the volumesnapshots after the snapshot was done successfully
velero backup logs backup-schedule-20240404010045 -n core-backup | grep error
W0404 11:47:52.061687  564298 gcp.go:120] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.25+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
time="2024-04-04T01:02:18Z" level=error msg="Fail to patch volumesnapshot with content {\"metadata\":{\"annotations\":{\"velero.io/csi-vsc-deletion-policy\":\"Retain\",\"velero.io/csi-volumesnapshot-handle\":\"projects/sep-infra/global/snapshots/snapshot-de6ee1ba-f64e-40e0-b893-78f262a12307\",\"velero.io/csi-driver-name\":\"pd.csi.storage.gke.io\",\"velero.io/vsi-volumesnapshot-restore-size\":\"1Gi\"}}}: volumesnapshots.snapshot.storage.k8s.io \"velero-vault-file-vault-0-444mc\" not found." backup=core-backup/backup-schedule-20240404010045 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:169" pluginName=velero-plugin-for-csi
time="2024-04-04T01:02:18Z" level=info msg="1 errors encountered backup up item" backup=core-backup/backup-schedule-20240404010045 logSource="pkg/backup/backup.go:457" name=velero-vault-file-vault-0-444mc
time="2024-04-04T01:02:18Z" level=error msg="Error backing up item" backup=core-backup/backup-schedule-20240404010045 error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=core-vault-migration, name=velero-vault-file-vault-0-444mc): rpc error: code = Unknown desc = volumesnapshots.snapshot.storage.k8s.io \"velero-vault-file-vault-0-444mc\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:380" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="pkg/backup/backup.go:461" name=velero-vault-file-vault-0-444mc
time="2024-04-04T01:02:18Z" level=error msg="Fail to patch volumesnapshot with content {\"metadata\":{\"annotations\":{\"velero.io/csi-vsc-deletion-policy\":\"Retain\",\"velero.io/csi-volumesnapshot-handle\":\"projects/sep-infra/global/snapshots/snapshot-6268e816-a697-4340-a130-b7b2457d7dc4\",\"velero.io/csi-driver-name\":\"pd.csi.storage.gke.io\",\"velero.io/vsi-volumesnapshot-restore-size\":\"1Gi\"}}}: volumesnapshots.snapshot.storage.k8s.io \"velero-vault-file-vault-1-jb9tw\" not found." backup=core-backup/backup-schedule-20240404010045 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:169" pluginName=velero-plugin-for-csi
time="2024-04-04T01:02:18Z" level=info msg="1 errors encountered backup up item" backup=core-backup/backup-schedule-20240404010045 logSource="pkg/backup/backup.go:457" name=velero-vault-file-vault-1-jb9tw
time="2024-04-04T01:02:18Z" level=error msg="Error backing up item" backup=core-backup/backup-schedule-20240404010045 error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=core-vault-migration, name=velero-vault-file-vault-1-jb9tw): rpc error: code = Unknown desc = volumesnapshots.snapshot.storage.k8s.io \"velero-vault-file-vault-1-jb9tw\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:380" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="pkg/backup/backup.go:461" name=velero-vault-file-vault-1-jb9tw
time="2024-04-04T01:02:18Z" level=error msg="Fail to patch volumesnapshot with content {\"metadata\":{\"annotations\":{\"velero.io/csi-vsc-deletion-policy\":\"Retain\",\"velero.io/csi-volumesnapshot-handle\":\"projects/sep-infra/global/snapshots/snapshot-29f79263-84f3-419e-8ffd-228fd5d84264\",\"velero.io/csi-driver-name\":\"pd.csi.storage.gke.io\",\"velero.io/vsi-volumesnapshot-restore-size\":\"1Gi\"}}}: volumesnapshots.snapshot.storage.k8s.io \"velero-vault-file-vault-2-wk9cs\" not found." backup=core-backup/backup-schedule-20240404010045 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:169" pluginName=velero-plugin-for-csi
time="2024-04-04T01:02:18Z" level=info msg="1 errors encountered backup up item" backup=core-backup/backup-schedule-20240404010045 logSource="pkg/backup/backup.go:457" name=velero-vault-file-vault-2-wk9cs
time="2024-04-04T01:02:18Z" level=error msg="Error backing up item" backup=core-backup/backup-schedule-20240404010045 error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=core-vault-migration, name=velero-vault-file-vault-2-wk9cs): rpc error: code = Unknown desc = volumesnapshots.snapshot.storage.k8s.io \"velero-vault-file-vault-2-wk9cs\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:380" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="pkg/backup/backup.go:461" name=velero-vault-file-vault-2-wk9cs

What did you expect to happen:

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
i have a support bundle created, but i can not upload it over here, there is sensitive information contained, like the backupstorelocations, so the submitted bundle could leak things like bucket names, gcp projects etc.

Anything else you would like to add:

Environment:

Velero Version: v1.13.0
velero/velero-plugin-for-gcp:v1.9.0
velero/velero-plugin-for-csi:v0.7.0
Kubernetes Version: v1.27.10-gke.1055000

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@kaovilai
Copy link
Member

kaovilai commented Apr 4, 2024

I want to have those objects remaining, thats the objects someone will use to create a new pvc from with the backuped data

You can create a new VolumeSnapshot from VSC. I don't think the usecase here justify more config complexity, most people test velero by deleting namespace which will delete VolumeSnapshot and its VSC. Someone can try test velero csi backup and claim restore isn't working due to missing VSC deleted via deletion of VS during disaster (namespace deletion).

Velero today generally do not run more than one backups at a time, but it maybe possible that VolumeSnapshot deletion from first backup completing do not block a new backup from starting.

@Elyytscha
Copy link
Author

Elyytscha commented Apr 4, 2024

You can create a new VolumeSnapshot from VSC. I don't think the usecase here justify more config complexity, most people test velero by deleting namespace which will delete VolumeSnapshot and its VSC. Someone can try test velero csi backup and claim restore isn't working due to missing VSC deleted via deletion of VS during disaster (namespace deletion).

But this is per definition what DeletionPolicy in the VolumeSnapshotClass does or wants to do.

It enables you to configure what happens to a VolumeSnapshotContent when the VolumeSnapshot object it is bound to is to be deleted
If the deletionPolicy is Delete, then the underlying storage snapshot will be deleted along with the VolumeSnapshotContent object. If the deletionPolicy is Retain, then both the underlying snapshot and VolumeSnapshotContent
(https://kubernetes.io/docs/concepts/storage/volume-snapshot-classes/#deletionpolicy)

if velero recommends per default a VolumeSnapshotClass for velero backups with
DeletionPolicy = Retain
in the docs for CSI, situations like someone testing it the way you described shouldn't arise, or at least they can be pointed to the docs.

it would be the exact same situation.

so wouldn't it

  1. reduce complexity for velero because the code for removal can be removed
  2. and make room for a decision if someone wants to have (additional) VolumeSnapshotClass which keeps the VS, because he wants explicitly ephemeral backups for example e2e or dev environments, where backups are needed / wanted but if you delete the namespace, you want explicitly to delete the backups with it too, because also backups can be a lot of $ if your k8s is hosted by a cloud provider.
  3. fix the race condition we seeing here

@ywk253100 ywk253100 added Area/CSI Related to Container Storage Interface support Needs investigation labels Apr 5, 2024
@kaovilai
Copy link
Member

kaovilai commented Apr 8, 2024

  1. reduce complexity for velero because the code for removal can be removed

this code stays, because it's used in CI, and is the current behavior.

  1. fix the race condition we seeing here

this can be fixed without removal of 1.

@kaovilai
Copy link
Member

kaovilai commented Apr 8, 2024

The default behavior is unlikely to change, if anything, the requirements you sought would add more configuration IMO.

if you delete the namespace, you want explicitly to delete the backups with it too

That is usually not the case, it needs to be explicitly set that way if this behavior is pursued.

@Elyytscha
Copy link
Author

this code stays, because it's used in CI, and is the current behavior.

I'm fine with that, I also agree that this can be fixed (I mean I would still call it workaround) without changing that behaviour but in that case I would like to vote for an additional setting maybe like this

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: general-etcd-backup-schedule
  namespace: velero
spec:
...
  disableVolumeSnaphotDeletion: true
...

which is false per default. Maybe I should open a separate Issue for this.

That is usually not the case, it needs to be explicitly set that way if this behavior is pursued.

I already noticed that as I tried to delete backups which where magically appearing again without a tool like ArgoCD doing this. Could you maybe link me to the documentation where its documented how I could turn that off? (for specific schedules)

@kaovilai
Copy link
Member

kaovilai commented Apr 9, 2024

I tried to delete backups .. magically appearing again

I assume you deleted via kubectl, which doesn't work because velero sync backup missing from cluster from object store.

velero backup delete <backupName> will delete the backup resource including all data in object/block storage

@Elyytscha
Copy link
Author

which doesn't work because velero sync backup missing from cluster from object store.

thats one of the reasons why I opened this ticket, I want to delete the backup from (cloud provider) storage when I delete the namespace in specific scenarios. Because thats how VolumeSnapshots should work, or should at least be configurable for.
For obvious reasons, cloud is expensive and cicd/argocd deletes namespaces

To note, this code creates other issues: #7648

On the other side, I never had time to test, but what would happen if I have a VolumeSnapshotClass like this:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: test-snapclass
driver: disk.csi.cloud.com
deletionPolicy: Delete

Would this directly delete my backups after they would have been taken (becaue code deletes VolumeSnapshot, which results into cascade delete VolumeSnapshotContent?)

@sseago
Copy link
Collaborator

sseago commented Apr 12, 2024

@Elyytscha one problem with "delete backup when deleting namespace" is that backups are shared across clusters. If I add the same BSL to 2 clusters, then cluster1 backs up namespace "foo", backup gets synced to cluster 2, then cluster2 deletes namespace "foo", you'd lose the cluster1 backup.

Another problem with automatically deleting backups on namespace deletion is that one of the reasons users make backups is so that if data is accidentally deleted, it can be recovered. What if I have 2 namespaces "myapp" (my main app namespace) and "myapp-1" (a clone I made for temporary use) and I accidentally type "kubectl delete myapp 1" when I meant to type "kubectl delete myapp-1" Now I've lost my main app, so I need to restore it from backup. But if Velero deleted all of the backups for my main app when I accidentally deleted the app, that defeats the purpose of having backed it up. I think a better approach would be to modify whatever workflow you're using to delete namespaces to also delete specific backups if you need to.

@Elyytscha
Copy link
Author

Elyytscha commented Apr 13, 2024

@Elyytscha one problem with "delete backup when deleting namespace" is that backups are shared across clusters. If I add the same BSL to 2 clusters, then cluster1 backs up namespace "foo", backup gets synced to cluster 2, then cluster2 deletes namespace "foo", you'd lose the cluster1 backup.

Another problem with automatically deleting backups on namespace deletion is that one of the reasons users make backups is so that if data is accidentally deleted, it can be recovered. What if I have 2 namespaces "myapp" (my main app namespace) and "myapp-1" (a clone I made for temporary use) and I accidentally type "kubectl delete myapp 1" when I meant to type "kubectl delete myapp-1" Now I've lost my main app, so I need to restore it from backup. But if Velero deleted all of the backups for my main app when I accidentally deleted the app, that defeats the purpose of having backed it up.

all of this couldn't arise if one could have..

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: default-backup
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: disk.csi.cloud.com
deletionPolicy: Retain
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ephremeral-backup
driver: disk.csi.cloud.com
deletionPolicy: Delete
# safe backup
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: test-backup
spec:
    includedNamespaces:
    - default
# ephremeral backup
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: test-backup-ephremeral
  annotations:
    velero.io/csi-volumesnapshot-class_disk.csi.cloud.com: "ephremeral-backup"
spec:
    includedNamespaces:
    - default

I think a better approach would be to modify whatever workflow you're using to delete namespaces to also delete specific backups if you need to.

the problem is, users often don't control the backups, they don't have a choice if their namespace gets backed up or not, that is defined by cicd, but when user removes an app (k8s resources are deleted from gitrepo)
argocd will terminate those resources and the backups will still remain, so in this case, argocd needs to terminate those VolumeSnapshots and not Velero..

And in CICD someone would want to define:
prod -> VolumeSnapshotClass -> default-backup
staging -> VolumeSnapshotClass -> ephremeral-backup
dev -> VolumeSnapshotClass -> ephremeral-backup
feature/branches -> VolumeSnapshotClass -> ephremeral-backup

This is how cloud native CICD works in general, you have tekton which adds updates and deletes kubernetes manifests in git repos, which then are picked up (or deleted) by argocd, it could or should be that simple..

But i mean, if noone sees this like me, I might just be allergic to cicd systems which doesn't handle correct cleanup of resources...

@Elyytscha
Copy link
Author

Elyytscha commented Oct 4, 2024

I just wanted to note that we still have this issue in production, that we have failing backups because of this and that our only workaround right now is to have NO backupschedule at the same time

status:
  formatVersion: 1.1.0
  errors: 2
  progress:
    itemsBackedUp: 3644
    totalItems: 3644
  expiration: '2024-11-03T01:00:20Z'
  startTimestamp: '2024-10-04T01:00:20Z'
  hookStatus:
    hooksAttempted: 6
  version: 1
  completionTimestamp: '2024-10-04T01:00:49Z'
  phase: PartiallyFailed
time="2024-10-04T01:00:46Z" level=error msg="Fail to patch VolumeSnapshot: volumesnapshots.snapshot.storage.k8s.io \"velero-central-db-5blj4\" not found." backup=core-backup/backup-schedule-20241004010000 cmd=/velero logSource="pkg/backup/actions/csi/volumesnapshot_action.go:223" pluginName=velero
time="2024-10-04T01:00:46Z" level=info msg="1 errors encountered backup up item" backup=core-backup/backup-schedule-20241004010000 logSource="pkg/backup/backup.go:507" name=velero-central-db-5blj4
time="2024-10-04T01:00:46Z" level=error msg="Error backing up item" backup=core-backup/backup-schedule-20241004010000 error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=stackrox, name=velero-central-db-5blj4): rpc error: code = Unknown desc = volumesnapshots.snapshot.storage.k8s.io \"velero-central-db-5blj4\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:400" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="pkg/backup/backup.go:511" name=velero-central-db-5blj4

velero and csi backups is basically unusable in this state.. everytime one adds a backup schedule running at the same time as one already existing we run into this.

@kaovilai
Copy link
Member

kaovilai commented Oct 4, 2024

Please create a design proposal that we can discuss with the maintainers the feasibility.

@kaovilai
Copy link
Member

kaovilai commented Oct 4, 2024

@sseago
Copy link
Collaborator

sseago commented Oct 4, 2024

From what I remember, we should only be deleting VS for the current backup -- if that is correct, then there may be a bug here. We should only be deleting specific VolumeSnapshots that were created by the current backup. This may be a consequence of the recent change to use async actions for CSI snapshots. If there are 2 queued backups that overlap, when the first backup moves to WaitingForPlugin operations, the second backup starts -- this backup may find the first backup's VS resources and back them up as well, and then the delete code at the end of backup processing may end up removing it.

I think there are 2 things we may need to do here:

  1. confirm that our VS/VSC deletion code never deletes a VS or VSC that doesn't belong to the current backup (and if we're not filtering that properly, then we should fix that)
  2. We may want to also filter out backed-up VolumeSnapshots that belong to another backup on restore -- RestoreItemAction plugins have the opportunity to discard a resource. If we're restoring a VS and this VS 1) has a velero backup label and 2) the backup name is different from the backup we're restoring, then we want to discard it.

@Elyytscha
Copy link
Author

Elyytscha commented Oct 7, 2024

From what I remember, we should only be deleting VS for the current backup -- if that is correct, then there may be a bug here. We should only be deleting specific VolumeSnapshots that were created by the current backup. This may be a consequence of the recent change to use async actions for CSI snapshots. If there are 2 queued backups that overlap, when the first backup moves to WaitingForPlugin operations, the second backup starts -- this backup may find the first backup's VS resources and back them up as well, and then the delete code at the end of backup processing may end up removing it.

I think there are 2 things we may need to do here:

  1. confirm that our VS/VSC deletion code never deletes a VS or VSC that doesn't belong to the current backup (and if we're not filtering that properly, then we should fix that)
  2. We may want to also filter out backed-up VolumeSnapshots that belong to another backup on restore -- RestoreItemAction plugins have the opportunity to discard a resource. If we're restoring a VS and this VS 1) has a velero backup label and 2) the backup name is different from the backup we're restoring, then we want to discard it.

as far as i have seen this, the issue is basically

2 backup's at the same time, 1 backups a namespace with csi based pvc's the other backups globally the cluster. the global backup runs, find the volumesnapshot dynamically created from the namespaced csi pvc backup, adds it to list of items which are backed up, the namespaced csi based backup deletes his volumesnapshot, the global backup now can't find the volumesnapshot anymore and gets a partially failed.

so basically what should be done somehow is preventing dynamically created volumesnapshots from velero to be included in velero backups (velero deletes the volumesnapshot after it did the backup, why velero should backup those volumesnapshots, if it deletes them afterwards)

or prevent volumesnapshots from OTHER velerobackups to be included in another backup schedule.

we also dont see this everyday, i would say from 10 backups from the general backup schedule, 1 fails with this error, so its sporadic.

Please create a design proposal that we can discuss with the maintainers the feasibility.

i'm fine with the architecture, imo velero does not need a redesign here right now, but this bug should get fixed. therefore this ticket should be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/CSI Related to Container Storage Interface support Needs investigation
Projects
None yet
Development

No branches or pull requests

5 participants