-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backups failing when schedule for whole cluster and schedule with pvc are running at the same time because velero deletes volumesnapshots #7625
Comments
You can create a new VolumeSnapshot from VSC. I don't think the usecase here justify more config complexity, most people test velero by deleting namespace which will delete VolumeSnapshot and its VSC. Someone can try test velero csi backup and claim restore isn't working due to missing VSC deleted via deletion of VS during disaster (namespace deletion). Velero today generally do not run more than one backups at a time, but it maybe possible that VolumeSnapshot deletion from first backup completing do not block a new backup from starting. |
But this is per definition what DeletionPolicy in the VolumeSnapshotClass does or wants to do.
if velero recommends per default a VolumeSnapshotClass for velero backups with it would be the exact same situation. so wouldn't it
|
this code stays, because it's used in CI, and is the current behavior.
this can be fixed without removal of 1. |
The default behavior is unlikely to change, if anything, the requirements you sought would add more configuration IMO.
That is usually not the case, it needs to be explicitly set that way if this behavior is pursued. |
I'm fine with that, I also agree that this can be fixed (I mean I would still call it workaround) without changing that behaviour but in that case I would like to vote for an additional setting maybe like this apiVersion: velero.io/v1
kind: Schedule
metadata:
name: general-etcd-backup-schedule
namespace: velero
spec:
...
disableVolumeSnaphotDeletion: true
... which is false per default. Maybe I should open a separate Issue for this.
I already noticed that as I tried to delete backups which where magically appearing again without a tool like ArgoCD doing this. Could you maybe link me to the documentation where its documented how I could turn that off? (for specific schedules) |
I assume you deleted via kubectl, which doesn't work because velero sync backup missing from cluster from object store.
|
thats one of the reasons why I opened this ticket, I want to delete the backup from (cloud provider) storage when I delete the namespace in specific scenarios. Because thats how VolumeSnapshots should work, or should at least be configurable for. To note, this code creates other issues: #7648 On the other side, I never had time to test, but what would happen if I have a VolumeSnapshotClass like this: apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: test-snapclass
driver: disk.csi.cloud.com
deletionPolicy: Delete Would this directly delete my backups after they would have been taken (becaue code deletes VolumeSnapshot, which results into cascade delete VolumeSnapshotContent?) |
@Elyytscha one problem with "delete backup when deleting namespace" is that backups are shared across clusters. If I add the same BSL to 2 clusters, then cluster1 backs up namespace "foo", backup gets synced to cluster 2, then cluster2 deletes namespace "foo", you'd lose the cluster1 backup. Another problem with automatically deleting backups on namespace deletion is that one of the reasons users make backups is so that if data is accidentally deleted, it can be recovered. What if I have 2 namespaces "myapp" (my main app namespace) and "myapp-1" (a clone I made for temporary use) and I accidentally type "kubectl delete myapp 1" when I meant to type "kubectl delete myapp-1" Now I've lost my main app, so I need to restore it from backup. But if Velero deleted all of the backups for my main app when I accidentally deleted the app, that defeats the purpose of having backed it up. I think a better approach would be to modify whatever workflow you're using to delete namespaces to also delete specific backups if you need to. |
all of this couldn't arise if one could have.. apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: default-backup
labels:
velero.io/csi-volumesnapshot-class: "true"
driver: disk.csi.cloud.com
deletionPolicy: Retain apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ephremeral-backup
driver: disk.csi.cloud.com
deletionPolicy: Delete # safe backup
apiVersion: velero.io/v1
kind: Backup
metadata:
name: test-backup
spec:
includedNamespaces:
- default # ephremeral backup
apiVersion: velero.io/v1
kind: Backup
metadata:
name: test-backup-ephremeral
annotations:
velero.io/csi-volumesnapshot-class_disk.csi.cloud.com: "ephremeral-backup"
spec:
includedNamespaces:
- default
the problem is, users often don't control the backups, they don't have a choice if their namespace gets backed up or not, that is defined by cicd, but when user removes an app (k8s resources are deleted from gitrepo) And in CICD someone would want to define: This is how cloud native CICD works in general, you have tekton which adds updates and deletes kubernetes manifests in git repos, which then are picked up (or deleted) by argocd, it could or should be that simple.. But i mean, if noone sees this like me, I might just be allergic to cicd systems which doesn't handle correct cleanup of resources... |
I just wanted to note that we still have this issue in production, that we have failing backups because of this and that our only workaround right now is to have NO backupschedule at the same time
velero and csi backups is basically unusable in this state.. everytime one adds a backup schedule running at the same time as one already existing we run into this. |
Please create a design proposal that we can discuss with the maintainers the feasibility. |
https://github.com/vmware-tanzu/velero/blob/main/design/_template.md should get you started. |
From what I remember, we should only be deleting VS for the current backup -- if that is correct, then there may be a bug here. We should only be deleting specific VolumeSnapshots that were created by the current backup. This may be a consequence of the recent change to use async actions for CSI snapshots. If there are 2 queued backups that overlap, when the first backup moves to WaitingForPlugin operations, the second backup starts -- this backup may find the first backup's VS resources and back them up as well, and then the delete code at the end of backup processing may end up removing it. I think there are 2 things we may need to do here:
|
as far as i have seen this, the issue is basically 2 backup's at the same time, 1 backups a namespace with csi based pvc's the other backups globally the cluster. the global backup runs, find the volumesnapshot dynamically created from the namespaced csi pvc backup, adds it to list of items which are backed up, the namespaced csi based backup deletes his volumesnapshot, the global backup now can't find the volumesnapshot anymore and gets a partially failed. so basically what should be done somehow is preventing dynamically created volumesnapshots from velero to be included in velero backups (velero deletes the volumesnapshot after it did the backup, why velero should backup those volumesnapshots, if it deletes them afterwards) or prevent volumesnapshots from OTHER velerobackups to be included in another backup schedule. we also dont see this everyday, i would say from 10 backups from the general backup schedule, 1 fails with this error, so its sporadic.
i'm fine with the architecture, imo velero does not need a redesign here right now, but this bug should get fixed. therefore this ticket should be sufficient. |
What steps did you take and what happened:
Its not that easy, this is a really complex problem in my opinion and i'm not sure if i'm completely right or have all steps to successfully reproduce it
lets assume we have two backup schedules, one which backups all k8s resources for the whole cluster without pvc's
and another schedule which backups a specific namespace with its PVC's at the same time.
This possibly fails because the first schedule wants to backup the VolumeSnapshot resources and the second schedule deletes the VolumeSnapshots after the snapshot was done successfully
A setup where CSI is used, with a VolumeSnapshotclass, the general backup schedule above CAN fail, because the backup schedule running at the same time with the pvc backup enabled does (imo) really weird things:
Velero deletes the VolumeSnapshot and just keeps the VolumeSnapshotContent, the argument is, so a backup cant deleted when the namespace is deleted
I want to have those objects remaining, thats the objects someone will use to create a new pvc from with the backuped data
so its disabling core functionality and a lot of purpose for volumesnapshots
If the backup should be kept or not when I delete the namespace should be determined by the deletion policy
it CAN fail because the first schedule wants to backup the volumesnapshot resources and the second schedule deletes the volumesnapshots after the snapshot was done successfully
What did you expect to happen:
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
i have a support bundle created, but i can not upload it over here, there is sensitive information contained, like the backupstorelocations, so the submitted bundle could leak things like bucket names, gcp projects etc.
Anything else you would like to add:
Environment:
Velero Version: v1.13.0
velero/velero-plugin-for-gcp:v1.9.0
velero/velero-plugin-for-csi:v0.7.0
Kubernetes Version: v1.27.10-gke.1055000
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: