-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data mover: when a data move fails, the created disk (and snapshot) are not cleaned #8135
Comments
Which remaining resources can you see? |
Please share the velero debug bundle by running |
We can see loads and loads of Azure disks. We have 163 actual, used, disk PVC in our cluster, and each new Velero backup creates an additionnal 163 PVCs to do the data upload. At the time where I created this issue, we had 2577 "unwanted" PVCs in our cluster, which is ~15 days of velero backups failing at our rate. We deleted all of them since then. From a look in the Velero DataUpload spec, Velero does this when backuping:
I'm running |
Could you post one of the DataUpload's YAML content here instead? I suspect insufficient memory resources of the node-agent pod caused the DataUpload to cancel, the node-agent restarted due to OOM, and the DataUploads was marked as Cancelled on node-agent pod restart. |
@Gui13 If any one of them is not empty, it means the PVC was indeed deleted but some thing had blocked it. Please also share one of the DUs yaml as @blackpiglet mentioned. |
Hi @Lyndon-Li Here is a backup that failed 15 days ago: The backup describe
The data upload➜ ~ kubectl get datauploads velero-loki-20240811000043-lw8wz -n velero -o yaml
apiVersion: [velero.io/v2alpha1](http://velero.io/v2alpha1)
kind: DataUpload
metadata:
creationTimestamp: "2024-08-11T00:00:53Z"
generateName: velero-loki-20240811000043-
generation: 3
labels:
[velero.io/accepted-by](http://velero.io/accepted-by): aks-default-97945801-vmss00001c
[velero.io/async-operation-id](http://velero.io/async-operation-id): du-67757235-06b4-424a-a64b-5fd59b93e421.e4334b14-893c-44a2dc9bb
[velero.io/backup-name](http://velero.io/backup-name): velero-loki-20240811000043
[velero.io/backup-uid](http://velero.io/backup-uid): 67757235-06b4-424a-a64b-5fd59b93e421
[velero.io/pvc-uid](http://velero.io/pvc-uid): e4334b14-893c-44a5-9dfa-c41bc3a63d88
name: velero-loki-20240811000043-lw8wz
namespace: velero
ownerReferences:
- apiVersion: [velero.io/v1](http://velero.io/v1)
controller: true
kind: Backup
name: velero-loki-20240811000043
uid: 67757235-06b4-424a-a64b-5fd59b93e421
resourceVersion: "326785784"
uid: 0b76aef5-6016-42d9-9082-7db360c73937
spec:
backupStorageLocation: default
csiSnapshot:
snapshotClass: csi-azure-disk
storageClass: managed-premium-lrs
volumeSnapshot: velero-storage-logging-loki-0-cmj2s
operationTimeout: 10m0s
snapshotType: CSI
sourceNamespace: monitoring
sourcePVC: storage-logging-loki-0
status:
completionTimestamp: "2024-08-11T00:00:58Z"
message: 'found a dataupload velero/velero-loki-20240811000043-lw8wz with expose
error: Pod is unschedulable: 0/44 nodes are available: pod has unbound immediate
PersistentVolumeClaims. preemption: 0/44 nodes are available: 44 Preemption is
not helpful for scheduling... mark it as cancel'
phase: Canceled
progress: {}
startTimestamp: "2024-08-11T00:00:58Z" The remaining PV (not a PVC, just a PV): The PV description~ kubectl get pv pvc-116fa2fa-c49c-4975-a029-655ac58b404a -A -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
[pv.kubernetes.io/provisioned-by](http://pv.kubernetes.io/provisioned-by): [disk.csi.azure.com](http://disk.csi.azure.com/)
[volume.kubernetes.io/provisioner-deletion-secret-name](http://volume.kubernetes.io/provisioner-deletion-secret-name): ""
[volume.kubernetes.io/provisioner-deletion-secret-namespace](http://volume.kubernetes.io/provisioner-deletion-secret-namespace): ""
creationTimestamp: "2024-08-11T00:01:00Z"
finalizers:
- [kubernetes.io/pv-protection](http://kubernetes.io/pv-protection)
name: pvc-116fa2fa-c49c-4975-a029-655ac58b404a
resourceVersion: "326785886"
uid: 68cb0368-8796-4644-9f28-88671800c7a9
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 256Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: velero-loki-20240811000043-lw8wz
namespace: velero
resourceVersion: "326785771"
uid: 116fa2fa-c49c-4975-a029-655ac58b404a
csi:
driver: [disk.csi.azure.com](http://disk.csi.azure.com/)
volumeAttributes:
cachingmode: ReadOnly
[csi.storage.k8s.io/pv/name](http://csi.storage.k8s.io/pv/name): pvc-116fa2fa-c49c-4975-a029-655ac58b404a
[csi.storage.k8s.io/pvc/name](http://csi.storage.k8s.io/pvc/name): velero-loki-20240811000043-lw8wz
[csi.storage.k8s.io/pvc/namespace](http://csi.storage.k8s.io/pvc/namespace): velero
kind: Managed
requestedsizegib: "256"
[storage.kubernetes.io/csiProvisionerIdentity](http://storage.kubernetes.io/csiProvisionerIdentity): [1723070305729-221-disk.csi.azure.com](http://1723070305729-221-disk.csi.azure.com/)
storageaccounttype: Premium_LRS
volumeHandle: /subscriptions/b7f4b112-855a-4cf5-9485-a4733fc136ea/resourceGroups/braincube-prod-aks/providers/Microsoft.Compute/disks/pvc-116fa2fa-c49c-4975-a029-655ac58b404a
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: [topology.disk.csi.azure.com/zone](http://topology.disk.csi.azure.com/zone)
operator: In
values:
- ""
persistentVolumeReclaimPolicy: Retain
storageClassName: managed-premium-lrs
volumeMode: Filesystem
status:
phase: Released As you can see, the PVC doesn't exist, but the DataUpload is still lingering around. You could probably add a phase to the DataUpload ("CancelCleanup") when you want to cancel it, so that it tries to remove the lingering PV, then transition to "Canceled" phase. |
This is still not an expected behavior, because before deleting the PVC, the PV's reclaim policy will be set to So we still need the debug log to see what happened. If you are not able to share all the logs, you can just filter the Error and Warning logs. |
If I can catch another instance of this problem I'll get you the logs. Right now, with the 1.14.1-rc we don't have the issue anymore. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands. |
This issue was closed because it has been stalled for 14 days with no activity. |
What steps did you take and what happened:
We are using velero 1.14.0 on Azure AKS, with the data mover feature, and are suffering from the bug #7898. All our backups are partially failed, due to data move being canceled.
We are awaiting the 1.14.1, but in the meantime, we have an ongoing issue with the side effect of the bug: when the data mover job fails, it doesn't clean the created managed disk that were to be used by the data mover.
This is causing a runaway cost increase, because we have 2000+ (at the moment) provisioned disks which are not cleaned, and take up daily costs.
Typical output in the error section of the velero backup is as such:
Although the dataupload is "canceled", the remaining resources are not cleaned.
Deleting the faulty backup will not release the created disks. We tried to
velero delete backup <FaultyBackupWith163LingeringDisk>
but this didn't work (although the backup was correctly deleted)We are removing these disks manually right now, but I think they should be cleaned by the Velero data mover as a last-ditch, even if the data move failed.
What did you expect to happen:
When the data mover fails, it should try to clean the lingering resources it created (snapshots & disks).
Or at least, deleting the Failed (or PartiallyFailed) backup should clean up the resources.
Anything else you would like to add:
We have tried the yet-to-be released release-1.14-dev branch to see if the #7898 "/DataUpload is canceled" issue was fixed, and it did. So that's a good point. I think you should release a 1.14.1 quickly for those people in the same case as us.
Environment:
velero version
): 1.14.0velero client config get features
): features:kubectl version
):/etc/os-release
): --Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: