Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data mover: when a data move fails, the created disk (and snapshot) are not cleaned #8135

Closed
Gui13 opened this issue Aug 21, 2024 · 10 comments
Closed

Comments

@Gui13
Copy link

Gui13 commented Aug 21, 2024

What steps did you take and what happened:

We are using velero 1.14.0 on Azure AKS, with the data mover feature, and are suffering from the bug #7898. All our backups are partially failed, due to data move being canceled.

We are awaiting the 1.14.1, but in the meantime, we have an ongoing issue with the side effect of the bug: when the data mover job fails, it doesn't clean the created managed disk that were to be used by the data mover.

This is causing a runaway cost increase, because we have 2000+ (at the moment) provisioned disks which are not cleaned, and take up daily costs.

Typical output in the error section of the velero backup is as such:

message: /DataUpload is canceled, plugin: velero.io/csi-pvc-backupper
message: /DataUpload is canceled, plugin: velero.io/csi-pvc-backupper
message: /DataUpload is canceled, plugin: velero.io/csi-pvc-backupper
message: /DataUpload is canceled, plugin: velero.io/csi-pvc-backupper
[ times 163 in our case ]

Although the dataupload is "canceled", the remaining resources are not cleaned.
Deleting the faulty backup will not release the created disks. We tried to velero delete backup <FaultyBackupWith163LingeringDisk> but this didn't work (although the backup was correctly deleted)

We are removing these disks manually right now, but I think they should be cleaned by the Velero data mover as a last-ditch, even if the data move failed.

What did you expect to happen:

When the data mover fails, it should try to clean the lingering resources it created (snapshots & disks).

Or at least, deleting the Failed (or PartiallyFailed) backup should clean up the resources.

Anything else you would like to add:

We have tried the yet-to-be released release-1.14-dev branch to see if the #7898 "/DataUpload is canceled" issue was fixed, and it did. So that's a good point. I think you should release a 1.14.1 quickly for those people in the same case as us.

Environment:

  • Velero version (use velero version): 1.14.0
  • Velero features (use velero client config get features): features:
  • Kubernetes version (use kubectl version):
Client Version: v1.29.0                                                                                                                                                                       Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3                                                                                                                                       Server Version: v1.27.16
  • Kubernetes installer & version: Azure AKS, provisioned by Terraform
  • Cloud provider or hardware configuration: Azure
  • OS (e.g. from /etc/os-release): --

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

Although the dataupload is "canceled", the remaining resources are not cleaned.

Which remaining resources can you see?

@Lyndon-Li
Copy link
Contributor

Please share the velero debug bundle by running velero debug

@Gui13
Copy link
Author

Gui13 commented Aug 21, 2024

Although the dataupload is "canceled", the remaining resources are not cleaned.

Which remaining resources can you see?

We can see loads and loads of Azure disks.

We have 163 actual, used, disk PVC in our cluster, and each new Velero backup creates an additionnal 163 PVCs to do the data upload. At the time where I created this issue, we had 2577 "unwanted" PVCs in our cluster, which is ~15 days of velero backups failing at our rate. We deleted all of them since then.

From a look in the Velero DataUpload spec, Velero does this when backuping:

  1. creates Snapshots with the CSI driver for each volume of type disk
  2. for each snapshot, it creates a disk derived from the snapshot,
  3. When the disk is created, it then removes the snapshot // <--- this is working for us right now
  4. it mounts the disk on a node of the cluster // <--- this used to fail in our cluster until you fixed Fail early DM pod checks have false positives #7898 and we switched to using the release-1.14-dev docker image for our agents, and this is where the disks we left behind IMHO
  5. once mounted, it starts "snapshotting" (in kopia language) this disk to the S3 ("Blobstore" in azure language).

I'm running velero debug right now, but after a quick test on my personal cluster, I see that it retrieves lots of information, I'm not sure I'm allowed to share all this information publicly, since it is our client cluster. Do you have means to clear the logs from PI or maybe to transmit it privately ? (also: it is 390MiB)

@blackpiglet
Copy link
Contributor

velero debug CLI collects the Velero-related k8s resources. There is no easy to erase sensitive information.

Could you post one of the DataUpload's YAML content here instead?

I suspect insufficient memory resources of the node-agent pod caused the DataUpload to cancel, the node-agent restarted due to OOM, and the DataUploads was marked as Cancelled on node-agent pod restart.

@Lyndon-Li
Copy link
Contributor

@Gui13
Please check this if you still have the leftover PVCs created by Velero:
Describe the PVC and check the deletionTempStamp and Finalizer, share us their values

If any one of them is not empty, it means the PVC was indeed deleted but some thing had blocked it.
If they are both empty, it looks like a problem Velero's DU cancel mechanism.

Please also share one of the DUs yaml as @blackpiglet mentioned.

@Gui13
Copy link
Author

Gui13 commented Aug 26, 2024

Hi @Lyndon-Li

Here is a backup that failed 15 days ago:

The backup describe
~ velero backup describe velero-loki-20240811000043 --details
Name:         velero-loki-20240811000043
Namespace:    velero
Labels:       [app.kubernetes.io/instance=velero](http://app.kubernetes.io/instance=velero)
              [app.kubernetes.io/managed-by=Helm](http://app.kubernetes.io/managed-by=Helm)
              [app.kubernetes.io/name=velero](http://app.kubernetes.io/name=velero)
              [helm.sh/chart=velero-7.1.4](http://helm.sh/chart=velero-7.1.4)
              [velero.io/schedule-name=velero-loki](http://velero.io/schedule-name=velero-loki)
              [velero.io/storage-location=default](http://velero.io/storage-location=default)
Annotations:  [meta.helm.sh/release-name=velero](http://meta.helm.sh/release-name=velero)
              [meta.helm.sh/release-namespace=velero](http://meta.helm.sh/release-namespace=velero)
              [velero.io/resource-timeout=10m0s](http://velero.io/resource-timeout=10m0s)
              [velero.io/source-cluster-k8s-gitversion=v1.27.16](http://velero.io/source-cluster-k8s-gitversion=v1.27.16)
              [velero.io/source-cluster-k8s-major-version=1](http://velero.io/source-cluster-k8s-major-version=1)
              [velero.io/source-cluster-k8s-minor-version=27](http://velero.io/source-cluster-k8s-minor-version=27)

Phase:  PartiallyFailed (run velero backup logs velero-loki-20240811000043 for more information)


Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  monitoring
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  app=loki

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          true
Data Mover:                  velero

TTL:  17568h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-08-11 02:00:44 +0200 CEST
Completed:  2024-08-11 02:01:10 +0200 CEST

Expiration:  2026-08-13 02:00:43 +0200 CEST

Total items to be backed up:  21
Items backed up:              21

Backup Item Operations:
  Operation for persistentvolumeclaims monitoring/storage-logging-loki-0:
    Backup Item Action Plugin:  [velero.io/csi-pvc-backupper](http://velero.io/csi-pvc-backupper)
    Operation ID:               du-67757235-06b4-424a-a64b-5fd59b93e421.e4334b14-893c-44a2dc9bb
    Items to Update:
                           [datauploads.velero.io](http://datauploads.velero.io/) velero/velero-loki-20240811000043-lw8wz
    Phase:                 Failed
    Operation Error:       DataUpload is canceled
    Progress description:  Canceled
    Created:               2024-08-11 02:00:53 +0200 CEST
    Started:               2024-08-11 02:00:58 +0200 CEST
    Updated:               2024-08-11 02:00:58 +0200 CEST
Resource List:
  apps/v1/ControllerRevision:
    - monitoring/logging-loki-666f698c9c
    - monitoring/logging-loki-78bc8885cf
    - monitoring/logging-loki-85447f888
  apps/v1/StatefulSet:
    - monitoring/logging-loki
  [discovery.k8s.io/v1/EndpointSlice](http://discovery.k8s.io/v1/EndpointSlice):
    - monitoring/logging-loki-headless-rq5dn
    - monitoring/logging-loki-m6tlz
    - monitoring/logging-loki-memberlist-4kgrk
  [rbac.authorization.k8s.io/v1/Role](http://rbac.authorization.k8s.io/v1/Role):
    - monitoring/logging-loki
  [rbac.authorization.k8s.io/v1/RoleBinding](http://rbac.authorization.k8s.io/v1/RoleBinding):
    - monitoring/logging-loki
  v1/Endpoints:
    - monitoring/logging-loki
    - monitoring/logging-loki-headless
    - monitoring/logging-loki-memberlist
  v1/Namespace:
    - monitoring
  v1/PersistentVolume:
    - pvc-e4334b14-893c-44a5-9dfa-c41bc3a63d88
  v1/PersistentVolumeClaim:
    - monitoring/storage-logging-loki-0
  v1/Pod:
    - monitoring/logging-loki-0
  v1/Secret:
    - monitoring/logging-loki
  v1/Service:
    - monitoring/logging-loki
    - monitoring/logging-loki-headless
    - monitoring/logging-loki-memberlist
  v1/ServiceAccount:
    - monitoring/logging-loki

Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots:
    monitoring/storage-logging-loki-0:
      Data Movement:
        Operation ID: du-67757235-06b4-424a-a64b-5fd59b93e421.e4334b14-893c-44a2dc9bb
        Data Mover: velero
        Uploader Type: kopia

  Pod Volume Backups: <none included>

HooksAttempted:  0
HooksFailed:     0

The data upload
➜  ~ kubectl get datauploads velero-loki-20240811000043-lw8wz -n velero -o yaml
apiVersion: [velero.io/v2alpha1](http://velero.io/v2alpha1)
kind: DataUpload
metadata:
  creationTimestamp: "2024-08-11T00:00:53Z"
  generateName: velero-loki-20240811000043-
  generation: 3
  labels:
    [velero.io/accepted-by](http://velero.io/accepted-by): aks-default-97945801-vmss00001c
    [velero.io/async-operation-id](http://velero.io/async-operation-id): du-67757235-06b4-424a-a64b-5fd59b93e421.e4334b14-893c-44a2dc9bb
    [velero.io/backup-name](http://velero.io/backup-name): velero-loki-20240811000043
    [velero.io/backup-uid](http://velero.io/backup-uid): 67757235-06b4-424a-a64b-5fd59b93e421
    [velero.io/pvc-uid](http://velero.io/pvc-uid): e4334b14-893c-44a5-9dfa-c41bc3a63d88
  name: velero-loki-20240811000043-lw8wz
  namespace: velero
  ownerReferences:
  - apiVersion: [velero.io/v1](http://velero.io/v1)
    controller: true
    kind: Backup
    name: velero-loki-20240811000043
    uid: 67757235-06b4-424a-a64b-5fd59b93e421
  resourceVersion: "326785784"
  uid: 0b76aef5-6016-42d9-9082-7db360c73937
spec:
  backupStorageLocation: default
  csiSnapshot:
    snapshotClass: csi-azure-disk
    storageClass: managed-premium-lrs
    volumeSnapshot: velero-storage-logging-loki-0-cmj2s
  operationTimeout: 10m0s
  snapshotType: CSI
  sourceNamespace: monitoring
  sourcePVC: storage-logging-loki-0
status:
  completionTimestamp: "2024-08-11T00:00:58Z"
  message: 'found a dataupload velero/velero-loki-20240811000043-lw8wz with expose
    error: Pod is unschedulable: 0/44 nodes are available: pod has unbound immediate
    PersistentVolumeClaims. preemption: 0/44 nodes are available: 44 Preemption is
    not helpful for scheduling... mark it as cancel'
  phase: Canceled
  progress: {}
  startTimestamp: "2024-08-11T00:00:58Z"

The remaining PV (not a PVC, just a PV):

The PV description
~ kubectl get pv pvc-116fa2fa-c49c-4975-a029-655ac58b404a -A -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    [pv.kubernetes.io/provisioned-by](http://pv.kubernetes.io/provisioned-by): [disk.csi.azure.com](http://disk.csi.azure.com/)
    [volume.kubernetes.io/provisioner-deletion-secret-name](http://volume.kubernetes.io/provisioner-deletion-secret-name): ""
    [volume.kubernetes.io/provisioner-deletion-secret-namespace](http://volume.kubernetes.io/provisioner-deletion-secret-namespace): ""
  creationTimestamp: "2024-08-11T00:01:00Z"
  finalizers:
  - [kubernetes.io/pv-protection](http://kubernetes.io/pv-protection)
  name: pvc-116fa2fa-c49c-4975-a029-655ac58b404a
  resourceVersion: "326785886"
  uid: 68cb0368-8796-4644-9f28-88671800c7a9
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 256Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: velero-loki-20240811000043-lw8wz
    namespace: velero
    resourceVersion: "326785771"
    uid: 116fa2fa-c49c-4975-a029-655ac58b404a
  csi:
    driver: [disk.csi.azure.com](http://disk.csi.azure.com/)
    volumeAttributes:
      cachingmode: ReadOnly
      [csi.storage.k8s.io/pv/name](http://csi.storage.k8s.io/pv/name): pvc-116fa2fa-c49c-4975-a029-655ac58b404a
      [csi.storage.k8s.io/pvc/name](http://csi.storage.k8s.io/pvc/name): velero-loki-20240811000043-lw8wz
      [csi.storage.k8s.io/pvc/namespace](http://csi.storage.k8s.io/pvc/namespace): velero
      kind: Managed
      requestedsizegib: "256"
      [storage.kubernetes.io/csiProvisionerIdentity](http://storage.kubernetes.io/csiProvisionerIdentity): [1723070305729-221-disk.csi.azure.com](http://1723070305729-221-disk.csi.azure.com/)
      storageaccounttype: Premium_LRS
    volumeHandle: /subscriptions/b7f4b112-855a-4cf5-9485-a4733fc136ea/resourceGroups/braincube-prod-aks/providers/Microsoft.Compute/disks/pvc-116fa2fa-c49c-4975-a029-655ac58b404a
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: [topology.disk.csi.azure.com/zone](http://topology.disk.csi.azure.com/zone)
          operator: In
          values:
          - ""
  persistentVolumeReclaimPolicy: Retain
  storageClassName: managed-premium-lrs
  volumeMode: Filesystem
status:
  phase: Released

As you can see, the PVC doesn't exist, but the DataUpload is still lingering around.
The PV has no "DeletedTimestamp", but it inherited our default "Retain" policy. BUT, this seems not to be a problem for the other PVs when the Dataupload performs correctly (they are still removed).

You could probably add a phase to the DataUpload ("CancelCleanup") when you want to cancel it, so that it tries to remove the lingering PV, then transition to "Canceled" phase.

@Lyndon-Li
Copy link
Contributor

This is still not an expected behavior, because before deleting the PVC, the PV's reclaim policy will be set to Delete.
So the case that PVC is successfully deleted but PV is still kept with Retain as the persistentVolumeReclaimPolicy should not happen.

So we still need the debug log to see what happened. If you are not able to share all the logs, you can just filter the Error and Warning logs.

@Gui13
Copy link
Author

Gui13 commented Aug 26, 2024

If I can catch another instance of this problem I'll get you the logs. Right now, with the 1.14.1-rc we don't have the issue anymore.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

Copy link

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants