Data mover failed - error to delete volume snapshot content - not found #7068

trunet · 2023-11-07T10:02:17Z

What steps did you take and what happened:

I created a schedule with velero csi snapshot and data mover enabled. I'm using openebs/zfs-localpv CSI.

17 succeeded, 1 failed to data move:

  Operation for persistentvolumeclaims mariadb/storage-mariadb-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-1a7e0ab8-ec92-4460-b617-dab149484bf3.47e56040-6544-41b01f5b1
    Items to Update:
                           datauploads.velero.io velero/k8s-20231107030552-f7s9t
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-d633b8c9-2d5d-420e-8b96-666150fc45c1" not found
    Progress description:  Failed
    Created:               2023-11-07 04:06:34 +0100 CET
    Started:               2023-11-07 04:06:34 +0100 CET
    Updated:               2023-11-07 04:06:50 +0100 CET

What did you expect to happen:

Backup should complete successfully.

The following information will help us better understand what's going on:

bundle-2023-11-07-10-51-47.tar.gz

Anything else you would like to add:

Environment:

Velero version: main
Velero features: NOT SET
Kubernetes version: Server Version: v1.28.3
Kubernetes installer & version: Talos
Cloud provider or hardware configuration: Talos bare-metal
OS (e.g. from /etc/os-release): Talos v1.5.4

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2023-11-07T10:28:35Z

@trunet Could you help to confirm it is an interim problem or you keep seeing this error with zfs-localpv?

trunet · 2023-11-07T10:40:58Z

@Lyndon-Li My local snapshots with zfs-localpv works just fine, every time I had to do. zfs-localpv is my only CSI. What do you need me to check specifically?

Lyndon-Li · 2023-11-08T00:48:55Z

@trunet
Could you help to check whether this is a interim problem or you see this for every backup in your env?
One more thing, when the problem happens, could you help to collect the CSI driver's log? We need to see what happens to the missing VolumeSnapshotContent

trunet · 2023-11-08T00:50:54Z

@Lyndon-Li I'll wait if the problem happen again tomorrow and report back with the logs. thanks!

trunet · 2023-11-08T10:00:02Z

@Lyndon-Li I had the error on 2 difference volumes now. Let's focus on one of them:

I redacted here, but the log attached is not redacted.

  Operation for persistentvolumeclaims [NAMESPACE]/[PV_NAME]-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-1b4258a5-fec8-42b7-b7e0-f994fc447604.f44628b1-4fc2-406a02d65
    Items to Update:
                           datauploads.velero.io velero/k8s-20231108030554-85dzc
    Phase:                 Failed
    Operation Error:       DataUpload is canceled
    Progress description:  Canceled
    Created:               2023-11-08 04:07:10 +0100 CET
    Started:               2023-11-08 04:07:10 +0100 CET
    Updated:               2023-11-08 04:08:30 +0100 CET

I fetched all ZFS CSI container logs:

openebs-zfs-localpv-controller-0_csi-provisioner.log
openebs-zfs-localpv-controller-0_csi-resizer.log
openebs-zfs-localpv-controller-0_csi-snapshotter.log
openebs-zfs-localpv-controller-0_openebs-zfs-plugin.log
openebs-zfs-localpv-controller-0_snapshot-controller.log
openebs-zfs-localpv-node-8qpjf_openebs-zfs-plugin.log

I grepped by PVC, PV and snapcontent and uploading here. If you need the full log for some reason, let me know, I can upload as well.

grepped.log.gz

Further grepping by error:

E1108 03:07:05.534722       1 volume.go:309] Get snapshot failed  err: zfssnapshots.zfs.openebs.io "snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03" not found
E1108 03:08:32.197414       1 zfs_util.go:732] zfs: could not destroy snapshot pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03 cmd [destroy zfspv-pool/pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03] error: cannot destroy 'zfspv-pool/pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03': snapshot has dependent clones
E1108 03:08:32.197484       1 snapshot.go:248] error syncing 'openebs/snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03': exit status 1, requeuing
E1108 03:08:32.936279       1 zfs_util.go:732] zfs: could not destroy snapshot pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03 cmd [destroy zfspv-pool/pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03] error: cannot destroy 'zfspv-pool/pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03': snapshot has dependent clones
E1108 03:08:32.936659       1 snapshot.go:248] error syncing 'openebs/snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03': exit status 1, requeuing
E1108 03:08:34.633941       1 snapshot.go:50] zfs snapshot 'openebs/snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03' has been deleted

Yesterday I checked and none of my volumes had any snapshots, no zfssnapshots, no volumesnapshotcontent and also zfs list -t snapshot, it was all clean. Therefore any snapshots created/deleted/managed was triggered by velero backup/data upload.

Lyndon-Li · 2023-11-08T10:13:01Z

@trunet Could you attach the new velero log that could match the CSI logs in grepped.log.gz?

trunet · 2023-11-08T12:36:46Z

@Lyndon-Li here
velero-backup.log.gz

Lyndon-Li · 2023-11-09T01:46:26Z

@trunet Please help to share the entire Velero log bundle, we need to check more from the node-agent logs. Meanwhile, if you have the full CSI driver logs, please also share.

trunet · 2023-11-09T09:58:05Z

@Lyndon-Li this is today's log. 4 failed today. Some with DataUpload is canceled and some with error to expose snapshot: error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-873f0016-9846-4d96-b2d1-16d5941bed29" not found

bundle-2023-11-09-10-54-54.tar.gz
openebs_logs-2023-11-09-10-28-26.tar.gz

Lyndon-Li · 2023-11-09T10:08:12Z

@trunet
For the cancelled DataUploads, found a dataupload with status \"InProgress\" during the node-agent starting, mark it as cancel, looks like the node-agent pods have restarted.
So a quick question: have you restarted/updated any of the Velero pods manually?
If not, could you run below two commands and share the output:
kubectl get po -n velero
kubectl logs --previous <node-agent pod> against all the node-agent pods

trunet · 2023-11-09T12:23:39Z

nice catch... OOMKilled. the logs --previous just shows dataupload happening and nothing else.

I'll remove the resources I had defined in my values, it'll use the same as the helm chart default (double). I had reduced a while ago to fit in my cluster (and I didn't remember that). I'll come back to you tomorrow if it finished successfully. I wonder if this will also fix volumesnapshotcontents.snapshot.storage.k8s.io not found error.

❯ kubectl logs --previous node-agent-9hctq | tail -n2
time="2023-11-09T03:08:17Z" level=info msg="Reconcile k8s-20231109030556-88rhs" controller=dataupload dataupload=velero/k8s-20231109030556-88rhs logSource="pkg/controller/data_upload_controller.go:113"
time="2023-11-09T03:08:17Z" level=info msg="Data upload is in progress" controller=dataupload dataupload=velero/k8s-20231109030556-88rhs logSource="pkg/controller/data_upload_controller.go:293"

❯ kubectl get pods
NAME                      READY   STATUS    RESTARTS        AGE
node-agent-2wl6l          1/1     Running   0               2d20h
node-agent-9hctq          1/1     Running   1 (9h ago)      3d12h
node-agent-t5bqf          1/1     Running   1 (33h ago)     3d12h
velero-649dcfb86f-hhwdq   1/1     Running   1 (7h46m ago)   2d20h

❯ kubectl describe pod node-agent-9hctq
Name:             node-agent-9hctq
Namespace:        velero
Priority:         0
Service Account:  velero-server
Node:             talos01/192.168.178.91
Start Time:       Mon, 06 Nov 2023 00:26:39 +0100
Labels:           app.kubernetes.io/instance=velero
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=velero
                  controller-revision-hash=558478db98
                  helm.sh/chart=velero-5.1.2
                  name=node-agent
                  pod-template-generation=3
Annotations:      cni.projectcalico.org/containerID: c491f8f6ba6b3ff98a5a82a9b9481b779143d4c69bc6973e755f2917c9a75cc3
                  cni.projectcalico.org/podIP: 10.244.251.26/32
                  cni.projectcalico.org/podIPs: 10.244.251.26/32,[IPV6 REDACTED]
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "k8s-pod-network",
                        "ips": [
                            "10.244.251.26",
                            "2a02:58:148:8244:120e:9743:bfb:297e"
                        ],
                        "default": true,
                        "dns": {}
                    }]
                  prometheus.io/path: /metrics
                  prometheus.io/port: 8085
                  prometheus.io/scrape: true
Status:           Running
IP:               10.244.251.26
IPs:
  IP:           10.244.251.26
  IP:           [IPv6 REDACTED]
Controlled By:  DaemonSet/node-agent
Containers:
  node-agent:
    Container ID:  containerd://47aa9c915a65bea5c1c5d27f0abb42e6d8b5aa04ea986057cec5116fb9968cfa
    Image:         velero/velero:main
    Image ID:      docker.io/velero/velero@sha256:8a0793cbc22db3b19fb419953b97e56f7edc46e80b206e64a493bfc80c1de3db
    Port:          8085/TCP
    Host Port:     0/TCP
    Command:
      /velero
    Args:
      node-agent
      server
      --features=EnableCSI
      --log-level=info
    State:          Running
      Started:      Thu, 09 Nov 2023 04:08:20 +0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 06 Nov 2023 00:26:41 +0100
      Finished:     Thu, 09 Nov 2023 04:08:19 +0100
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:     250m
      memory:  256Mi
    Environment:
      VELERO_NAMESPACE:                velero (v1:metadata.namespace)
      NODE_NAME:                        (v1:spec.nodeName)
      VELERO_SCRATCH_DIR:              /scratch
      AWS_SHARED_CREDENTIALS_FILE:     /credentials/cloud
      GOOGLE_APPLICATION_CREDENTIALS:  /credentials/cloud
      AZURE_CREDENTIALS_FILE:          /credentials/cloud
      ALIBABA_CLOUD_CREDENTIALS_FILE:  /credentials/cloud
      OPENEBS_NAMESPACE:               openebs
    Mounts:
      /credentials from cloud-credentials (rw)
      /host_pods from host-pods (rw)
      /scratch from scratch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gtmmf (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  cloud-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  velero-aws-credentials
    Optional:    false
  host-pods:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/pods
    HostPathType:
  scratch:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-gtmmf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>

Lyndon-Li · 2023-11-10T05:55:44Z

@trunet
Please help to test it after increasing the memory limit. It is very likely the snapshot not found problem was also caused by the node-agent pod restart --- after node-agent pod restarts, some information will be lost, so several logics in the code try to cancel/fail the DataUpload. However, not all the node-agent pods have restarted, the alive one may be still process the DataUpload, as a result, there is a contest.

So please help to confirm this.

trunet · 2023-11-10T10:03:04Z

@Lyndon-Li I really appreciate the time and effort you're putting on this.

today's, I got no restarts after fixing the resources:

❯ kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
node-agent-78kg9          1/1     Running   0          21h
node-agent-j7ntj          1/1     Running   0          21h
node-agent-nhvfv          1/1     Running   0          21h
velero-67974cfcc4-l9pb8   1/1     Running   0          21h

❯ kubectl get pods -n openebs
NAME                                           READY   STATUS    RESTARTS   AGE
openebs-localpv-provisioner-759dd54cf4-7mxrw   1/1     Running   0          2d21h
openebs-zfs-localpv-controller-0               5/5     Running   0          2d21h
openebs-zfs-localpv-node-7mpd7                 2/2     Running   0          2d21h
openebs-zfs-localpv-node-8qpjf                 2/2     Running   0          2d21h
openebs-zfs-localpv-node-d47wh                 2/2     Running   0          2d21h

One failed:

  Operation for persistentvolumeclaims heimdall/heimdall-data1-heimdall-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d87ffaac-fed8-417a-95eb-5a7b0a5efc5a.7e3d211c-2721-4a7d422d1
    Items to Update:
                           datauploads.velero.io velero/k8s-20231110030553-pt87l
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-9a328e95-bd95-4a25-9bf9-4deb607668e6" not found
    Progress description:  Failed
    Created:               2023-11-10 04:06:23 +0100 CET
    Started:               2023-11-10 04:06:23 +0100 CET
    Updated:               2023-11-10 04:06:26 +0100 CET

Logs:
bundle-2023-11-10-10-56-39.tar.gz
openebs_zfs_localpv.tar.gz

Yesterday I installed a logging stack with fluent-bit sending to an elasticsearch cluster, and I generated this CSV report from kibana filtering by the pvc name, pv name or the snapcontent id to try to help you.

kibana_report.csv.gz

boedy · 2023-11-10T16:18:34Z

Seems like I'm running into the same issue. I'm on Velero 5.1.3. Al my node agents are running fine (no restarts)

Backup Item Operations:
  Operation for persistentvolumeclaims main/storage-mariadb-1:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-48f9d651-4f46-45f9-a252-89c8b00bee07.66019010-54d5-41a5e4666
    Items to Update:
                           datauploads.velero.io velero/10min-20231110154025-qdjhk
    Phase:                 Completed
    Progress:              544381632 of 544381632 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-10 16:40:35 +0100 CET
    Started:               2023-11-10 16:40:36 +0100 CET
    Updated:               2023-11-10 16:40:49 +0100 CET
  Operation for persistentvolumeclaims main/storage-mariadb-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-48f9d651-4f46-45f9-a252-89c8b00bee07.18caf264-1fe4-475b820e8
    Items to Update:
                           datauploads.velero.io velero/10min-20231110154025-sj9sp
    Phase:                 Completed
    Progress:              544381173 of 544381173 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-10 16:40:41 +0100 CET
    Started:               2023-11-10 16:40:41 +0100 CET
    Updated:               2023-11-10 16:40:59 +0100 CET
  Operation for persistentvolumeclaims main/storage-mariadb-2:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-48f9d651-4f46-45f9-a252-89c8b00bee07.4991cf58-17c7-4919909eb
    Items to Update:
                           datauploads.velero.io velero/10min-20231110154025-v5zwh
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-bc37a272-3dbc-4785-a672-1012db16932c" not found
    Progress description:  Failed
    Created:               2023-11-10 16:40:46 +0100 CET
    Started:               2023-11-10 16:40:46 +0100 CET
    Updated:               2023-11-10 16:40:49 +0100 CET
  Operation for persistentvolumeclaims main/beanstalkd-beanstalkd-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-48f9d651-4f46-45f9-a252-89c8b00bee07.8bf958ac-7945-4e7789610
    Items to Update:
                           datauploads.velero.io velero/10min-20231110154025-cckpz
    Phase:                 Completed
    Progress:              10485760 of 10485760 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-10 16:40:57 +0100 CET
    Started:               2023-11-10 16:40:57 +0100 CET
    Updated:               2023-11-10 16:41:11 +0100 CET

We are using Linstor CSI. Issue seems intermitted, but fail for the most part:

$ velero backup get
NAME                    STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
10min-20231110161025    PartiallyFailed   2        0          2023-11-10 17:10:25 +0100 CET   16m       default            <none>
10min-20231110160025    PartiallyFailed   1        0          2023-11-10 17:00:25 +0100 CET   6m        default            <none>
10min-20231110155025    Completed         0        0          2023-11-10 16:50:25 +0100 CET   3m ago    default            <none>
10min-20231110154025    PartiallyFailed   1        0          2023-11-10 16:40:25 +0100 CET   13m ago   default            <none>
10min-20231110153025    Completed         0        0          2023-11-10 16:30:25 +0100 CET   23m ago   default            <none>
hourly-20231110160025   PartiallyFailed   3        0          2023-11-10 17:00:57 +0100 CET   1d        default            <none>

snapshots are succesfully taken:

$ linstor snapshot list
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ResourceName                             | SnapshotName                                  | NodeNames              | Volumes  | CreatedOn           | State      |
|=================================================================================================================================================================|
| pvc-77f31934-95a6-4b41-a360-4a2b8ae9733f | snapshot-bc37a272-3dbc-4785-a672-1012db16932c | h-fsn-ded2             | 0: 5 GiB | 2023-11-10 15:40:42 | Successful |
| pvc-99c35065-89db-4bba-a407-ed0de2becb6e | snapshot-cb0c3f24-ec51-4b47-820e-1b736667c71d | h-fsn-ded4             | 0: 5 GiB | 2023-11-10 16:02:00 | Successful |
| pvc-a393d2f1-54dc-4deb-9721-b3e5ab60d429 | snapshot-3201e114-3617-4bc9-962a-3123275ed257 | h-fsn-ded4, h-fsn-ded5 | 0: 3 GiB | 2023-11-10 16:01:20 | Successful |
| pvc-cfee054e-53db-4b0e-b826-cd332a87b81f | snapshot-37e86026-e638-4332-bba1-bb4072f0a269 | h-fsn-ded5             | 0: 5 GiB | 2023-11-10 16:00:31 | Successful |
| pvc-e53269b5-4234-42cd-a1da-7d9bf84ef970 | snapshot-f666b96a-3007-4115-959c-9074d6ab3de4 | h-fsn-ded4, h-fsn-ded5 | 0: 1 GiB | 2023-11-10 16:01:36 | Successful |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

boedy · 2023-11-10T16:40:32Z

In my case I get the impression the wrong node-agent is processing the snapshot. These logs were taken from the velero node-agent running on node h-fsn-ded4, whilst the actual snapshot data was stored on f-fsn-ded2 as can be seen in the previous table.

Relevant logs node-agent running on h-fsn-4.

time="2023-11-10T15:40:46Z" level=info msg="Exposing CSI snapshot" logSource="pkg/exposer/csi_snapshot.go:95" owner=10min-20231110154025-v5zwh
time="2023-11-10T15:40:46Z" level=info msg="Volumesnapshot is ready" logSource="pkg/exposer/csi_snapshot.go:102" owner=10min-20231110154025-v5zwh
time="2023-11-10T15:40:46Z" level=info msg="Got VSC from VS in namespace main" logSource="pkg/exposer/csi_snapshot.go:109" owner=10min-20231110154025-v5zwh vs name=velero-storage-mariadb-2-qzp89 vsc name=snapcontent-bc37a272-3dbc-4785-a672-1012db16932c
time="2023-11-10T15:40:46Z" level=info msg="Finished to retain VSC" logSource="pkg/exposer/csi_snapshot.go:116" owner=10min-20231110154025-v5zwh retained=true vsc name=snapcontent-bc37a272-3dbc-4785-a672-1012db16932c
time="2023-11-10T15:40:48Z" level=info msg="VS is deleted in namespace main" logSource="pkg/exposer/csi_snapshot.go:129" owner=10min-20231110154025-v5zwh vs name=velero-storage-mariadb-2-qzp89
1.6996308490858662e+09	ERROR	Reconciler error	{"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "dataUpload": {"name":"10min-20231110154025-v5zwh","namespace":"velero"}, "namespace": "velero", "name": "10min-20231110154025-v5zwh", "reconcileID": "5531df84-bee8-4bab-b53e-0305e459945d", "error": "error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-bc37a272-3dbc-4785-a672-1012db16932c\" not found", "errorVerbose": "volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-bc37a272-3dbc-4785-a672-1012db16932c\" not found\nerror to delete volume snapshot content\ngithub.com/vmware-tanzu/velero/pkg/util/csi.EnsureDeleteVSC\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:177\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:131\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\nerror to delete volume snapshot content\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:133\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
time="2023-11-10T15:40:49Z" level=info msg="Reconcile 10min-20231110154025-qdjhk" controller=dataupload dataupload=velero/10min-20231110154025-qdjhk logSource="pkg/controller/data_upload_controller.go:113"
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234

The Pod mounting the PV is also running on f-fsn-ded2.

Lyndon-Li · 2023-11-14T02:51:19Z

@trunet @boedy
I have created a trial fix, please help to verify it through the latest main image velero/velero:main (if you have pulled the main image previously, make sure the image is re-pulled, i.e., setting the Velero server and node-agent pods imagePullPolicy as Always).
I cannot reproduce it locally, so I will wait for you verification result.

Hopefully we can resolve & verify it by this week, then we will be able to catch up the release of 1.12.2. Thanks!

boedy · 2023-11-14T11:45:22Z

I just tried out your trial fix. It's complaining "no new finalizers can be added if the object is being deleted":

Backup Item Operations:
  Operation for persistentvolumeclaims main/storage-mariadb-1:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-46a5b339-7c7e-4cc2-b7e5-cb800d91f9c2.66019010-54d5-41abdb916
    Items to Update:
                           datauploads.velero.io velero/10min-20231114114009-zdvxt
    Phase:                 Completed
    Progress:              544381632 of 544381632 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-14 12:40:20 +0100 CET
    Started:               2023-11-14 12:40:20 +0100 CET
    Updated:               2023-11-14 12:40:43 +0100 CET
  Operation for persistentvolumeclaims main/storage-mariadb-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-46a5b339-7c7e-4cc2-b7e5-cb800d91f9c2.18caf264-1fe4-4757ac248
    Items to Update:
                           datauploads.velero.io velero/10min-20231114114009-248sv
    Phase:                 Completed
    Progress:              544381173 of 544381173 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-14 12:40:25 +0100 CET
    Started:               2023-11-14 12:40:25 +0100 CET
    Updated:               2023-11-14 12:40:53 +0100 CET
  Operation for persistentvolumeclaims main/storage-mariadb-2:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-46a5b339-7c7e-4cc2-b7e5-cb800d91f9c2.4991cf58-17c7-491225a20
    Items to Update:
                           datauploads.velero.io velero/10min-20231114114009-n95fr
    Phase:                 Completed
    Progress:              544660117 of 544660117 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-14 12:40:30 +0100 CET
    Started:               2023-11-14 12:40:30 +0100 CET
    Updated:               2023-11-14 12:40:53 +0100 CET
  Operation for persistentvolumeclaims main/beanstalkd-beanstalkd-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-46a5b339-7c7e-4cc2-b7e5-cb800d91f9c2.8bf958ac-7945-4e78e4147
    Items to Update:
                           datauploads.velero.io velero/10min-20231114114009-xmh5c
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to delete volume snapshot content: error to remove protect finalizer from vsc snapcontent-85024560-ca99-4f46-8378-5439bdb11a15: error patching VSC: VolumeSnapshotContent.snapshot.storage.k8s.io "snapcontent-85024560-ca99-4f46-8378-5439bdb11a15" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection"}
    Progress description:  Failed
    Created:               2023-11-14 12:40:40 +0100 CET
    Started:               2023-11-14 12:40:40 +0100 CET
    Updated:               2023-11-14 12:40:43 +0100 CET

trunet · 2023-11-14T12:01:27Z

I just updated mine. Will have results tomorrow.

Lyndon-Li · 2023-11-15T09:19:12Z

@boedy @trunet
I updated the code again, so please use the latest main image velero/velero:main to test it one more time (So do remember to set the Velero server & node-agent pods' pullPolicy to Always).

The beloe error indicates that we are on the right direction to fix the problem, merely the previous code is not robust enough to solve the contest
error to delete volume snapshot content: error to remove protect finalizer from vsc snapcontent-85024560-ca99-4f46-8378-5439bdb11a15: error patching VSC: VolumeSnapshotContent.snapshot.storage.k8s.io "snapcontent-85024560-ca99-4f46-8378-5439bdb11a15" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection"}

Hopefully, with the new code update, we can make a success.

(The latest main image should be available in 30 min from now)

boedy · 2023-11-15T10:27:57Z

Great work at @trunet

I believe the latest image fixed the issue. I've just completed 6 consecutive backups without any errors! 🥳

trunet · 2023-11-15T10:40:18Z

I had one completed and one failed yesterday (before this last commit fix). It failed because the default helm charts memory resource was not enough and it got OOMKilled. I increased resource memory to limit at 768Mi and request 256Mi.

I updated the container once again, let's wait tomorrow to confirm.

...
Containers:
  velero:
    Container ID:  containerd://6d72a8d239d77ae3c7634a4508d1ec398f8a57c63aa5142195aa1422b27641b3
    Image:         velero/velero:main
    Image ID:      docker.io/velero/velero@sha256:a1151590d9d904555c9fa2a131f1d37205a924004dbcd0c3d39bf27bbed9696a
    Port:          8085/TCP
    Host Port:     0/TCP
    Command:
      /velero
    Args:
      server
      --uploader-type=kopia
      --log-level=info
      --features=EnableCSI
    State:          Running
      Started:      Wed, 15 Nov 2023 04:58:47 +0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 14 Nov 2023 12:58:19 +0100
      Finished:     Wed, 15 Nov 2023 04:58:41 +0100
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:      500m
      memory:   128Mi
    Liveness:   http-get http://:http-monitoring/metrics delay=10s timeout=5s period=30s #success=1 #failure=5
    Readiness:  http-get http://:http-monitoring/metrics delay=10s timeout=5s period=30s #success=1 #failure=5
    Environment:
      VELERO_SCRATCH_DIR:              /scratch
      VELERO_NAMESPACE:                velero (v1:metadata.namespace)
      LD_LIBRARY_PATH:                 /plugins
      AWS_SHARED_CREDENTIALS_FILE:     /credentials/cloud
      GOOGLE_APPLICATION_CREDENTIALS:  /credentials/cloud
      AZURE_CREDENTIALS_FILE:          /credentials/cloud
      ALIBABA_CLOUD_CREDENTIALS_FILE:  /credentials/cloud
      OPENEBS_NAMESPACE:               openebs
          periodSeconds: 30
    Mounts:
      /credentials from cloud-credentials (rw)
      /plugins from plugins (rw)
      /scratch from scratch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5vvp7 (ro)
...

Lyndon-Li · 2023-11-16T02:05:53Z

@trunet Looks like the issue has been fixed, but I will wait for your verification. Hopefully we can get the result by this week, otherwise, we will have to deliver the fix post 1.12.2.

trunet · 2023-11-16T09:56:47Z

@Lyndon-Li I didn't have any more backup errors since this change. I'm closing this ticket, feel free to release it.

Lyndon-Li added the area/datamover label Nov 7, 2023

Lyndon-Li self-assigned this Nov 8, 2023

Lyndon-Li mentioned this issue Nov 14, 2023

Issue 7068: add a finalizer to protect retained VSC #7095

Merged

Lyndon-Li mentioned this issue Nov 15, 2023

Issue 7068: add a finalizer to protect retained VSC #7102

Merged

Lyndon-Li added this to the v1.13 milestone Nov 16, 2023

Lyndon-Li added target/1.12.2 and removed target/1.12.2 labels Nov 16, 2023

Lyndon-Li mentioned this issue Nov 16, 2023

Backup partially failed with csi plugin 0.6.0-rc2 on OVH cluster #6852

Open

trunet closed this as completed Nov 16, 2023

Lyndon-Li mentioned this issue Nov 16, 2023

Issue 7068: add a finalizer to protect retained VSC #7114

Merged

Lyndon-Li added the target/1.12.2 label Nov 17, 2023

nknoii mentioned this issue Mar 18, 2024

Error exposing snapshot with velero CSI snapshot schedule #7536

Closed

Lyndon-Li mentioned this issue Apr 10, 2024

Data mover backup failed to delete the created snapshot in some failure case #7648

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data mover failed - error to delete volume snapshot content - not found #7068

Data mover failed - error to delete volume snapshot content - not found #7068

trunet commented Nov 7, 2023

Lyndon-Li commented Nov 7, 2023

trunet commented Nov 7, 2023

Lyndon-Li commented Nov 8, 2023

trunet commented Nov 8, 2023

trunet commented Nov 8, 2023

Lyndon-Li commented Nov 8, 2023

trunet commented Nov 8, 2023

Lyndon-Li commented Nov 9, 2023 •

edited

Loading

trunet commented Nov 9, 2023 •

edited

Loading

Lyndon-Li commented Nov 9, 2023 •

edited

Loading

trunet commented Nov 9, 2023 •

edited

Loading

Lyndon-Li commented Nov 10, 2023

trunet commented Nov 10, 2023 •

edited

Loading

boedy commented Nov 10, 2023

boedy commented Nov 10, 2023

Lyndon-Li commented Nov 14, 2023 •

edited

Loading

boedy commented Nov 14, 2023

trunet commented Nov 14, 2023

Lyndon-Li commented Nov 15, 2023 •

edited

Loading

boedy commented Nov 15, 2023

trunet commented Nov 15, 2023

Lyndon-Li commented Nov 16, 2023

trunet commented Nov 16, 2023

Data mover failed - error to delete volume snapshot content - not found #7068

Data mover failed - error to delete volume snapshot content - not found #7068

Comments

trunet commented Nov 7, 2023

Lyndon-Li commented Nov 7, 2023

trunet commented Nov 7, 2023

Lyndon-Li commented Nov 8, 2023

trunet commented Nov 8, 2023

trunet commented Nov 8, 2023

Lyndon-Li commented Nov 8, 2023

trunet commented Nov 8, 2023

Lyndon-Li commented Nov 9, 2023 • edited Loading

trunet commented Nov 9, 2023 • edited Loading

Lyndon-Li commented Nov 9, 2023 • edited Loading

trunet commented Nov 9, 2023 • edited Loading

Lyndon-Li commented Nov 10, 2023

trunet commented Nov 10, 2023 • edited Loading

boedy commented Nov 10, 2023

boedy commented Nov 10, 2023

Lyndon-Li commented Nov 14, 2023 • edited Loading

boedy commented Nov 14, 2023

trunet commented Nov 14, 2023

Lyndon-Li commented Nov 15, 2023 • edited Loading

boedy commented Nov 15, 2023

trunet commented Nov 15, 2023

Lyndon-Li commented Nov 16, 2023

trunet commented Nov 16, 2023

Lyndon-Li commented Nov 9, 2023 •

edited

Loading

trunet commented Nov 9, 2023 •

edited

Loading

Lyndon-Li commented Nov 9, 2023 •

edited

Loading

trunet commented Nov 9, 2023 •

edited

Loading

trunet commented Nov 10, 2023 •

edited

Loading

Lyndon-Li commented Nov 14, 2023 •

edited

Loading

Lyndon-Li commented Nov 15, 2023 •

edited

Loading