Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data mover failed - error to delete volume snapshot content - not found #7068

Closed
trunet opened this issue Nov 7, 2023 · 23 comments
Closed

Comments

@trunet
Copy link

trunet commented Nov 7, 2023

What steps did you take and what happened:

I created a schedule with velero csi snapshot and data mover enabled. I'm using openebs/zfs-localpv CSI.

17 succeeded, 1 failed to data move:

  Operation for persistentvolumeclaims mariadb/storage-mariadb-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-1a7e0ab8-ec92-4460-b617-dab149484bf3.47e56040-6544-41b01f5b1
    Items to Update:
                           datauploads.velero.io velero/k8s-20231107030552-f7s9t
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-d633b8c9-2d5d-420e-8b96-666150fc45c1" not found
    Progress description:  Failed
    Created:               2023-11-07 04:06:34 +0100 CET
    Started:               2023-11-07 04:06:34 +0100 CET
    Updated:               2023-11-07 04:06:50 +0100 CET

What did you expect to happen:

Backup should complete successfully.

The following information will help us better understand what's going on:

bundle-2023-11-07-10-51-47.tar.gz

Anything else you would like to add:

Environment:

  • Velero version: main
  • Velero features: NOT SET
  • Kubernetes version: Server Version: v1.28.3
  • Kubernetes installer & version: Talos
  • Cloud provider or hardware configuration: Talos bare-metal
  • OS (e.g. from /etc/os-release): Talos v1.5.4

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

@trunet Could you help to confirm it is an interim problem or you keep seeing this error with zfs-localpv?

@trunet
Copy link
Author

trunet commented Nov 7, 2023

@Lyndon-Li My local snapshots with zfs-localpv works just fine, every time I had to do. zfs-localpv is my only CSI. What do you need me to check specifically?

@Lyndon-Li
Copy link
Contributor

@trunet
Could you help to check whether this is a interim problem or you see this for every backup in your env?
One more thing, when the problem happens, could you help to collect the CSI driver's log? We need to see what happens to the missing VolumeSnapshotContent

@Lyndon-Li Lyndon-Li self-assigned this Nov 8, 2023
@trunet
Copy link
Author

trunet commented Nov 8, 2023

@Lyndon-Li I'll wait if the problem happen again tomorrow and report back with the logs. thanks!

@trunet
Copy link
Author

trunet commented Nov 8, 2023

@Lyndon-Li I had the error on 2 difference volumes now. Let's focus on one of them:

I redacted here, but the log attached is not redacted.

  Operation for persistentvolumeclaims [NAMESPACE]/[PV_NAME]-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-1b4258a5-fec8-42b7-b7e0-f994fc447604.f44628b1-4fc2-406a02d65
    Items to Update:
                           datauploads.velero.io velero/k8s-20231108030554-85dzc
    Phase:                 Failed
    Operation Error:       DataUpload is canceled
    Progress description:  Canceled
    Created:               2023-11-08 04:07:10 +0100 CET
    Started:               2023-11-08 04:07:10 +0100 CET
    Updated:               2023-11-08 04:08:30 +0100 CET

I fetched all ZFS CSI container logs:

openebs-zfs-localpv-controller-0_csi-provisioner.log
openebs-zfs-localpv-controller-0_csi-resizer.log
openebs-zfs-localpv-controller-0_csi-snapshotter.log
openebs-zfs-localpv-controller-0_openebs-zfs-plugin.log
openebs-zfs-localpv-controller-0_snapshot-controller.log
openebs-zfs-localpv-node-8qpjf_openebs-zfs-plugin.log

I grepped by PVC, PV and snapcontent and uploading here. If you need the full log for some reason, let me know, I can upload as well.

grepped.log.gz

Further grepping by error:

E1108 03:07:05.534722       1 volume.go:309] Get snapshot failed  err: zfssnapshots.zfs.openebs.io "snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03" not found
E1108 03:08:32.197414       1 zfs_util.go:732] zfs: could not destroy snapshot pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03 cmd [destroy zfspv-pool/pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03] error: cannot destroy 'zfspv-pool/pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03': snapshot has dependent clones
E1108 03:08:32.197484       1 snapshot.go:248] error syncing 'openebs/snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03': exit status 1, requeuing
E1108 03:08:32.936279       1 zfs_util.go:732] zfs: could not destroy snapshot pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03 cmd [destroy zfspv-pool/pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03] error: cannot destroy 'zfspv-pool/pvc-f44628b1-4fc2-4069-83c5-39d5b72231a0@snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03': snapshot has dependent clones
E1108 03:08:32.936659       1 snapshot.go:248] error syncing 'openebs/snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03': exit status 1, requeuing
E1108 03:08:34.633941       1 snapshot.go:50] zfs snapshot 'openebs/snapshot-51eddca1-98c1-4696-9d87-07b9d2f4ab03' has been deleted

Yesterday I checked and none of my volumes had any snapshots, no zfssnapshots, no volumesnapshotcontent and also zfs list -t snapshot, it was all clean. Therefore any snapshots created/deleted/managed was triggered by velero backup/data upload.

@Lyndon-Li
Copy link
Contributor

@trunet Could you attach the new velero log that could match the CSI logs in grepped.log.gz?

@trunet
Copy link
Author

trunet commented Nov 8, 2023

@Lyndon-Li here
velero-backup.log.gz

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Nov 9, 2023

@trunet Please help to share the entire Velero log bundle, we need to check more from the node-agent logs. Meanwhile, if you have the full CSI driver logs, please also share.

@trunet
Copy link
Author

trunet commented Nov 9, 2023

@Lyndon-Li this is today's log. 4 failed today. Some with DataUpload is canceled and some with error to expose snapshot: error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-873f0016-9846-4d96-b2d1-16d5941bed29" not found

bundle-2023-11-09-10-54-54.tar.gz
openebs_logs-2023-11-09-10-28-26.tar.gz

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Nov 9, 2023

@trunet
For the cancelled DataUploads, found a dataupload with status \"InProgress\" during the node-agent starting, mark it as cancel, looks like the node-agent pods have restarted.
So a quick question: have you restarted/updated any of the Velero pods manually?
If not, could you run below two commands and share the output:
kubectl get po -n velero
kubectl logs --previous <node-agent pod> against all the node-agent pods

@trunet
Copy link
Author

trunet commented Nov 9, 2023

nice catch... OOMKilled. the logs --previous just shows dataupload happening and nothing else.

I'll remove the resources I had defined in my values, it'll use the same as the helm chart default (double). I had reduced a while ago to fit in my cluster (and I didn't remember that). I'll come back to you tomorrow if it finished successfully. I wonder if this will also fix volumesnapshotcontents.snapshot.storage.k8s.io not found error.

❯ kubectl logs --previous node-agent-9hctq | tail -n2
time="2023-11-09T03:08:17Z" level=info msg="Reconcile k8s-20231109030556-88rhs" controller=dataupload dataupload=velero/k8s-20231109030556-88rhs logSource="pkg/controller/data_upload_controller.go:113"
time="2023-11-09T03:08:17Z" level=info msg="Data upload is in progress" controller=dataupload dataupload=velero/k8s-20231109030556-88rhs logSource="pkg/controller/data_upload_controller.go:293"

❯ kubectl get pods
NAME                      READY   STATUS    RESTARTS        AGE
node-agent-2wl6l          1/1     Running   0               2d20h
node-agent-9hctq          1/1     Running   1 (9h ago)      3d12h
node-agent-t5bqf          1/1     Running   1 (33h ago)     3d12h
velero-649dcfb86f-hhwdq   1/1     Running   1 (7h46m ago)   2d20h

❯ kubectl describe pod node-agent-9hctq
Name:             node-agent-9hctq
Namespace:        velero
Priority:         0
Service Account:  velero-server
Node:             talos01/192.168.178.91
Start Time:       Mon, 06 Nov 2023 00:26:39 +0100
Labels:           app.kubernetes.io/instance=velero
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=velero
                  controller-revision-hash=558478db98
                  helm.sh/chart=velero-5.1.2
                  name=node-agent
                  pod-template-generation=3
Annotations:      cni.projectcalico.org/containerID: c491f8f6ba6b3ff98a5a82a9b9481b779143d4c69bc6973e755f2917c9a75cc3
                  cni.projectcalico.org/podIP: 10.244.251.26/32
                  cni.projectcalico.org/podIPs: 10.244.251.26/32,[IPV6 REDACTED]
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "k8s-pod-network",
                        "ips": [
                            "10.244.251.26",
                            "2a02:58:148:8244:120e:9743:bfb:297e"
                        ],
                        "default": true,
                        "dns": {}
                    }]
                  prometheus.io/path: /metrics
                  prometheus.io/port: 8085
                  prometheus.io/scrape: true
Status:           Running
IP:               10.244.251.26
IPs:
  IP:           10.244.251.26
  IP:           [IPv6 REDACTED]
Controlled By:  DaemonSet/node-agent
Containers:
  node-agent:
    Container ID:  containerd://47aa9c915a65bea5c1c5d27f0abb42e6d8b5aa04ea986057cec5116fb9968cfa
    Image:         velero/velero:main
    Image ID:      docker.io/velero/velero@sha256:8a0793cbc22db3b19fb419953b97e56f7edc46e80b206e64a493bfc80c1de3db
    Port:          8085/TCP
    Host Port:     0/TCP
    Command:
      /velero
    Args:
      node-agent
      server
      --features=EnableCSI
      --log-level=info
    State:          Running
      Started:      Thu, 09 Nov 2023 04:08:20 +0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 06 Nov 2023 00:26:41 +0100
      Finished:     Thu, 09 Nov 2023 04:08:19 +0100
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:     250m
      memory:  256Mi
    Environment:
      VELERO_NAMESPACE:                velero (v1:metadata.namespace)
      NODE_NAME:                        (v1:spec.nodeName)
      VELERO_SCRATCH_DIR:              /scratch
      AWS_SHARED_CREDENTIALS_FILE:     /credentials/cloud
      GOOGLE_APPLICATION_CREDENTIALS:  /credentials/cloud
      AZURE_CREDENTIALS_FILE:          /credentials/cloud
      ALIBABA_CLOUD_CREDENTIALS_FILE:  /credentials/cloud
      OPENEBS_NAMESPACE:               openebs
    Mounts:
      /credentials from cloud-credentials (rw)
      /host_pods from host-pods (rw)
      /scratch from scratch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gtmmf (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  cloud-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  velero-aws-credentials
    Optional:    false
  host-pods:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/pods
    HostPathType:
  scratch:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-gtmmf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>

@Lyndon-Li
Copy link
Contributor

@trunet
Please help to test it after increasing the memory limit. It is very likely the snapshot not found problem was also caused by the node-agent pod restart --- after node-agent pod restarts, some information will be lost, so several logics in the code try to cancel/fail the DataUpload. However, not all the node-agent pods have restarted, the alive one may be still process the DataUpload, as a result, there is a contest.

So please help to confirm this.

@trunet
Copy link
Author

trunet commented Nov 10, 2023

@Lyndon-Li I really appreciate the time and effort you're putting on this.

today's, I got no restarts after fixing the resources:

❯ kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
node-agent-78kg9          1/1     Running   0          21h
node-agent-j7ntj          1/1     Running   0          21h
node-agent-nhvfv          1/1     Running   0          21h
velero-67974cfcc4-l9pb8   1/1     Running   0          21h

❯ kubectl get pods -n openebs
NAME                                           READY   STATUS    RESTARTS   AGE
openebs-localpv-provisioner-759dd54cf4-7mxrw   1/1     Running   0          2d21h
openebs-zfs-localpv-controller-0               5/5     Running   0          2d21h
openebs-zfs-localpv-node-7mpd7                 2/2     Running   0          2d21h
openebs-zfs-localpv-node-8qpjf                 2/2     Running   0          2d21h
openebs-zfs-localpv-node-d47wh                 2/2     Running   0          2d21h

One failed:

  Operation for persistentvolumeclaims heimdall/heimdall-data1-heimdall-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d87ffaac-fed8-417a-95eb-5a7b0a5efc5a.7e3d211c-2721-4a7d422d1
    Items to Update:
                           datauploads.velero.io velero/k8s-20231110030553-pt87l
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-9a328e95-bd95-4a25-9bf9-4deb607668e6" not found
    Progress description:  Failed
    Created:               2023-11-10 04:06:23 +0100 CET
    Started:               2023-11-10 04:06:23 +0100 CET
    Updated:               2023-11-10 04:06:26 +0100 CET

Logs:
bundle-2023-11-10-10-56-39.tar.gz
openebs_zfs_localpv.tar.gz

Yesterday I installed a logging stack with fluent-bit sending to an elasticsearch cluster, and I generated this CSV report from kibana filtering by the pvc name, pv name or the snapcontent id to try to help you.

kibana_report.csv.gz

@boedy
Copy link

boedy commented Nov 10, 2023

Seems like I'm running into the same issue. I'm on Velero 5.1.3. Al my node agents are running fine (no restarts)

Backup Item Operations:
  Operation for persistentvolumeclaims main/storage-mariadb-1:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-48f9d651-4f46-45f9-a252-89c8b00bee07.66019010-54d5-41a5e4666
    Items to Update:
                           datauploads.velero.io velero/10min-20231110154025-qdjhk
    Phase:                 Completed
    Progress:              544381632 of 544381632 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-10 16:40:35 +0100 CET
    Started:               2023-11-10 16:40:36 +0100 CET
    Updated:               2023-11-10 16:40:49 +0100 CET
  Operation for persistentvolumeclaims main/storage-mariadb-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-48f9d651-4f46-45f9-a252-89c8b00bee07.18caf264-1fe4-475b820e8
    Items to Update:
                           datauploads.velero.io velero/10min-20231110154025-sj9sp
    Phase:                 Completed
    Progress:              544381173 of 544381173 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-10 16:40:41 +0100 CET
    Started:               2023-11-10 16:40:41 +0100 CET
    Updated:               2023-11-10 16:40:59 +0100 CET
  Operation for persistentvolumeclaims main/storage-mariadb-2:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-48f9d651-4f46-45f9-a252-89c8b00bee07.4991cf58-17c7-4919909eb
    Items to Update:
                           datauploads.velero.io velero/10min-20231110154025-v5zwh
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-bc37a272-3dbc-4785-a672-1012db16932c" not found
    Progress description:  Failed
    Created:               2023-11-10 16:40:46 +0100 CET
    Started:               2023-11-10 16:40:46 +0100 CET
    Updated:               2023-11-10 16:40:49 +0100 CET
  Operation for persistentvolumeclaims main/beanstalkd-beanstalkd-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-48f9d651-4f46-45f9-a252-89c8b00bee07.8bf958ac-7945-4e7789610
    Items to Update:
                           datauploads.velero.io velero/10min-20231110154025-cckpz
    Phase:                 Completed
    Progress:              10485760 of 10485760 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-10 16:40:57 +0100 CET
    Started:               2023-11-10 16:40:57 +0100 CET
    Updated:               2023-11-10 16:41:11 +0100 CET

We are using Linstor CSI. Issue seems intermitted, but fail for the most part:

$ velero backup get
NAME                    STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
10min-20231110161025    PartiallyFailed   2        0          2023-11-10 17:10:25 +0100 CET   16m       default            <none>
10min-20231110160025    PartiallyFailed   1        0          2023-11-10 17:00:25 +0100 CET   6m        default            <none>
10min-20231110155025    Completed         0        0          2023-11-10 16:50:25 +0100 CET   3m ago    default            <none>
10min-20231110154025    PartiallyFailed   1        0          2023-11-10 16:40:25 +0100 CET   13m ago   default            <none>
10min-20231110153025    Completed         0        0          2023-11-10 16:30:25 +0100 CET   23m ago   default            <none>
hourly-20231110160025   PartiallyFailed   3        0          2023-11-10 17:00:57 +0100 CET   1d        default            <none>

snapshots are succesfully taken:

$ linstor snapshot list
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ResourceName                             | SnapshotName                                  | NodeNames              | Volumes  | CreatedOn           | State      |
|=================================================================================================================================================================|
| pvc-77f31934-95a6-4b41-a360-4a2b8ae9733f | snapshot-bc37a272-3dbc-4785-a672-1012db16932c | h-fsn-ded2             | 0: 5 GiB | 2023-11-10 15:40:42 | Successful |
| pvc-99c35065-89db-4bba-a407-ed0de2becb6e | snapshot-cb0c3f24-ec51-4b47-820e-1b736667c71d | h-fsn-ded4             | 0: 5 GiB | 2023-11-10 16:02:00 | Successful |
| pvc-a393d2f1-54dc-4deb-9721-b3e5ab60d429 | snapshot-3201e114-3617-4bc9-962a-3123275ed257 | h-fsn-ded4, h-fsn-ded5 | 0: 3 GiB | 2023-11-10 16:01:20 | Successful |
| pvc-cfee054e-53db-4b0e-b826-cd332a87b81f | snapshot-37e86026-e638-4332-bba1-bb4072f0a269 | h-fsn-ded5             | 0: 5 GiB | 2023-11-10 16:00:31 | Successful |
| pvc-e53269b5-4234-42cd-a1da-7d9bf84ef970 | snapshot-f666b96a-3007-4115-959c-9074d6ab3de4 | h-fsn-ded4, h-fsn-ded5 | 0: 1 GiB | 2023-11-10 16:01:36 | Successful |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

@boedy
Copy link

boedy commented Nov 10, 2023

In my case I get the impression the wrong node-agent is processing the snapshot. These logs were taken from the velero node-agent running on node h-fsn-ded4, whilst the actual snapshot data was stored on f-fsn-ded2 as can be seen in the previous table.

Relevant logs node-agent running on h-fsn-4.

time="2023-11-10T15:40:46Z" level=info msg="Exposing CSI snapshot" logSource="pkg/exposer/csi_snapshot.go:95" owner=10min-20231110154025-v5zwh
time="2023-11-10T15:40:46Z" level=info msg="Volumesnapshot is ready" logSource="pkg/exposer/csi_snapshot.go:102" owner=10min-20231110154025-v5zwh
time="2023-11-10T15:40:46Z" level=info msg="Got VSC from VS in namespace main" logSource="pkg/exposer/csi_snapshot.go:109" owner=10min-20231110154025-v5zwh vs name=velero-storage-mariadb-2-qzp89 vsc name=snapcontent-bc37a272-3dbc-4785-a672-1012db16932c
time="2023-11-10T15:40:46Z" level=info msg="Finished to retain VSC" logSource="pkg/exposer/csi_snapshot.go:116" owner=10min-20231110154025-v5zwh retained=true vsc name=snapcontent-bc37a272-3dbc-4785-a672-1012db16932c
time="2023-11-10T15:40:48Z" level=info msg="VS is deleted in namespace main" logSource="pkg/exposer/csi_snapshot.go:129" owner=10min-20231110154025-v5zwh vs name=velero-storage-mariadb-2-qzp89
1.6996308490858662e+09	ERROR	Reconciler error	{"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "dataUpload": {"name":"10min-20231110154025-v5zwh","namespace":"velero"}, "namespace": "velero", "name": "10min-20231110154025-v5zwh", "reconcileID": "5531df84-bee8-4bab-b53e-0305e459945d", "error": "error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-bc37a272-3dbc-4785-a672-1012db16932c\" not found", "errorVerbose": "volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-bc37a272-3dbc-4785-a672-1012db16932c\" not found\nerror to delete volume snapshot content\ngithub.com/vmware-tanzu/velero/pkg/util/csi.EnsureDeleteVSC\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:177\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:131\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\nerror to delete volume snapshot content\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:133\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
time="2023-11-10T15:40:49Z" level=info msg="Reconcile 10min-20231110154025-qdjhk" controller=dataupload dataupload=velero/10min-20231110154025-qdjhk logSource="pkg/controller/data_upload_controller.go:113"
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234

The Pod mounting the PV is also running on f-fsn-ded2.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Nov 14, 2023

@trunet @boedy
I have created a trial fix, please help to verify it through the latest main image velero/velero:main (if you have pulled the main image previously, make sure the image is re-pulled, i.e., setting the Velero server and node-agent pods imagePullPolicy as Always).
I cannot reproduce it locally, so I will wait for you verification result.

Hopefully we can resolve & verify it by this week, then we will be able to catch up the release of 1.12.2. Thanks!

@boedy
Copy link

boedy commented Nov 14, 2023

I just tried out your trial fix. It's complaining "no new finalizers can be added if the object is being deleted":

Backup Item Operations:
  Operation for persistentvolumeclaims main/storage-mariadb-1:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-46a5b339-7c7e-4cc2-b7e5-cb800d91f9c2.66019010-54d5-41abdb916
    Items to Update:
                           datauploads.velero.io velero/10min-20231114114009-zdvxt
    Phase:                 Completed
    Progress:              544381632 of 544381632 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-14 12:40:20 +0100 CET
    Started:               2023-11-14 12:40:20 +0100 CET
    Updated:               2023-11-14 12:40:43 +0100 CET
  Operation for persistentvolumeclaims main/storage-mariadb-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-46a5b339-7c7e-4cc2-b7e5-cb800d91f9c2.18caf264-1fe4-4757ac248
    Items to Update:
                           datauploads.velero.io velero/10min-20231114114009-248sv
    Phase:                 Completed
    Progress:              544381173 of 544381173 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-14 12:40:25 +0100 CET
    Started:               2023-11-14 12:40:25 +0100 CET
    Updated:               2023-11-14 12:40:53 +0100 CET
  Operation for persistentvolumeclaims main/storage-mariadb-2:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-46a5b339-7c7e-4cc2-b7e5-cb800d91f9c2.4991cf58-17c7-491225a20
    Items to Update:
                           datauploads.velero.io velero/10min-20231114114009-n95fr
    Phase:                 Completed
    Progress:              544660117 of 544660117 complete (Bytes)
    Progress description:  Completed
    Created:               2023-11-14 12:40:30 +0100 CET
    Started:               2023-11-14 12:40:30 +0100 CET
    Updated:               2023-11-14 12:40:53 +0100 CET
  Operation for persistentvolumeclaims main/beanstalkd-beanstalkd-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-46a5b339-7c7e-4cc2-b7e5-cb800d91f9c2.8bf958ac-7945-4e78e4147
    Items to Update:
                           datauploads.velero.io velero/10min-20231114114009-xmh5c
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to delete volume snapshot content: error to remove protect finalizer from vsc snapcontent-85024560-ca99-4f46-8378-5439bdb11a15: error patching VSC: VolumeSnapshotContent.snapshot.storage.k8s.io "snapcontent-85024560-ca99-4f46-8378-5439bdb11a15" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection"}
    Progress description:  Failed
    Created:               2023-11-14 12:40:40 +0100 CET
    Started:               2023-11-14 12:40:40 +0100 CET
    Updated:               2023-11-14 12:40:43 +0100 CET

@trunet
Copy link
Author

trunet commented Nov 14, 2023

I just updated mine. Will have results tomorrow.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Nov 15, 2023

@boedy @trunet
I updated the code again, so please use the latest main image velero/velero:main to test it one more time (So do remember to set the Velero server & node-agent pods' pullPolicy to Always).

The beloe error indicates that we are on the right direction to fix the problem, merely the previous code is not robust enough to solve the contest
error to delete volume snapshot content: error to remove protect finalizer from vsc snapcontent-85024560-ca99-4f46-8378-5439bdb11a15: error patching VSC: VolumeSnapshotContent.snapshot.storage.k8s.io "snapcontent-85024560-ca99-4f46-8378-5439bdb11a15" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection"}

Hopefully, with the new code update, we can make a success.

(The latest main image should be available in 30 min from now)

@boedy
Copy link

boedy commented Nov 15, 2023

Great work at @trunet

I believe the latest image fixed the issue. I've just completed 6 consecutive backups without any errors! 🥳

@trunet
Copy link
Author

trunet commented Nov 15, 2023

I had one completed and one failed yesterday (before this last commit fix). It failed because the default helm charts memory resource was not enough and it got OOMKilled. I increased resource memory to limit at 768Mi and request 256Mi.

I updated the container once again, let's wait tomorrow to confirm.

...
Containers:
  velero:
    Container ID:  containerd://6d72a8d239d77ae3c7634a4508d1ec398f8a57c63aa5142195aa1422b27641b3
    Image:         velero/velero:main
    Image ID:      docker.io/velero/velero@sha256:a1151590d9d904555c9fa2a131f1d37205a924004dbcd0c3d39bf27bbed9696a
    Port:          8085/TCP
    Host Port:     0/TCP
    Command:
      /velero
    Args:
      server
      --uploader-type=kopia
      --log-level=info
      --features=EnableCSI
    State:          Running
      Started:      Wed, 15 Nov 2023 04:58:47 +0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 14 Nov 2023 12:58:19 +0100
      Finished:     Wed, 15 Nov 2023 04:58:41 +0100
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:      500m
      memory:   128Mi
    Liveness:   http-get http://:http-monitoring/metrics delay=10s timeout=5s period=30s #success=1 #failure=5
    Readiness:  http-get http://:http-monitoring/metrics delay=10s timeout=5s period=30s #success=1 #failure=5
    Environment:
      VELERO_SCRATCH_DIR:              /scratch
      VELERO_NAMESPACE:                velero (v1:metadata.namespace)
      LD_LIBRARY_PATH:                 /plugins
      AWS_SHARED_CREDENTIALS_FILE:     /credentials/cloud
      GOOGLE_APPLICATION_CREDENTIALS:  /credentials/cloud
      AZURE_CREDENTIALS_FILE:          /credentials/cloud
      ALIBABA_CLOUD_CREDENTIALS_FILE:  /credentials/cloud
      OPENEBS_NAMESPACE:               openebs
          periodSeconds: 30
    Mounts:
      /credentials from cloud-credentials (rw)
      /plugins from plugins (rw)
      /scratch from scratch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5vvp7 (ro)
...

@Lyndon-Li Lyndon-Li added this to the v1.13 milestone Nov 16, 2023
@Lyndon-Li
Copy link
Contributor

@trunet Looks like the issue has been fixed, but I will wait for your verification. Hopefully we can get the result by this week, otherwise, we will have to deliver the fix post 1.12.2.

@trunet
Copy link
Author

trunet commented Nov 16, 2023

@Lyndon-Li I didn't have any more backup errors since this change. I'm closing this ticket, feel free to release it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants