Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI Snapshots fail after velero CSI plugin update #6919

Closed
stocc opened this issue Oct 4, 2023 · 9 comments
Closed

CSI Snapshots fail after velero CSI plugin update #6919

stocc opened this issue Oct 4, 2023 · 9 comments
Assignees
Labels

Comments

@stocc
Copy link

stocc commented Oct 4, 2023

What steps did you take and what happened:

After an upgrade of velero-plugin-for-csi from version 0.5.1 to 0.6.0, all our backups involving CSI volume snapshots started to partially fail.
The relevant log line appears to be

time="2023-10-04T15:30:34Z" level=error msg="error getting volumesnapshot webportal/velero-redis-data-test-redis-replicas-2-5q55n: volumesnapshots.snapshot.storage.k8s.io \"velero-redis-data-test-redis-replicas-2-5q55n\" not found" backup=velero/webportal-20231004151837 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:234" pluginName=velero-plugin-for-csi

This has happened in multiple clusters both hosted on-premise and on AWS EKS with different CSI backends (Longhorn and AWS EBS)
After a downgrade to v0.5.1 of the plugin, our backups are working again.

What did you expect to happen:
After the upgrade, the backups continue to run without issues.

Anything else you would like to add:

bundle-2023-10-04-17-35-17.tar.gz

I set the log level to debug if that helps.
Here's the relevant backup schedule:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: webportal
  namespace: velero
spec:
  schedule: 20 2 * * *
  template:
    csiSnapshotTimeout: 30m
    hooks: {}
    includedNamespaces:
    - 'webportal'
    metadata: {}
    snapshotVolumes: true
    ttl: 720h0m0s
  useOwnerReferencesInBackup: true

Environment:

  • Velero version (use velero version): v1.11.1 (Helm Chart v5.0.2)
  • Velero features (use velero client config get features): features:
  • Kubernetes version (use kubectl version): Server Version: v1.27.6
  • Kubernetes installer & version: kubeadm (Issue also occurs on AWS EKS)
  • Cloud provider or hardware configuration: vSphere VMs (also AWS EKS)
  • OS (e.g. from /etc/os-release): AlmaLinux 8 (and Amazon Linux)

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@danfengliu
Copy link
Contributor

danfengliu commented Oct 5, 2023

First of all, I need further investigation to find out the root cause of this failures, before having the conclusion, please check Velero CSI pugin Compatibility statement, and could you help us to know the reason to upgrade velero-plugin-for-csi individually.

@Lyndon-Li
Copy link
Contributor

Looks like a dup with #6852.

@danfengliu danfengliu self-assigned this Oct 10, 2023
@blackpiglet
Copy link
Contributor

I agree with @danfengliu. The v0.6.0 Velero CSI plugin should work with v1.12.0 Velero server.
There was some modification in how the VolumeSnapshot resources created during backup are handled.
The VolumeSnapshot resources created during backup should be cleaned because that can prevent the snapshots from deleting when the VolumeSnapshots are deleted or the VolumeSnapshots' namespace is deleted.

The change introduced in v1.12.0 is the VolumeSnapshot cleaning logic is moved into the CSI plugin. The benefit is the time-consuming multiple VolumeSnapshots handling is now handled concurrently.
If the v1.11.x Velero server is used with v0.6.x Velero CSI plugin, there is a possibility that both Velero server and Velero CSI plugin try to delete the same VolumeSnapshot, which may causes the described issue.

@al-cheb
Copy link

al-cheb commented Oct 17, 2023

Looks like, same issue for me with velero 1.12.0 and velero-plugin-for-csi:v0.6.0:

$ velero backup get -n backup
wordpress-17102023-1247   PartiallyFailed   1        0          2023-10-17 13:47:38 +0300 MSK   29d       default            <none>
wordpress-17102023-1355   Completed         0        0          2023-10-17 13:55:09 +0300 MSK   29d       default            <none>

$ velero backup describe wordpress-17102023-1247 -n backup --details

Started:    2023-10-17 13:47:38 +0300 MSK
Completed:  2023-10-17 13:48:53 +0300 MSK

Expiration:  2023-11-16 13:47:38 +0300 MSK

Total items to be backed up:  27
Items backed up:              27

Backup Item Operations:
  Operation for persistentvolumeclaims wordpress/wordpress:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-2cafb064-988d-4705-83d8-5abbd45d33ba.93e126eb-7c4b-4531c1ac6
    Items to Update:
                           datauploads.velero.io backup/wordpress-17102023-1247-hznzp
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to delete volume snapshot content: error to delete volume snapshot content: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-d8059ccb-2bca-4334-b5d4-2fbd03dd5241" not found
    Progress description:  Failed
    Created:               2023-10-17 13:47:55 +0300 MSK
    Started:               2023-10-17 13:47:55 +0300 MSK
    Updated:               2023-10-17 13:47:58 +0300 MSK
  Operation for persistentvolumeclaims wordpress/data-wordpress-mariadb-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-2cafb064-988d-4705-83d8-5abbd45d33ba.0a76ef60-540c-47f59da0e
    Items to Update:
                           datauploads.velero.io backup/wordpress-17102023-1247-fd9ff
    Phase:                 Completed
    Progress:              172612867 of 172612867 complete (Bytes)
    Progress description:  Completed
    Created:               2023-10-17 13:48:11 +0300 MSK
    Started:               2023-10-17 13:48:11 +0300 MSK
    Updated:               2023-10-17 13:48:53 +0300 MSK

$ velero backup describe wordpress-17102023-1355 -n backup --details

Started:    2023-10-17 13:55:09 +0300 MSK
Completed:  2023-10-17 13:57:23 +0300 MSK

Expiration:  2023-11-16 13:55:09 +0300 MSK

Total items to be backed up:  41
Items backed up:              41

Backup Item Operations:
  Operation for persistentvolumeclaims wordpress/wordpress:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-fe2c661c-fb84-4e9a-b423-1decc2e5ff6c.93e126eb-7c4b-4534ae47f
    Items to Update:
                           datauploads.velero.io backup/wordpress-17102023-1355-2g97j
    Phase:                 Completed
    Progress:              151450621 of 151450621 complete (Bytes)
    Progress description:  Completed
    Created:               2023-10-17 13:55:26 +0300 MSK
    Started:               2023-10-17 13:55:26 +0300 MSK
    Updated:               2023-10-17 13:56:24 +0300 MSK
  Operation for persistentvolumeclaims wordpress/data-wordpress-mariadb-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-fe2c661c-fb84-4e9a-b423-1decc2e5ff6c.0a76ef60-540c-47fa6eaf4
    Items to Update:
                           datauploads.velero.io backup/wordpress-17102023-1355-25vkk
    Phase:                 Completed
    Progress:              172612867 of 172612867 complete (Bytes)
    Progress description:  Completed
    Created:               2023-10-17 13:55:41 +0300 MSK
    Started:               2023-10-17 13:55:41 +0300 MSK
    Updated:               2023-10-17 13:57:14 +0300 MSK

@blackpiglet
Copy link
Contributor

@al-cheb
Could you help to collect the debug bundle file for both the succeeded and failed backups?
The command to collect debug bundle is velero debug <backup-name>

@al-cheb
Copy link

al-cheb commented Oct 18, 2023

@blackpiglet
Copy link
Contributor

Thanks for the debug information.
Looks like the Failed and Completed backups have the same Velero and CSI plugin version. They all used v1.12.0 Velero and v0.6.0 CSI image.

I found some error log from the node-agent pod log:

time="2023-10-17T12:17:25Z" level=error msg="Error when processing docker/registry/v2/repositories/re/tent/_uploads/05efba34-ec82-44cf-aceb-5c800c626174/data: ConcatenateObjects is not supported" backup=backup/harbor-fsb-02 controller=podvolumebackup logSource="pkg/uploader/kopia/progress.go:86" parentSnapshot= path="/host_pods/2843cf3c-ca70-4e89-a79b-e6d378d5998d/volumes/kubernetes.io~csi/pvc-cc7f441f-32a4-46ba-93ca-be1943c1c422/mount" podvolumebackup=backup/harbor-fsb-02-8scrj realSource=

The problem is same as #6880.
It's already fixed in the coming v1.12.1 and the main release.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

Copy link

github-actions bot commented Jan 1, 2024

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants