-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup partially failed with csi plugin 0.6.0-rc2 on OVH cluster #6852
Comments
I took some time to debug while looking at the source code so here are my investigations if it can help in any way:
But if it is iterating two time on this loop it would mean that the first time it was able to successfuly get the VolumeSnapshot and reached the |
From below log, Velero CSI plugin indeed polled the VS twice. For the first time, it got the VS successfully, but failed for the second time:
Perhaps the VS was deleted after the 1st time poll, but I don't know why. I searched the log, Velero didn't do it since the DataUpload request had not created yet, no data mover modules would touch the VS. @Arcahub Additionally, could you also try CSI snapshot backup (without data movement) with Velero 1.12 + CSI plugin 0.6.0? You can run this by removing the CSI snapshot backup has somehow different workflows from CSI snapshot data movement backup, let's see it is a generic problem related to CSI snapshot or not. |
Hello @Lyndon-Li, thank you for taking a look at my issue. I didn't mentionne it in my previous post but ovh csi driver is cinder if it can help somehow. |
Hello @Lyndon-Li, I just tested running the backup without the data movement and it failed. The installation was the same and the command was also the same without the bundle-2023-10-02-12-33-45.tar.gz As I said previously the csi driver is cinder on ovhcloud but I wasn't able to find any logs. |
@Arcahub There was some modification in how the VolumeSnapshot resources created during backup are handled. The change introduced in v1.12.0 is the VolumeSnapshot cleaning logic is moved into the CSI plugin. The benefit is the time-consuming multiple VolumeSnapshots handling is now handled concurrently. It's possible that the v1.12.0 Velero and the v0.5.1 CSI plugin both don't have the VolumeSnapshot resources cleaning. |
I already tested with I am currently using the file-system backup since the data movement is an essential feature in my case and that is why I am experimenting the csi data movement since I would rather prefer this strategy. I also tested with the official 1.20.0 release of velero and the |
@Arcahub But I found some things from the succeed backup.
The client version is right, but the server's version is still v1.12.0. The images used are:
Second, although the backup finished with completed, but no PVs' data is backed up.
Could you please use the v1.11.x version of Velero CLI to reinstall the Velero environment? Please uninstall the Velero environment with To debug further, could you also check the CSI snapshotter pods' log to find whether there is some information about why the VolumeSnapshots deleted? |
I am sorry for my mistake, I was using alias to switch between version but they were not expanded in my bash scripts. Here is the bundle of the test with The I am 100% sure that those snapshot are created and managed by velero since there is no other snapshot mecanisme currently enable on this cluster and when I delete the backup the snapshot are also deleted. Sadly as I said before, I am not able to provide cinder csi pods log since I just can access them. Click mePods list
OVH might not be managing the csi through pods or just hiding them from the users but I am not able to provide any logs since I don't have access to them. I would totally agree that it would help to debug this issue and at least I can try to contact the support to ask for the logs. Just in case I rerun with the official lastest release 1.12.0 since I had done the same mistake by not changing the version. It ended with the same PartialyFailed as before |
Thanks for the feed back. Could you check the other failed PVCs' StorageClass setting?
|
@blackpiglet The Here is the list of pvc in the cluster: Click mePVC list
My interpretation is that the error we are facing is somehow a latency error or at least a time related error and high speed pvc are more likely to complete or be reachable at the moment velero make the API call but still we can see that all high speed are not successful. I checked others bundle I uploaded before in this issue and I was able to find other pvc that succeded but they were not always using |
Thanks. |
Yeah I do agree on that. I have created a ticket on ovh support to ask for access to csi driver logs and some help on this issue from their side. I am waiting for an answers from them and will keep you updated. I also have an openstack install on premise on my side so I will try to install a Kubernetes cluster with my own cinder csi driver to test if it is an issue only related to ovh or on overall cinder csi driver |
@Arcahub I think you may not need to contract the CSI driver vendor, because the snapshot controller is a Kubernetes upstream module and the pods should be in kube-system namespace. |
I'm running on OVH too with the same behavior as far as I understood this so far.
On each K8S Node runs a container like this The Log below starts together with the Velero Backup.
Let me know if you need any further logs from me to assist. |
@MrOffline77 Actually, we need the external-snapshotter log as mentioned in #7068, there are multiple containers including sidecar containers, we need the logs from all the containers. |
name: Bug report
about: Using the velero 1.12.0 Data Movement feature on OVH managed cluster make backup partially failed while using matching csi plugin version v0.6.0-rc2 while it was working on v0.5.1.
What steps did you take and what happened:
I wantend to test the Data Movement feature.
I installed velero CLI v1.12.0-rc.2
The backup ended in a
PartiallyFailed
state with error for the majority of PVC:Fail to wait VolumeSnapshot snapshot handle created
. Still some PVC was able to be backup while some didn't, so I am guessing it's realated to some timeout error.What did you expect to happen:
I expected the backup to work in the rc version of the csi plugin since nothing else changed on the cluster except this version.
The following information will help us better understand what's going on:
The bundle extract from
velero debug --backup
:bundle-2023-09-21-11-15-47.tar.gz
Anything else you would like to add:
I tried running a backup with the exact same install commands mentionned before but changing the plugins version of the csi plugin to
v0.5.1
:And it's worked without any error. Here is the debug bundle of the working backup with csi plugin in version
v0.5.1
.bundle-2023-09-21-12-20-13.tar.gz
Of course even if it's worked it is missing the DataUpload part to achieve DataMovement so it is not what I am looking for.
Environment:
velero version
):v1.12.0-rc.2 7112c62velero client config get features
):kubectl version
):/etc/os-release
): - RuntimeOS: linux - RuntimeArch: amd64Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: