-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data mover failed - error to delete volume snapshot content - not found #7068
Comments
@trunet Could you help to confirm it is an interim problem or you keep seeing this error with zfs-localpv? |
@Lyndon-Li My local snapshots with zfs-localpv works just fine, every time I had to do. zfs-localpv is my only CSI. What do you need me to check specifically? |
@trunet |
@Lyndon-Li I'll wait if the problem happen again tomorrow and report back with the logs. thanks! |
@Lyndon-Li I had the error on 2 difference volumes now. Let's focus on one of them: I redacted here, but the log attached is not redacted.
I fetched all ZFS CSI container logs:
I grepped by PVC, PV and snapcontent and uploading here. If you need the full log for some reason, let me know, I can upload as well. Further grepping by error:
Yesterday I checked and none of my volumes had any snapshots, no zfssnapshots, no volumesnapshotcontent and also zfs list -t snapshot, it was all clean. Therefore any snapshots created/deleted/managed was triggered by velero backup/data upload. |
@trunet Could you attach the new velero log that could match the CSI logs in grepped.log.gz? |
@trunet Please help to share the entire Velero log bundle, we need to check more from the node-agent logs. Meanwhile, if you have the full CSI driver logs, please also share. |
@Lyndon-Li this is today's log. 4 failed today. Some with bundle-2023-11-09-10-54-54.tar.gz |
@trunet |
nice catch... OOMKilled. the logs --previous just shows dataupload happening and nothing else. I'll remove the resources I had defined in my values, it'll use the same as the helm chart default (double). I had reduced a while ago to fit in my cluster (and I didn't remember that). I'll come back to you tomorrow if it finished successfully. I wonder if this will also fix
|
@trunet So please help to confirm this. |
@Lyndon-Li I really appreciate the time and effort you're putting on this. today's, I got no restarts after fixing the resources:
One failed:
Logs: Yesterday I installed a logging stack with fluent-bit sending to an elasticsearch cluster, and I generated this CSV report from kibana filtering by the pvc name, pv name or the snapcontent id to try to help you. |
Seems like I'm running into the same issue. I'm on Velero 5.1.3. Al my node agents are running fine (no restarts)
We are using Linstor CSI. Issue seems intermitted, but fail for the most part:
snapshots are succesfully taken:
|
In my case I get the impression the wrong node-agent is processing the snapshot. These logs were taken from the velero node-agent running on node Relevant logs node-agent running on h-fsn-4.
The Pod mounting the PV is also running on |
@trunet @boedy Hopefully we can resolve & verify it by this week, then we will be able to catch up the release of 1.12.2. Thanks! |
I just tried out your trial fix. It's complaining "no new finalizers can be added if the object is being deleted":
|
I just updated mine. Will have results tomorrow. |
@boedy @trunet The beloe error indicates that we are on the right direction to fix the problem, merely the previous code is not robust enough to solve the contest Hopefully, with the new code update, we can make a success. (The latest main image should be available in 30 min from now) |
Great work at @trunet I believe the latest image fixed the issue. I've just completed 6 consecutive backups without any errors! 🥳 |
I had one completed and one failed yesterday (before this last commit fix). It failed because the default helm charts memory resource was not enough and it got OOMKilled. I increased resource memory to limit at 768Mi and request 256Mi. I updated the container once again, let's wait tomorrow to confirm.
|
@trunet Looks like the issue has been fixed, but I will wait for your verification. Hopefully we can get the result by this week, otherwise, we will have to deliver the fix post 1.12.2. |
@Lyndon-Li I didn't have any more backup errors since this change. I'm closing this ticket, feel free to release it. |
What steps did you take and what happened:
I created a schedule with velero csi snapshot and data mover enabled. I'm using openebs/zfs-localpv CSI.
17 succeeded, 1 failed to data move:
What did you expect to happen:
Backup should complete successfully.
The following information will help us better understand what's going on:
bundle-2023-11-07-10-51-47.tar.gz
Anything else you would like to add:
Environment:
/etc/os-release
): Talos v1.5.4Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: