You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What steps did you take and what happened:
Velero has been upgraded to 1.15.0. When executing a backup, I noticed that for a few DataUploads that are in status Completed, the status for BytesDone and TotalBytes differ, so that BytesDone < TotalBytes, but status is still Completed.
What did you expect to happen:
All DataUploads should only have the status Completed, insofar as BytesDone = TotalBytes, as otherwise you could ask yourself whether this is an error when updating the CR or whether bytes were actually not successfully uploaded to S3.
The following information will help us better understand what's going on:
As requested from the discussion #8492, here is the information on the quantity and the node utilization at this time on CPU:
./check_datauploads.sh
backupname-backup-20241206063243-7rwjd is Completed but 42990145943 < 58373579741 at xxx00214.xxx.de
backupname-backup-20241206063243-dvk4k is Completed but 878837521525 < 886194227785 at xxx00114.xxx.de
backupname-backup-20241206063243-xjxp7 is Completed but 16786601663 < 21805896215 at xxx00114.xxx.de
backupname-backup-20241206063243-zlxtm is Completed but 447458401232 < 466810484914 at xxx00214.xxx.de
As you can see, the behavior occurred in 4 of a total of 733 DataUploads as part of the same backup process.
The two nodes mentioned have uploaded some more DataUploads in this backup process (i.e. not only those that show the corresponding behavior). They are RHEL 8.10 nodes and are each equipped with 20 cores (or 40 threaded) and 512GB RAM.
The CPU load over the entire backup process is as follows for the two nodes (I am only including the nodes on which the behavior occurred in the pods, otherwise there would be too many):
xxx00114.xxx.de:
xxx00214.xxx.de:
Please note the timezones, the dashboard has CET timezone (+1) by default.
In this section, I will add Pod-Log excerpts for the affected DataUploads. The only difference or commonality in the affected pods is a timeout, which I cannot see in the other data uploads at random.
Judging by the logs, it looks to me as if the DataPath is complete and nothing is missing - only the CR update is not complete. You can see that these pods remain on Completed for a little longer (30 seconds - 1 minute) and the status only changes from InProgress to Completed in the CR after "deletion". The other DataUpload pods end immediately after completion.
backupname-backup-20241206063243-7rwjd:
backupname-backup-20241206063243-dvk4k:
backupname-backup-20241206063243-xjxp7:
backupname-backup-20241206063243-zlxtm:
"Working" one - backupname-backup-20241206063243-tsdh5:
Environment:
Velero version (use velero version): 1.15.0
Velero features (use velero client config get features): EnableCSI
Kubernetes version (use kubectl version): 1.27.10
Kubernetes installer & version: kubeadm 1.27.10 as customized Ansible rollout
OS (e.g. from /etc/os-release): RHEL 8.10
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"
The text was updated successfully, but these errors were encountered:
This is due to limitation of Kubernete Event that it doesn't wait and drain out the queue before closing the event sink.
But this problem should be rare, as the current behavior, Velero will wait for as long as 1 min before closing the event sink.
You can use the backed up data safely even for the DU with this problem, the backup has actually completed, merely, the progress is not updated correctly due to not receiving the last several event updates.
@Lyndon-Li all clear, thanks for the quick confirmation that the log entries are correct, only the CR is no longer finally updated. Maybe a longer timeout is not a bad idea in case someone else encounters the same behavior.
We've also seen this issue as well. The kopia backup integrity is in good state on restore. Just the reporting of BytesDone vs BytesTotal is incorrect.
What steps did you take and what happened:
Velero has been upgraded to 1.15.0. When executing a backup, I noticed that for a few DataUploads that are in status Completed, the status for BytesDone and TotalBytes differ, so that BytesDone < TotalBytes, but status is still Completed.
What did you expect to happen:
All DataUploads should only have the status Completed, insofar as BytesDone = TotalBytes, as otherwise you could ask yourself whether this is an error when updating the CR or whether bytes were actually not successfully uploaded to S3.
The following information will help us better understand what's going on:
As requested from the discussion #8492, here is the information on the quantity and the node utilization at this time on CPU:
As you can see, the behavior occurred in 4 of a total of 733 DataUploads as part of the same backup process.
The two nodes mentioned have uploaded some more DataUploads in this backup process (i.e. not only those that show the corresponding behavior). They are RHEL 8.10 nodes and are each equipped with 20 cores (or 40 threaded) and 512GB RAM.
The CPU load over the entire backup process is as follows for the two nodes (I am only including the nodes on which the behavior occurred in the pods, otherwise there would be too many):
xxx00114.xxx.de:
xxx00214.xxx.de:
Please note the timezones, the dashboard has CET timezone (+1) by default.
Anything else you would like to add:
In this section, I will add Pod-Log excerpts for the affected DataUploads. The only difference or commonality in the affected pods is a timeout, which I cannot see in the other data uploads at random.
Judging by the logs, it looks to me as if the DataPath is complete and nothing is missing - only the CR update is not complete. You can see that these pods remain on Completed for a little longer (30 seconds - 1 minute) and the status only changes from InProgress to Completed in the CR after "deletion". The other DataUpload pods end immediately after completion.
backupname-backup-20241206063243-7rwjd:
backupname-backup-20241206063243-dvk4k:
backupname-backup-20241206063243-xjxp7:
backupname-backup-20241206063243-zlxtm:
"Working" one - backupname-backup-20241206063243-tsdh5:
Environment:
velero version
): 1.15.0velero client config get features
): EnableCSIkubectl version
): 1.27.10/etc/os-release
): RHEL 8.10Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: