DataUpload CR sometimes not updated / finalized or Bytes missing? #8497

realjump · 2024-12-09T07:38:45Z

What steps did you take and what happened:
Velero has been upgraded to 1.15.0. When executing a backup, I noticed that for a few DataUploads that are in status Completed, the status for BytesDone and TotalBytes differ, so that BytesDone < TotalBytes, but status is still Completed.

What did you expect to happen:
All DataUploads should only have the status Completed, insofar as BytesDone = TotalBytes, as otherwise you could ask yourself whether this is an error when updating the CR or whether bytes were actually not successfully uploaded to S3.

The following information will help us better understand what's going on:
As requested from the discussion #8492, here is the information on the quantity and the node utilization at this time on CPU:

./check_datauploads.sh
backupname-backup-20241206063243-7rwjd is Completed but 42990145943 < 58373579741 at xxx00214.xxx.de
backupname-backup-20241206063243-dvk4k is Completed but 878837521525 < 886194227785 at xxx00114.xxx.de
backupname-backup-20241206063243-xjxp7 is Completed but 16786601663 < 21805896215 at xxx00114.xxx.de
backupname-backup-20241206063243-zlxtm is Completed but 447458401232 < 466810484914 at xxx00214.xxx.de

kubectl get datauploads.velero.io -n velero | grep backupname-backup-20241206063243 | wc -l
733

As you can see, the behavior occurred in 4 of a total of 733 DataUploads as part of the same backup process.

The two nodes mentioned have uploaded some more DataUploads in this backup process (i.e. not only those that show the corresponding behavior). They are RHEL 8.10 nodes and are each equipped with 20 cores (or 40 threaded) and 512GB RAM.

The CPU load over the entire backup process is as follows for the two nodes (I am only including the nodes on which the behavior occurred in the pods, otherwise there would be too many):

xxx00114.xxx.de:

xxx00214.xxx.de:

Please note the timezones, the dashboard has CET timezone (+1) by default.

backupname-backup-20241206063243-7rwjd: Start 2024-12-06T11:57:24Z, Completion: 2024-12-06T12:06:22Z
backupname-backup-20241206063243-dvk4k: Start 2024-12-06T08:55:37Z, Completion: 2024-12-06T11:47:19Z
backupname-backup-20241206063243-xjxp7: Start 2024-12-06T07:59:21Z, Completion: 2024-12-06T08:08:30Z
backupname-backup-20241206063243-zlxtm: Start 2024-12-06T08:56:30Z, Completion: 2024-12-06T09:24:45Z

Anything else you would like to add:

In this section, I will add Pod-Log excerpts for the affected DataUploads. The only difference or commonality in the affected pods is a timeout, which I cannot see in the other data uploads at random.

Judging by the logs, it looks to me as if the DataPath is complete and nothing is missing - only the CR update is not complete. You can see that these pods remain on Completed for a little longer (30 seconds - 1 minute) and the status only changes from InProgress to Completed in the CR after "deletion". The other DataUpload pods end immediately after completion.

backupname-backup-20241206063243-7rwjd:

backupname-backup-20241206063243-dvk4k:

backupname-backup-20241206063243-xjxp7:

backupname-backup-20241206063243-zlxtm:

"Working" one - backupname-backup-20241206063243-tsdh5:

Environment:

Velero version (use velero version): 1.15.0
Velero features (use velero client config get features): EnableCSI
Kubernetes version (use kubectl version): 1.27.10
Kubernetes installer & version: kubeadm 1.27.10 as customized Ansible rollout
OS (e.g. from /etc/os-release): RHEL 8.10

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-12-11T02:41:58Z

This is due to limitation of Kubernete Event that it doesn't wait and drain out the queue before closing the event sink.
But this problem should be rare, as the current behavior, Velero will wait for as long as 1 min before closing the event sink.

Let me try to enhance this in 1.16.

Lyndon-Li · 2024-12-11T02:43:49Z

You can use the backed up data safely even for the DU with this problem, the backup has actually completed, merely, the progress is not updated correctly due to not receiving the last several event updates.

realjump · 2024-12-11T06:36:04Z

@Lyndon-Li all clear, thanks for the quick confirmation that the log entries are correct, only the CR is no longer finally updated. Maybe a longer timeout is not a bad idea in case someone else encounters the same behavior.

msfrucht · 2024-12-11T18:07:26Z

We've also seen this issue as well. The kopia backup integrity is in good state on restore. Just the reporting of BytesDone vs BytesTotal is incorrect.

blackpiglet added the area/datamover label Dec 9, 2024

blackpiglet assigned Lyndon-Li Dec 10, 2024

Lyndon-Li added this to the v1.16 milestone Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataUpload CR sometimes not updated / finalized or Bytes missing? #8497

DataUpload CR sometimes not updated / finalized or Bytes missing? #8497

realjump commented Dec 9, 2024 •

edited

Loading

Lyndon-Li commented Dec 11, 2024

Lyndon-Li commented Dec 11, 2024

realjump commented Dec 11, 2024

msfrucht commented Dec 11, 2024

DataUpload CR sometimes not updated / finalized or Bytes missing? #8497

DataUpload CR sometimes not updated / finalized or Bytes missing? #8497

Comments

realjump commented Dec 9, 2024 • edited Loading

Lyndon-Li commented Dec 11, 2024

Lyndon-Li commented Dec 11, 2024

realjump commented Dec 11, 2024

msfrucht commented Dec 11, 2024

realjump commented Dec 9, 2024 •

edited

Loading