Fail early DM pod checks have false positives #7898

sseago · 2024-06-17T14:03:02Z

What steps did you take and what happened:
In Velero 1.13, we added code to fail early if the datamover pods are in an unrecoverable state: #7052
In 1.14, some additional changes were made here: #7437

In our OADP testing with Velero 1.14, we found that in some cases DDs or DUs were being canceled almost immediately. Digging into the code added in the above 2 PRs, there are a few situations where Velero considers a pod unrecoverable:

pod phase of Failed or Unknown
pod phase of Pending with an "unschedulable" condition
container status with ImagePullBackOff or ErrImgNeverPull

It was the second situation that was causing the failure for us. The pod will have an "unschedulable" condition until the PVC is is bound to a PV. In many cases, this all happens quickly enough that by the time Velero checks, the PVC is bound and the Pod is no longer unschedulable, but if the provisioner takes a few seconds to bind the PV and PVC together, the Pod may still have an "unschedulable" condition when the node agent first checks, which will cause the DU or DD to be immediately canceled.

We are seeing this happen frequently with Ceph volumes.

We have a couple of options here.

The simplest option would be to remove "unschedulable" from the list of unrecoverable conditions. As we see here, "unschedulable" does not always indicate an unrecoverable state. The only downside here is that we lose fast fail for permanently-unschedulable pods.
Slightly more complicated -- we could wrap the unschedulable check in a PollUntilContextTimeout call with some hard-coded timeout (2 minutes maybe?) -- this will result in almost-fast-failing for unschedulable pods. This assumes that no provisioner should take more than 2 minutes to provision the PV. That's probably safe.

I will be posting a draft PR for option 1) -- I'm open to moving to 2) if the consensus is that it's a better solution.

Environment:

Velero version (use velero version): 1.14
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-06-18T02:17:23Z

@sseago

if the provisioner takes a few seconds to bind the PV and PVC together, the Pod may still have an "unschedulable" condition

During data mover restore, I think this happens for almost every volume ---- the workload pod and PVC is firstly created, but the workload PVC is not bound to any PV until the DDCR completes, this won't complete in seconds/minutes. However, we didn't see this problem during the test.

Or do you mean this is an intermediate moment when the PV is provisioned and it takes some seconds to bound the PVC and PV.

sseago · 2024-06-18T20:46:52Z

It is definitely an intermediate scenario. I think some provisioners are fast enough that the code doesn't force cancel of DU/DD with the current code, but we have discovered with Ceph as a CSI provisioner, we are seeing frequent situations where DU or DD are getting canceled because the "unschedulable" pod state is considered "unrecoverable" by Velero even though it's not unrecoverable.

kaovilai · 2024-06-19T00:30:02Z

I personally think ignoring unschedulable makes the most sense.. an admin (or autoscaler) can scale up more nodes etc for scheduling to resolve. timing out at 1 minute and marking restore fail can be misleading. velero did everything right, infra was just not yet avail to schedule but could be totally fine 30 minutes later.

Lyndon-Li · 2024-06-19T03:09:27Z

Some background about the pod unrecoverable checking:
This is introduced by the data mover backup node selection, as described here.

This is not a must have for node selection, it just make the backup fail earlier. But if we want keep this benefit, one possible way is to check more on the message along with the Unrecoverable status:

message: '0/2 nodes are available: 1 node(s) didn''t match Pod''s node affinity/selector,
        1 node(s) had volume node affinity conflict. preemption: 0/2 nodes are available:
        2 Preemption is not helpful for scheduling..'

@sseago Could you help to share the message in the same place when you see the problem for Ceph? Let's see if we can make some differences from the messages. E.g., can we just filter the didn''t match Pod''s node affinity/selector in the message?

sseago · 2024-06-19T13:32:39Z

@Lyndon-Li here is the error we saw with Ceph:

 message: 'found a dataupload openshift-adp/backup20-llp79 with expose error: Pod
  is unschedulable: 0/6 nodes are available: pod has unbound immediate PersistentVolumeClaims.
  preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling...
  mark it as cancel'

The "unbound immediate PersistentVolumeClaims" part is probably the most relevant part of the message. Unbound PVC is a temporary error.

Lyndon-Li · 2024-06-20T02:20:46Z

@sseago
Then I think we can preserve the benefit by doing more check on the message though it is a little bit trick (when the message is changed by kubernetes, we also need to change), specifically:

We add a new parameter (e.g., unscheduableMsg) to IsPodUnrecoverable
We check the Unscheduable container state only when unscheduableMsg is not empty
When we detect the pod is Unschedulable status, we fruther check the message contains the unscheduableMsg
There is only one place need to pass an non-empty unscheduableMsg, that is in PeekExposed where we set unscheduableMsg as node affinity conflict

What do you think?

cc @reasonerjt

sseago · 2024-06-21T18:52:25Z

@Lyndon-Li Looks like we have another bug seeing the same thing -- this time with node autoscaling -- i.e. no nodes available initially, but wait a little bit and new nodes are made available, but the DU has already been canceled: #7910

sseago · 2024-06-21T18:53:47Z

@Lyndon-Li given that even "no nodes available" isn't necessarily permanent, I'm wondering whether the message-parsing approach might be too error-prone, missing edge cases, etc.

Lyndon-Li · 2024-06-24T06:49:52Z

We further discussed this issue and realized that there is no reliable way to filter out the unschedule problem caused by users' data mover node selection.
Therefore, we suggest to simply remove the changes remove the "UnReachable" container message check and let the DU/DD being cancelled after 30min timeout. This should be what #7899 is doing + data mover node selection design change.

@sseago Let us know your thoughts.

sseago · 2024-06-24T23:36:51Z

@Lyndon-Li so it sounds like you're saying the change in #7899 is appropriate? We still fail early for ImagePullBackOff or Failed pods, but for unschedulable, we just let the 30 minute timeout take effect? I'm marking that PR as ready for review now -- we can approve/merge that one or continue to discuss changes needed there.

Lyndon-Li · 2024-06-25T02:12:00Z

Yes, #7899 has been merged. Meanwhile, I will submit a PR to change the node-selection design.

sseago · 2024-06-25T18:53:55Z

@Lyndon-Li reopening for 1.14.1 -- let me know if it's better to create a new issue instead.

kaovilai · 2024-06-26T00:52:00Z

IMO, cherrypicks don't have to reference a still opened issue, one could have cherrypicks that close an already closed issue for a different release branch.

Lyndon-Li · 2024-06-26T05:20:39Z

@Lyndon-Li reopening for 1.14.1 -- let me know if it's better to create a new issue instead.

No problem, we can use the same issue to track all (it was auto closed when merging #7899)

Lyndon-Li · 2024-06-26T05:20:59Z

Close this issue as all being tracked have been completed.

cf vmware-tanzu/velero#7898 contains the fix in vmware-tanzu/velero#7899 Signed-off-by: Clément Nussbaumer <[email protected]>

Lyndon-Li · 2024-08-19T06:01:48Z

As the problem describe here, we cannot fail earlier, but we still need to preserve as more info as possible to help the troubleshooting when prepare timeout happens. Opened an issue #8125 for this.

sseago mentioned this issue Jun 17, 2024

Don't consider unschedulable pods unrecoverable #7899

Merged

Lyndon-Li added the target/1.14.1 label Jun 18, 2024

sseago mentioned this issue Jun 21, 2024

datamovement should wait for node autoscaling #7910

Closed

Lyndon-Li closed this as completed in #7899 Jun 25, 2024

Lyndon-Li mentioned this issue Jun 25, 2024

Issue 7898: change the node-agent load affinity design #7922

Merged

sseago reopened this Jun 25, 2024

sseago mentioned this issue Jun 25, 2024

[release-1.14] Don't consider unschedulable pods unrecoverable #7926

Merged

Lyndon-Li closed this as completed Jun 26, 2024

Lyndon-Li added this to the v1.15 milestone Jun 26, 2024

This was referenced Jul 4, 2024

Restore of PV fails when CSI snapshot and Data-Move is in use #7975

Closed

Velero 1.14.0 datamover cancels pods too early #7985

Closed

Lyndon-Li assigned sseago Jul 31, 2024

Lyndon-Li mentioned this issue Jul 31, 2024

failed to restore volume with StorageClass, claim Selector is not supported #7946

Closed

phac008 mentioned this issue Aug 2, 2024

[velero] claim Selector is not supported suxess-it/kubriX#389

Closed

clementnuss added a commit to clementnuss/k8s-gitops that referenced this issue Aug 2, 2024

fix(velero): use self-built version with velero#7898 fix

7d2f1c6

cf vmware-tanzu/velero#7898 contains the fix in vmware-tanzu/velero#7899 Signed-off-by: Clément Nussbaumer <[email protected]>

This was referenced Aug 7, 2024

Error with --snapshot-move-data Flag in Velero with Rook-Ceph Storage #8058

Open

Error with --snapshot-move-data Flag in Velero with Rook-Ceph Storage backup time #8092

Closed

Gui13 mentioned this issue Aug 21, 2024

Data mover: when a data move fails, the created disk (and snapshot) are not cleaned #8135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail early DM pod checks have false positives #7898

Fail early DM pod checks have false positives #7898

sseago commented Jun 17, 2024

Lyndon-Li commented Jun 18, 2024

sseago commented Jun 18, 2024

kaovilai commented Jun 19, 2024

Lyndon-Li commented Jun 19, 2024 •

edited

Loading

sseago commented Jun 19, 2024

Lyndon-Li commented Jun 20, 2024 •

edited

Loading

sseago commented Jun 21, 2024

sseago commented Jun 21, 2024

Lyndon-Li commented Jun 24, 2024

sseago commented Jun 24, 2024

Lyndon-Li commented Jun 25, 2024

sseago commented Jun 25, 2024 •

edited

Loading

kaovilai commented Jun 26, 2024

Lyndon-Li commented Jun 26, 2024

Lyndon-Li commented Jun 26, 2024

Lyndon-Li commented Aug 19, 2024

Fail early DM pod checks have false positives #7898

Fail early DM pod checks have false positives #7898

Comments

sseago commented Jun 17, 2024

Lyndon-Li commented Jun 18, 2024

sseago commented Jun 18, 2024

kaovilai commented Jun 19, 2024

Lyndon-Li commented Jun 19, 2024 • edited Loading

sseago commented Jun 19, 2024

Lyndon-Li commented Jun 20, 2024 • edited Loading

sseago commented Jun 21, 2024

sseago commented Jun 21, 2024

Lyndon-Li commented Jun 24, 2024

sseago commented Jun 24, 2024

Lyndon-Li commented Jun 25, 2024

sseago commented Jun 25, 2024 • edited Loading

kaovilai commented Jun 26, 2024

Lyndon-Li commented Jun 26, 2024

Lyndon-Li commented Jun 26, 2024

Lyndon-Li commented Aug 19, 2024

Lyndon-Li commented Jun 19, 2024 •

edited

Loading

Lyndon-Li commented Jun 20, 2024 •

edited

Loading

sseago commented Jun 25, 2024 •

edited

Loading