Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail early DM pod checks have false positives #7898

Closed
sseago opened this issue Jun 17, 2024 · 16 comments · Fixed by #7899
Closed

Fail early DM pod checks have false positives #7898

sseago opened this issue Jun 17, 2024 · 16 comments · Fixed by #7899
Assignees
Milestone

Comments

@sseago
Copy link
Collaborator

sseago commented Jun 17, 2024

What steps did you take and what happened:
In Velero 1.13, we added code to fail early if the datamover pods are in an unrecoverable state: #7052
In 1.14, some additional changes were made here: #7437

In our OADP testing with Velero 1.14, we found that in some cases DDs or DUs were being canceled almost immediately. Digging into the code added in the above 2 PRs, there are a few situations where Velero considers a pod unrecoverable:

  1. pod phase of Failed or Unknown
  2. pod phase of Pending with an "unschedulable" condition
  3. container status with ImagePullBackOff or ErrImgNeverPull

It was the second situation that was causing the failure for us. The pod will have an "unschedulable" condition until the PVC is is bound to a PV. In many cases, this all happens quickly enough that by the time Velero checks, the PVC is bound and the Pod is no longer unschedulable, but if the provisioner takes a few seconds to bind the PV and PVC together, the Pod may still have an "unschedulable" condition when the node agent first checks, which will cause the DU or DD to be immediately canceled.

We are seeing this happen frequently with Ceph volumes.

We have a couple of options here.

  1. The simplest option would be to remove "unschedulable" from the list of unrecoverable conditions. As we see here, "unschedulable" does not always indicate an unrecoverable state. The only downside here is that we lose fast fail for permanently-unschedulable pods.

  2. Slightly more complicated -- we could wrap the unschedulable check in a PollUntilContextTimeout call with some hard-coded timeout (2 minutes maybe?) -- this will result in almost-fast-failing for unschedulable pods. This assumes that no provisioner should take more than 2 minutes to provision the PV. That's probably safe.

I will be posting a draft PR for option 1) -- I'm open to moving to 2) if the consensus is that it's a better solution.

Environment:

  • Velero version (use velero version): 1.14
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

@sseago

if the provisioner takes a few seconds to bind the PV and PVC together, the Pod may still have an "unschedulable" condition

During data mover restore, I think this happens for almost every volume ---- the workload pod and PVC is firstly created, but the workload PVC is not bound to any PV until the DDCR completes, this won't complete in seconds/minutes. However, we didn't see this problem during the test.

Or do you mean this is an intermediate moment when the PV is provisioned and it takes some seconds to bound the PVC and PV.

@sseago
Copy link
Collaborator Author

sseago commented Jun 18, 2024

It is definitely an intermediate scenario. I think some provisioners are fast enough that the code doesn't force cancel of DU/DD with the current code, but we have discovered with Ceph as a CSI provisioner, we are seeing frequent situations where DU or DD are getting canceled because the "unschedulable" pod state is considered "unrecoverable" by Velero even though it's not unrecoverable.

@kaovilai
Copy link
Member

I personally think ignoring unschedulable makes the most sense.. an admin (or autoscaler) can scale up more nodes etc for scheduling to resolve. timing out at 1 minute and marking restore fail can be misleading. velero did everything right, infra was just not yet avail to schedule but could be totally fine 30 minutes later.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 19, 2024

Some background about the pod unrecoverable checking:
This is introduced by the data mover backup node selection, as described here.

This is not a must have for node selection, it just make the backup fail earlier. But if we want keep this benefit, one possible way is to check more on the message along with the Unrecoverable status:

message: '0/2 nodes are available: 1 node(s) didn''t match Pod''s node affinity/selector,
        1 node(s) had volume node affinity conflict. preemption: 0/2 nodes are available:
        2 Preemption is not helpful for scheduling..'

@sseago Could you help to share the message in the same place when you see the problem for Ceph? Let's see if we can make some differences from the messages. E.g., can we just filter the didn''t match Pod''s node affinity/selector in the message?

@sseago
Copy link
Collaborator Author

sseago commented Jun 19, 2024

@Lyndon-Li here is the error we saw with Ceph:

 message: 'found a dataupload openshift-adp/backup20-llp79 with expose error: Pod
  is unschedulable: 0/6 nodes are available: pod has unbound immediate PersistentVolumeClaims.
  preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling...
  mark it as cancel'

The "unbound immediate PersistentVolumeClaims" part is probably the most relevant part of the message. Unbound PVC is a temporary error.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 20, 2024

@sseago
Then I think we can preserve the benefit by doing more check on the message though it is a little bit trick (when the message is changed by kubernetes, we also need to change), specifically:

  • We add a new parameter (e.g., unscheduableMsg) to IsPodUnrecoverable
  • We check the Unscheduable container state only when unscheduableMsg is not empty
  • When we detect the pod is Unschedulable status, we fruther check the message contains the unscheduableMsg
  • There is only one place need to pass an non-empty unscheduableMsg, that is in PeekExposed where we set unscheduableMsg as node affinity conflict

What do you think?

cc @reasonerjt

@sseago
Copy link
Collaborator Author

sseago commented Jun 21, 2024

@Lyndon-Li Looks like we have another bug seeing the same thing -- this time with node autoscaling -- i.e. no nodes available initially, but wait a little bit and new nodes are made available, but the DU has already been canceled: #7910

@sseago
Copy link
Collaborator Author

sseago commented Jun 21, 2024

@Lyndon-Li given that even "no nodes available" isn't necessarily permanent, I'm wondering whether the message-parsing approach might be too error-prone, missing edge cases, etc.

@Lyndon-Li
Copy link
Contributor

We further discussed this issue and realized that there is no reliable way to filter out the unschedule problem caused by users' data mover node selection.
Therefore, we suggest to simply remove the changes remove the "UnReachable" container message check and let the DU/DD being cancelled after 30min timeout. This should be what #7899 is doing + data mover node selection design change.

@sseago Let us know your thoughts.

@sseago
Copy link
Collaborator Author

sseago commented Jun 24, 2024

@Lyndon-Li so it sounds like you're saying the change in #7899 is appropriate? We still fail early for ImagePullBackOff or Failed pods, but for unschedulable, we just let the 30 minute timeout take effect? I'm marking that PR as ready for review now -- we can approve/merge that one or continue to discuss changes needed there.

@Lyndon-Li
Copy link
Contributor

Yes, #7899 has been merged. Meanwhile, I will submit a PR to change the node-selection design.

@sseago
Copy link
Collaborator Author

sseago commented Jun 25, 2024

@Lyndon-Li reopening for 1.14.1 -- let me know if it's better to create a new issue instead.

@kaovilai
Copy link
Member

IMO, cherrypicks don't have to reference a still opened issue, one could have cherrypicks that close an already closed issue for a different release branch.

@Lyndon-Li
Copy link
Contributor

@Lyndon-Li reopening for 1.14.1 -- let me know if it's better to create a new issue instead.

No problem, we can use the same issue to track all (it was auto closed when merging #7899)

@Lyndon-Li
Copy link
Contributor

Close this issue as all being tracked have been completed.

@Lyndon-Li
Copy link
Contributor

As the problem describe here, we cannot fail earlier, but we still need to preserve as more info as possible to help the troubleshooting when prepare timeout happens. Opened an issue #8125 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants