-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail early DM pod checks have false positives #7898
Comments
During data mover restore, I think this happens for almost every volume ---- the workload pod and PVC is firstly created, but the workload PVC is not bound to any PV until the DDCR completes, this won't complete in seconds/minutes. However, we didn't see this problem during the test. Or do you mean this is an intermediate moment when the PV is provisioned and it takes some seconds to bound the PVC and PV. |
It is definitely an intermediate scenario. I think some provisioners are fast enough that the code doesn't force cancel of DU/DD with the current code, but we have discovered with Ceph as a CSI provisioner, we are seeing frequent situations where DU or DD are getting canceled because the "unschedulable" pod state is considered "unrecoverable" by Velero even though it's not unrecoverable. |
I personally think ignoring unschedulable makes the most sense.. an admin (or autoscaler) can scale up more nodes etc for scheduling to resolve. timing out at 1 minute and marking restore fail can be misleading. velero did everything right, infra was just not yet avail to schedule but could be totally fine 30 minutes later. |
Some background about the pod unrecoverable checking: This is not a must have for node selection, it just make the backup fail earlier. But if we want keep this benefit, one possible way is to check more on the message along with the Unrecoverable status:
@sseago Could you help to share the message in the same place when you see the problem for Ceph? Let's see if we can make some differences from the messages. E.g., can we just filter the |
@Lyndon-Li here is the error we saw with Ceph:
The "unbound immediate PersistentVolumeClaims" part is probably the most relevant part of the message. Unbound PVC is a temporary error. |
@sseago
What do you think? cc @reasonerjt |
@Lyndon-Li Looks like we have another bug seeing the same thing -- this time with node autoscaling -- i.e. no nodes available initially, but wait a little bit and new nodes are made available, but the DU has already been canceled: #7910 |
@Lyndon-Li given that even "no nodes available" isn't necessarily permanent, I'm wondering whether the message-parsing approach might be too error-prone, missing edge cases, etc. |
We further discussed this issue and realized that there is no reliable way to filter out the unschedule problem caused by users' data mover node selection. @sseago Let us know your thoughts. |
@Lyndon-Li so it sounds like you're saying the change in #7899 is appropriate? We still fail early for ImagePullBackOff or Failed pods, but for unschedulable, we just let the 30 minute timeout take effect? I'm marking that PR as ready for review now -- we can approve/merge that one or continue to discuss changes needed there. |
Yes, #7899 has been merged. Meanwhile, I will submit a PR to change the node-selection design. |
@Lyndon-Li reopening for 1.14.1 -- let me know if it's better to create a new issue instead. |
IMO, cherrypicks don't have to reference a still opened issue, one could have cherrypicks that close an already closed issue for a different release branch. |
No problem, we can use the same issue to track all (it was auto closed when merging #7899) |
Close this issue as all being tracked have been completed. |
cf vmware-tanzu/velero#7898 contains the fix in vmware-tanzu/velero#7899 Signed-off-by: Clément Nussbaumer <[email protected]>
As the problem describe here, we cannot fail earlier, but we still need to preserve as more info as possible to help the troubleshooting when prepare timeout happens. Opened an issue #8125 for this. |
What steps did you take and what happened:
In Velero 1.13, we added code to fail early if the datamover pods are in an unrecoverable state: #7052
In 1.14, some additional changes were made here: #7437
In our OADP testing with Velero 1.14, we found that in some cases DDs or DUs were being canceled almost immediately. Digging into the code added in the above 2 PRs, there are a few situations where Velero considers a pod unrecoverable:
It was the second situation that was causing the failure for us. The pod will have an "unschedulable" condition until the PVC is is bound to a PV. In many cases, this all happens quickly enough that by the time Velero checks, the PVC is bound and the Pod is no longer unschedulable, but if the provisioner takes a few seconds to bind the PV and PVC together, the Pod may still have an "unschedulable" condition when the node agent first checks, which will cause the DU or DD to be immediately canceled.
We are seeing this happen frequently with Ceph volumes.
We have a couple of options here.
The simplest option would be to remove "unschedulable" from the list of unrecoverable conditions. As we see here, "unschedulable" does not always indicate an unrecoverable state. The only downside here is that we lose fast fail for permanently-unschedulable pods.
Slightly more complicated -- we could wrap the unschedulable check in a PollUntilContextTimeout call with some hard-coded timeout (2 minutes maybe?) -- this will result in almost-fast-failing for unschedulable pods. This assumes that no provisioner should take more than 2 minutes to provision the PV. That's probably safe.
I will be posting a draft PR for option 1) -- I'm open to moving to 2) if the consensus is that it's a better solution.
Environment:
velero version
): 1.14velero client config get features
):kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: