[BUG] Flyte task permanently marked as failed after transient failed to sync secret
error
#4309
Closed
2 tasks done
Labels
Describe the bug
When launching a large number of tasks (~50000 with 500 max parrallelism) we sometimes get transient errors like
Error: failed to sync secret cache: timed out waiting
during pod initialisation. As soon as flytepropeller sees this in the pod events it marks the task as failed. However if we look at pod directly we see that it recovers and runs successfully. We can even see the output that it wrote to blob storage.The error visible in the flyteconsole is:
Looking at the flyteworkflow CRD the relevant part is:
Relevant Pod events:
Expected behavior
If the k8s pod succeeds then flyte should detect it as succeeded. I don't think it should matter if there are some transient errors during pod initialisation.
Additional context to reproduce
Not sure I can provide a neat way to reproduce, but I believe I have been able to debug very precisely where the error is.
A few features of our setup that might be relevant to reproducing:
The problem:
We can see from the snippet above from the workflow CRD that this error is considered a
CreateContainerConfigError
. The relevant code is DemystifyPending. If we look at this logic we can see that a few error types e.g.CreateContainerError
allow a grace period on errors before flyte considers them to mean the task has failed. HoweverCreateContainerConfigError
does not allow such a grace period.Building a custom version of flytepropeller that adds a grace period on
CreateContainerConfigError
s has vastly reduced the frquency of the problem. I'm happy to make a PR if we think this is a sensible solution.Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: