Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flyte task permanently marked as failed after transient failed to sync secret error #4309

Closed
2 tasks done
Tom-Newton opened this issue Oct 26, 2023 · 0 comments · Fixed by #4310
Closed
2 tasks done
Labels
bug Something isn't working flytepropeller

Comments

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Oct 26, 2023

Describe the bug

When launching a large number of tasks (~50000 with 500 max parrallelism) we sometimes get transient errors like Error: failed to sync secret cache: timed out waiting during pod initialisation. As soon as flytepropeller sees this in the pod events it marks the task as failed. However if we look at pod directly we see that it recovers and runs successfully. We can even see the output that it wrote to blob storage.

The error visible in the flyteconsole is:

Error: failed to sync secret cache: timed out waiting for the condition

Looking at the flyteworkflow CRD the relevant part is:

    n1:
      TaskNodeStatus:
        pState: <some long string of characters that looked a bit like a secret, maybe>
        phase: 8
        psv: 1
        updAt: "2023-10-26T09:51:56.050539301Z"
      dynamicNodeStatus: {}
      error:
        code: ContainersNotReady|CreateContainerConfigError
        kind: USER
        message: 'containers with unready status: [primary]|failed to sync secret
          cache: timed out waiting for the condition'
      laStartedAt: "2023-10-26T09:48:33Z"
      lastUpdatedAt: "2023-10-26T09:51:56Z"
      message: 'containers with unready status: [primary]|failed to sync secret cache:
        timed out waiting for the condition'
      phase: 6
      queuedAt: "2023-10-26T09:48:33Z"
      startedAt: "2023-10-26T09:48:33Z"
      stoppedAt: "2023-10-26T09:51:56Z"

Relevant Pod events:

Events:
  Type     Reason                  Age                    From                Message
  ----     ------                  ----                   ----                -------
  Normal   Pulled                  6m13s                  kubelet             Successfully pulled image <image tag> in 82.521605ms (153.070494ms including waiting)
  Warning  Failed                  6m12s (x4 over 6m44s)  kubelet             Error: failed to sync secret cache: timed out waiting for the condition
  Normal   Pulling                 6m (x5 over 8m9s)      kubelet             Pulling image <image tag>
  Normal   Pulled                  6m                     kubelet             Successfully pulled image <image tag> in 74.073486ms (152.72983ms including waiting)
  Normal   Created                 6m                     kubelet             Created container primary
  Normal   Started                 6m                     kubelet             Started container primary

Expected behavior

If the k8s pod succeeds then flyte should detect it as succeeded. I don't think it should matter if there are some transient errors during pod initialisation.

Additional context to reproduce

Not sure I can provide a neat way to reproduce, but I believe I have been able to debug very precisely where the error is.

A few features of our setup that might be relevant to reproducing:

  1. Running in Azure kubernetes service
  2. Tasks have a pod template that adds environment variables that reference the k8s secrets.
  3. Launching lots of pods at the same time.

The problem:

We can see from the snippet above from the workflow CRD that this error is considered a CreateContainerConfigError. The relevant code is DemystifyPending. If we look at this logic we can see that a few error types e.g. CreateContainerError allow a grace period on errors before flyte considers them to mean the task has failed. However CreateContainerConfigError does not allow such a grace period.

Building a custom version of flytepropeller that adds a grace period on CreateContainerConfigErrors has vastly reduced the frquency of the problem. I'm happy to make a PR if we think this is a sensible solution.

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@Tom-Newton Tom-Newton added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Oct 26, 2023
@hamersaw hamersaw added flytepropeller and removed untriaged This issues has not yet been looked at by the Maintainers labels Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flytepropeller
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants