Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Mark taskResult completed if wait container terminated not gracefully. Fixes #13373 #13491

Merged
merged 1 commit into from
Aug 29, 2024

Conversation

jswxstw
Copy link
Member

@jswxstw jswxstw commented Aug 21, 2024

Fixes #13373

Motivation

#12103 (comment)

Modifications

Mark its taskResult as completed if pod has been terminated not gracefully.

Verification

This issue only occurs when wait container is terminated not gracefully, it is hard to reproduce it.

  1. submit a workflow with a long running pod.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sleep-example-
spec:
  entrypoint: main
  templates:
  - name: main
    steps:
    - - name: sleep
        template: sleep

  - name: sleep
    container:
      image: alpine:latest
      command: [sh, -c, sleep 3600]
  1. kill -9 wait/main container
ctr task kill --signal 9 8c66238d4723b9237c4e71d3fd16710f5f985d2ce8f7218ff987e143b3e8140b # wait container
ctr task kill --signal 9 5da93afce108bc59f746e8c8df791ae6c808deaedaa87abc8f6ea5f4697da2df # main container
  1. Pod Error and wait container does not write LabelKeyReportOutputsCompleted to true
# sleep-example-tdnk7-sleep-1943014577   0/2     Error     0            9m21s
# k logs -f sleep-example-tdnk7-sleep-1943014577 -c wait
time="2024-08-21T03:55:14.519Z" level=info msg="Starting Workflow Executor" version=untagged
time="2024-08-21T03:55:14.522Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-08-21T03:55:14.522Z" level=info msg="Executor initialized" deadline="2024-08-21 04:00:13 +0000 UTC" includeScriptOutput=false namespace=argo podName=sleep-example-tdnk7-sleep-1943014577 templateName=sleep version="&Version{Version:untagged,BuildDate:2024-08-15T12:58:34Z,GitCommit:c143e3e78c1b6d585d64a2d9363bf0c9c64caf6c,GitTag:untagged,GitTreeState:clean,GoVersion:go1.22.6,Compiler:gc,Platform:linux/amd64,}"
time="2024-08-21T03:55:14.529Z" level=debug msg="Create workflowtaskresults 201"
time="2024-08-21T03:55:14.530Z" level=info msg="Starting deadline monitor"
time="2024-08-21T03:55:21.534Z" level=debug msg="Create workflowtaskresults 409"
time="2024-08-21T03:55:21.538Z" level=debug msg="Patch workflowtaskresults 200"
time="2024-08-21T03:55:28.535Z" level=debug msg="Create workflowtaskresults 409"
time="2024-08-21T03:55:28.538Z" level=debug msg="Patch workflowtaskresults 200"
time="2024-08-21T03:55:35.535Z" level=debug msg="Create workflowtaskresults 409"
time="2024-08-21T03:55:35.538Z" level=debug msg="Patch workflowtaskresults 200"
time="2024-08-21T03:55:42.535Z" level=debug msg="Create workflowtaskresults 409"
time="2024-08-21T03:55:42.538Z" level=debug msg="Patch workflowtaskresults 200"
time="2024-08-21T03:55:49.536Z" level=debug msg="Create workflowtaskresults 409"
time="2024-08-21T03:55:49.540Z" level=debug msg="Patch workflowtaskresults 200"
time="2024-08-21T03:55:56.534Z" level=debug msg="Create workflowtaskresults 409"
time="2024-08-21T03:55:56.538Z" level=debug msg="Patch workflowtaskresults 200"
time="2024-08-21T03:56:03.534Z" level=debug msg="Create workflowtaskresults 409"
time="2024-08-21T03:56:03.537Z" level=debug msg="Patch workflowtaskresults 200"
time="2024-08-21T03:56:10.535Z" level=debug msg="Create workflowtaskresults 409"
time="2024-08-21T03:56:10.538Z" level=debug msg="Patch workflowtaskresults 200"
time="2024-08-21T03:56:17.535Z" level=debug msg="Create workflowtaskresults 409"
time="2024-08-21T03:56:17.538Z" level=debug msg="Patch workflowtaskresults 200"
  1. Workflow Failed as expected.
# argo get sleep-example-tdnk7
Name:                sleep-example-tdnk7
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Failed
Message:             child 'sleep-example-tdnk7-1943014577' failed
Conditions:
 PodRunning          False
 Completed           True
Created:             Wed Aug 21 11:55:13 +0800 (14 minutes ago)
Started:             Wed Aug 21 11:55:13 +0800 (14 minutes ago)
Finished:            Wed Aug 21 11:56:37 +0800 (13 minutes ago)
Duration:            1 minute 24 seconds
Progress:            0/1
ResourcesDuration:   13s*(1 cpu),2m0s*(100Mi memory)

STEP                    TEMPLATE  PODNAME                               DURATION  MESSAGE
 ✖ sleep-example-tdnk7  main                                                      child 'sleep-example-tdnk7-1943014577' failed
 └───✖ sleep            sleep     sleep-example-tdnk7-sleep-1943014577  1m        Error (exit code 137)

controller log:

time="2024-08-21T11:56:37.580Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2696088 namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=debug msg="task result:\n&WorkflowTaskResult{ObjectMeta:{sleep-example-tdnk7-1943014577  argo  64e126a5-83fe-4fc8-8c4d-9e416f3e8de4 2696021 1 2024-08-21 11:55:14 +0800 CST <nil> <nil> map[workflows.argoproj.io/report-outputs-completed:false workflows.argoproj.io/workflow:sleep-example-tdnk7] map[] [{argoproj.io/v1alpha1 Workflow sleep-example-tdnk7 28e951a9-8353-4d92-9cab-2853e00b7134 <nil> <nil>}] [] [{argoexec Update argoproj.io/v1alpha1 2024-08-21 11:55:14 +0800 CST FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:workflows.argoproj.io/report-outputs-completed\":{},\"f:workflows.argoproj.io/workflow\":{}},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"28e951a9-8353-4d92-9cab-2853e00b7134\\\"}\":{}}}} }]},NodeResult:NodeResult{Phase:,Message:,Outputs:nil,Progress:,},}" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=debug msg="task result name:\nsleep-example-tdnk7-1943014577" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=debug msg="Marking task result incomplete sleep-example-tdnk7-1943014577" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=info msg="task-result changed" namespace=argo nodeID=sleep-example-tdnk7-1943014577 workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=debug msg="Skipping artifact GC" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=info msg="Pod failed: Error (exit code 137)" displayName=sleep namespace=argo pod=sleep-example-tdnk7-sleep-1943014577 templateName=sleep workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=info msg="node changed" namespace=argo new.message="Error (exit code 137)" new.phase=Failed new.progress=0/1 nodeID=sleep-example-tdnk7-1943014577 old.message= old.phase=Running old.progress=0/1 workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=debug msg="TaskResult of the node not yet completed" namespace=argo nodeID=sleep-example-tdnk7-1943014577 workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=debug msg="Node failed or error, mark its taskResult as completed" namespace=argo nodeID=sleep-example-tdnk7-1943014577 workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=debug msg="Evaluating node sleep-example-tdnk7: template: *v1alpha1.WorkflowStep (main), boundaryID: " namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.581Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=argo,name=sleep-example-tdnk7)" tmpl="*v1alpha1.WorkflowStep (main)"
time="2024-08-21T11:56:37.581Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=argo,name=sleep-example-tdnk7)" tmpl="*v1alpha1.WorkflowStep (main)"
time="2024-08-21T11:56:37.581Z" level=debug msg="Getting the template by name: main" base="*v1alpha1.Workflow (namespace=argo,name=sleep-example-tdnk7)" tmpl="*v1alpha1.WorkflowStep (main)"
time="2024-08-21T11:56:37.582Z" level=debug msg="Executing node sleep-example-tdnk7 of Steps is Running" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=debug msg="Evaluating node sleep-example-tdnk7[0].sleep: template: *v1alpha1.WorkflowStep (sleep), boundaryID: sleep-example-tdnk7" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=argo,name=sleep-example-tdnk7)" tmpl="*v1alpha1.WorkflowStep (sleep)"
time="2024-08-21T11:56:37.582Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=argo,name=sleep-example-tdnk7)" tmpl="*v1alpha1.WorkflowStep (sleep)"
time="2024-08-21T11:56:37.582Z" level=debug msg="Getting the template by name: sleep" base="*v1alpha1.Workflow (namespace=argo,name=sleep-example-tdnk7)" tmpl="*v1alpha1.WorkflowStep (sleep)"
time="2024-08-21T11:56:37.582Z" level=debug msg="Node sleep-example-tdnk7[0].sleep already completed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="Step group node sleep-example-tdnk7-2176115732 deemed failed: child 'sleep-example-tdnk7-1943014577' failed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="node sleep-example-tdnk7-2176115732 phase Running -> Failed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="node sleep-example-tdnk7-2176115732 message: child 'sleep-example-tdnk7-1943014577' failed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="node sleep-example-tdnk7-2176115732 finished: 2024-08-21 03:56:37.582246691 +0000 UTC" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="step group sleep-example-tdnk7-2176115732 was unsuccessful: child 'sleep-example-tdnk7-1943014577' failed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=argo,name=sleep-example-tdnk7)" tmpl="*v1alpha1.NodeStatus (sleep)"
time="2024-08-21T11:56:37.582Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=argo,name=sleep-example-tdnk7)" tmpl="*v1alpha1.NodeStatus (sleep)"
time="2024-08-21T11:56:37.582Z" level=debug msg="Getting the template by name: sleep" base="*v1alpha1.Workflow (namespace=argo,name=sleep-example-tdnk7)" tmpl="*v1alpha1.NodeStatus (sleep)"
time="2024-08-21T11:56:37.582Z" level=info msg="Outbound nodes of sleep-example-tdnk7-1943014577 is [sleep-example-tdnk7-1943014577]" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="Outbound nodes of sleep-example-tdnk7 is [sleep-example-tdnk7-1943014577]" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="node sleep-example-tdnk7 phase Running -> Failed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="node sleep-example-tdnk7 message: child 'sleep-example-tdnk7-1943014577' failed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="node sleep-example-tdnk7 finished: 2024-08-21 03:56:37.582346369 +0000 UTC" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=debug msg="Checking daemoned children of sleep-example-tdnk7" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg=reconcileAgentPod namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=debug msg="Task results completion status: map[sleep-example-tdnk7-1943014577:true]" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="Updated message  -> child 'sleep-example-tdnk7-1943014577' failed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=info msg="Marking workflow completed" namespace=argo workflow=sleep-example-tdnk7
time="2024-08-21T11:56:37.582Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"argo\", Name:\"sleep-example-tdnk7\", UID:\"28e951a9-8353-4d92-9cab-2853e00b7134\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"2696088\", FieldPath:\"\"}): type: 'Warning' reason: 'WorkflowFailed' child 'sleep-example-tdnk7-1943014577' failed"
time="2024-08-21T11:56:37.582Z" level=debug msg="Log changes patch: {\"metadata\":{\"labels\":{\"workflows.argoproj.io/completed\":\"true\",\"workflows.argoproj.io/phase\":\"Failed\"}},\"status\":{\"conditions\":[{\"status\":\"False\",\"type\":\"PodRunning\"},{\"status\":\"True\",\"type\":\"Completed\"}],\"finishedAt\":\"2024-08-21T03:56:37Z\",\"message\":\"child 'sleep-example-tdnk7-1943014577' failed\",\"nodes\":{\"sleep-example-tdnk7\":{\"finishedAt\":\"2024-08-21T03:56:37Z\",\"message\":\"child 'sleep-example-tdnk7-1943014577' failed\",\"outboundNodes\":[\"sleep-example-tdnk7-1943014577\"],\"phase\":\"Failed\"},\"sleep-example-tdnk7-1943014577\":{\"finishedAt\":\"2024-08-21T03:56:34Z\",\"message\":\"Error (exit code 137)\",\"phase\":\"Failed\",\"resourcesDuration\":{\"cpu\":13,\"memory\":120}},\"sleep-example-tdnk7-2176115732\":{\"finishedAt\":\"2024-08-21T03:56:37Z\",\"message\":\"child 'sleep-example-tdnk7-1943014577' failed\",\"phase\":\"Failed\"}},\"phase\":\"Failed\",\"taskResultsCompletionStatus\":{\"sleep-example-tdnk7-1943014577\":true}}}"
time="2024-08-21T11:56:37.582Z" level=info msg="Workflow to be dehydrated" Workflow Size=2233
time="2024-08-21T11:56:37.586Z" level=debug msg="Create events 201"
time="2024-08-21T11:56:37.588Z" level=debug msg="Update workflows 200"
time="2024-08-21T11:56:37.589Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=2696092 workflow=sleep-example-tdnk7

@jswxstw jswxstw marked this pull request as draft August 21, 2024 06:53
@jswxstw

This comment was marked as resolved.

@jswxstw jswxstw marked this pull request as ready for review August 21, 2024 08:47
@agilgur5 agilgur5 requested review from Joibel and isubasinghe August 21, 2024 13:44
@agilgur5 agilgur5 changed the title fix: Mark its taskResult as completed if pod has been terminated not gracefully. Fixes #13373 fix: Mark taskResult as completed if pod terminated not gracefully. Fixes #13373 Aug 21, 2024
@agilgur5 agilgur5 changed the title fix: Mark taskResult as completed if pod terminated not gracefully. Fixes #13373 fix: Mark taskResult completed if pod terminated not gracefully. Fixes #13373 Aug 21, 2024
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Aug 21, 2024
@yonirab
Copy link
Contributor

yonirab commented Aug 22, 2024

@jswxstw Does this PR also address the scenario of a workflow with a step that gets OOMKilled and then retries and succeeds?
Does it make sure the workflow status is correctly reflected as Succeeded in such a case?

@jswxstw
Copy link
Member Author

jswxstw commented Aug 22, 2024

@jswxstw Does this PR also address the scenario of a workflow with a step that gets OOMKilled and then retries and succeeds? Does it make sure the workflow status is correctly reflected as Succeeded in such a case?

@yonirab Can you provide your problem scenario(detail status of workflow and pod, logs of workflow-controller) in issue #13373?
Based on my testing, if the main container is OOMKilled, it doesn't cause any issues, and I couldn't create a scenario where the wait container is OOMKilled.
I guess the wait container in your OOMKilled pod should be terminated with non-zero exitcode, and there might be a line like this in the controller logs:

time="2024-07-22T08:07:05.077Z" level=debug msg="taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed" fromPhase=Running namespace=argo-simu-simulation-platform toPhase=Error workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw

workflow/controller/operator.go Outdated Show resolved Hide resolved
@jswxstw jswxstw force-pushed the fix-13373 branch 3 times, most recently from 97f93d2 to 606f994 Compare August 23, 2024 02:40
@jswxstw jswxstw requested a review from Joibel August 23, 2024 02:43
@jswxstw jswxstw changed the title fix: Mark taskResult completed if pod terminated not gracefully. Fixes #13373 fix: Mark taskResult completed if wait container terminated not gracefully. Fixes #13373 Aug 23, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Aug 24, 2024
@agilgur5 agilgur5 added the prioritized-review For members of the Sustainability Effort label Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v3.5.8: workflow stuck in Running, but only pod exited with OOMKilled (exit code 137)
4 participants