You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a node is cordoned (marked as unschedulable) for maintenance, any PipelineRuns with TaskRuns running on that node should run to completion.
(Out of scope = nodes going down or pods being evicted)
Actual Behavior
This situation can result in deadlock when the affinity assistant is enabled. Subsequent TaskRun pods have affinity for the placeholder pod, which is on an unschedulable node. These pods cannot be scheduled and do not trigger scale-up, so they just pend until the TaskRuns time out. (Reported by @skaegi and @pritidesai.)
With the affinity assistant disabled, you can cordon a node, wait for existing TaskRuns to finish before you delete any pods, and then the cluster autoscaler will trigger a scale up, creating a new node matching the node affinity terms of the original PV. Subsequent TaskRun pods get scheduled on the new node and the PipelineRun completes successfully, i.e. this is not a problem.
When the second TaskRun is created, its pod is stuck in pending status:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
affinity-assistant-6d8794b076-0 1/1 Running 0 117s
good-morning-run-kcsr4-first-pod 0/1 Completed 0 117s
good-morning-run-kcsr4-last-pod 0/1 Pending 0 26s
$ kubectl get events -n default --field-selector involvedObject.name=good-morning-run-kcsr4-last-pod
LAST SEEN TYPE REASON OBJECT MESSAGE
60s Warning FailedScheduling pod/good-morning-run-kcsr4-last-pod 0/4 nodes are available: 1 node(s) didn't match pod affinity rules, 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
60s Normal NotTriggerScaleUp pod/good-morning-run-kcsr4-last-pod pod didn't trigger scale-up: 2 node(s) had volume node affinity conflict, 1 node(s) didn't match pod affinity rules
Additional Info
Kubernetes version:
Client Version: v1.25.4
Kustomize Version: v4.5.7
Server Version: v1.24.10-gke.2300
Tekton Pipeline version:
main
The text was updated successfully, but these errors were encountered:
Expected Behavior
If a node is cordoned (marked as unschedulable) for maintenance, any PipelineRuns with TaskRuns running on that node should run to completion.
(Out of scope = nodes going down or pods being evicted)
Actual Behavior
This situation can result in deadlock when the affinity assistant is enabled. Subsequent TaskRun pods have affinity for the placeholder pod, which is on an unschedulable node. These pods cannot be scheduled and do not trigger scale-up, so they just pend until the TaskRuns time out. (Reported by @skaegi and @pritidesai.)
Related: #4699
With the affinity assistant disabled, you can cordon a node, wait for existing TaskRuns to finish before you delete any pods, and then the cluster autoscaler will trigger a scale up, creating a new node matching the node affinity terms of the original PV. Subsequent TaskRun pods get scheduled on the new node and the PipelineRun completes successfully, i.e. this is not a problem.
Steps to Reproduce the Problem
Additional Info
main
The text was updated successfully, but these errors were encountered: