-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DaemonSet controller actively kills failed pods (to recreate them) #40330
DaemonSet controller actively kills failed pods (to recreate them) #40330
Conversation
case shouldContinueRunning && len(daemonPods) > 1: | ||
case shouldContinueRunning: | ||
// If a daemon pod failed, delete it | ||
// TODO: handle the case when the daemon pods fail consistently and causes kill-recreate hot loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often does the controller sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be that often. @janetkuo we probably want to cap on a maximum # of retries and then drop daemon sets out of the queue so we won't end up hotlooping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about returning errors (at the end) whenever there's a failed daemon pod? We use rate limiter when syncHandler returns an error. This can prevent hotloop
Why not add or update a unit test for the new behavior? |
There is a new extended test added |
Just a comment about the hotloop, lgtm otherwise. |
There is an e2e test, but I would much rather see a unit test. |
f599875
to
33cf0c9
Compare
Thanks. It looks reasonable overall, but I don't have much context. I'll let @Kargakis LGTM. |
@@ -547,6 +563,10 @@ func (dsc *DaemonSetsController) manage(ds *extensions.DaemonSet) error { | |||
for err := range errCh { | |||
errors = append(errors, err) | |||
} | |||
if failedPodsObserved > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this - why do you need to return the error here? Won't the daemon set be resynced because of the deleted pod event anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you want to use the ratelimiter - ok. Although for perma-failed daemon sets we probably want to stop retrying them after a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support perma-failed daemon sets yet. Normally DS controller would check if the daemon pod can be scheduled on the node before creating it, so it's unlikely it'll create pods that are doomed to fail. However, sometimes there could be a race condition that kubelet uses its own node object to admit pods, and then rejected the pods (pods become Failed
).
Let's deal with this in a follow up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the comment to before if
statement to make it more clear
@@ -653,6 +661,31 @@ func TestObservedGeneration(t *testing.T) { | |||
} | |||
} | |||
|
|||
// DaemonSet controller should kill all failed pods and recreate at most 1 failed pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"at most 1 pod on every node"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
24f2926
to
e502f78
Compare
e502f78
to
81c1e0c
Compare
@Kargakis ptal |
/approve |
[APPROVALNOTIFIER] This PR is NOT APPROVED The following people have approved this PR: mikedanese Needs approval from an approver in each of these OWNERS Files:
We suggest the following people: |
/lgtm |
Automatic merge from submit-queue |
Automatic merge from submit-queue (batch tested with PRs 40556, 40720) Emit events on 'Failed' daemon pods Follow up #40330 @erictune @mikedanese @Kargakis @lukaszo @kubernetes/sig-apps-bugs
Is there a reason to do this in the controller instead of just letting kubelet do it? |
Ref #36482, @erictune @yujuhong @mikedanese @Kargakis @lukaszo @piosz @kubernetes/sig-apps-bugs
This also helps with DaemonSet update