-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make HandleError prevent hot-loops #40497
Conversation
Removing label |
I will make another PR with a test. This is the minimum fix and it should be easy to cherrypick. |
// package for that to be accessible here. | ||
lastErrorTime time.Time | ||
minPeriod time.Duration | ||
lock sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, lock should be the field above the thing it locks and should be named lasteErrorTimeLock or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will fix here and in the cherrypick branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cherrypick branch is fixed.
[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files: We suggest the following people: |
/lgtm |
I want to make the fix in the release branch here before this merges. |
(but I'm building and testing the release branch first) |
Add an error "handler" that just sleeps for a bit if errors happen more often than 500ms. Manually tested against kubernetes#39816.
OK, this should be mergable now. |
Added a tag to resolve questions about alternative solutions before this merges and affects all callers. |
I'll second what @deads2k said. Let's focus on the code that's hot-looping and formalize our error handling functionality in the controller framework. |
|
So there's two different arguments going on here:
David is concerned that 2 should be fixed at the same time as 1, and that 1 makes 2 not work as well (because it allows other code that is not controllers to break controllers). I agree with this, but we'd need to police the mechanism for 2. So:
|
Doesn't glog have a rate limiter? Why wouldn't we also set that? |
It's be used by the majority, it's not being used by GC. This effectively breaks the backoff handling of most controllers in a way that unnecessarily blocks execution of that controller (AddRateLimited doesn't) and the majority of callers from the controller packages don't want a delay like this. Further, if you're building something to just stop a hotloop on an unconsidered error (again, not the case in the majority of controllers), you can sleep for a very short period (10s of milliseconds) and you'd just do it unconditionally since the purpose is to just avoid ddos-ing yourself. However, since its the opposite of what the majority of controllers want, all the existing controllers with proper handling need to be updated to not use this new (or severely changed) method. Ending up in this place means that instead of fixing rating limiting for the GC controller (clients have rate limiters) and instead of fixing the GC controller to AddRateLimited (libraries already exist), means that we've taken the biggest hammer we have and applied it across multiple processes and allowed errors in on go routine to negatively rate limit already rate limited controllers for one bad apple. |
Yes, that's 2, and we need 3 to prevent it from being abused.
I had the expectation that
I would expect us to fix that in 1.6. |
#38679 switches GC to a ratelimited work queue for 1.6 and simple wait could be added here https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/garbagecollector/garbagecollector.go#L594 as a minimal touch in 1.5 instead of disrupting most of the other controllers in the 1.5 stream. |
I did this because I never want to have an afternoon destroyed by someone logging every 20 microseconds again, and this seemed like the most general place with the fewest side effects. I don't care if we make it 50ms instead of 500. Should we really be |
We spoke in slack and decided that a 1ms delay would protect infrastructure with minimal impact to callers like controllers. Once its updated, I'm ok with sleeping here. |
@lavalamp can you cherrypick into 1.4 while you are at it? |
We had a discussion on slack and it seems 1ms is the number that makes everyone able to live with a global limit like this. I will update this PR and send an adjustment to the 1.5 branch. |
OK, number adjusted, in a second commit so it'll be easier to cherrypick. |
/lgtm |
@grodrigues3 @apelisse The bot seems confused about the LGTM ordering here? Or is there some other reason why this is stuck? |
Agreed, there is a confusion. Thanks Lavalamp, somebody noticed this bug before but we couldn't figure out what was going wrong. I think I somewhat understand now. |
Automatic merge from submit-queue |
Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
Add an error "handler" that just sleeps for a bit if errors happen more
often than 500ms. Manually tested against #39816. This doesn't fix #39816 but it does keep it from crippling a cluster.