-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚠️ Refactor manager to avoid race conditions and provide clean shutdown #1695
Conversation
7c59ac6
to
e98ac23
Compare
48a59d3
to
21128d1
Compare
7a08e72
to
66670ea
Compare
Testing on Cluster API AWS, this did seem to work on the first try:
|
Just to also mention it here. I ran full CAPI/CAPD e2e tests with and without leader election and they were all successful. |
77f47df
to
51c8abd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me. Just a few questions and comments.
fakeCluster.informer.mu.Lock() | ||
defer fakeCluster.informer.mu.Unlock() | ||
return fakeCluster.informer.wasStarted && fakeCluster.informer.wasSynced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need Eventually()
now? Just wondering because before, it looks like we expect these values to be true the first time we check them.
pkg/manager/runnable_group.go
Outdated
default: | ||
return r.LeaderElection.Add(fn, ready) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking. Unknown runnable types are started after leader election? I (maybe naively) expected unknown types to be added with r.Others.Add
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1695 (comment)
pkg/manager/runnable_group.go
Outdated
go func() { | ||
if rn.Check(r.ctx) { | ||
r.group.Store(rn, true) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm fully grokking this goroutine. We're running a Check()
in the background, and meanwhile we Start()
the runnable.
Check()
is supposed to block until the runnable is ready, and then returns true, at which point we set the runnable as being ready in the group store? If Check()
returns false, the runnable is never set as ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at which point we set the runnable as being ready in the group store?
Once checks returns true, the runnable is going to be marked as ready; if it never becomes ready we block execution and eventually timeout the Waits
If Check() returns false, the runnable is never set as ready?
Correct, the runnable should then exit and return an error on its own which is then propagated to the errChan which exits the manager
pkg/manager/runnable_group.go
Outdated
// or returned an error to the channel. | ||
// | ||
// We should always decrement the WaitGroup and | ||
// mark the runnable as ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand on why the runnable always needs to be marked as ready when we return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
pkg/manager/runnable_group.go
Outdated
// WaitReady polls until the group is ready or until the context is cancelled. | ||
func (r *runnableGroup) WaitReady(ctx context.Context) error { | ||
return wait.PollImmediateInfiniteWithContext(ctx, | ||
100*time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this wait interval too long? It seems like most runnables (other then hasCache runnables) will be ready within a few milliseconds. Perhaps we could have a backoff here as well, where the first few polls happen pretty quickly (maybe starting at 1ms)?
With 4-5 runnable groups, we could take up to a half second in this phase of the operator startup, which is fairly fast, but seems like it could be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100ms was a somewhat a good balance between looking at too quickly and not fast enough, I can reduce it if you prefer although the delay is mostly on manager startup
Expect(found.Runnable).To(BeAssignableToTypeOf(runnable)) | ||
Expect(found.Runnable.Start(context.Background())).To(MatchError(err)) | ||
}) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests for runnables that implement LeaderElectionRunnable
and return both true and false from NeedLeaderElection()
?
51c8abd
to
adeca0f
Compare
Needs linter fixup
|
adeca0f
to
38f1b90
Compare
@alvaroaleman @joelanford ptal |
/retitle |
pkg/manager/runnable_group.go
Outdated
} | ||
|
||
// WaitReady polls until the group is ready or until the context is cancelled. | ||
func (r *runnableGroup) WaitReady(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its conceptionally slightly weird that we internally have this notion of a ready
runnable but don't make that part of our api in any way. Is it still correct that we only need this for the cache? If so, can we document it like that?
I also have to say I am not a huge fan of the poll-based architecture we end up with here and would prefer it it if we found a way to push this instead. Maybe make reconcile
first loop over an initial
channel and then an additional
one for runnables that are added after the manaer is started and close a ready
chan in between?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its conceptionally slightly weird that we internally have this notion of a ready runnable but don't make that part of our api in any way. Is it still correct that we only need this for the cache? If so, can we document it like that?
For now yes, although it could be expanded later to have more built-in checks for the webhook server for example, or give the capability to runnables to have their own.
I also have to say I am not a huge fan of the poll-based architecture we end up with here and would prefer it it if we found a way to push this instead. Maybe make reconcile first loop over an initial channel and then an additional one for runnables that are added after the manaer is started and close a ready chan in between?
Agreed, I pushed an update to it to use channels instead and only wait for the initial (before-start) runnables.
33b61fa
to
423ec74
Compare
@vincepri full CAPI e2e tests with and without leader election are successful with the latest commit: |
Modulo a squash, this looks good to go from my end 🚢 |
This changeset provides a series of improvements and refactors how the manager starts and stops. During testing with Cluster API (a user of controller runtime), folks noticed that the manager which runs a series of components can deadlock itself when using conversion webhooks, or health checks, or won't cleanly shutdown and cleanup all the running controller, runnables, caches, webhooks, and http servers. In particular: - The Manager internal mutex didn't actually lock operations while the manager was in the process of starting up. The manager internal Start() implementation started off a series of goroutines internally and then waits. Concurrent operations on the manager, like AddHealthzCheck or AddReadyzCheck or AddMetricsExtraHandler modified the internals map while or after their respective servers were being configured, causing potential races or being ineffective. - Unclear ordering of the manager caused deadlock when the caches would start up. Upon startup, conversion webhooks are required when waiting for the cache initial List() call, which warms the internal caches. If a webook server or a healthz/readyz probe didn't start in time, the cache List() call fails because the webhooks would be unavailable. - Manager would say it was Elected() (note: this is used regardless if leader election is enabled or not) without waiting for all the caches to warm up, which could result in failed client calls. - Stop proceduce cancelled everything at once regardless of ordering. Previously, the context cancelled all the runnables regardless of ordering which can also cause dependencies issues. With these changes, if graceful shutdown is set, we try to cancel and wait for runnable groups to be done in a strict order before proceeding to exit the program. - Stop procedure cancelled leader election only if graceful shutdown was set. This was probably an oversight, now we're cancelling leader election regardless if graceful timeout is set or not. - The http.Server used throughout the codebase now properly sets idle and read header timeout to match the api-server. Signed-off-by: Vince Prignano <[email protected]>
423ec74
to
612e9b2
Compare
@alvaroaleman Thanks! The modulo has been satisfied ✔️ 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Vince Prignano [email protected]
This changeset provides a series of improvements and refactors how the
manager starts and stops. During testing with Cluster API (a user of
controller runtime), folks noticed that the manager which runs a series
of components can deadlock itself when using conversion webhooks, or
health checks, or won't cleanly shutdown and cleanup all the running
controller, runnables, caches, webhooks, and http servers.
In particular:
The Manager internal mutex didn't actually lock operations while the
manager was in the process of starting up. The manager internal
Start() implementation started off a series of goroutines internally
and then waits. Concurrent operations on the manager, like
AddHealthzCheck or AddReadyzCheck or AddMetricsExtraHandler modified
the internals map while or after their respective servers were being
configured, causing potential races or being ineffective.
Unclear ordering of the manager caused deadlock when the caches would
start up. Upon startup, conversion webhooks are required when
waiting for the cache initial List() call, which warms the internal
caches. If a webook server or a healthz/readyz probe didn't start in
time, the cache List() call fails because the webhooks would be
unavailable.
Manager would say it was Elected() (note: this is used regardless if
leader election is enabled or not) without waiting for all the caches
to warm up, which could result in failed client calls.
Stop proceduce cancelled everything at once regardless of ordering.
Previously, the context cancelled all the runnables regardless of
ordering which can also cause dependencies issues. With these changes,
if graceful shutdown is set, we try to cancel and wait for runnable
groups to be done in a strict order before proceeding to exit the
program.
Stop procedure cancelled leader election only if graceful shutdown was
set. This was probably an oversight, now we're cancelling leader
election regardless if graceful timeout is set or not.
The http.Server used throughout the codebase now properly sets idle
and read header timeout to match the api-server.