Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
🐛 Refactor manager to avoid race conditions and provide clean shutdown
This changeset provides a series of improvements and refactors how the manager starts and stops. During testing with Cluster API (a user of controller runtime), folks noticed that the manager which runs a series of components can deadlock itself when using conversion webhooks, or health checks, or won't cleanly shutdown and cleanup all the running controller, runnables, caches, webhooks, and http servers. In particular: - The Manager internal mutex didn't actually lock operations while the manager was in the process of starting up. The manager internal Start() implementation started off a series of goroutines internally and then waits. Concurrent operations on the manager, like AddHealthzCheck or AddReadyzCheck or AddMetricsExtraHandler modified the internals map while or after their respective servers were being configured, causing potential races or being ineffective. - Unclear ordering of the manager caused deadlock when the caches would start up. Upon startup, conversion webhooks are required when waiting for the cache initial List() call, which warms the internal caches. If a webook server or a healthz/readyz probe didn't start in time, the cache List() call fails because the webhooks would be unavailable. - Manager would say it was Elected() (note: this is used regardless if leader election is enabled or not) without waiting for all the caches to warm up, which could result in failed client calls. - Stop proceduce cancelled everything at once regardless of ordering. Previously, the context cancelled all the runnables regardless of ordering which can also cause dependencies issues. With these changes, if graceful shutdown is set, we try to cancel and wait for runnable groups to be done in a strict order before proceeding to exit the program. - Stop procedure cancelled leader election only if graceful shutdown was set. This was probably an oversight, now we're cancelling leader election regardless if graceful timeout is set or not. - The http.Server used throughout the codebase now properly sets idle and read header timeout to match the api-server. Signed-off-by: Vince Prignano <[email protected]>
- Loading branch information