Skip to content

Commit

Permalink
🐛 Refactor manager to avoid race conditions and provide clean shutdown
Browse files Browse the repository at this point in the history
This changeset provides a series of improvements and refactors how the
manager starts and stops. During testing with Cluster API (a user of
controller runtime), folks noticed that the manager which runs a series
of components can deadlock itself when using conversion webhooks, or
health checks, or won't cleanly shutdown and cleanup all the running
controller, runnables, caches, webhooks, and http servers.

In particular:
- The Manager internal mutex didn't actually lock operations while the
  manager was in the process of starting up. The manager internal
  Start() implementation started off a series of goroutines internally
  and then waits. Concurrent operations on the manager, like
  AddHealthzCheck or AddReadyzCheck or AddMetricsExtraHandler modified
  the internals map while or after their respective servers were being
  configured, causing potential races or being ineffective.

- Unclear ordering of the manager caused deadlock when the caches would
  start up. Upon startup, conversion webhooks are required when
  waiting for the cache initial List() call, which warms the internal
  caches. If a webook server or a healthz/readyz probe didn't start in
  time, the cache List() call fails because the webhooks would be
  unavailable.

- Manager would say it was Elected() (note: this is used regardless if
  leader election is enabled or not) without waiting for all the caches
  to warm up, which could result in failed client calls.

- Stop proceduce cancelled everything at once regardless of ordering.
  Previously, the context cancelled all the runnables regardless of
  ordering which can also cause dependencies issues. With these changes,
  if graceful shutdown is set, we try to cancel and wait for runnable
  groups to be done in a strict order before proceeding to exit the
  program.

- Stop procedure cancelled leader election only if graceful shutdown was
  set. This was probably an oversight, now we're cancelling leader
  election regardless if graceful timeout is set or not.

- The http.Server used throughout the codebase now properly sets idle
  and read header timeout to match the api-server.

Signed-off-by: Vince Prignano <[email protected]>
  • Loading branch information
vincepri committed Nov 2, 2021
1 parent c73b143 commit 38f1b90
Show file tree
Hide file tree
Showing 8 changed files with 842 additions and 298 deletions.
16 changes: 16 additions & 0 deletions pkg/internal/httpserver/server.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
package httpserver

import (
"net/http"
"time"
)

// New returns a new server with sane defaults.
func New(handler http.Handler) *http.Server {
return &http.Server{
Handler: handler,
MaxHeaderBytes: 1 << 20,
IdleTimeout: 90 * time.Second, // matches http.DefaultTransport keep-alive timeout
ReadHeaderTimeout: 32 * time.Second,
}
}
Loading

0 comments on commit 38f1b90

Please sign in to comment.