Skip to content
This repository has been archived by the owner on Dec 2, 2021. It is now read-only.

Latest commit

 

History

History
705 lines (524 loc) · 37 KB

sysctl.md

File metadata and controls

705 lines (524 loc) · 37 KB

Setting Sysctls on the Pod Level

This proposal aims at extending the current pod specification with support for namespaced kernel parameters (sysctls) set for each pod.

Roadmap

v1.4

  • initial implementation for v1.4 kubernetes/kubernetes#27180
    • node-level whitelist for safe sysctls: kernel.shm_rmid_forced, net.ipv4.ip_local_port_range, net.ipv4.tcp_max_syn_backlog, net.ipv4.tcp_syncookies
    • (disabled by-default) unsafe sysctls: kernel.msg*, kernel.sem, kernel.shm*, fs.mqueue.*, net.*
    • new kubelet flag: --experimental-allowed-unsafe-sysctls
    • PSP default: *
  • document node-level whitelist with kubectl flags and taints/tolerations
  • document host-level sysctls with daemon sets + taints/tolerations
  • in parallel: kernel upstream patches to fix ipc accounting for 4.5+

v1.5+

  • pre-requisites for kernel.sem, kernel.msg*, fs.mqueue.* on the node-level whitelist
    • pod cgroups active by default (compare Pod Resource Management)
    • kmem accounting active by default
    • kernel patches for 4.5+ (merged since 4.9)
  • reconsider what to do with kernel.shm* and other resource-limit sysctls with proper isolation: (a) keep them in the API (b) set node-level defaults

Table of Contents

Abstract

In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via /proc/sys/ virtual process file system. The parameters cover various subsystems such as:

  • kernel (common prefix: kernel.)
  • networking (common prefix: net.)
  • virtual memory (common prefix: vm.)
  • MDADM (common prefix: dev.)

More subsystems are described in Kernel docs.

To get a list of basic prefixes on your system, you can run

$ sudo sysctl -a | cut -d' ' -f1 | cut -d'.' -f1 | sort -u

To get a list of all parameters, you can run

$ sudo sysctl -a

A number of them are namespaced and can therefore be set for a container independently with today's Linux kernels.

Note: This proposal - while sharing some use-cases - does not cover ulimits (compare Expose or utilize docker's rlimit support).

Motivation

A number of Linux applications need certain kernel parameter settings to

  • either run at all
  • or perform well.

In Kubernetes we want to allow to set these parameters within a pod specification in order to enable the use of the platform for those applications.

With Docker version 1.11.1 it is possible to change kernel parameters inside privileged containers. However, the process is purely manual and the changes might be applied across all containers affecting the entire host system. It is not possible to set the parameters within a non-privileged container.

With docker#19265 docker-run as of 1.12.0 supports setting a number of whitelisted sysctls during the container creation process.

Some real-world examples for the use of sysctls:

  • PostgreSQL requires kernel.shmmax and kernel.shmall (among others) to be set to reasonable high values (compare PostgreSQL Manual 17.4.1. Shared Memory and Semaphores). The default of 32 MB for shared memory is not reasonable for a database.

  • RabbitMQ proposes a number of sysctl settings to optimize networking: https://www.rabbitmq.com/networking.html.

  • web applications with many concurrent connections require high values for net.core.somaxconn.

  • a containerized IPv6 routing daemon requires e.g. /proc/sys/net/ipv6/conf/all/forwarding and /proc/sys/net/ipv6/conf/all/accept_redirects (compare docker#4717)

  • the nginx ingress controller in kubernetes/contrib uses a privileged sidekick container to set net.core.somaxconn and net.ipv4.ip_local_port_range.

  • a huge software-as-a-service provider uses shared memory (kernel.shm*) and message queues (kernel.msg*) to communicate between containers of their web-serving pods, configuring up to 20 GB of shared memory.

    For optimal network layer performance they set net.core.rmem_max, net.core.wmem_max, net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to much higher values than kernel defaults.

  • In Linux Tuning guides for 10G ethernet it is suggested to set net.core.rmem_max/net.core.wmem_max to values as high as 64 MB and similar dimensions for net.ipv4.tcp_rmem/net.ipv4.tcp_wmem.

    It is noted that

    tuning settings described here will actually decrease performance of hosts connected at rates of OC3 (155 Mbps) or less.

  • For integration of a web-backend with the load-balancer retry mechanics it is suggested in http://serverfault.com/questions/518862/will-increasing-net-core-somaxconn-make-a-difference:

    Sometimes it's preferable to fail fast and let the load-balancer to do it's job(retry) than to make user wait - for that purpose we set net.core.somaxconn any value, and limit application backlog to e.g. 10 and set net.ipv4.tcp_abort_on_overflow to 1.

    In other words, sysctls change the observable application behavior from the view of the load-balancer radically.

Abstract Use Cases

As an administrator I want to set customizable kernel parameters for a container

  1. To be able to limit consumed kernel resources
    1. so I can provide more resources to other containers
    2. to restrict system communication that slows down the host or other containers
    3. to protect against programming errors like resource leaks
    4. to protect against DDoS attacks.
  2. To be able to increase limits for certain applications while not changing the default for all containers on a host
    1. to enable resource hungry applications like databases to perform well while the default limits for all other applications can be kept low
    2. to enable many network connections e.g. for web backends
    3. to allow special memory management like Java hugepages.
  3. To be able to enable kernel features.
    1. to enable containerized execution of special purpose applications without the need to enable those kernel features host wide, e.g. ip forwarding for network router daemons

Constraints and Assumptions

  • Only namespaced kernel parameters can be modified
  • Resource isolation is ensured for all safe sysctls. Sysctl with unclear, weak or not existing isolation are called unsafe sysctls. The later are disabled by default.
  • Built on-top of the existing security context work
  • Be container-runtime agnostic
    • on the API level
    • the implementation (and the set of supported sysctls) will depend on the runtime
  • Kernel parameters can be set during a container creation process only.

Further work (out of scope for this proposal)

Community Work

Docker support for sysctl

Supported sysctls (whitelist) as of Docker 1.12.0:

  • IPC namespace
    • System V: kernel.msgmax, kernel.msgmnb, kernel.msgmni, kernel.sem, kernel.shmall, kernel.shmmax, kernel.shmmni, kernel.shm_rmid_forced
    • POSIX queues: fs.mqueue.*
  • network namespace: net.*

Error behavior:

  • not whitelisted sysctls are rejected:
$ docker run --sysctl=foo=bla -it busybox /bin/sh
invalid value "foo=bla" for flag --sysctl: sysctl 'foo=bla' is not whitelisted
See 'docker run --help'.

Applied changes:

Related issues:

Runc support for sysctl

Supported sysctls (whitelist) as of RunC 0.1.1 (compare libcontainer config validator):

  • IPC namespace
    • System V: kernel.msgmax, kernel.msgmnb, kernel.msgmni, kernel.sem, kernel.shmall, kernel.shmmax, kernel.shmmni, kernel.shm_rmid_forced
    • POSIX queues: fs.mqueue.*
  • network namespace: net.*

Applied changes:

Rkt support for sysctl

The only sysctl support in rkt is through a CNI plugin plugin. The Kubernetes network plugin kubenet can easily be extended to call this with a given list of sysctls during pod launch.

The default network plugin for rkt is no-op though. This mode leaves all network initialization to rkt itself. Rkt in turn uses the static CNI plugin configuration in /etc/rkt/net.d. This does not allow to customize the sysctls for a pod. Hence, in order to implement this proposal in no-op mode additional changes in rkt are necessary.

Supported sysctls (whitelist):

  • network namespace: net.*

Applied changes:

Issues:

Design Alternatives and Considerations

  • Each pod has its own network stack that is shared among its containers. A privileged side-kick or init container (compare https://git.k8s.io/contrib/ingress/controllers/nginx/examples/sysctl/change-proc-values-rc.yaml#L80) is able to set net.* sysctls.

    Clearly, this is completely uncontrolled by the kubelet, but is a usable work-around if privileged containers are permitted in the environment. As privileged container permissions (in the admission controller) are an all-or-nothing decision and the actual code executed in them is not limited, allowing privileged container might be a security threat.

    The same work-around also works for shared memory and message queue sysctls as they are shared among the containers of a pod in their ipc namespace.

  • Instead of giving the user a way to set sysctls for his pods, an alternative seems to be to set high values for the limits of interest from the beginning inside the kubelet or the runtime. Then - so the theory - the user's pods operate under quasi unlimited bounds.

    This might be true for some of the sysctls, which purely set limits for some host resources, but

    • some sysctls influence the behavior of the application, e.g.:
      • kernel.shm_rmid_forced adds a garbage collection semantics to shared memory segments when possessing processes die. This is against the System V standard though.
      • net.ipv4.tcp_abort_on_overflow makes the kernel send RST packets when the application is overloaded, giving a load-balancer the chance to reschedule a request to another backend.
    • some sysctls lead to changed resource requirement characteristics, e.g.:
      • net.ipv4.tcp_rmem/net.ipv4.tcp_wmem not only define min and max values, but also the default tcp window buffer size for each socket. While large values are necessary for certain environments and applications, they lead to waste of resources in the 90% case.
    • some sysctls have a different error behavior, e.g.:
      • creating a shared memory segment will fail immediately when kernel.shmmax is too small.

        With a large kernel.shmmax default, the creation of a segment always succeeds, but the OOM killer will do its job when a shared memory segment exceeds the memory request of the container.

    The high values that could be set by the kubelet on launch might depend on the node's capacity and capabilities. But for portability of workloads it is helpful to have a common baseline of sysctls settings one can expect on every node. The kernel defaults (which are active if the kubelet does not change defaults) are such a (natural) baseline.

  • One could imagine to offer certain non-namespaced sysctls as well which taint a host such that only containers with compatible sysctls settings are scheduled there. This is considered out of scope to schedule pods with certain sysctls onto certain hosts according to some given rules. This must be done manually by the admin, e.g. by using taints and tolerations.

  • (Next to namespacing) isolation is the key requirement for a sysctl to be unconditionally allowed in a pod spec. There are the following alternatives:

    1. allow only namespaced and isolated sysctls (= safe) in the API
    2. allow only namespaced and isolated sysctls by-default and make all other namespaced sysctls with unclear or weak isolation (= unsafe) opt-in by the cluster admin.

    For v1.4 only a handful of safe sysctls are defined. There are known, non-edge-case use-cases (see above) for a number of further sysctls. Some of them (especially the ipc sysctls) will probably be promoted onto the whitelist of safe sysctls in the near future when Kubernetes implements better resource isolation.

    On the other hand, especially in the net.* hierarchy there are a number of very low-level knobs to tune the network stack. They might be necessary for classes of applications requiring high-performance or realtime behavior. It is hard to forsee which knobs will be necessary in the future. At the same time the net.* hierarchy is huge making deep analysis on a 1-on-1 basis hard. If there is no way to use them at-your-own-risk, those users are forced into the use of privileged containers. This might be a security threat and a no-go for certain environments. Sysctls in the API (even if unsafe) in contrast allow finegrained control by the cluster admin without essentially opening up root access to the cluster nodes for some users.

    This requirement for a large number of accessible sysctls must be balanced though with the desire to have a minimal API surface: removing certain (unsafe) sysctls from an official API in a later version (e.g. because they turned out to be problematic for the node health) is problematic.

    To balance those two desires the API can be split in half: one official way to declare safe sysctls in a pod spec (this one will be promoted to beta and stable some day) and an alternative way to define unsafe sysctls. Possibly the second way will stay alpha forever to make it clear that unsafe sysctls are not a stable API of Kubernetes. Moreover, for all unsafe sysctls an opt-in policy is desirable, only controllable by the cluster admin, not by each cluster user.

Analysis of Sysctls of Interest

Note: The kmem accounting has fundamentally changed in kernel 4.5 (compare https://github.com/torvalds/linux/commit/a9bb7e620efdfd29b6d1c238041173e411670996): older kernels (e.g. 4.4 from Ubuntu 16.04, 3.10 from CentOS 7.2) use a blacklist (__GFP_NOACCOUNT), newer kernels (e.g. 4.6.x from Fedora 24) use a whitelist (__GFP_ACCOUNT). In the following the analysis is done for kernel >= 4.5:

  • kernel.shmall, kernel.shmmax, kernel.shmmni: configure System V shared memory

  • kernel.shm_rmid_forced: enforce removal of shared memory segments on process shutdown

    • namespaced in ipc ns
  • kernel.msgmax, kernel.msgmnb, kernel.msgmni: configure System V messages

    Even without using a sysctl the kernel default allows any pod to allocate 512 MB of message memory (compare https://github.com/sttts/kmem-ipc-msg-queues as a test-case). If kmem acconting is not active, this is outside of the pod resource limits. Then a node with 8 GB will not survive with >16 replicas of such a pod.

  • fs.mqueue.*: configure POSIX message queues.

  • kernel.sem: configure System V semaphores

  • net.*: configure the network stack

    • net.core.somaxconn: maximum queue length specifiable by listen.
      • namespaced in net ns
      • might have application influence for high values as it limits the socket queue length
      • [?] No real evidence found until now for accounting. The limit is checked by sk_acceptq_is_full at http://lxr.free-electrons.com/source/net/ipv4/tcp_ipv4.c#L1276. After that a new socket is created. Probably, the tcp socket buffer sysctls apply then, with their accounting, see below.
      • very unreliable tcp memory accounting. There have a been a number of attempts to drop that from the kernel completely, e.g. https://lkml.org/lkml/2014/9/12/401. On Fedora 24 (4.6.3) tcp accounting did not work at all, on Ubuntu 16.06 (4.4) it kind of worked in the root-cg, but in containers only values copied from the root-cg appeared. e - net.ipv4.tcp_wmem/net.ipv4.tcp_wmem/net.core.rmem_max/net.core.wmem_max: socket buffer sizes
      • not namespaced in net ns, and they are not even available under /sys/net
    • net.ipv4.ip_local_port_range: local tcp/udp port range
      • namespaced in net ns
      • no memory involved
    • net.ipv4.tcp_max_syn_backlog: number of half-open connections
      • not namespaced
    • net.ipv4.tcp_syncookies: enable syn cookies
      • namespaced in net ns
      • no memory involved

Summary of Namespacing and Isolation

The individual analysis above leads to the following summary of:

  • namespacing (ns) - the sysctl is set in this namespace, independently from the parent/root namespace

  • accounting (acc.) - the memory resources caused by the sysctl are accounted for by the given cgroup

    Kernel <= 4.4 and >= 4.5 fundamentally different kernel memory accounting (see note above). The two columns describe the two cases.

sysctl ns acc. for <= 4.4 >= 4.5
kernel.shm* ipc user memcg 1) user memcg 1)
kernel.msg* ipc kmem memcg 3) - 3)
fs.mqueue.* ipc kmem memcg -
kernel.sem ipc kmem memcg -
net.core.somaxconn net unreliable 4) unreliable 4)
net.*.tcp_wmem/rmem - 2) unreliable 4) unreliable 4)
net.core.wmem/rmem_max - 2) unreliable 4) unreliable 4)
net.ipv4.ip_local_port_range net not needed 5) not needed 5)
net.ipv4.tcp_syncookies net not needed 5) not needed 5)
net.ipv4.tcp_max_syn_backlog - 2) ? ?

Footnotes:

  1. a pod memory cgroup is necessary to catch segments from a dying process.
  2. only available in root-ns, not even visible in a container
  3. compare https://github.com/sttts/kmem-ipc-msg-queues as a test-case
  4. in theory socket buffers should be accounted for by the kmem.tcp memcg counters. In practice this only worked very unreliably and not reproducibly, on some kernel not at all. kmem.tcp acconuting seems to be deprecated and on lkml patches has been posted to drop this broken feature.
  5. b/c no memory is involved, i.e. purely functional difference

Note: for all sysctls marked as "kmem memcg" kernel memory accounting must be enabled in the container for proper isolation. This will not be the case for 1.4, but is planned for 1.5.

Classification

From the previous analysis the following classification is derived:

sysctl ns accounting reclaim pre-requisites
kernel.shm* pod container pod i 1)
kernel.msg* pod container pod i + ii + iii
fs.mqueue.* pod container pod i + ii + iii
kernel.sem pod container pod i + ii + iii
net.core.somaxconn pod container container i + ii + iv
net.*.tcp_wmem/rmem host container container i + ii + iv
net.core.wmem/rmem_max host container container i + ii + iv
net.ipv4.ip_local_port_range pod n/a n/a -
net.ipv4.tcp_syncookies pod n/a n/a -
net.ipv4.tcp_max_syn_backlog pod n/a n/a -

Explanation:

  • ns: value is namespaced on this level
  • accounting: memory is accounted for against limits of this level
  • reclaim: in the worst case, memory resources fall-through to this level and are accounted for there until they get destroyed
  • pre-requisites:
    1. pod level cgroups
    2. kmem acconuting enabled in Kubernetes
    3. kmem accounting fixes for ipc namespace in Kernel >= 4.5
    4. reliable kernel tcp net buffer accounting, which probably means to wait for cgroups v2.

Footnote:

  1. Pod level cgroups don't exist today and pages are already re-parented on container deletion in v1.3. So supporting pod level sysctls in v1.4 that are tracked by user space memcg is not introducing any regression.

Note: with the exception of kernel.shm* all of the listed pod-level sysctls depend on kernel memory accounting to be enabled for proper resource isolation. This will not be the case for 1.4 by default, but is planned in 1.5.

Note: all the ipc objects persist when the originating containers dies. Their resources (if kmem accounting is enabled) fall back to the parent cgroup. As long as there is no pod level memory cgroup, the parent will be the container runtime, e.g. the docker daemon or the RunC process. It is planned with v1.5 to introduce a pod level memory cgroup which will fix this problem.

Note: in general it is good practice to reserve special nodes for those pods which set sysctls which the kernel does not guarantee proper isolation for.

Proposed Design

Sysctls in pods and PodSecurityPolicy are first introduced as an alpha feature for Kubernetes 1.4. This means that the API will model these as annotations, with the plan to turn those in first class citizens in a later release when the feature is promoted to beta.

It is proposed to use a syntactical validation in the apiserver and a node-level whitelist of safe sysctls in the kubelet. The whitelist shall be fixed per version and might grow in the future when better resource isolation is in place in the kubelet. In addition a list of allowed unsafe sysctls will be configured per node by the cluster admin, with an empty list as the default.

The following rules apply:

  • Only sysctls shall be whitelisted in the kubelet
    • that are properly namespaced by the container or the pod (e.g. in the ipc or net namespace)
    • and that cannot lead to resource consumption outside of the limits of the container or the pod. These are called safe.
  • The cluster admin shall only be able to manually enable sysctls in the kubelet
    • that are properly namespaced by the container or the pod (e.g. in the ipc or net namespace). These are call unsafe.

This means that sysctls that are not namespaced must be set by the admin on host level at his own risk, e.g. by running a privileged daemonset, possibly limited to a restricted, special-purpose set of nodes, if necessary with the host network namespace. This is considered out-of-scope of this proposal and out-of-scope of what the kubelet will do for the admin. A section is going to be added to the documentation describing this.

The allowed unsafe sysctls will be configurable on the node via a flag of the kubelet.

Pod API Changes

Pod specification must be changed to allow the specification of kernel parameters:

// Sysctl defines a kernel parameter to be set
type Sysctl struct {
	// Name of a property to set
	Name string `json:"name"`
	// Value of a property to set
	Value intstr.IntOrString `json:"value"`
	// Must be true for unsafe sysctls.
	Unsafe bool `json:"unsafe,omitempty"`
}

// PodSecurityContext holds pod-level security attributes and common container settings.
// Some fields are also present in container.securityContext.  Field values of
// container.securityContext take precedence over field values of PodSecurityContext.
type PodSecurityContext struct {
	...
	// Sysctls hold a list of namespaced sysctls used for the pod. Pods with unsupported
	// sysctls (by the container runtime) might fail to launch.
	Sysctls []Sysctl `json:"sysctls,omitempty"`
}

During alpha the extension of PodSecurityContext is modeled with annotations:

security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1`
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3`

The value is a comma separated list of key-value pairs separated by =.

Safe sysctls may be declared with unsafe: true (or in the respective annotation), while for unsafe sysctls unsafe: true is mandatory. This guarantees backwards-compatibility in future versions when sysctls have been promoted to the whitelist: old pod specs will still work.

Possibly, the security.alpha.kubernetes.io/unsafe-sysctls annotation will stay as an alpha API (replacing the Unsafe bool field) even when security.alpha.kubernetes.io/sysctls has been promoted to beta or stable. This helps to make clear that unsafe sysctls are not a stable feature.

Note: none of the whitelisted (and in general none with the exceptions of descriptive plain text ones) sysctls use anything else than numbers, possibly separated with spaces.

Note: sysctls must be on the pod level because containers in a pod share IPC and network namespaces (if pod.spec.hostIPC and pod.spec.hostNetwork is false) and therefore cannot have conflicting sysctl values. Moreover, note that all namespaced sysctl supported by Docker/RunC are either in the IPC or network namespace.

Apiserver Validation and Kubelet Admission

In the Apiserver

The name of each sysctl in PodSecurityContext.Sysctls[*].Name (or the annotation security.alpha.kubernetes.io/[unsafe-]sysctls during alpha) is validated by the apiserver against:

  • 253 characters in length
  • it matches sysctlRegexp:
const SysctlSegmentFmt string = "[a-z0-9]([-_a-z0-9]*[a-z0-9])?"
const SysctlFmt string = "(" + SysctlSegmentFmt + "\\.)*" + SysctlSegmentFmt
var sysctlRegexp = regexp.MustCompile("^" + SysctlFmt + "$")

In the Kubelet

The name of each sysctl in PodSecurityContext.Sysctls[*].Name (or the annotation security.alpha.kubernetes.io/[unsafe-]sysctls during alpha) is checked by the kubelet against a static whitelist.

The whitelist is defined under pkg/kubelet and to be maintained by the nodes team.

The initial whitelist of safe sysctls will be:

var whitelist = []string{
    "kernel.shm_rmid_forced",
    "net.ipv4.ip_local_port_range",
    "net.ipv4.tcp_syncookies",
    "net.ipv4.tcp_max_syn_backlog",
}

In parallel a namespace list is maintained with all sysctls and their respective, known kernel namespaces. This is initially derived from Docker's internal sysctl whitelist:

var namespaces = map[string]string{
    "kernel.sem": "ipc",
}

var prefixNamespaces = map[string]string{
    "kernel.msg": "ipc",
    "kenrel.shm": "ipc",
    "fs.mqueue.": "ipc",
    "net.":       "net",
}

If a pod is created with host ipc or host network namespace, the respective sysctls are forbidden.

Error behavior

Pods that do not comply with the syntactical sysctl format will be rejected by the apiserver. Pods that do not comply with the whitelist (or are not manually enabled as allowed unsafe sysctls for a node by the cluster admin) will fail to launch. An event will be created by the kubelet to notify the user.

Kubelet Flags to Extend the Whitelist

The kubelet will get a new flag:

--experimental-allowed-unsafe-sysctls  Comma-separated whitelist of unsafe
                                       sysctls or unsafe sysctl patterns
                                       (ending in *). Use these at your own
                                       risk.

It defaults to the empty list.

During kubelet launch the given value is checked against the list of known namespaces for sysctls or sysctl prefixes. If a namespace is not known, the kubelet will terminate with an error.

SecurityContext Enforcement

Alternative 1: by name

A list of permissible sysctls is to be added to pkg/apis/extensions/types.go (compare pod-security-policy):

// PodSecurityPolicySpec defines the policy enforced.
type PodSecurityPolicySpec struct {
	...
	// Sysctls is a white list of allowed sysctls in a pod spec. Each entry
	// is either a plain sysctl name or ends in "*" in which case it is considered
	// as a prefix of allowed sysctls.
	Sysctls []string `json:"sysctls,omitempty"`
}

The simpleProvider in pkg.security.podsecuritypolicy will validate the value of PodSecurityPolicySpec.Sysctls with the sysctls of a given pod in ValidatePodSecurityContext.

The default policy will be *, i.e. all syntactly correct sysctls are admitted by the PodSecurityPolicySpec.

The PodSecurityPolicySpec applies to safe and unsafe sysctls in the same way.

During alpha the following annotation will be used:

security.alpha.kubernetes.io/sysctls: kernel.shmmax,kernel.msgmax,fs.mqueue.*`

on PodSecurityPolicy objects to customize the allowed sysctls.

Note: This does not override the whitelist or the allowed unsafe sysctls on the nodes. They still apply. This only changes admission of pods in the apiserver. Pods can still fail to launch due to failed admission on the kubelet.

Alternative 2: SysctlPolicy

// SysctlPolicy defines how a sysctl may be set. If neither Values,
// nor Min, Max are set, any value is allowed.
type SysctlPolicy struct {
    // Name is the name of a sysctl or a pattern for a name. It consists of
    // dot separated name segments. A name segment matches [a-z]+[a-z_-0-9]* or
    // equals "*". The later is interpretated as a wildcard for that name
    // segment.
    Name string `json:"name"`

    // Values are allowed values to be set. Either Values is
    // set or Min and Max.
    Values []string `json:"values,omitempty"`

    // Min is the minimal value allowed to be set.
    Min *int64 `json:"min,omitempty"`

    // Max is the maximum value allowed to be set.
    Max *int64 `json:"max,omitempty"`
}

// PodSecurityPolicySpec defines the policy enforced on sysctls.
type PodSecurityPolicySpec struct {
    ...
    // Sysctls is a white list of allowed sysctls in a pod spec.
    Sysctls []SysctlPolicy `json:"sysctls,omitempty"`
}

During alpha the following annotation will be used:

security.alpha.kubernetes.io/sysctls: kernel.shmmax,kernel.msgmax=max:10:min:1,kernel.msgmni=values:1000 2000 3000`

This extended syntax is a natural extension of that of alternative 1 and therefore can be implemented any time during alpha.

Alternative 1 or 2 has to be chosen for the external API once the feature is promoted to beta.

Application of the given Sysctls

Finally, the container runtime will interpret pod.spec.securityPolicy.sysctls, e.g. in the case of Docker the DockerManager will apply the given sysctls to the infra container in createPodInfraContainer.

In a later implementation of a container runtime interface (compare kubernetes/kubernetes#25899), sysctls will be part of LinuxPodSandboxConfig (compare kubernetes/kubernetes#25899 (comment)) and to be applied by the runtime implementation to the PodSandbox by the PodSandboxManager implementation.

Examples

Use in a pod

Here is an example of a pod that has the safe sysctl net.ipv4.ip_local_port_range set to 1024 65535 and the unsafe sysctl net.ipv4.route.min_pmtu to 1000.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
  securityContext:
    sysctls:
    - name: net.ipv4.ip_local_port_range
      value: "1024 65535"
    - name: net.ipv4.route.min_pmtu
      value: 1000
      unsafe: true

Allowing only certain sysctls

Here is an example of a PodSecurityPolicy, allowing kernel.shmmax, kernel.shmall and all net.* sysctls to be set:

apiVersion: v1
kind: PodSecurityPolicy
metadata:
  name: database
spec:
  sysctls:
  - kernel.shmmax
  - kernel.shmall
  - net.*

and a restricted default PodSecurityPolicy:

apiVersion: v1
kind: PodSecurityPolicy
metadata:
  name:
spec:
  sysctls: # none

in contrast to a permissive default PodSecurityPolicy:

apiVersion: v1
kind: PodSecurityPolicy
metadata:
  name:
spec:
  sysctls:
  - *