Setting Sysctls on the Pod Level

This proposal aims at extending the current pod specification with support for namespaced kernel parameters (sysctls) set for each pod.

Roadmap

v1.4

initial implementation for v1.4 kubernetes/kubernetes#27180
- node-level whitelist for safe sysctls: kernel.shm_rmid_forced, net.ipv4.ip_local_port_range, net.ipv4.tcp_max_syn_backlog, net.ipv4.tcp_syncookies
- (disabled by-default) unsafe sysctls: kernel.msg*, kernel.sem, kernel.shm*, fs.mqueue.*, net.*
- new kubelet flag: --experimental-allowed-unsafe-sysctls
- PSP default: *
document node-level whitelist with kubectl flags and taints/tolerations
document host-level sysctls with daemon sets + taints/tolerations
in parallel: kernel upstream patches to fix ipc accounting for 4.5+
- submitted to mainline
- merged into mainline, compare https://github.com/torvalds/linux/commit/8c8d4d45204902e144abc0f15b7c658828028fa1

v1.5+

pre-requisites for kernel.sem, kernel.msg*, fs.mqueue.* on the node-level whitelist
- pod cgroups active by default (compare Pod Resource Management)
- kmem accounting active by default
- kernel patches for 4.5+ (merged since 4.9)
reconsider what to do with kernel.shm* and other resource-limit sysctls with proper isolation: (a) keep them in the API (b) set node-level defaults

Abstract

In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via /proc/sys/ virtual process file system. The parameters cover various subsystems such as:

kernel (common prefix: kernel.)
networking (common prefix: net.)
virtual memory (common prefix: vm.)
MDADM (common prefix: dev.)

More subsystems are described in Kernel docs.

To get a list of basic prefixes on your system, you can run

$ sudo sysctl -a | cut -d' ' -f1 | cut -d'.' -f1 | sort -u

To get a list of all parameters, you can run

$ sudo sysctl -a

A number of them are namespaced and can therefore be set for a container independently with today's Linux kernels.

Note: This proposal - while sharing some use-cases - does not cover ulimits (compare Expose or utilize docker's rlimit support).

Motivation

A number of Linux applications need certain kernel parameter settings to

either run at all
or perform well.

In Kubernetes we want to allow to set these parameters within a pod specification in order to enable the use of the platform for those applications.

With Docker version 1.11.1 it is possible to change kernel parameters inside privileged containers. However, the process is purely manual and the changes might be applied across all containers affecting the entire host system. It is not possible to set the parameters within a non-privileged container.

With docker#19265 docker-run as of 1.12.0 supports setting a number of whitelisted sysctls during the container creation process.

Some real-world examples for the use of sysctls:

PostgreSQL requires kernel.shmmax and kernel.shmall (among others) to be set to reasonable high values (compare PostgreSQL Manual 17.4.1. Shared Memory and Semaphores). The default of 32 MB for shared memory is not reasonable for a database.
RabbitMQ proposes a number of sysctl settings to optimize networking: https://www.rabbitmq.com/networking.html.
web applications with many concurrent connections require high values for net.core.somaxconn.
a containerized IPv6 routing daemon requires e.g. /proc/sys/net/ipv6/conf/all/forwarding and /proc/sys/net/ipv6/conf/all/accept_redirects (compare docker#4717)
the nginx ingress controller in kubernetes/contrib uses a privileged sidekick container to set net.core.somaxconn and net.ipv4.ip_local_port_range.
a huge software-as-a-service provider uses shared memory (kernel.shm*) and message queues (kernel.msg*) to communicate between containers of their web-serving pods, configuring up to 20 GB of shared memory.

For optimal network layer performance they set net.core.rmem_max, net.core.wmem_max, net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to much higher values than kernel defaults.
In Linux Tuning guides for 10G ethernet it is suggested to set net.core.rmem_max/net.core.wmem_max to values as high as 64 MB and similar dimensions for net.ipv4.tcp_rmem/net.ipv4.tcp_wmem.

It is noted that

tuning settings described here will actually decrease performance of hosts connected at rates of OC3 (155 Mbps) or less.
For integration of a web-backend with the load-balancer retry mechanics it is suggested in http://serverfault.com/questions/518862/will-increasing-net-core-somaxconn-make-a-difference:

Sometimes it's preferable to fail fast and let the load-balancer to do it's job(retry) than to make user wait - for that purpose we set net.core.somaxconn any value, and limit application backlog to e.g. 10 and set net.ipv4.tcp_abort_on_overflow to 1.

In other words, sysctls change the observable application behavior from the view of the load-balancer radically.

Abstract Use Cases

As an administrator I want to set customizable kernel parameters for a container

To be able to limit consumed kernel resources
1. so I can provide more resources to other containers
2. to restrict system communication that slows down the host or other containers
3. to protect against programming errors like resource leaks
4. to protect against DDoS attacks.
To be able to increase limits for certain applications while not changing the default for all containers on a host
1. to enable resource hungry applications like databases to perform well while the default limits for all other applications can be kept low
2. to enable many network connections e.g. for web backends
3. to allow special memory management like Java hugepages.
To be able to enable kernel features.
1. to enable containerized execution of special purpose applications without the need to enable those kernel features host wide, e.g. ip forwarding for network router daemons

Constraints and Assumptions

Only namespaced kernel parameters can be modified
Resource isolation is ensured for all safe sysctls. Sysctl with unclear, weak or not existing isolation are called unsafe sysctls. The later are disabled by default.
Built on-top of the existing security context work
Be container-runtime agnostic
- on the API level
- the implementation (and the set of supported sysctls) will depend on the runtime
Kernel parameters can be set during a container creation process only.

Further work (out of scope for this proposal)

Update kernel parameters in running containers.
Integration with new container runtime proposal: kubernetes/kubernetes#25899.
Hugepages support (compare docker#4717) - while also partly configured through sysctls (vm.nr_hugepages, compare http://andrigoss.blogspot.de/2008/02/jvm-performance-tuning.html) - is out-of-scope for this proposal as it is not namespaced and as a limited resource (similar to normal memory) needs deeper integration e.g. with the scheduler.

Community Work

Docker support for sysctl

Supported sysctls (whitelist) as of Docker 1.12.0:

IPC namespace
- System V: kernel.msgmax, kernel.msgmnb, kernel.msgmni, kernel.sem, kernel.shmall, kernel.shmmax, kernel.shmmni, kernel.shm_rmid_forced
- POSIX queues: fs.mqueue.*
network namespace: net.*

Error behavior:

not whitelisted sysctls are rejected:

$ docker run --sysctl=foo=bla -it busybox /bin/sh
invalid value "foo=bla" for flag --sysctl: sysctl 'foo=bla' is not whitelisted
See 'docker run --help'.

Applied changes:

moby/moby#19265
docker/engine-api#38

Related issues:

moby/moby#21126
https://github.com/ibm-messaging/mq-docker/issues/13

Runc support for sysctl

Supported sysctls (whitelist) as of RunC 0.1.1 (compare libcontainer config validator):

IPC namespace
- System V: kernel.msgmax, kernel.msgmnb, kernel.msgmni, kernel.sem, kernel.shmall, kernel.shmmax, kernel.shmmni, kernel.shm_rmid_forced
- POSIX queues: fs.mqueue.*
network namespace: net.*

Applied changes:

opencontainers/runc#73
opencontainers/runc#303

Rkt support for sysctl

The only sysctl support in rkt is through a CNI plugin plugin. The Kubernetes network plugin kubenet can easily be extended to call this with a given list of sysctls during pod launch.

The default network plugin for rkt is no-op though. This mode leaves all network initialization to rkt itself. Rkt in turn uses the static CNI plugin configuration in /etc/rkt/net.d. This does not allow to customize the sysctls for a pod. Hence, in order to implement this proposal in no-op mode additional changes in rkt are necessary.

Supported sysctls (whitelist):

network namespace: net.*

Applied changes:

rkt/rkt#2140

Issues:

rkt/rkt#2075

Design Alternatives and Considerations

Each pod has its own network stack that is shared among its containers. A privileged side-kick or init container (compare https://git.k8s.io/contrib/ingress/controllers/nginx/examples/sysctl/change-proc-values-rc.yaml#L80) is able to set net.* sysctls.

Clearly, this is completely uncontrolled by the kubelet, but is a usable work-around if privileged containers are permitted in the environment. As privileged container permissions (in the admission controller) are an all-or-nothing decision and the actual code executed in them is not limited, allowing privileged container might be a security threat.

The same work-around also works for shared memory and message queue sysctls as they are shared among the containers of a pod in their ipc namespace.
Instead of giving the user a way to set sysctls for his pods, an alternative seems to be to set high values for the limits of interest from the beginning inside the kubelet or the runtime. Then - so the theory - the user's pods operate under quasi unlimited bounds.

This might be true for some of the sysctls, which purely set limits for some host resources, but
- some sysctls influence the behavior of the application, e.g.:
  - kernel.shm_rmid_forced adds a garbage collection semantics to shared memory segments when possessing processes die. This is against the System V standard though.
  - net.ipv4.tcp_abort_on_overflow makes the kernel send RST packets when the application is overloaded, giving a load-balancer the chance to reschedule a request to another backend.
- some sysctls lead to changed resource requirement characteristics, e.g.:
  - net.ipv4.tcp_rmem/net.ipv4.tcp_wmem not only define min and max values, but also the default tcp window buffer size for each socket. While large values are necessary for certain environments and applications, they lead to waste of resources in the 90% case.
- some sysctls have a different error behavior, e.g.:
  - creating a shared memory segment will fail immediately when kernel.shmmax is too small.
    
    With a large kernel.shmmax default, the creation of a segment always succeeds, but the OOM killer will do its job when a shared memory segment exceeds the memory request of the container.
The high values that could be set by the kubelet on launch might depend on the node's capacity and capabilities. But for portability of workloads it is helpful to have a common baseline of sysctls settings one can expect on every node. The kernel defaults (which are active if the kubelet does not change defaults) are such a (natural) baseline.
One could imagine to offer certain non-namespaced sysctls as well which taint a host such that only containers with compatible sysctls settings are scheduled there. This is considered out of scope to schedule pods with certain sysctls onto certain hosts according to some given rules. This must be done manually by the admin, e.g. by using taints and tolerations.
(Next to namespacing) isolation is the key requirement for a sysctl to be unconditionally allowed in a pod spec. There are the following alternatives:
1. allow only namespaced and isolated sysctls (= safe) in the API
2. allow only namespaced and isolated sysctls by-default and make all other namespaced sysctls with unclear or weak isolation (= unsafe) opt-in by the cluster admin.
For v1.4 only a handful of safe sysctls are defined. There are known, non-edge-case use-cases (see above) for a number of further sysctls. Some of them (especially the ipc sysctls) will probably be promoted onto the whitelist of safe sysctls in the near future when Kubernetes implements better resource isolation.

On the other hand, especially in the net.* hierarchy there are a number of very low-level knobs to tune the network stack. They might be necessary for classes of applications requiring high-performance or realtime behavior. It is hard to forsee which knobs will be necessary in the future. At the same time the net.* hierarchy is huge making deep analysis on a 1-on-1 basis hard. If there is no way to use them at-your-own-risk, those users are forced into the use of privileged containers. This might be a security threat and a no-go for certain environments. Sysctls in the API (even if unsafe) in contrast allow finegrained control by the cluster admin without essentially opening up root access to the cluster nodes for some users.

This requirement for a large number of accessible sysctls must be balanced though with the desire to have a minimal API surface: removing certain (unsafe) sysctls from an official API in a later version (e.g. because they turned out to be problematic for the node health) is problematic.

To balance those two desires the API can be split in half: one official way to declare safe sysctls in a pod spec (this one will be promoted to beta and stable some day) and an alternative way to define unsafe sysctls. Possibly the second way will stay alpha forever to make it clear that unsafe sysctls are not a stable API of Kubernetes. Moreover, for all unsafe sysctls an opt-in policy is desirable, only controllable by the cluster admin, not by each cluster user.

Analysis of Sysctls of Interest

Note: The kmem accounting has fundamentally changed in kernel 4.5 (compare https://github.com/torvalds/linux/commit/a9bb7e620efdfd29b6d1c238041173e411670996): older kernels (e.g. 4.4 from Ubuntu 16.04, 3.10 from CentOS 7.2) use a blacklist (__GFP_NOACCOUNT), newer kernels (e.g. 4.6.x from Fedora 24) use a whitelist (__GFP_ACCOUNT). In the following the analysis is done for kernel >= 4.5:

kernel.shmall, kernel.shmmax, kernel.shmmni: configure System V shared memory
- namespaced in ipc ns
- accounted for as user memory in memcg, using sparse allocation (like tmpfs) uses Resizable virtual memory filesystem
- hence safe to customize
- no application influence with high values
- defaults to unlimited pages, unlimited size, 4096 segments on today's kernels. This makes customization practically unnecessary, at least for the segment sizes. IBM's DB2 suggests 256*GB of RAM for kernel.shmmni (compare http://www.ibm.com/support/knowledgecenter/SSEPGG_10.1.0/com.ibm.db2.luw.qb.server.doc/doc/c0057140.html), exceeding the kernel defaults for machines with >16GB of RAM.
kernel.shm_rmid_forced: enforce removal of shared memory segments on process shutdown
- namespaced in ipc ns
kernel.msgmax, kernel.msgmnb, kernel.msgmni: configure System V messages
- namespaced in ipc ns
- temporarily allocated in kmem in a linked message list, but not accounted for in memcg with kernel >= 4.5
- defaults to 8kb max packet size, 16384 kb total queue size, 32000 queues, which might be too small for certain applications
- arbitrary values up to INT_MAX. Hence, potential DoS attack vector against the host.
Even without using a sysctl the kernel default allows any pod to allocate 512 MB of message memory (compare https://github.com/sttts/kmem-ipc-msg-queues as a test-case). If kmem acconting is not active, this is outside of the pod resource limits. Then a node with 8 GB will not survive with >16 replicas of such a pod.
fs.mqueue.*: configure POSIX message queues.
- namespaced in ipc ns
- uses the same load_msg as System V messages, i.e. no accounting for kernel >= 4.5
- does strict checking against rlimits though
- defaults to 256 queues, max queue length 10, message size 8kb
- can be customized via sysctls up to 64k max queue length, message size 16MB. Hence, potential DoS attack vector against the host
kernel.sem: configure System V semaphores
- namespaced in ipc ns
- uses plain kmalloc and vmalloc without accounting
- defaults to 32000 ids and 32000 semaphores per id (needing double digit number of bytes each), probably enough for all applications:
  
  The values has been chosen to be larger than necessary for any known configuration. (linux/sem.h)
net.*: configure the network stack
- net.core.somaxconn: maximum queue length specifiable by listen.
  - namespaced in net ns
  - might have application influence for high values as it limits the socket queue length
  - [?] No real evidence found until now for accounting. The limit is checked by sk_acceptq_is_full at http://lxr.free-electrons.com/source/net/ipv4/tcp_ipv4.c#L1276. After that a new socket is created. Probably, the tcp socket buffer sysctls apply then, with their accounting, see below.
  - very unreliable tcp memory accounting. There have a been a number of attempts to drop that from the kernel completely, e.g. https://lkml.org/lkml/2014/9/12/401. On Fedora 24 (4.6.3) tcp accounting did not work at all, on Ubuntu 16.06 (4.4) it kind of worked in the root-cg, but in containers only values copied from the root-cg appeared. e - net.ipv4.tcp_wmem/net.ipv4.tcp_wmem/net.core.rmem_max/net.core.wmem_max: socket buffer sizes
  - not namespaced in net ns, and they are not even available under /sys/net
- net.ipv4.ip_local_port_range: local tcp/udp port range
  - namespaced in net ns
  - no memory involved
- net.ipv4.tcp_max_syn_backlog: number of half-open connections
  - not namespaced
- net.ipv4.tcp_syncookies: enable syn cookies
  - namespaced in net ns
  - no memory involved

Summary of Namespacing and Isolation

The individual analysis above leads to the following summary of:

namespacing (ns) - the sysctl is set in this namespace, independently from the parent/root namespace
accounting (acc.) - the memory resources caused by the sysctl are accounted for by the given cgroup

Kernel <= 4.4 and >= 4.5 fundamentally different kernel memory accounting (see note above). The two columns describe the two cases.

sysctl	ns	acc. for <= 4.4	>= 4.5
kernel.shm*	ipc	user memcg 1)	user memcg 1)
kernel.msg*	ipc	kmem memcg 3)	- 3)
fs.mqueue.*	ipc	kmem memcg	-
kernel.sem	ipc	kmem memcg	-
net.core.somaxconn	net	unreliable 4)	unreliable 4)
net.*.tcp_wmem/rmem	- 2)	unreliable 4)	unreliable 4)
net.core.wmem/rmem_max	- 2)	unreliable 4)	unreliable 4)
net.ipv4.ip_local_port_range	net	not needed 5)	not needed 5)
net.ipv4.tcp_syncookies	net	not needed 5)	not needed 5)
net.ipv4.tcp_max_syn_backlog	- 2)	?	?

Footnotes:

a pod memory cgroup is necessary to catch segments from a dying process.
only available in root-ns, not even visible in a container
compare https://github.com/sttts/kmem-ipc-msg-queues as a test-case
in theory socket buffers should be accounted for by the kmem.tcp memcg counters. In practice this only worked very unreliably and not reproducibly, on some kernel not at all. kmem.tcp acconuting seems to be deprecated and on lkml patches has been posted to drop this broken feature.
b/c no memory is involved, i.e. purely functional difference

Note: for all sysctls marked as "kmem memcg" kernel memory accounting must be enabled in the container for proper isolation. This will not be the case for 1.4, but is planned for 1.5.

Classification

From the previous analysis the following classification is derived:

sysctl	ns	accounting	reclaim	pre-requisites
kernel.shm*	pod	container	pod	i 1)
kernel.msg*	pod	container	pod	i + ii + iii
fs.mqueue.*	pod	container	pod	i + ii + iii
kernel.sem	pod	container	pod	i + ii + iii
net.core.somaxconn	pod	container	container	i + ii + iv
net.*.tcp_wmem/rmem	host	container	container	i + ii + iv
net.core.wmem/rmem_max	host	container	container	i + ii + iv
net.ipv4.ip_local_port_range	pod	n/a	n/a	-
net.ipv4.tcp_syncookies	pod	n/a	n/a	-
net.ipv4.tcp_max_syn_backlog	pod	n/a	n/a	-

Explanation:

ns: value is namespaced on this level
accounting: memory is accounted for against limits of this level
reclaim: in the worst case, memory resources fall-through to this level and are accounted for there until they get destroyed
pre-requisites:
1. pod level cgroups
2. kmem acconuting enabled in Kubernetes
3. kmem accounting fixes for ipc namespace in Kernel >= 4.5
4. reliable kernel tcp net buffer accounting, which probably means to wait for cgroups v2.

Footnote:

Pod level cgroups don't exist today and pages are already re-parented on container deletion in v1.3. So supporting pod level sysctls in v1.4 that are tracked by user space memcg is not introducing any regression.

Note: with the exception of kernel.shm* all of the listed pod-level sysctls depend on kernel memory accounting to be enabled for proper resource isolation. This will not be the case for 1.4 by default, but is planned in 1.5.

Note: all the ipc objects persist when the originating containers dies. Their resources (if kmem accounting is enabled) fall back to the parent cgroup. As long as there is no pod level memory cgroup, the parent will be the container runtime, e.g. the docker daemon or the RunC process. It is planned with v1.5 to introduce a pod level memory cgroup which will fix this problem.

Note: in general it is good practice to reserve special nodes for those pods which set sysctls which the kernel does not guarantee proper isolation for.

Proposed Design

Sysctls in pods and PodSecurityPolicy are first introduced as an alpha feature for Kubernetes 1.4. This means that the API will model these as annotations, with the plan to turn those in first class citizens in a later release when the feature is promoted to beta.

It is proposed to use a syntactical validation in the apiserver and a node-level whitelist of safe sysctls in the kubelet. The whitelist shall be fixed per version and might grow in the future when better resource isolation is in place in the kubelet. In addition a list of allowed unsafe sysctls will be configured per node by the cluster admin, with an empty list as the default.

The following rules apply:

Only sysctls shall be whitelisted in the kubelet
- that are properly namespaced by the container or the pod (e.g. in the ipc or net namespace)
- and that cannot lead to resource consumption outside of the limits of the container or the pod. These are called safe.
The cluster admin shall only be able to manually enable sysctls in the kubelet
- that are properly namespaced by the container or the pod (e.g. in the ipc or net namespace). These are call unsafe.

This means that sysctls that are not namespaced must be set by the admin on host level at his own risk, e.g. by running a privileged daemonset, possibly limited to a restricted, special-purpose set of nodes, if necessary with the host network namespace. This is considered out-of-scope of this proposal and out-of-scope of what the kubelet will do for the admin. A section is going to be added to the documentation describing this.

The allowed unsafe sysctls will be configurable on the node via a flag of the kubelet.

Pod API Changes

Pod specification must be changed to allow the specification of kernel parameters:

// Sysctl defines a kernel parameter to be set
type Sysctl struct {
	// Name of a property to set
	Name string `json:"name"`
	// Value of a property to set
	Value intstr.IntOrString `json:"value"`
	// Must be true for unsafe sysctls.
	Unsafe bool `json:"unsafe,omitempty"`
}

// PodSecurityContext holds pod-level security attributes and common container settings.
// Some fields are also present in container.securityContext.  Field values of
// container.securityContext take precedence over field values of PodSecurityContext.
type PodSecurityContext struct {
	...
	// Sysctls hold a list of namespaced sysctls used for the pod. Pods with unsupported
	// sysctls (by the container runtime) might fail to launch.
	Sysctls []Sysctl `json:"sysctls,omitempty"`
}

During alpha the extension of PodSecurityContext is modeled with annotations:

security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1`
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3`

The value is a comma separated list of key-value pairs separated by =.

Safe sysctls may be declared with unsafe: true (or in the respective annotation), while for unsafe sysctls unsafe: true is mandatory. This guarantees backwards-compatibility in future versions when sysctls have been promoted to the whitelist: old pod specs will still work.

Possibly, the security.alpha.kubernetes.io/unsafe-sysctls annotation will stay as an alpha API (replacing the Unsafe bool field) even when security.alpha.kubernetes.io/sysctls has been promoted to beta or stable. This helps to make clear that unsafe sysctls are not a stable feature.

Note: none of the whitelisted (and in general none with the exceptions of descriptive plain text ones) sysctls use anything else than numbers, possibly separated with spaces.

Note: sysctls must be on the pod level because containers in a pod share IPC and network namespaces (if pod.spec.hostIPC and pod.spec.hostNetwork is false) and therefore cannot have conflicting sysctl values. Moreover, note that all namespaced sysctl supported by Docker/RunC are either in the IPC or network namespace.

Apiserver Validation and Kubelet Admission

In the Apiserver

The name of each sysctl in PodSecurityContext.Sysctls[*].Name (or the annotation security.alpha.kubernetes.io/[unsafe-]sysctls during alpha) is validated by the apiserver against:

253 characters in length
it matches sysctlRegexp:

const SysctlSegmentFmt string = "[a-z0-9]([-_a-z0-9]*[a-z0-9])?"
const SysctlFmt string = "(" + SysctlSegmentFmt + "\\.)*" + SysctlSegmentFmt
var sysctlRegexp = regexp.MustCompile("^" + SysctlFmt + "$")

In the Kubelet

The name of each sysctl in PodSecurityContext.Sysctls[*].Name (or the annotation security.alpha.kubernetes.io/[unsafe-]sysctls during alpha) is checked by the kubelet against a static whitelist.

The whitelist is defined under pkg/kubelet and to be maintained by the nodes team.

The initial whitelist of safe sysctls will be:

var whitelist = []string{
    "kernel.shm_rmid_forced",
    "net.ipv4.ip_local_port_range",
    "net.ipv4.tcp_syncookies",
    "net.ipv4.tcp_max_syn_backlog",
}

In parallel a namespace list is maintained with all sysctls and their respective, known kernel namespaces. This is initially derived from Docker's internal sysctl whitelist:

var namespaces = map[string]string{
    "kernel.sem": "ipc",
}

var prefixNamespaces = map[string]string{
    "kernel.msg": "ipc",
    "kenrel.shm": "ipc",
    "fs.mqueue.": "ipc",
    "net.":       "net",
}

If a pod is created with host ipc or host network namespace, the respective sysctls are forbidden.

Error behavior

Pods that do not comply with the syntactical sysctl format will be rejected by the apiserver. Pods that do not comply with the whitelist (or are not manually enabled as allowed unsafe sysctls for a node by the cluster admin) will fail to launch. An event will be created by the kubelet to notify the user.

Kubelet Flags to Extend the Whitelist

The kubelet will get a new flag:

--experimental-allowed-unsafe-sysctls  Comma-separated whitelist of unsafe
                                       sysctls or unsafe sysctl patterns
                                       (ending in *). Use these at your own
                                       risk.

It defaults to the empty list.

During kubelet launch the given value is checked against the list of known namespaces for sysctls or sysctl prefixes. If a namespace is not known, the kubelet will terminate with an error.

SecurityContext Enforcement

Alternative 1: by name

A list of permissible sysctls is to be added to pkg/apis/extensions/types.go (compare pod-security-policy):

// PodSecurityPolicySpec defines the policy enforced.
type PodSecurityPolicySpec struct {
	...
	// Sysctls is a white list of allowed sysctls in a pod spec. Each entry
	// is either a plain sysctl name or ends in "*" in which case it is considered
	// as a prefix of allowed sysctls.
	Sysctls []string `json:"sysctls,omitempty"`
}

The simpleProvider in pkg.security.podsecuritypolicy will validate the value of PodSecurityPolicySpec.Sysctls with the sysctls of a given pod in ValidatePodSecurityContext.

The default policy will be *, i.e. all syntactly correct sysctls are admitted by the PodSecurityPolicySpec.

The PodSecurityPolicySpec applies to safe and unsafe sysctls in the same way.

During alpha the following annotation will be used:

security.alpha.kubernetes.io/sysctls: kernel.shmmax,kernel.msgmax,fs.mqueue.*`

on PodSecurityPolicy objects to customize the allowed sysctls.

Note: This does not override the whitelist or the allowed unsafe sysctls on the nodes. They still apply. This only changes admission of pods in the apiserver. Pods can still fail to launch due to failed admission on the kubelet.

Alternative 2: SysctlPolicy

// SysctlPolicy defines how a sysctl may be set. If neither Values,
// nor Min, Max are set, any value is allowed.
type SysctlPolicy struct {
    // Name is the name of a sysctl or a pattern for a name. It consists of
    // dot separated name segments. A name segment matches [a-z]+[a-z_-0-9]* or
    // equals "*". The later is interpretated as a wildcard for that name
    // segment.
    Name string `json:"name"`

    // Values are allowed values to be set. Either Values is
    // set or Min and Max.
    Values []string `json:"values,omitempty"`

    // Min is the minimal value allowed to be set.
    Min *int64 `json:"min,omitempty"`

    // Max is the maximum value allowed to be set.
    Max *int64 `json:"max,omitempty"`
}

// PodSecurityPolicySpec defines the policy enforced on sysctls.
type PodSecurityPolicySpec struct {
    ...
    // Sysctls is a white list of allowed sysctls in a pod spec.
    Sysctls []SysctlPolicy `json:"sysctls,omitempty"`
}

During alpha the following annotation will be used:

security.alpha.kubernetes.io/sysctls: kernel.shmmax,kernel.msgmax=max:10:min:1,kernel.msgmni=values:1000 2000 3000`

This extended syntax is a natural extension of that of alternative 1 and therefore can be implemented any time during alpha.

Alternative 1 or 2 has to be chosen for the external API once the feature is promoted to beta.

Application of the given Sysctls

Finally, the container runtime will interpret pod.spec.securityPolicy.sysctls, e.g. in the case of Docker the DockerManager will apply the given sysctls to the infra container in createPodInfraContainer.

In a later implementation of a container runtime interface (compare kubernetes/kubernetes#25899), sysctls will be part of LinuxPodSandboxConfig (compare kubernetes/kubernetes#25899 (comment)) and to be applied by the runtime implementation to the PodSandbox by the PodSandboxManager implementation.

Examples

Use in a pod

Here is an example of a pod that has the safe sysctl net.ipv4.ip_local_port_range set to 1024 65535 and the unsafe sysctl net.ipv4.route.min_pmtu to 1000.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
  securityContext:
    sysctls:
    - name: net.ipv4.ip_local_port_range
      value: "1024 65535"
    - name: net.ipv4.route.min_pmtu
      value: 1000
      unsafe: true

Allowing only certain sysctls

Here is an example of a PodSecurityPolicy, allowing kernel.shmmax, kernel.shmall and all net.* sysctls to be set:

apiVersion: v1
kind: PodSecurityPolicy
metadata:
  name: database
spec:
  sysctls:
  - kernel.shmmax
  - kernel.shmall
  - net.*

and a restricted default PodSecurityPolicy:

apiVersion: v1
kind: PodSecurityPolicy
metadata:
  name:
spec:
  sysctls: # none

in contrast to a permissive default PodSecurityPolicy:

apiVersion: v1
kind: PodSecurityPolicy
metadata:
  name:
spec:
  sysctls:
  - *

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sysctl.md

sysctl.md

Setting Sysctls on the Pod Level

Roadmap

v1.4

v1.5+

Table of Contents

Abstract

Motivation

Abstract Use Cases

Constraints and Assumptions

Further work (out of scope for this proposal)

Community Work

Docker support for sysctl

Runc support for sysctl

Rkt support for sysctl

Design Alternatives and Considerations

Analysis of Sysctls of Interest

Summary of Namespacing and Isolation

Classification

Proposed Design

Pod API Changes

Apiserver Validation and Kubelet Admission

In the Apiserver

In the Kubelet

Error behavior

Kubelet Flags to Extend the Whitelist

SecurityContext Enforcement

Alternative 1: by name

Alternative 2: SysctlPolicy

Application of the given Sysctls

Examples

Use in a pod

Allowing only certain sysctls

Files

sysctl.md

Latest commit

History

sysctl.md

File metadata and controls

Setting Sysctls on the Pod Level

Roadmap

v1.4

v1.5+

Table of Contents

Abstract

Motivation

Abstract Use Cases

Constraints and Assumptions

Further work (out of scope for this proposal)

Community Work

Docker support for sysctl

Runc support for sysctl

Rkt support for sysctl

Design Alternatives and Considerations

Analysis of Sysctls of Interest

Summary of Namespacing and Isolation

Classification

Proposed Design

Pod API Changes

Apiserver Validation and Kubelet Admission

In the Apiserver

In the Kubelet

Error behavior

Kubelet Flags to Extend the Whitelist

SecurityContext Enforcement

Alternative 1: by name

Alternative 2: SysctlPolicy

Application of the given Sysctls

Examples

Use in a pod

Allowing only certain sysctls