This proposal aims at extending the current pod specification with support for namespaced kernel parameters (sysctls) set for each pod.
- initial implementation for v1.4 kubernetes/kubernetes#27180
- node-level whitelist for safe sysctls:
kernel.shm_rmid_forced
,net.ipv4.ip_local_port_range
,net.ipv4.tcp_max_syn_backlog
,net.ipv4.tcp_syncookies
- (disabled by-default) unsafe sysctls:
kernel.msg*
,kernel.sem
,kernel.shm*
,fs.mqueue.*
,net.*
- new kubelet flag:
--experimental-allowed-unsafe-sysctls
- PSP default:
*
- node-level whitelist for safe sysctls:
- document node-level whitelist with kubectl flags and taints/tolerations
- document host-level sysctls with daemon sets + taints/tolerations
- in parallel: kernel upstream patches to fix ipc accounting for 4.5+
- submitted to mainline
- merged into mainline, compare https://github.com/torvalds/linux/commit/8c8d4d45204902e144abc0f15b7c658828028fa1
- pre-requisites for
kernel.sem
,kernel.msg*
,fs.mqueue.*
on the node-level whitelist- pod cgroups active by default (compare Pod Resource Management)
- kmem accounting active by default
- kernel patches for 4.5+ (merged since 4.9)
- reconsider what to do with
kernel.shm*
and other resource-limit sysctls with proper isolation: (a) keep them in the API (b) set node-level defaults
- Setting Sysctls on the Pod Level
In Linux, the sysctl interface allows an administrator to modify kernel
parameters at runtime. Parameters are available via /proc/sys/
virtual
process file system. The parameters cover various subsystems such as:
- kernel (common prefix:
kernel.
) - networking (common prefix:
net.
) - virtual memory (common prefix:
vm.
) - MDADM (common prefix:
dev.
)
More subsystems are described in Kernel docs.
To get a list of basic prefixes on your system, you can run
$ sudo sysctl -a | cut -d' ' -f1 | cut -d'.' -f1 | sort -u
To get a list of all parameters, you can run
$ sudo sysctl -a
A number of them are namespaced and can therefore be set for a container independently with today's Linux kernels.
Note: This proposal - while sharing some use-cases - does not cover ulimits (compare Expose or utilize docker's rlimit support).
A number of Linux applications need certain kernel parameter settings to
- either run at all
- or perform well.
In Kubernetes we want to allow to set these parameters within a pod specification in order to enable the use of the platform for those applications.
With Docker version 1.11.1 it is possible to change kernel parameters inside privileged containers. However, the process is purely manual and the changes might be applied across all containers affecting the entire host system. It is not possible to set the parameters within a non-privileged container.
With docker#19265 docker-run as of 1.12.0 supports setting a number of whitelisted sysctls during the container creation process.
Some real-world examples for the use of sysctls:
-
PostgreSQL requires
kernel.shmmax
andkernel.shmall
(among others) to be set to reasonable high values (compare PostgreSQL Manual 17.4.1. Shared Memory and Semaphores). The default of 32 MB for shared memory is not reasonable for a database. -
RabbitMQ proposes a number of sysctl settings to optimize networking: https://www.rabbitmq.com/networking.html.
-
web applications with many concurrent connections require high values for
net.core.somaxconn
. -
a containerized IPv6 routing daemon requires e.g.
/proc/sys/net/ipv6/conf/all/forwarding
and/proc/sys/net/ipv6/conf/all/accept_redirects
(compare docker#4717) -
the nginx ingress controller in kubernetes/contrib uses a privileged sidekick container to set
net.core.somaxconn
andnet.ipv4.ip_local_port_range
. -
a huge software-as-a-service provider uses shared memory (
kernel.shm*
) and message queues (kernel.msg*
) to communicate between containers of their web-serving pods, configuring up to 20 GB of shared memory.For optimal network layer performance they set
net.core.rmem_max
,net.core.wmem_max
,net.ipv4.tcp_rmem
andnet.ipv4.tcp_wmem
to much higher values than kernel defaults. -
In Linux Tuning guides for 10G ethernet it is suggested to set
net.core.rmem_max
/net.core.wmem_max
to values as high as 64 MB and similar dimensions fornet.ipv4.tcp_rmem
/net.ipv4.tcp_wmem
.It is noted that
tuning settings described here will actually decrease performance of hosts connected at rates of OC3 (155 Mbps) or less.
-
For integration of a web-backend with the load-balancer retry mechanics it is suggested in http://serverfault.com/questions/518862/will-increasing-net-core-somaxconn-make-a-difference:
Sometimes it's preferable to fail fast and let the load-balancer to do it's job(retry) than to make user wait - for that purpose we set net.core.somaxconn any value, and limit application backlog to e.g. 10 and set net.ipv4.tcp_abort_on_overflow to 1.
In other words, sysctls change the observable application behavior from the view of the load-balancer radically.
As an administrator I want to set customizable kernel parameters for a container
- To be able to limit consumed kernel resources
- so I can provide more resources to other containers
- to restrict system communication that slows down the host or other containers
- to protect against programming errors like resource leaks
- to protect against DDoS attacks.
- To be able to increase limits for certain applications while not
changing the default for all containers on a host
- to enable resource hungry applications like databases to perform well while the default limits for all other applications can be kept low
- to enable many network connections e.g. for web backends
- to allow special memory management like Java hugepages.
- To be able to enable kernel features.
- to enable containerized execution of special purpose applications without the need to enable those kernel features host wide, e.g. ip forwarding for network router daemons
- Only namespaced kernel parameters can be modified
- Resource isolation is ensured for all safe sysctls. Sysctl with unclear, weak or not existing isolation are called unsafe sysctls. The later are disabled by default.
- Built on-top of the existing security context work
- Be container-runtime agnostic
- on the API level
- the implementation (and the set of supported sysctls) will depend on the runtime
- Kernel parameters can be set during a container creation process only.
- Update kernel parameters in running containers.
- Integration with new container runtime proposal: kubernetes/kubernetes#25899.
- Hugepages support (compare docker#4717) - while also partly configured through sysctls (
vm.nr_hugepages
, compare http://andrigoss.blogspot.de/2008/02/jvm-performance-tuning.html) - is out-of-scope for this proposal as it is not namespaced and as a limited resource (similar to normal memory) needs deeper integration e.g. with the scheduler.
Supported sysctls (whitelist) as of Docker 1.12.0:
- IPC namespace
- System V:
kernel.msgmax
,kernel.msgmnb
,kernel.msgmni
,kernel.sem
,kernel.shmall
,kernel.shmmax
,kernel.shmmni
,kernel.shm_rmid_forced
- POSIX queues:
fs.mqueue.*
- System V:
- network namespace:
net.*
Error behavior:
- not whitelisted sysctls are rejected:
$ docker run --sysctl=foo=bla -it busybox /bin/sh
invalid value "foo=bla" for flag --sysctl: sysctl 'foo=bla' is not whitelisted
See 'docker run --help'.
Applied changes:
Related issues:
Supported sysctls (whitelist) as of RunC 0.1.1 (compare libcontainer config validator):
- IPC namespace
- System V:
kernel.msgmax
,kernel.msgmnb
,kernel.msgmni
,kernel.sem
,kernel.shmall
,kernel.shmmax
,kernel.shmmni
,kernel.shm_rmid_forced
- POSIX queues:
fs.mqueue.*
- System V:
- network namespace:
net.*
Applied changes:
The only sysctl support in rkt is through a CNI plugin plugin. The Kubernetes network plugin kubenet
can easily be extended to call this with a given list of sysctls during pod launch.
The default network plugin for rkt is no-op
though. This mode leaves all network initialization to rkt itself. Rkt in turn uses the static CNI plugin configuration in /etc/rkt/net.d
. This does not allow to customize the sysctls for a pod. Hence, in order to implement this proposal in no-op
mode additional changes in rkt are necessary.
Supported sysctls (whitelist):
- network namespace:
net.*
Applied changes:
Issues:
-
Each pod has its own network stack that is shared among its containers. A privileged side-kick or init container (compare https://git.k8s.io/contrib/ingress/controllers/nginx/examples/sysctl/change-proc-values-rc.yaml#L80) is able to set
net.*
sysctls.Clearly, this is completely uncontrolled by the kubelet, but is a usable work-around if privileged containers are permitted in the environment. As privileged container permissions (in the admission controller) are an all-or-nothing decision and the actual code executed in them is not limited, allowing privileged container might be a security threat.
The same work-around also works for shared memory and message queue sysctls as they are shared among the containers of a pod in their ipc namespace.
-
Instead of giving the user a way to set sysctls for his pods, an alternative seems to be to set high values for the limits of interest from the beginning inside the kubelet or the runtime. Then - so the theory - the user's pods operate under quasi unlimited bounds.
This might be true for some of the sysctls, which purely set limits for some host resources, but
- some sysctls influence the behavior of the application, e.g.:
kernel.shm_rmid_forced
adds a garbage collection semantics to shared memory segments when possessing processes die. This is against the System V standard though.net.ipv4.tcp_abort_on_overflow
makes the kernel send RST packets when the application is overloaded, giving a load-balancer the chance to reschedule a request to another backend.
- some sysctls lead to changed resource requirement characteristics, e.g.:
net.ipv4.tcp_rmem
/net.ipv4.tcp_wmem
not only define min and max values, but also the default tcp window buffer size for each socket. While large values are necessary for certain environments and applications, they lead to waste of resources in the 90% case.
- some sysctls have a different error behavior, e.g.:
-
creating a shared memory segment will fail immediately when
kernel.shmmax
is too small.With a large
kernel.shmmax
default, the creation of a segment always succeeds, but the OOM killer will do its job when a shared memory segment exceeds the memory request of the container.
-
The high values that could be set by the kubelet on launch might depend on the node's capacity and capabilities. But for portability of workloads it is helpful to have a common baseline of sysctls settings one can expect on every node. The kernel defaults (which are active if the kubelet does not change defaults) are such a (natural) baseline.
- some sysctls influence the behavior of the application, e.g.:
-
One could imagine to offer certain non-namespaced sysctls as well which taint a host such that only containers with compatible sysctls settings are scheduled there. This is considered out of scope to schedule pods with certain sysctls onto certain hosts according to some given rules. This must be done manually by the admin, e.g. by using taints and tolerations.
-
(Next to namespacing) isolation is the key requirement for a sysctl to be unconditionally allowed in a pod spec. There are the following alternatives:
- allow only namespaced and isolated sysctls (= safe) in the API
- allow only namespaced and isolated sysctls by-default and make all other namespaced sysctls with unclear or weak isolation (= unsafe) opt-in by the cluster admin.
For v1.4 only a handful of safe sysctls are defined. There are known, non-edge-case use-cases (see above) for a number of further sysctls. Some of them (especially the ipc sysctls) will probably be promoted onto the whitelist of safe sysctls in the near future when Kubernetes implements better resource isolation.
On the other hand, especially in the
net.*
hierarchy there are a number of very low-level knobs to tune the network stack. They might be necessary for classes of applications requiring high-performance or realtime behavior. It is hard to forsee which knobs will be necessary in the future. At the same time thenet.*
hierarchy is huge making deep analysis on a 1-on-1 basis hard. If there is no way to use them at-your-own-risk, those users are forced into the use of privileged containers. This might be a security threat and a no-go for certain environments. Sysctls in the API (even if unsafe) in contrast allow finegrained control by the cluster admin without essentially opening up root access to the cluster nodes for some users.This requirement for a large number of accessible sysctls must be balanced though with the desire to have a minimal API surface: removing certain (unsafe) sysctls from an official API in a later version (e.g. because they turned out to be problematic for the node health) is problematic.
To balance those two desires the API can be split in half: one official way to declare safe sysctls in a pod spec (this one will be promoted to beta and stable some day) and an alternative way to define unsafe sysctls. Possibly the second way will stay alpha forever to make it clear that unsafe sysctls are not a stable API of Kubernetes. Moreover, for all unsafe sysctls an opt-in policy is desirable, only controllable by the cluster admin, not by each cluster user.
Note: The kmem accounting has fundamentally changed in kernel 4.5 (compare https://github.com/torvalds/linux/commit/a9bb7e620efdfd29b6d1c238041173e411670996): older kernels (e.g. 4.4 from Ubuntu 16.04, 3.10 from CentOS 7.2) use a blacklist (__GFP_NOACCOUNT
), newer kernels (e.g. 4.6.x from Fedora 24) use a whitelist (__GFP_ACCOUNT
). In the following the analysis is done for kernel >= 4.5:
-
kernel.shmall
,kernel.shmmax
,kernel.shmmni
: configure System V shared memory- namespaced in ipc ns
- accounted for as user memory in memcg, using sparse allocation (like tmpfs) uses Resizable virtual memory filesystem
- hence safe to customize
- no application influence with high values
- defaults to unlimited pages, unlimited size, 4096 segments on today's kernels. This makes customization practically unnecessary, at least for the segment sizes. IBM's DB2 suggests
256*GB of RAM
forkernel.shmmni
(compare http://www.ibm.com/support/knowledgecenter/SSEPGG_10.1.0/com.ibm.db2.luw.qb.server.doc/doc/c0057140.html), exceeding the kernel defaults for machines with >16GB of RAM.
-
kernel.shm_rmid_forced
: enforce removal of shared memory segments on process shutdown- namespaced in ipc ns
-
kernel.msgmax
,kernel.msgmnb
,kernel.msgmni
: configure System V messages- namespaced in ipc ns
- temporarily allocated in kmem in a linked message list, but not accounted for in memcg with kernel >= 4.5
- defaults to 8kb max packet size, 16384 kb total queue size, 32000 queues, which might be too small for certain applications
- arbitrary values up to INT_MAX. Hence, potential DoS attack vector against the host.
Even without using a sysctl the kernel default allows any pod to allocate 512 MB of message memory (compare https://github.com/sttts/kmem-ipc-msg-queues as a test-case). If kmem acconting is not active, this is outside of the pod resource limits. Then a node with 8 GB will not survive with >16 replicas of such a pod.
-
fs.mqueue.*
: configure POSIX message queues.- namespaced in ipc ns
- uses the same
load_msg
as System V messages, i.e. no accounting for kernel >= 4.5 - does strict checking against rlimits though
- defaults to 256 queues, max queue length 10, message size 8kb
- can be customized via sysctls up to 64k max queue length, message size 16MB. Hence, potential DoS attack vector against the host
-
kernel.sem
: configure System V semaphores-
namespaced in ipc ns
-
uses plain kmalloc and vmalloc without accounting
-
defaults to 32000 ids and 32000 semaphores per id (needing double digit number of bytes each), probably enough for all applications:
The values has been chosen to be larger than necessary for any known configuration. (linux/sem.h)
-
-
net.*
: configure the network stacknet.core.somaxconn
: maximum queue length specifiable by listen.- namespaced in net ns
- might have application influence for high values as it limits the socket queue length
- [?] No real evidence found until now for accounting. The limit is checked by
sk_acceptq_is_full
at http://lxr.free-electrons.com/source/net/ipv4/tcp_ipv4.c#L1276. After that a new socket is created. Probably, the tcp socket buffer sysctls apply then, with their accounting, see below. - very unreliable tcp memory accounting. There have a been a number of attempts to drop that from the kernel completely, e.g. https://lkml.org/lkml/2014/9/12/401. On Fedora 24 (4.6.3) tcp accounting did not work at all, on Ubuntu 16.06 (4.4) it kind of worked in the root-cg, but in containers only values copied from the root-cg appeared.
e -
net.ipv4.tcp_wmem
/net.ipv4.tcp_wmem
/net.core.rmem_max
/net.core.wmem_max
: socket buffer sizes - not namespaced in net ns, and they are not even available under
/sys/net
net.ipv4.ip_local_port_range
: local tcp/udp port range- namespaced in net ns
- no memory involved
net.ipv4.tcp_max_syn_backlog
: number of half-open connections- not namespaced
net.ipv4.tcp_syncookies
: enable syn cookies- namespaced in net ns
- no memory involved
The individual analysis above leads to the following summary of:
-
namespacing (ns) - the sysctl is set in this namespace, independently from the parent/root namespace
-
accounting (acc.) - the memory resources caused by the sysctl are accounted for by the given cgroup
Kernel <= 4.4 and >= 4.5 fundamentally different kernel memory accounting (see note above). The two columns describe the two cases.
sysctl | ns | acc. for <= 4.4 | >= 4.5 |
---|---|---|---|
kernel.shm* | ipc | user memcg 1) | user memcg 1) |
kernel.msg* | ipc | kmem memcg 3) | - 3) |
fs.mqueue.* | ipc | kmem memcg | - |
kernel.sem | ipc | kmem memcg | - |
net.core.somaxconn | net | unreliable 4) | unreliable 4) |
net.*.tcp_wmem/rmem | - 2) | unreliable 4) | unreliable 4) |
net.core.wmem/rmem_max | - 2) | unreliable 4) | unreliable 4) |
net.ipv4.ip_local_port_range | net | not needed 5) | not needed 5) |
net.ipv4.tcp_syncookies | net | not needed 5) | not needed 5) |
net.ipv4.tcp_max_syn_backlog | - 2) | ? | ? |
Footnotes:
- a pod memory cgroup is necessary to catch segments from a dying process.
- only available in root-ns, not even visible in a container
- compare https://github.com/sttts/kmem-ipc-msg-queues as a test-case
- in theory socket buffers should be accounted for by the kmem.tcp memcg counters. In practice this only worked very unreliably and not reproducibly, on some kernel not at all. kmem.tcp acconuting seems to be deprecated and on lkml patches has been posted to drop this broken feature.
- b/c no memory is involved, i.e. purely functional difference
Note: for all sysctls marked as "kmem memcg" kernel memory accounting must be enabled in the container for proper isolation. This will not be the case for 1.4, but is planned for 1.5.
From the previous analysis the following classification is derived:
sysctl | ns | accounting | reclaim | pre-requisites |
---|---|---|---|---|
kernel.shm* | pod | container | pod | i 1) |
kernel.msg* | pod | container | pod | i + ii + iii |
fs.mqueue.* | pod | container | pod | i + ii + iii |
kernel.sem | pod | container | pod | i + ii + iii |
net.core.somaxconn | pod | container | container | i + ii + iv |
net.*.tcp_wmem/rmem | host | container | container | i + ii + iv |
net.core.wmem/rmem_max | host | container | container | i + ii + iv |
net.ipv4.ip_local_port_range | pod | n/a | n/a | - |
net.ipv4.tcp_syncookies | pod | n/a | n/a | - |
net.ipv4.tcp_max_syn_backlog | pod | n/a | n/a | - |
Explanation:
- ns: value is namespaced on this level
- accounting: memory is accounted for against limits of this level
- reclaim: in the worst case, memory resources fall-through to this level and are accounted for there until they get destroyed
- pre-requisites:
- pod level cgroups
- kmem acconuting enabled in Kubernetes
- kmem accounting fixes for ipc namespace in Kernel >= 4.5
- reliable kernel tcp net buffer accounting, which probably means to wait for cgroups v2.
Footnote:
- Pod level cgroups don't exist today and pages are already re-parented on container deletion in v1.3. So supporting pod level sysctls in v1.4 that are tracked by user space memcg is not introducing any regression.
Note: with the exception of kernel.shm*
all of the listed pod-level sysctls depend on kernel memory accounting to be enabled for proper resource isolation. This will not be the case for 1.4 by default, but is planned in 1.5.
Note: all the ipc objects persist when the originating containers dies. Their resources (if kmem accounting is enabled) fall back to the parent cgroup. As long as there is no pod level memory cgroup, the parent will be the container runtime, e.g. the docker daemon or the RunC process. It is planned with v1.5 to introduce a pod level memory cgroup which will fix this problem.
Note: in general it is good practice to reserve special nodes for those pods which set sysctls which the kernel does not guarantee proper isolation for.
Sysctls in pods and PodSecurityPolicy
are first introduced as an alpha feature for Kubernetes 1.4. This means that the API will model these as annotations, with the plan to turn those in first class citizens in a later release when the feature is promoted to beta.
It is proposed to use a syntactical validation in the apiserver and a node-level whitelist of safe sysctls in the kubelet. The whitelist shall be fixed per version and might grow in the future when better resource isolation is in place in the kubelet. In addition a list of allowed unsafe sysctls will be configured per node by the cluster admin, with an empty list as the default.
The following rules apply:
- Only sysctls shall be whitelisted in the kubelet
- that are properly namespaced by the container or the pod (e.g. in the ipc or net namespace)
- and that cannot lead to resource consumption outside of the limits of the container or the pod. These are called safe.
- The cluster admin shall only be able to manually enable sysctls in the kubelet
- that are properly namespaced by the container or the pod (e.g. in the ipc or net namespace). These are call unsafe.
This means that sysctls that are not namespaced must be set by the admin on host level at his own risk, e.g. by running a privileged daemonset, possibly limited to a restricted, special-purpose set of nodes, if necessary with the host network namespace. This is considered out-of-scope of this proposal and out-of-scope of what the kubelet will do for the admin. A section is going to be added to the documentation describing this.
The allowed unsafe sysctls will be configurable on the node via a flag of the kubelet.
Pod specification must be changed to allow the specification of kernel parameters:
// Sysctl defines a kernel parameter to be set
type Sysctl struct {
// Name of a property to set
Name string `json:"name"`
// Value of a property to set
Value intstr.IntOrString `json:"value"`
// Must be true for unsafe sysctls.
Unsafe bool `json:"unsafe,omitempty"`
}
// PodSecurityContext holds pod-level security attributes and common container settings.
// Some fields are also present in container.securityContext. Field values of
// container.securityContext take precedence over field values of PodSecurityContext.
type PodSecurityContext struct {
...
// Sysctls hold a list of namespaced sysctls used for the pod. Pods with unsupported
// sysctls (by the container runtime) might fail to launch.
Sysctls []Sysctl `json:"sysctls,omitempty"`
}
During alpha the extension of PodSecurityContext
is modeled with annotations:
security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1`
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3`
The value is a comma separated list of key-value pairs separated by =
.
Safe sysctls may be declared with unsafe: true
(or in the respective annotation), while for unsafe sysctls unsafe: true
is mandatory. This guarantees backwards-compatibility in future versions when sysctls have been promoted to the whitelist: old pod specs will still work.
Possibly, the security.alpha.kubernetes.io/unsafe-sysctls
annotation will stay as an alpha API (replacing the Unsafe bool
field) even when security.alpha.kubernetes.io/sysctls
has been promoted to beta or stable. This helps to make clear that unsafe sysctls are not a stable feature.
Note: none of the whitelisted (and in general none with the exceptions of descriptive plain text ones) sysctls use anything else than numbers, possibly separated with spaces.
Note: sysctls must be on the pod level because containers in a pod share IPC and network namespaces (if pod.spec.hostIPC and pod.spec.hostNetwork is false) and therefore cannot have conflicting sysctl values. Moreover, note that all namespaced sysctl supported by Docker/RunC are either in the IPC or network namespace.
The name of each sysctl in PodSecurityContext.Sysctls[*].Name
(or the annotation security.alpha.kubernetes.io/[unsafe-]sysctls
during alpha) is validated by the apiserver against:
- 253 characters in length
- it matches
sysctlRegexp
:
const SysctlSegmentFmt string = "[a-z0-9]([-_a-z0-9]*[a-z0-9])?"
const SysctlFmt string = "(" + SysctlSegmentFmt + "\\.)*" + SysctlSegmentFmt
var sysctlRegexp = regexp.MustCompile("^" + SysctlFmt + "$")
The name of each sysctl in PodSecurityContext.Sysctls[*].Name
(or the annotation security.alpha.kubernetes.io/[unsafe-]sysctls
during alpha) is checked by the kubelet against a static whitelist.
The whitelist is defined under pkg/kubelet
and to be maintained by the nodes team.
The initial whitelist of safe sysctls will be:
var whitelist = []string{
"kernel.shm_rmid_forced",
"net.ipv4.ip_local_port_range",
"net.ipv4.tcp_syncookies",
"net.ipv4.tcp_max_syn_backlog",
}
In parallel a namespace list is maintained with all sysctls and their respective, known kernel namespaces. This is initially derived from Docker's internal sysctl whitelist:
var namespaces = map[string]string{
"kernel.sem": "ipc",
}
var prefixNamespaces = map[string]string{
"kernel.msg": "ipc",
"kenrel.shm": "ipc",
"fs.mqueue.": "ipc",
"net.": "net",
}
If a pod is created with host ipc or host network namespace, the respective sysctls are forbidden.
Pods that do not comply with the syntactical sysctl format will be rejected by the apiserver. Pods that do not comply with the whitelist (or are not manually enabled as allowed unsafe sysctls for a node by the cluster admin) will fail to launch. An event will be created by the kubelet to notify the user.
The kubelet will get a new flag:
--experimental-allowed-unsafe-sysctls Comma-separated whitelist of unsafe
sysctls or unsafe sysctl patterns
(ending in *). Use these at your own
risk.
It defaults to the empty list.
During kubelet launch the given value is checked against the list of known namespaces for sysctls or sysctl prefixes. If a namespace is not known, the kubelet will terminate with an error.
A list of permissible sysctls is to be added to pkg/apis/extensions/types.go
(compare pod-security-policy):
// PodSecurityPolicySpec defines the policy enforced.
type PodSecurityPolicySpec struct {
...
// Sysctls is a white list of allowed sysctls in a pod spec. Each entry
// is either a plain sysctl name or ends in "*" in which case it is considered
// as a prefix of allowed sysctls.
Sysctls []string `json:"sysctls,omitempty"`
}
The simpleProvider
in pkg.security.podsecuritypolicy
will validate the value of PodSecurityPolicySpec.Sysctls
with the sysctls of a given pod in ValidatePodSecurityContext
.
The default policy will be *
, i.e. all syntactly correct sysctls are admitted by the PodSecurityPolicySpec
.
The PodSecurityPolicySpec
applies to safe and unsafe sysctls in the same way.
During alpha the following annotation will be used:
security.alpha.kubernetes.io/sysctls: kernel.shmmax,kernel.msgmax,fs.mqueue.*`
on PodSecurityPolicy
objects to customize the allowed sysctls.
Note: This does not override the whitelist or the allowed unsafe sysctls on the nodes. They still apply. This only changes admission of pods in the apiserver. Pods can still fail to launch due to failed admission on the kubelet.
// SysctlPolicy defines how a sysctl may be set. If neither Values,
// nor Min, Max are set, any value is allowed.
type SysctlPolicy struct {
// Name is the name of a sysctl or a pattern for a name. It consists of
// dot separated name segments. A name segment matches [a-z]+[a-z_-0-9]* or
// equals "*". The later is interpretated as a wildcard for that name
// segment.
Name string `json:"name"`
// Values are allowed values to be set. Either Values is
// set or Min and Max.
Values []string `json:"values,omitempty"`
// Min is the minimal value allowed to be set.
Min *int64 `json:"min,omitempty"`
// Max is the maximum value allowed to be set.
Max *int64 `json:"max,omitempty"`
}
// PodSecurityPolicySpec defines the policy enforced on sysctls.
type PodSecurityPolicySpec struct {
...
// Sysctls is a white list of allowed sysctls in a pod spec.
Sysctls []SysctlPolicy `json:"sysctls,omitempty"`
}
During alpha the following annotation will be used:
security.alpha.kubernetes.io/sysctls: kernel.shmmax,kernel.msgmax=max:10:min:1,kernel.msgmni=values:1000 2000 3000`
This extended syntax is a natural extension of that of alternative 1 and therefore can be implemented any time during alpha.
Alternative 1 or 2 has to be chosen for the external API once the feature is promoted to beta.
Finally, the container runtime will interpret pod.spec.securityPolicy.sysctls
,
e.g. in the case of Docker the DockerManager
will apply the given sysctls to the infra container in createPodInfraContainer
.
In a later implementation of a container runtime interface (compare kubernetes/kubernetes#25899), sysctls will be part of LinuxPodSandboxConfig
(compare kubernetes/kubernetes#25899 (comment)) and to be applied by the runtime implementation to the PodSandbox
by the PodSandboxManager
implementation.
Here is an example of a pod that has the safe sysctl net.ipv4.ip_local_port_range
set to 1024 65535
and the unsafe sysctl net.ipv4.route.min_pmtu
to 1000
.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
name: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
securityContext:
sysctls:
- name: net.ipv4.ip_local_port_range
value: "1024 65535"
- name: net.ipv4.route.min_pmtu
value: 1000
unsafe: true
Here is an example of a PodSecurityPolicy
, allowing kernel.shmmax
, kernel.shmall
and all net.*
sysctls to be set:
apiVersion: v1
kind: PodSecurityPolicy
metadata:
name: database
spec:
sysctls:
- kernel.shmmax
- kernel.shmall
- net.*
and a restricted default PodSecurityPolicy
:
apiVersion: v1
kind: PodSecurityPolicy
metadata:
name:
spec:
sysctls: # none
in contrast to a permissive default PodSecurityPolicy
:
apiVersion: v1
kind: PodSecurityPolicy
metadata:
name:
spec:
sysctls:
- *