- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Privileged containers are containers that are enabled with similar access to the host as processes that run on the host directly. With privileged containers, users can package and distribute management operations and functionalities that require host access while retaining versioning and deployment methods provided by containers. Linux privileged containers are currently used for a variety of key scenarios in Kubernetes, including kube-proxy (via kubeadm), storage, and networking scenarios. Support for these scenarios in Windows currently requires workarounds via proxies or other implementations. This proposal aims to extend the Windows container model to support privileged containers. This proposal also aims to enable host network mode for privileged networking scenarios. Enabling privileged containers and host network mode for privileged containers would enable users to package and distribute key functionalities requiring host access.
The lack of privileged container support within the Windows container model has resulted in separate workarounds and privileged proxies for Windows workloads that are not required for Linux workloads. These workarounds have provided necessary functionality for key scenarios such as networking, storage, and device access, but have also presented many challenges, including increased available attack surfaces, complex change and update management, and scenario specific solutions. There is significant interest from the community for the Windows container model to support privileged containers and host network mode (which enable pods to be created in the host’s network compartment/namespace, as opposed to getting their own) to transition off such workarounds and align more closely with Linux support and operational models.
Furthermore, since kube-proxy cannot be run as a privileged daemonset, it must either be run with a proxy or directly on the host as a service. In the case that it is run as a service, the admin kube config must be stored on the Windows node which poses a security concern. This is also true for networking daemons such as Flannel.
- To provide a method to build, launch, and run a Windows-based container with privileged access to host resources, including the host network service, devices, disks (including hostPath volumes), etc.
- To enable access to host network resources for privileged containers and pods with host network mode
- To provide access to host network resources for non-privileged containers and pods
- To provide a privileged mode for Hyper-V containers, or a method to run privileged process containers within a Hyper-V isolation boundary. This is a non-goal as running a Hyper-V container in the root namespace from within the isolation boundary is not supported.
- To enable privileged containers for Docker. This will only be for Containerd
Privileged daemon sets are used to deploy networking (CNI), storage (CSI), and device plugins, kube-proxy, and other agents to Linux nodes. Currently, similar set-up and deployment operations utilize wins or proxies (i.e. CSI-proxy, HNS-Proxy) for Windows nodes. With Windows privileged containers, privileged daemon sets will also be available to deploy desired plugins and agents to Windows nodes. For networking scenarios, host network mode will enable these privileged deployments to access and configure host network resources.
Enablement of both host network mode and privileged containers will allow users to configure network policies between pods and the hosts. Privileged containers will also enable service mesh support on Windows by enabling the creation of an init container that can configure HNS.
Windows privileged containers would also enable deployment of single privileged containers to Windows nodes. Other functionalities that could be enabled with privileged containers in or out of daemon set deployment include device enumeration, monitoring add-ons, among others.
Some interesting scenario examples:
- Cluster API
- CSI Proxy
- Logging Daemons
- Host network mode support is only targeted for privileged containers and pods.
- Privileged pods can only consist of privileged containers. Standard Windows Server containers or other non-privileged containers will not be supported. This is because containers in a Kubernetes pod share an IP. For the privileged containers with host network mode, this container IP will be the host IP. As a result, a pod cannot consist of a privileged container with the host IP and an unprivileged Windows Server container(s) sharing a vNic on the host with a different IP, or vice versa.
- We are currently investigating service mesh scenarios where privileged containers in a pod will need host networking access but run alongside non-privileged containers in a pod. This would require further changes and investigation.
Most of the fundamental changes to enable this feature for Windows containers is dependent on changes within hcsshim, which serves as the runtime (container creation and management) coordinator and shim layer for containerd. However:
- Several upstream changes are required to support this feature in Kubernetes, including changes to containerd, OCI, CRI, and kubelet. The identified changes include (see CRI and Kubelet Implementation Details below for more details on changes):
- Containerd: enabling host network mode for privileged containers and pods (working prototype demo). Prototype is done using containerd runtimehandler but this proposal is to use cri-api.
- OCI spec: https://github.com/opencontainers/runtime-spec
- [TBD]
- CRI-api:
- Adding a privileged field to the runtime spec and pass through to containers
- Pass security context and flag of runtime spec to podsandbox spec (not currently supported, open issue: kubernetes/kubernetes#92963)
- Kubelet: Pass privileged flag and windows security context to runtime spec and other PSP changes (see below).
- There are risks that changes at each of these levels may not be supported.
- If containerd changes are not supported, host network mode will not be enabled.This would restrict the scenarios that privileged containers would enable, as CNI plugins, network policy, etc. rely on host network mode to enable access to host network resources.
- If CRI changes to enable a privileged flag are not supported, there would be a less-ideal workaround via annotations in the pod container spec.
- The CRI changes may make an annotation in the OCI spec until the OCI updates are included.
Additionally, privileged containers may impact other pod security policies (PSPs) outside of allowPrivilegedEscalation. We will provide guidance similar to Pod Security Standards for Windows privileged containers. There is an analysis for non-privileged containers which can be augmented with the details below. There is an open question if PSP should be updated or the recommendation is to use an out of tree enforcement tool such as GateKeeper. The anticipated impacted PSPs include:
Use case | Field name | Applicable | Scenario | Priority |
Running of privileged containers | privileged | yes | Required. | Alpha |
Usage of host namespaces | HostPID, hostIPC | no | No support for pid and ipc since jobs don’t provide isolation at that level. | N/A |
Usage of host networking and ports | hostnetwork | yes | Will be in host network by default initially. Support to set network to a different compartment may be desirable in the future. | Beta |
Usage of volume types | Volumes | no | Not applicable. | N/A |
Usage of the host filesystem | Allowed host paths | no | Job objects have full access to write to the root file system. Current design does not have a way to control access to read only. | N/A |
Allow specific FlexVolume drivers | Flex volume | no | Not applicable. | N/A |
Allocating an FSGroup that owns the pod's volumes | Fsgroup (file system group) | no | The privileged container can be tied to run as a particular user that determines access to different fsgroups. | N/A |
The user and group IDs of the container | Runasuser, runasgroup, supplementalgroup | no | Assigning users to groups would have to occur at node provisioning, or via a privileged container deployment. | N/A |
Allowprivilegedescalation, default | no | Privilege via job objects is not granularly configurable. | N/A | |
Linux capabilities | Capabilities | no | Windows OS has a concept of “capabilities” (referred to as “privileged constants” but they are not supported in the platform today. | N/A |
Restrictions that could be applied to Windows Privileged Containers | Other restrictions for job objects | TBD | There are restrictions that could be enabled via the job object, i.e. UI restrictions | N/A |
Use GMSA with privileged containers | GMSA – would need to implement | yes | Required for auth to domain controller. | GA |
Windows privileged containers will be implemented with Job Objects, a break from the previous container model using server silos. Job objects provide the ability to manage a group of processes as a group, and assign resource constraints to the processes in the job. Job objects have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the correct permissions, among other host resources. The init process, and any processes it launches or that are explicitly launched by the user, are all assigned to the job object of that container. When the init process exits or is signaled to exit, all the processes in the job will be signaled to exit, the job handle will be closed and the storage will be unmounted.
- The container will be in the host’s network namespace (default network compartment) so it will have access to all the host’s network interfaces and have the host's IP as well.
- Resource limits (disk, memory, cpu count) will be applied to the job and will be job wide. For example, with a limit of 10 MB is set for the job, if every process in the jobs memory allocations added up exceeds 10 MB this limit would be reached. This is the same behavior as other Windows container types. These limits would be specified the same way they are currently for whatever orchestrator/runtime is being used.
- The container's lifecycle will be managed by the container runtime just like other Windows container types.
- The privileged container should be able to run as any user that's available on the host or in the domain of the host machine. Password accounts are being investigated.
- Directory mounts (i.e. directory C:\test mapped to C:\test in the privileged container) are still being investigated. Note that file system mapping in the current implementation would expose both the host filesystem and the container filesystem to the processes in the privileged container. We are still investigation how external mounts such as secrets or storage accounts would be added. The default visibility of all the host filesystem may present backwards compatibility challenges in the future if host path mounting is enabled in Windows. Currently the only method to restrict would be via the user in a way similar to UID or GID in the PSP. Currently however, this host path mounting is unlikely to be enabled in the future as a restriction of Job Objects.
- A slim base image may be required to satisfy hierarchy requirements from HCS. It was found that the graphdriver calls expect certain files to be in a windows image when un-tarring. These files were found to be simply several registry hive files in /windows/system32/config (which can be empty).
- We are currently targeting to release a simple image alongside this feature including the above mentioned files, which can be used instead of traditional Windows base images of server core or otherwise. Server core and others that container the necessary files would also work as privileged container base images.
- This is another ongoing area of investigation and open to feedback.
We will need to add a privileged field to the runtime spec. We can model this after the Linux pod security context and container security context that is a boolean that is set to “true” for privileged containers. References:
In windows, the same privileged field would need to be added in the windows pod security context (open issue) and the container security context will need to be augmented with the privileged flag.
The new WindowsPodSandboxConfig:
message WindowsPodSandboxConfig {.
WindowsSandboxSecurityContext security_context = 1;
}
A new WindowsSandboxSecurityContext:
message WindowsSandboxSecurityContext {
string run_as_user = 1;
bool privileged = 2;
}
The new field for the WindowsContainerSecurityContext:
message WindowsContainerSecurityContext {
string run_as_username = 1;
string credential_spec = 2;
bool privileged = 3;
}
There are no kubernetes API changes required. Kubelet can use the privileged flag from SecurityContext api and pass that to the new privileged flag in the CRI layer.
Add functionality to Kuberuntime_sandbox to split out the linux sandbox creation and add windows sandbox creation (WIP PR).
If using Windows privileged containers the Host Network Mode will also be enabled, as the pod will automatically get the host IP. The PodSpec has a hostNetwork boolean field:
In the case of windows this field would need to be addressed. During the alpha stage the proposal for this field would be to add documentation that note the field is only applicable to Linux. In addition the check for hostnetwork mode will need to be updated to check the privileged flag on windows PodSpec. This will be implemented in the following way, with validation loosened in the future if required:
- this pod must run on a windows host, and kubelets must reject it if not on windows hosts
- all pods marked privileged on windows must have host network enabled, if not the pod does not validate
In beta there is a possibility to enable the privileged container to be a part of a different network component. If this feature is enabled we will use the existing Pod HostNetwork field to enable/disable.
There are no plans to update Docker to have support for Privileged containers due to requirements for support HCSv2. The additional fields for docker would be ignored. The default value should be false.
Need to investigate if the choice of base image would cause errors when pulling with docker.
Although there are no changes required to the kubernetes API, the ability to pass the the privileged flag to the CRI should be behind a feature gate in Kubelet:
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#feature-stages
Alpha
- Preliminary analysis/testing of known kubernetes scenarios:
- [List]
Beta
- Testing and validation of key scenarios identified in the alpha analysis
- Test grids
- Validate running kubeproxy as a daemon set
- [List]
Alpha
- Version of containerd: v1.5
- Version of Kubernetes: Target 1.20 or 1.21
- Version of OS support: 1809/Windows 2019 LTSC and 2004
- Alpha Feature Gate for passing privilege flag to CRI
Beta
- Go through PSP Linux test (e2e: validation & conformance) and make them relevant for Windows (which apply, which dont and where we need to write new tests).
- Provide guidance similar to Pod Security Standards for Windows privileged containers
- Containerd: v1.5
- Kubernetes Target 1.21 or 1.22
- OS support: 1809/Windows 2019 LTSC and 2004
- Beta Feature Gate for passing privilege flag to CRI
GA:
- [need feedback]
- Remove feature gate for passing privileged flag
- Windows: This implementation requires no backports for OS components.
- Kubernetes: [need feedback on]
- Containerd: [need feedback]
N/A
This section must be completed when targeting alpha to a release.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
- Components depending on the feature gate:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume
Dynamic Kubelet Config
feature is enabled).
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior? Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here.
-
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Also set
disable-supported
totrue
orfalse
inkep.yaml
. Describe the consequences on existing workloads (e.g., if this is a runtime feature, can it break the existing applications?). -
What happens if we reenable the feature if it was previously rolled back?
-
Are there any tests for feature enablement/disablement? The e2e framework does not currently support enabling or disabling feature gates. However, unit tests in each component dealing with managing data, created with and without the feature, are necessary. At the very least, think about conversion tests if API types are being modified.
This section must be completed when targeting beta graduation to a release.
-
How can a rollout fail? Can it impact already running workloads? Try to be as paranoid as possible - e.g., what if some components will restart mid-rollout?
-
What specific metrics should inform a rollback?
-
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can't do that now.
-
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? Even if applying deprecation policies, they may still surprise some users.
This section must be completed when targeting beta graduation to a release.
-
How can an operator determine if the feature is in use by workloads? Ideally, this should be a metric. Operations against the Kubernetes API (e.g., checking if there are objects with field X set) may be a last resort. Avoid logs or events for this purpose.
-
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
- Metrics
-
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? At a high level, this usually will be in the form of "high percentile of SLI per day <= X". It's impossible to provide comprehensive guidance, but at the very high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors <= 1%
- 99% percentile over day of absolute value from (job creation time minus expected job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
-
Are there any missing metrics that would be useful to have to improve observability of this feature? Describe the metrics themselves and the reasons why they weren't added (e.g., cost, implementation difficulties, etc.).
This section must be completed when targeting beta graduation to a release.
-
Does this feature depend on any specific services running in the cluster? Think about both cluster-level services (e.g. metrics-server) as well as node-level agents (e.g. specific version of CRI). Focus on external or optional services that are needed. For example, if this feature depends on a cloud provider API, or upon an external software-defined storage or network control plane.
For each of these, fill in the following—thinking about running existing user workloads and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
- Usage description:
- [Dependency name]
For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.
-
Will enabling / using this feature result in any new API calls? Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller) focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.)
-
Will enabling / using this feature result in introducing new API types? Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-
Will enabling / using this feature result in any new calls to the cloud provider?
-
Will enabling / using this feature result in increasing size or count of the existing API objects? Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details.
-
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? Things to keep in mind include: additional in-memory state, additional non-trivial computations, excessive access to disks (including increased log volume), significant amount of data sent and/or received over network, etc. This through this both in small and large cases, again with respect to the supported limits.
The Troubleshooting section currently serves the Playbook
role. We may consider
splitting it into a dedicated Playbook
document (potentially with some monitoring
details). For now, we leave it here.
This section must be completed when targeting beta graduation to a release.
-
How does this feature react if the API server and/or etcd is unavailable?
-
What are other known failure modes? For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way: how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already running user workloads?
- Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
- [Failure mode brief description]
-
What steps should be taken if SLOs are not being met to determine the problem?
-
Use Containerd Runtimehandlers and K8s RuntimeClasses - Runtimehandlers are using the prototype. Adding the ability to the CRI provides kubelet to have more control over the security context and and fields that it allows through giving additional checks (such as runasnonroot)
-
Use annotations on CRI to pass privileged flag to Containerd - Adding the field to the CRI spec allows for the existing CRI calls to work as is. The resulting code is cleaner and doesn’t rely on magic strings. There is currently a PR for adding the SecurityFields to the CRI API adding Sandbox level security support for window containers. The Runasusername will be required for privileged containers to make sure every container (including pause) runs as the correct user to limit access to the file system.
- What’s the future of plug-ins that will be impacted
- What will be the future CSI-proxy and other plug-ins that will be impacted?
- CSI-proxy and HNS-proxy are likely to be impacted
- Container base image support
- Is “from scratch” required
- Would a slimmer “privileged base image” be more desirable than using stand server core
- Container image build differences with traditional windows server and impacts on image use and distribution
- Should PSP be updated with latest checks or should out-of-tree enforcement tool be used?
- PSP will be depreciated and documentation and guidance should be produced for Security Standards. Implementations in out-of-tree enforcement should be favored and POC/impementation in gatekeeper would be a great way to demonstrate.
- Scheduling checks
- Privileged containers in the same network compartment as the non-privileged pod, and otherwise init privileged containers may be able to still access the host network