virtcontainers
is a Go library that can be used to build hardware-virtualized container
runtimes.
The few existing VM-based container runtimes (Clear Containers, runv, rkt's
kvm stage 1) all share the same hardware virtualization semantics but use different
code bases to implement them. virtcontainers
's goal is to factorize this code into
a common Go library.
Ideally, VM-based container runtime implementations would become translation
layers from the runtime specification they implement (e.g. the OCI runtime-spec
or the Kubernetes CRI) to the virtcontainers
API.
Implementing a container runtime tool is out of scope for this project. Any tools or executables in this repository are only provided for demonstration or testing purposes.
virtcontainers
's API is loosely inspired by the Kubernetes CRI because
we believe it provides the right level of abstractions for containerized pods.
However, despite the API similarities between the two projects, the goal of
virtcontainers
is not to build a CRI implementation, but instead to provide a
generic, runtime-specification agnostic, hardware-virtualized containers
library that other projects could leverage to implement CRI themselves.
The virtcontainers
execution unit is a pod, i.e. virtcontainers
users start pods where
containers will be running.
virtcontainers
creates a pod by starting a virtual machine and setting the pod
up within that environment. Starting a pod means launching all containers with
the VM pod runtime environment.
The virtcontainers
package relies on hypervisors to start and stop virtual machine where
pods will be running. An hypervisor is defined by an Hypervisor interface implementation,
and the default implementation is the QEMU one.
During the lifecycle of a container, the runtime running on the host needs to interact with
the virtual machine guest OS in order to start new commands to be executed as part of a given
container workload, set new networking routes or interfaces, fetch a container standard or
error output, and so on.
There are many existing and potential solutions to resolve that problem and virtcontainers
abstracts
this through the Agent interface.
The high level virtcontainers
API is the following one:
-
CreatePod(podConfig PodConfig)
creates a Pod. The Pod is prepared and will run into a virtual machine. It is not started, i.e. the VM is not running afterCreatePod()
is called. -
DeletePod(podID string)
deletes a Pod. The function will fail if the Pod is running. In that caseStopPod()
needs to be called first. -
StartPod(podID string)
starts an already created Pod. -
StopPod(podID string)
stops an already running Pod. -
ListPod()
lists all running Pods on the host. -
EnterPod(cmd Cmd)
enters a Pod root filesystem and runs a given command. -
PodStatus(podID string)
returns a detailed Pod status.
-
CreateContainer(podID string, container ContainerConfig)
creates a Container on a given Pod. -
DeleteContainer(podID, containerID string)
deletes a Container from a Pod. If the container is running it needs to be stopped first. -
StartContainer(podID, containerID string)
starts an already created container. -
StopContainer(podID, containerID string)
stops an already running container. -
EnterContainer(podID, containerID string, cmd Cmd)
enters an already running container and runs a given command. -
ContainerStatus(podID, containerID string)
returns a detailed container status.
An example tool using the virtcontainers
API is provided in the hack/virtc
package.
Virtcontainers implements two different way of setting up pod's network:
CNM lifecycle
-
RequestPool
-
CreateNetwork
-
RequestAddress
-
CreateEndPoint
-
CreateContainer
-
Create config.json
-
Create PID and network namespace
-
ProcessExternalKey
-
JoinEndPoint
-
LaunchContainer
-
Launch
-
Run container
Runtime network setup with CNM
-
Read config.json
-
Create the network namespace (code)
-
Call the prestart hook (from inside the netns) (code)
-
Scan network interfaces inside netns and get the name of the interface created by prestart hook (code)
-
Create bridge, TAP, and link all together with network interface previously created (code)
-
Start VM inside the netns and start the container (code)
Drawbacks of CNM
There are three drawbacks about using CNM instead of CNI:
- The way we call into it is not very explicit: Have to re-exec dockerd binary so that it can accept parameters and execute the prestart hook related to network setup.
- Implicit way to designate the network namespace: Instead of explicitely giving the netns to dockerd, we give it the PID of our runtime so that it can find the netns from this PID. This means we have to make sure being in the right netns while calling the hook, otherwise the veth pair will be created with the wrong netns.
- No results are back from the hook: We have to scan the network interfaces to discover which one has been created inside the netns. This introduces more latency in the code because it forces us to scan the network in the CreatePod path, which is critical for starting the VM as quick as possible.
Runtime network setup with CNI
-
Create the network namespace (code)
-
Get CNI plugin information (code)
-
Start the plugin (providing previously created netns) to add a network described into /etc/cni/net.d/ directory. At that time, the CNI plugin will create the cni0 network interface and a veth pair between the host and the created netns. It links cni0 to the veth pair before to exit. (code)
-
Create bridge, TAP, and link all together with network interface previously created (code)
-
Start VM inside the netns and start the container (code)