In this tutorial we are going to play with containers in order to better understand what is a container, how does it differ from a virtual machine and what do container engines such as docker
, rkt
and lxc
do under the hood to create containers. While in real life you will likely be using one of these engines, it is really useful to understand the low level details in some depth in order to effectively and securely use containers as a development tool or as a part of a larger system's architecture.
People have been dealing a lot with computers in the past few decades. This period saw great advancements in technology, both hardware and software, but there have been some recurring patterns. Despite the constant change of technology some problems need to be solved over and over again. Once you finish developing your program and want to publish it to a server you start facing problems. Is it going to work at all? It works on your development machine, but does the server have all dependencies installed? Is it going to be secure? You know that there are other programs running on that server so some of them might mess with your program and its state. Or they might consume all available resources forcing it to crash. Portability, security and isolation have been hot topics in the world of computers almost since the very beginning.
One early way to address these problems was to use multiple OS users. Every user would have limited permissions, preventing it from seeing and modifying files owned by other users. While this model worked, what wasn't so great about it was the fact that all apps were still running on the same host, so a malicious app could still drain all the resources. This model also did not address portability.
So at some point people invented virtual machines. With virtual machines we are giving each app not just a user, but a whole OS. This is way more secure since apps are no longer running side by side on the same host. What's more each VM has its own image, which also solves the portability problem - distribute your program as a VM image and it would run everywhere. The problem with VMs is that they are slow to create and expensive to manage. After all you are starting a whole OS just to run your app. Isn't there a better way?
This is exactly what containers are trying to explore. Just like VMs, they are addressing the problems of isolation, security and portability, but are cheaper, more lightweight and way more flexible.
There are multiple definitions of a container. Some popular ones are 'lightweight virtualization' and the shipping container analogy. I am not going to confuse you with yet another one. Instead, let's create a container using docker and explore how it looks like both from the inside and from the outside.
First of all we need a linux machine. They don't call them linux containers for nothing, right? In case you are running on Windows or MacOS you can spin up an ubuntu virtual machine by executing several simple commands. Open a GIT Bash window as Administrator and run the following commands:
cd workspace/installscript/diycontainers
vagrant destroy -f && vagrant up # recreate the box to start clean
This is going to create an ubuntu VM that is running the docker daemon. We are going to refer to it as the 'container host' or just 'host'.
For the rest of the tutorial we are going to be using two terminal windows tiled next to each other (run both GIT Bash terminals as administrotor. To tile them you might use Windows Key + left or right arrow). We are going to refer to them as the 'left terminal' and the 'right terminal'. We are going to open shell sessions to the container host in both terminal windows. We will use the left terminal to run commands in the container and the right terminal to run commands on the host.
After the vm is up and running make sure you open shell sessions by running the following commands in both terminal windows:
vagrant ssh
sudo su -
Now let's create a container using docker, so that we can inspect it. Run this command in your left terminal:
$ docker run -it busybox
This should result in a shell running in the newly created docker container. Now let's run some commands in the container:
$ ls -la /
drwxr-xr-x 1 root root 4096 Feb 11 14:00 .
drwxr-xr-x 1 root root 4096 Feb 11 14:00 ..
-rwxr-xr-x 1 root root 0 Feb 11 14:00 .dockerenv
drwxr-xr-x 2 root root 12288 Dec 31 18:16 bin
drwxr-xr-x 5 root root 360 Feb 11 14:00 dev
drwxr-xr-x 1 root root 4096 Feb 11 14:00 etc
drwxr-xr-x 2 nobody nogroup 4096 Dec 31 18:16 home
dr-xr-xr-x 189 root root 0 Feb 11 14:00 proc
drwx------ 1 root root 4096 Feb 11 14:00 root
dr-xr-xr-x 13 root root 0 Feb 11 12:07 sys
drwxrwxrwt 2 root root 4096 Dec 31 18:16 tmp
drwxr-xr-x 3 root root 4096 Dec 31 18:16 usr
drwxr-xr-x 4 root root 4096 Dec 31 18:16 var
$ ps aux
PID USER TIME COMMAND
1 root 0:00 sh
7 root 0:00 ps aux
$ uname -a
Linux f758f8c61110 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 GNU/Linux
Let's run the same commands on the host and compare the output. In your right terminal execute the following:
$ ls -la /
vagrant@ubuntu-bionic:~$ ls -la /
total 96
drwxr-xr-x 24 root root 4096 Feb 11 15:38 .
drwxr-xr-x 24 root root 4096 Feb 11 15:38 ..
-rw-r--r-- 1 root root 0 Feb 11 15:38 I_AM_THE_HOST
drwxr-xr-x 2 root root 4096 Feb 11 15:33 bin
drwxr-xr-x 3 root root 4096 Feb 11 15:35 boot
drwxr-xr-x 16 root root 3660 Feb 11 15:32 dev
drwxr-xr-x 92 root root 4096 Feb 11 15:36 etc
drwxr-xr-x 4 root root 4096 Feb 11 15:32 home
...
drwxr-xr-x 13 root root 4096 Sep 3 16:06 var
lrwxrwxrwx 1 root root 30 Feb 11 15:34 vmlinuz -> boot/vmlinuz-4.15.0-45-generic
lrwxrwxrwx 1 root root 30 Sep 3 16:04 vmlinuz.old -> boot/vmlinuz-4.15.0-33-generic
$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.5 0.9 159888 9400 ? Ss 15:32 0:06 /lib/systemd/systemd --system --deserialize 34
root 2 0.0 0.0 0 0 ? S 15:32 0:00 [kthreadd]
...
syslog 12030 0.0 0.3 263036 3356 ? Ssl 15:34 0:00 /usr/sbin/rsyslogd -n
root 12488 0.0 0.0 25376 244 ? Ss 15:34 0:00 /sbin/iscsid
root 12491 0.0 0.5 25880 5300 ? S<Ls 15:34 0:00 /sbin/iscsid
root 12699 0.0 1.6 170936 16316 ? Ssl 15:34 0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
root 12905 0.0 0.6 288876 6268 ? Ssl 15:34 0:00 /usr/lib/policykit-1/polkitd --no-debug
root 14576 0.0 0.5 72296 5496 ? Ss 15:34 0:00 /usr/sbin/sshd -D
root 30310 0.2 4.1 809224 42004 ? Ssl 15:36 0:01 /usr/bin/containerd
root 31342 0.0 6.7 782564 68016 ? Ssl 15:36 0:00 /usr/bin/dockerd -H fd://
uname -a
Linux f758f8c61110 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 GNU/Linux
When we compare the outputs, we can say that the container is behaving a lot like a VM. It is seeing its own image and its own set of processes (with much smaller pid numbers) that have nothing to do with those of the host. From the container's point of view it looks like it is running on a different machine. However, the kernel version (displayed by uname -a
) looks exactly the same, which is suspicious.
Now let's trigger a long running process in the container. In your left terminal execute:
sleep 9999
While it is sleeping go to your right terminal and list the process tree:
$ ps auxf | grep -C3 "[s]leep 9999"
root 30310 0.2 4.2 957840 43216 ? Ssl 15:36 0:03 /usr/bin/containerd
root 8517 0.0 0.4 9324 4976 ? Sl 16:01 0:00 \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/f758f8c6111068310c25c742624a091a96253bc466c7a1a2fad7f1d720012c13 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
root 8540 0.0 0.0 1296 4 pts/0 Ss 16:01 0:00 \_ sh
root 8596 0.0 0.0 1280 4 pts/0 S+ 16:01 0:00 \_ sleep 9999
root 31342 0.0 8.3 874524 84344 ? Ssl 15:36 0:00 /usr/bin/dockerd -H fd://
vagrant 8346 0.0 0.7 76612 7660 ? Ss 15:43 0:00 /lib/systemd/systemd --user
vagrant 8347 0.0 0.2 193872 2672 ? S 15:43 0:00 \_ (sd-pam)
It looks like the sleep we triggered in the container is actually a process on the host. It is a child of a process called containerd-shim
, which is a child of another process called containerd
. If you google for containerd you will find out that it is nothing but the container runtime that docker uses underneath. Let's see what would happen if we try to kill that process. In the right terminal:
kill 8596
As a result, the sleep in the container exited. This would not be possible if we were running sleep in a VM. What is even more interesting - if you try running reboot
in the left terminal (in the container) nothing will happen. Not what you would expect if you executed this command on a VM.
So to sum up our container looks a lot like a VM in that it has its own view of the filesystem and the process tree, but it does not quite behave like one. It is sharing the same kernel with the host, processes in the container are visible from the host, we cannot do things like reboot. A container is a set of processes running in isolation.
Yeah, that's right! In the linux kernel there is no single native object that represents a container. Instead, a container is composed of lower level kernel primitives like processes, namespaces and cgroups. We already saw that a container is essentially a set of processes running on the host. But what are namespaces and cgroups? Put simply these are kernel primitives for process isolation. Each linux namespace isolates a certain aspect of the process by giving it its own view of some part of the system, like the filesystem, the process tree and others. Here are the most important types of linux namespaces:
- Mount: gives the process its own view of the root filesystem (a.k.a. container image)
- Pid: gives the process its own view of the process tree
- User: giver the process its own view of the OS users
- others: There are a few more, but they are not as visible, so we are going to focus on the three above.
Each process in the linux OS is running in a namespace of each type. Namespaces form a hierarchy. Most of the processes on the host are sharing the same set of namespaces at the root of that hierarchy. As a result they have the same view of the system - they are seeing the same files and the same process tree. The root user can create child namespaces using the unshare
command. A call to unshare
looks like this:
unshare [options] [<program> [<argument>...]]
It takes options, which are telling it what namespaces to unshare and what program to run in the unshared namespaces. As a result we get a new process that is running in a new set of namespaces, so it has a different view of the system. It can modify certain aspects of the system without affecting the rest of the processes on the host. When this process exits, the linux kernel is going to deallocate the new namespaces and the changes will go away. That's the basic lifecycle of a linux namespace.
Cool, what about cgroups? Cgroup stands for 'control group'. Cgroups are another set of primitives which are used to control process resource usage, by setting resource limits. They are completely orthogonal to namespaces. For example you can create a new memory cgroup that sets a memory limit. You can join your process to that cgroup and it won't be able to allocate more memory than the amount specified in the cgroup. Some important types of cgroups are memory
, cpu
and blkio
. Cgroups play an important role in container isolation, but have less visible effects than namespaces, so we are not going to explore them in this tutorial.
Let's finally get our hands dirty and start building our own container using namespaces and the unshare
command. We are going to need just a clean linux distro and a root filesystem to use as an image for our container. We are not going to need docker.
In order to prepare your terminals type exit
in the left one. This will kill the docker container we created earlier and get you back to the host. In both terminals make sure you are still logged in the vagrant VM as the root
user. Remember, you need to be root in order to create new namespaces.
So we need to create a process and some namespaces. We will start with the mount namespace. According to the kernel docs Mount namespaces provide isolation of the list of mount points seen by the processes in each namespace instance.
What does that mean? In the Linux OS the filesystem has a single root - a directory named /
. All other files and directories are children and grandchildren of /
. The filesystem that starts on /
is known as a root filesystem (rootfs) or the machine image. If we want to look at another fileystem, for example a USB stick or some network storage, we need to mount the new filesystem somewhere in the root filesystem, so that we can browse it. For example we might mount a USB stick on /mnt/myusbstick
and if it is correctly formatted we will be able to see its contents under that path. /mnt/myusbstick
is a mountpoint, since it is the root of a filesystem that is different from the root filesystem. The list of all mount points is known as the mount table. The only thing that the mount namespace does is to provide the new process with a unique copy of the mount table, or in other words - its own view of the mount table. In the beginning it is an exact clone of the parent mount namespace, but any mounts that the new process creates will go to its own mount table and will not be visible on the host. Let's see this in action. In your left terminal, run this command:
unshare -m /bin/sh
This is going to start a new shell process in its own mount namespace. Let's inspect the mount tables from both the left and the right terminal. Run cat /proc/mounts
in both windows. You will notice that they are identical. The mount table in the new mount namespace starts off as a copy of the host's mount table. We can count the number of mount points by running cat /proc/mounts | wc -l
and it will be the same on the host and in the new namespace. On my machine this number happens to be 33.
You might wonder what is this proc
directory that we are looking at. Let's spend a minute talking about it since it is an important concept. The /proc
directory itself is a mount point. We can confirm this by executing this on the host:
$ mountpoint /proc
/proc is a mountpoint
If it is a mountpoint, this means there is a filesystem that is mounted underneath it. This is the procfs
filesystem - it is a special virtual filesystem. It is not associated with a block storage device such as a disk or a USB, but it is directly exposing runtime information about the state of the system, such as the mount tables and the processes that are currently running. It contains many virtual files that give you real time info about certain aspects of the system. For example /proc/uptime
tells you how long has the machine been running and /proc/meminfo
gives you detailed information about allocated memory.
Back to what we were doing. Now that we have a new mount namespace, let's mount something. In the left terminal navigate to /tmp/playground
and list the rootfs
directory.
$ cd /tmp/playground
$ ls rootfs
I_AM_THE_CONTAINER bin linuxrc sbin usr
The rootfs
directory contains the busybox image that we will be using as a rootfs for all our container experiments. Let's bind mount the rootfs.
mount --bind rootfs/ rootfs/
We just created a new mount point and mounted the contents of the rootfs
directory as a new filesystem. The --bind
option tells the mount command that the filesystem data is not coming from a block storage device such as a disk or a flash drive, but from a local directory. The first argument is the path to the source directory and the second argument is the path where we want to mount it. In our case we are mounting the rootfs dir onto itself. We are doing this, because all we want is to get /tmp/playground/rootfs
in the mount table. You will see why we need it there in a second. We could have mounted it on an arbitrary path if we wanted.
Now that we have a mount point in the container let's run cat /proc/mounts | wc -l
in both terminal windows in order to count the mount points again. This time is is reporting 34 mount points in the container and 33 on the host. We are already witnessing some isolation. Unfortunately listing /
still yields the same results in both terminal windows, so the unshared process does not quite look like a container yet. What are we missing?
Well, just unsharing a new mount namespace and bind mounting a rootfs is not enough to give the container its own view of the root filesystem. Technically both shell processes still have the same rootfs and this is the rootfs of the ubuntu VM. In order to change the rootfs of the container we need to run the pivot_root
command. Here is how it is used:
pivot_root new_root put_old
This command will change the root filesystem of the calling process. The first argument is the new rootfs that we want to switch to. It needs to be a mountpoint - that's why we needed to bind mount the rootfs. The second argument is a path in the new rootfs where pivot_root
is going to put the old ubuntu VM rootfs, so that its accessible in case we need it. Let's try:
$ mkdir rootfs/old
$ pivot_root rootfs rootfs/old
$ cd /
$ ls /
I_AM_THE_CONTAINER bin linuxrc sbin usr
Now let's list /
on the host:
$ ls /
I_AM_THE_HOST boot etc initrd.img lib lost+found mnt proc run snap sys usr var vmlinuz.old
bin dev home initrd.img.old lib64 media opt root sbin srv tmp vagrant vmlinuz
Cool! Our unshared process looks a lot more like the docker container we were playing with in the beginning! However we still have access to the old rootfs - it can be found under /old
in the container. Let's check:
$ ls /old
I_AM_THE_HOST etc lib mnt run sys var
bin home lib64 opt sbin tmp vmlinuz
boot initrd.img lost+found proc snap usr vmlinuz.old
dev initrd.img.old media root srv vagrant
It is there. If we want we can unmount it via umount -l /old
Now let's list the process tree - run ps aux
in both terminals. We see the same set of processes which is not what we would expect in a container. Let's fix that by moving on to the next namespace. Before we do that do not forget to exit from the container in the left terminal.
Let's isolate our container even further by unsharing both mount and pid namespaces. Execute the following in the left terminal window:
unshare -m -p -f /bin/sh
We have added two new options to unshare
. The -p
option is telling unshare
to create a new pid namespace. The -f
option makes unshare
fork a child before execing our program. We need to do that because of how pid namespaces work. According to the man page the first child created by a process after a call to unshare using the CLONE_NEWPID flag has the PID 1, and is the "init" process for the namespace
. So the -f
is just making sure that the shell process will be the first process in the new pid namespace. Let's confirm that by printing the container shell pid:
$ echo $$
1
Looks like we are the first process in the container. Pretty cool!
Before we dive into the world of pids lets quickly configure the rootfs as we did before:
cd /tmp/playground
mount --bind rootfs/ rootfs/
pivot_root rootfs/ rootfs/old
umount -l /old
cd /
Now let's list all the processes in the container:
$ ps aux
PID USER TIME COMMAND
ps: can't open '/proc': No such file or directory
It looks weird but it is expected. The ps command is reading the procfs
filesystem in order to get information about the running processes. It expects to find this filesystem on /proc
but it is not there so it is failing. Let's fix that. Run this in the container:
mkdir /proc
mount -t proc none /proc
Now run ps aux
again:
$ ps aux
PID USER TIME COMMAND
1 0 0:00 /bin/sh
7 0 0:00 ps auxf
Voila! We have process tree isolation now. Exactly like we did in the docker container! We are doing pretty good! Our container is still not absolutely safe though. Let's run a long sleep in the container on the left:
sleep 999
While it is sleeping run the following on the host:
$ ps aux | grep "[s]leep 999"
root 7728 0.0 0.0 3224 4 pts/0 S+ 14:55 0:00 sleep 999
As you can see our sleep process is running as the host root user. This is what we call a privileged container - a container running as host root. Running your program as the root user is generally discouraged practice in the linux world. The root user is the most privileged user on the system and can do anything, so if someone manages to hack your program they can cause a lot of damage. However if your program runs as an unprivileged user, even if it gets hacked, the hacker would not be able to easily affect other programs. Let's try to build an unprivileged container. Before that make sure you exit from the current container.
First of all make sure both terminal windows are logged in the ubuntu VM as user vagrant:
su vagrant
Then run the unshare
command in the left terminal as usual:
unshare -U -m -p -f /bin/sh
Now let's check what user are we running as in the container. Run this on the left:
$ whoami
nobody
Interesting. If you run this on the host you are going to get vagrant
as the user name. What happened is that we created a new user namespace, but did not initialize it. That's why we are 'nobody'. User namespaces are a bit different from the others. They are the only type of namespace that you can unshare as an unprivileged user (that's the whole point). The cool thing about running in a user namespace is that you can be root
(uid 0) inside the namespace, but vagrant
(uid 1000) in the parent user namespace. This way you have privileges only in the container. This is achieved by the so called user mappings. User mappings need to be written immediately after the user namespace is unshared. They are written to a special file with path /proc/<pid>/uid_map
. This yet another virtual file in the procfs. The procfs keeps a directory for each running process. The name of the directory is the same as the process pid as shown by ps
. So let's find out the pid of our container. Run this on the host:
$ ps auxf | grep -A1 [u]nshare
vagrant 7773 0.0 0.0 7912 800 pts/0 S 15:07 0:00 | \_ unshare -U -m -p -f /bin/sh
vagrant 7774 0.0 0.0 4628 788 pts/0 S+ 15:07 0:00 | \_ /bin/sh
Looks like our sh
process has a pid of 7774
. Let's list the user mappings for this process:
cat /proc/7774/uid_map
It is empty. This is the reason why our container currently thinks it is nobody. The users in the new user namespace are not mapped to the users on the host. Let's write a sensible mapping. Mappings are written to /proc/<pid>/uid_map
in the following format:
<uid> <puid> <size>
The first number is uid in the new userns, the second number is uid in its parent namespace and the last number is the size of the mapping. For example, a mapping with size 2 is going to map uid
to puid
and uid+1
to puid+1
. In order to map uid 0 in our new user namespace to uid 1000 in its parent (the user namespace of the host) we need to write 0 1000 1
to the mapping file. Let's do it. Run the following command on the host:
echo 0 1000 1 > /proc/7774/uid_map
And now let's check the container user again:
$ whoami
root
Now, that's somethig. Our shell thinks it is root
, but if we look at pid 7774
on the host is is run by user vagrant
.
$ ps aux | grep [7]774
vagrant 7774 0.0 0.0 4628 788 pts/0 S+ 15:07 0:00 /bin/sh
If we want we can quickly isolate the rootfs and process tree as we did before:
cd /tmp/playground
mount --bind rootfs/ rootfs/
pivot_root rootfs/ rootfs/old
umount -l /old
cd /
mount -t proc none /proc
And we have built a decently isolated container. Sure, there's a lot more to do, but I think you are getting the point, so I am going to stop now.
Here are the steps that we performed all in one place:
# on the host
unshare -U -m -p -f
echo 0 1000 1 > /proc/pid/uid_map
#in the container
mount --bind rootfs/ rootfs/
pivot_root rootfs/ rootfs/old
umount -l /old
cd /
mount -t proc none /proc
There we are building containers! What did we learn in the process?
- Containers don't exist
- A container is just a set of processes running in isolation
- We can isolate processes as little or as much as we like using namespaces.
Creating containers is like playing with legos. You can use the primitive building blocks and you can build whatever you need with them. You do not need to use all namespaces, you can create containers that share namespaces, the options are limitless. Docker is doing just that under the hood. It is just one of the available lego sets. There are others as well, but the important part is that they are all using the same building blocks.