I love understanding the inner workings of systems. I have been working with containers for almost 2 years and I have always wanted to understand its internals. So I took some time to explore the userspace and the Kernel level implementations of the technologies that make containers possible.
I have written extensive notes on my finding and my implementations in my notes collection. The project page is available here and the implementation is available at my github : sheharyaar/container-runtime.
This blog is about my experience and the things I found interesting. So buckle up!
Experience
The first step was to figure out the workings of the namespaces userspace APIs
like clone()
, unshare()
and setns()
. I created byte-size programs and put them under
playground/
folder at the repository. Once I was through with the userspace part, I explored the
Kernel code related to these system calls. I also referred to many great blogs and articles, which
I have mentioned in the docs itself.
After this, I went deeper into namespaces and studied the nsproxy
structure and it’s interaction with
the system calls. My prior experience with Linux Kernel helped me understand the source code with less
frustration. Exploring cgroups
was the most difficult one and it was very frustrating to understand the
architecture and the multiple lists used by cgroups subsystem to connect the controllers, cgroup and
tasks.
Once this was done, the remaining process was easier. To start working on my implementation, I started with the good blogs by Hechao Li. For confusing topics and blockers I had to refer to lxc and runc implementations.
Interesting stuff
Prior to this undertaking, I was aware of the theory of containers, but implementing taught me a lot of
new things. I spent much time debugging and understanding the effect of namespace isolation on the child process, parent process
and synchronization between them. Making pivot_root
work was another huge task to understand and do it
correctly.
Parent and child synchronization
When you clone a process with flags like CLONE_NEWNS
, CLONE_NEWIPC
, CLONE_NEWNET
and separation
of Virtual Memory and Filesystem, you are left with few choices of process synchronizations.
What didn’t work :
-
IPC mechanisms like
semaphores
,message queues
andshared memory
do not work due to IPC isolation and no common memory between the child and parent (due to absence ofCLONE_VM
flag). -
UNIX sockets could not be used since the file system was also isolated. To make this work, there needed To be a common area of FS where the socket would have to be created, leading to complex solutions. Similarly, isolation of network namespace using
CLONE_NEWNET
makes it difficult to use TCP or other sockets.
What worked :
pipe
(verified it myself) andeventfd
(not verified) are two ways that would work. I usedpipe
to synchronize the parent and the child process.
Why was synchronization needed ?
I was using CLONE_NEWUSER
flag which creates a user namespace, so the child process has different user
ID in the new namespace. To add on the difficulty, there needs to be a mapping between the host namespace /
parent namespace and the container namespace for programs to make changes to the system.
If the UID 0
of the container is mapped as UID 1000
or any other non-zero value on the host, then it
would not be able to make privileged changes on the host system. But if we map the UID 0
of the container
to the UID 0
of the host, the container is allowed to run privileged instructions, except a few. This is how
--privileged
option works in Docker.
So this has to be done by the parent process after
the child is cloned. So the child must wait
for the
parent to setup the mappings in /proc/child_pid/uid_map
and corresponding gid_map
files before it can
exec the command provided by the user.
Cgroups
To deal with cgroup limits, I had two options:
-
After cloning the child, move the process to the cgroup by writing the child PID to
cgroup.procs
file. This is not the recommended method if we need to create a new process. -
The other method was to use
CLONE_INTO_CGROUP
flag forclone3
, which required the support ofclone3
syscall in the Kernel. This made it easy to set cgroup limits and let the child be created directly into the cgroup.
Mount and pivot_root
One of the tough task was to understand and do pivot_root
correctly. This was made complex by the task
to mount procfs
in the child namespace. After much experiment, I managed to do it correctly. I have
documented in the Setting up the file-system
in my notes.
Network connection with the host
The last part that I wanted to implement was a veth connection between the host network namespace as veth0
and the container as veth1
. I failed here, due to my lack of experience with netlink
and rtnetlink
in
particular. I tried using both the raw netlink socket
and the libnl
library, but I faced issues that I
could not debug. I have added it my TODO
list, which I would pickup once I get enough experience with netlink.
My notes
My notes are available here.
Thank you for reading the blog ❤️. I hope my work helped you!