Recent weeks brought a number of interesting articles on Hacker News regarding Docker – an open-source project to develop lightweight app containers. The idea is quite elegant – isolate user applications with their run-time environments into deployable packages. Such a setting should allow to easily to distribute apps and host a number of them inside a single environment. The technology to do it has been out there for a couple of years now and is the workhorse behind such popular PaaS providers like Heroku.
The traditional way of securely handling more than one application on one machine was to utilize tools like chroot or complex virtualization techniques along the line of VMware ESX, KVM or Xen. Unfortunately, both models have their serious limitations – chroot being way too simple with just a filesystem separation mechanism and VMs being too complex to maintain and with some substantial overhead from the guest OS.
Somewhere around the year 2007 and in Linux Kernel 2.6.19/2.6.24, a project inside the kernel aimed at providing two foundation stones for modern OS-level virtualization: namespaces and cgroups. On top of that, LinuX Containers (LXC) were developed to run multiple isolated Linux systems (containers) on a single machine and now the recent addition to the bunch – Docker, provides a nice and easy way to put your app with all its dependencies in a single, standard “container” to be deployed. This series of posts will try to uncover the details behind all of this, starting with the most low-level.
Kernel Namespaces made a first glimpse appearance in the Linux kernel as far as 2002 with mount namespaces being introduced into kernel 2.4.19. The core concept is quite interesting – provide a way to have varying views of the system for different processes. What that means is to have the capability to put apps in isolated environments with separate process lists, network devices, user lists and filesystems. Such a functionality could be implemented inside the kernel without the need to run hypervisors or virtualization. The OS needs to keep separate internal structures and make sure they remain isolated. The mechanism itself looks quite simple on paper, but takes expertise and extra care to implement correctly.
Kernel Namespaces are created and manipulated using 3 basic syscalls:
clone() which creates a new process and allows the child to share parts of the execution context with its parent. This low-level function sits behind the well known fork().
unshare() which allows processes to disassociate parts of their execution context. The main use of this syscall is to modify the execution context of a process without spawning a new child process.
setns() which changes the namespace of the calling process.
While clone() and unshare() are standard linux syscalls, the setns() is a new addition dedicated solely to manipulate namespace membership. The namespace functionality has been introduced into both clone() and unshare() through a set of additional flags that indicate which namespaces are to be created for the child process (or which are to be disassociated from the caller).
Currently, the Linux Kernel implements 6 namespaces:
mnt – separates mountpoints and filesystems. Allows processes to have separate root partitions and mounts. Think of it as chroot on steroids. The clone() flag indicating that a mountpoint name space needs to be created is CLONE_NEWNS.
pid – separate process trees and environments. Each pid namespace has its own PID 1 init process. A single process can reside in only one pid namespace at a time, but note that pid namespaces create a hierarchy with processes in a given namespace being visible to processes in the pid namespace of the parent process. This also means that each process can have more than one PID number, one for each namespace it is visible in. The clone() flag to create a new pid namespace is CLONE_NEWPID. An important thing to remember is that some programs rely on the /proc contents to inspect running processes. In order to make that list reflect the true contents of a pid namespace, a new instance of procfs needs to be mounted over /proc.
net – a separate network stack. The default net namespace contains the loopback device and all physical cards present in the system while newly created net namespaces contain only a single loopback device. In order to allow one namespace to reach other networks or utilize physical devices, a virtual ethernet (veth) device can be setup. The veth device is composed of a pair of network devices that are linked together. Usually both are created in the default net namespace and one member of the pair is migrated into the newly created namespace to act as a tunel endpoint. Net namespaces can be easily manipulated with the ip netns command.
ipc – an isolated System V IPC namespace with isolated objects and POSIX message queues.
uts – a hostname namespace where each uts namespace provides different values by the uname() call: domain name, host name, os release etc. Quite useful when running multiple virtual servers.
user – a mechanism to have separate users, groups and capabilities lists. Each namespace holds a mapping of host system user ids to user ids in the namespace. This is the most recent addition to the Linux Kernel (since 3.8) and not all OS-es are user namespace aware.
Namespaces don’t have names. Instead, each gets a unique inode number at the time of its creation. In Linux Kernel 3.8 and higher, you can use the /proc filesystem to check in which namespace a given process resides. Simply go to /proc/<pid>/ns and you will see links for each namespace type.
Apart from the above mentioned syscall changes, a number of user-space tools have been provided.
The ip program provides a very elaborate mechanism to manipulate net namespaces from the commandline:
ip netns add <name> – will create a new net namespace and also triggers a creation of a file under /var/run/netns/<name> which can be used with setns().
ip netns del <name> – deletes a net namespace.
ip netns list – prints a list of all existing net namespaces.
ip netns pids <name> – lists all processes in a given net namespace.
ip netns exec <namespace> <program> – starts a program in a namespace.
Another important user-space tool to manipulate namespaces is the unshare command-line utility. It allows you to spawn a program in an isolated environment by specifying which namespaces are to be created, e.g. unshare -n bash will spawn a bash shell in a new net namespace.
Checkout the LWN series of articles to learn more at http://lwn.net/Articles/531114/