There’s no such thing as a container.
Not in the Linux kernel, anyway. There’s no “container” system call, no container data structure, no container subsystem. What we call “containers” are just regular Linux processes with some isolation applied. That isolation comes from two kernel features: namespaces and cgroups.
Understanding these primitives demystifies containers, helps you debug them, and explains why certain things work the way they do.
The Two Pillars ¶
Namespaces: Control what a process can see. A process in a PID namespace sees a different set of processes than the host. A process in a network namespace has its own network stack.
Cgroups: Control what a process can use. Limit CPU, memory, IO, and other resources. Account for resource usage.
That’s it. A “container” is a process with namespace isolation and cgroup limits. Everything else — images, layers, runtimes — is tooling built on top of these primitives.
Namespaces: Isolation of View ¶
Linux has eight namespace types. Each isolates a different aspect of the system:
| Namespace | Isolates | Flag |
|---|---|---|
| PID | Process IDs | CLONE_NEWPID |
| Network | Network stack (interfaces, routing, ports) | CLONE_NEWNET |
| Mount | Filesystem mounts | CLONE_NEWNS |
| UTS | Hostname and domain name | CLONE_NEWUTS |
| IPC | Inter-process communication (semaphores, message queues) | CLONE_NEWIPC |
| User | User and group IDs | CLONE_NEWUSER |
| Cgroup | Cgroup root directory | CLONE_NEWCGROUP |
| Time | System clocks (Linux 5.6+) | CLONE_NEWTIME |
PID Namespace ¶
In a new PID namespace, the first process becomes PID 1. It can only see processes in its namespace and descendants:
# Host sees hundreds of processes
$ ps aux | wc -l
247
# Create a new PID namespace
$ sudo unshare --pid --fork --mount-proc bash
# Inside: only see processes in this namespace
$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 9032 5120 pts/0 S 10:00 0:00 bash
root 8 0.0 0.0 10768 3328 pts/0 R+ 10:00 0:00 ps aux
PID 1 inside the namespace is not the real init. It’s just bash, but it has PID 1 from its own perspective. The host still sees it with its real PID.
Network Namespace ¶
Each network namespace has its own interfaces, routing tables, iptables rules, and ports:
# Create a new network namespace
$ sudo ip netns add mycontainer
# The new namespace is isolated — no interfaces except loopback
$ sudo ip netns exec mycontainer ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN
# Create a veth pair to connect namespaces
$ sudo ip link add veth-host type veth peer name veth-container
$ sudo ip link set veth-container netns mycontainer
# Configure the interface inside the namespace
$ sudo ip netns exec mycontainer ip addr add 10.0.0.2/24 dev veth-container
$ sudo ip netns exec mycontainer ip link set veth-container up
$ sudo ip netns exec mycontainer ip link set lo up
# Configure host side
$ sudo ip addr add 10.0.0.1/24 dev veth-host
$ sudo ip link set veth-host up
# Now they can communicate
$ sudo ip netns exec mycontainer ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.048 ms
This is exactly how container networking works — veth pairs connecting container network namespaces to a bridge on the host.
Mount Namespace ¶
Isolates the filesystem view. Each namespace can have different mounts:
# Create a new mount namespace
$ sudo unshare --mount bash
# Mounts here don't affect the host
$ mount -t tmpfs tmpfs /mnt
$ echo "secret" > /mnt/data
# Exit and check — /mnt is empty on host
$ exit
$ ls /mnt
# Empty
This is how containers get their own root filesystem. The container’s mount namespace has the container image mounted as root, while the host sees its normal filesystem.
UTS Namespace ¶
Isolates hostname and domain name:
$ sudo unshare --uts bash
$ hostname container-1
$ hostname
container-1
# Host hostname unchanged
$ exit
$ hostname
myhost.example.com
User Namespace ¶
Maps user IDs between namespaces. Root inside the container can be a non-root user on the host:
# Create user namespace (can be done without root)
$ unshare --user --map-root-user bash
$ whoami
root
$ id
uid=0(root) gid=0(root) groups=0(root)
# But on the host, this process runs as your regular user
This is the basis of “rootless containers” — the process thinks it’s root, but has no real root privileges on the host.
Inspecting Namespaces ¶
Every process has namespace references in /proc:
$ ls -la /proc/$$/ns/
lrwxrwxrwx 1 user user 0 Jan 25 10:00 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 net -> 'net:[4026531992]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 user -> 'user:[4026531837]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 uts -> 'uts:[4026531838]'
The numbers (like 4026531836) are inode numbers. Processes in the same namespace share the same inode.
Enter a container’s namespace:
# Find container's PID on host
$ docker inspect --format '{{.State.Pid}}' mycontainer
12345
# Enter its namespaces
$ sudo nsenter --target 12345 --mount --uts --ipc --net --pid bash
Cgroups: Resource Control ¶
Namespaces isolate view. Cgroups limit resources.
Cgroups (control groups) allow you to:
- Limit resources (max 1 CPU, max 512MB memory)
- Prioritize resources (this group gets more CPU than that one)
- Account for usage (how much CPU has this group used?)
- Control processes (freeze all processes in a group)
Cgroup Hierarchy ¶
Cgroups are organized in a hierarchy, exposed as a filesystem:
# Cgroups v2 unified hierarchy (modern)
$ ls /sys/fs/cgroup/
cgroup.controllers cgroup.subtree_control cpu.stat memory.current ...
# Cgroups v1 (legacy) — separate hierarchies per controller
$ ls /sys/fs/cgroup/
blkio cpu cpuacct cpuset devices freezer memory net_cls pids ...
Cgroups v1 vs v2 ¶
Cgroups v1 (legacy):
- Separate hierarchy per resource controller
- A process can be in different cgroups for CPU vs memory
- Complex, hard to manage
- Still common in older systems
Cgroups v2 (unified, recommended):
- Single unified hierarchy
- A process is in one cgroup for all controllers
- Simpler, better resource distribution
- Default in recent Linux distributions (Ubuntu 22.04+, Fedora 31+)
Check which version you’re using:
# If this exists and has content, you're on v2
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
# If this exists, you're on v1 or hybrid
$ ls /sys/fs/cgroup/memory/
Creating a Cgroup (v2) ¶
# Create a new cgroup
$ sudo mkdir /sys/fs/cgroup/mycontainer
# Set memory limit (256MB)
$ echo $((256 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/mycontainer/memory.max
# Set CPU limit (50% of one core)
$ echo "50000 100000" | sudo tee /sys/fs/cgroup/mycontainer/cpu.max
# Add a process to the cgroup
$ echo $$ | sudo tee /sys/fs/cgroup/mycontainer/cgroup.procs
# Now this shell and its children are limited
Memory Limits ¶
# memory.max — hard limit (OOM kill if exceeded)
$ cat /sys/fs/cgroup/mycontainer/memory.max
268435456
# memory.current — current usage
$ cat /sys/fs/cgroup/mycontainer/memory.current
8462336
# memory.high — throttling threshold (soft limit)
$ cat /sys/fs/cgroup/mycontainer/memory.high
max
When a process exceeds memory.max, the kernel OOM-kills it. This is why containers get “OOMKilled” — they hit their cgroup memory limit.
CPU Limits ¶
CPU limits use the CFS (Completely Fair Scheduler) bandwidth controller:
# cpu.max — format: "$MAX $PERIOD" (microseconds)
$ cat /sys/fs/cgroup/mycontainer/cpu.max
50000 100000
This means: in every 100ms period, the cgroup can use 50ms of CPU time — effectively 50% of one core.
The CPU throttling problem:
If your application uses all its quota early in the period, it gets throttled until the next period. This causes latency spikes:
Period 1: |████████░░░░░░░░░░░░| Used quota early, throttled for 50ms
Period 2: |████████░░░░░░░░░░░░| Same pattern
This is why Kubernetes apps with CPU limits can have inconsistent latency — they’re being throttled even when the node has spare CPU.
IO Limits ¶
# io.max — limit IO per device
# Format: "$MAJOR:$MINOR rbps=$READ_BPS wbps=$WRITE_BPS riops=$READ_IOPS wiops=$WRITE_IOPS"
$ echo "8:0 rbps=10485760 wbps=10485760" | sudo tee /sys/fs/cgroup/mycontainer/io.max
PID Limits ¶
Limit the number of processes:
# pids.max — maximum number of processes
$ echo 100 | sudo tee /sys/fs/cgroup/mycontainer/pids.max
This prevents fork bombs from taking down the host.
Viewing Container Cgroups ¶
Find a container’s cgroup:
# Get container PID
$ docker inspect --format '{{.State.Pid}}' mycontainer
12345
# Find its cgroup
$ cat /proc/12345/cgroup
0::/system.slice/docker-abc123.scope
# View its limits
$ cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max
536870912
Building a “Container” by Hand ¶
Let’s create a container using only Linux primitives — no Docker, no containerd:
#!/bin/bash
# mini-container.sh — A container in 50 lines
set -e
ROOTFS="/tmp/container-rootfs"
CGROUP="/sys/fs/cgroup/mini-container"
# Create a minimal rootfs (using Alpine)
mkdir -p "$ROOTFS"
if [ ! -f "$ROOTFS/bin/sh" ]; then
echo "Downloading Alpine rootfs..."
curl -L https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.0-x86_64.tar.gz | tar xz -C "$ROOTFS"
fi
# Create cgroup with limits
sudo mkdir -p "$CGROUP"
echo $((128 * 1024 * 1024)) | sudo tee "$CGROUP/memory.max" > /dev/null # 128MB
echo "50000 100000" | sudo tee "$CGROUP/cpu.max" > /dev/null # 50% CPU
echo 50 | sudo tee "$CGROUP/pids.max" > /dev/null # 50 processes
# Launch container with namespaces
sudo unshare \
--pid \
--mount \
--uts \
--ipc \
--net \
--fork \
--cgroup \
/bin/bash -c "
# Add ourselves to the cgroup
echo \$\$ > $CGROUP/cgroup.procs
# Set hostname
hostname container
# Setup mount namespace
mount --make-rprivate /
# Mount proc and sys in new root
mkdir -p $ROOTFS/proc $ROOTFS/sys $ROOTFS/dev
# Pivot to new root
cd $ROOTFS
mkdir -p .old-root
pivot_root . .old-root
# Mount essential filesystems
mount -t proc proc /proc
mount -t sysfs sys /sys
mount -t devtmpfs dev /dev
# Unmount old root
umount -l /.old-root
rmdir /.old-root
# Run shell
exec /bin/sh
"
# Cleanup cgroup on exit
sudo rmdir "$CGROUP" 2>/dev/null || true
This creates a process that:
- Has its own PID namespace (PID 1 inside)
- Has its own mount namespace (Alpine rootfs as /)
- Has its own hostname
- Is limited to 128MB memory, 50% CPU, 50 processes
That’s a container. No Docker required.
How Docker/containerd Use These Primitives ¶
When you run docker run alpine sh, Docker:
- Pulls the image — downloads and extracts layers to a directory
- Creates a rootfs — uses a union filesystem (overlayfs) to combine layers
- Creates namespaces — PID, mount, network, UTS, IPC
- Creates a cgroup — with resource limits from
--memory,--cpus, etc. - Sets up networking — creates veth pair, connects to bridge
- Pivots root — makes the image rootfs the container’s
/ - Drops privileges — applies seccomp, capabilities, etc.
- Execs the entrypoint — runs your process
Every container runtime (Docker, containerd, CRI-O, podman) does these same steps, just with different tooling.
Debugging with Primitives ¶
Knowing the primitives helps you debug:
See a container’s namespaces:
$ sudo ls -la /proc/<container-pid>/ns/
Enter a container’s namespace:
$ sudo nsenter -t <container-pid> -n ip addr # Network namespace only
$ sudo nsenter -t <container-pid> -a bash # All namespaces
Check cgroup limits:
$ cat /sys/fs/cgroup/<cgroup-path>/memory.max
$ cat /sys/fs/cgroup/<cgroup-path>/cpu.max
Check cgroup usage:
$ cat /sys/fs/cgroup/<cgroup-path>/memory.current
$ cat /sys/fs/cgroup/<cgroup-path>/cpu.stat
See what’s throttling:
$ cat /sys/fs/cgroup/<cgroup-path>/cpu.stat
usage_usec 123456789
user_usec 100000000
system_usec 23456789
nr_periods 10000
nr_throttled 500 # <-- Throttled 500 times
throttled_usec 5000000 # <-- 5 seconds total throttle time
Summary ¶
Containers are not magic. They’re Linux processes with:
- Namespaces for isolation (what the process can see)
- Cgroups for resource control (what the process can use)
Understanding these primitives helps you:
- Debug container issues at the source
- Understand resource limits and why things get OOM-killed or throttled
- Demystify container networking
- Build custom isolation when needed
When a container misbehaves, you’re not debugging “container technology” — you’re debugging Linux processes with isolation. The same tools work: /proc, strace, nsenter, and the cgroup filesystem.