What Is a Container, Really? Namespaces, Cgroups, and the Linux Primitives Behind Docker


There’s no such thing as a container.

Not in the Linux kernel, anyway. There’s no “container” system call, no container data structure, no container subsystem. What we call “containers” are just regular Linux processes with some isolation applied. That isolation comes from two kernel features: namespaces and cgroups.

Understanding these primitives demystifies containers, helps you debug them, and explains why certain things work the way they do.

Namespaces: Control what a process can see. A process in a PID namespace sees a different set of processes than the host. A process in a network namespace has its own network stack.

Cgroups: Control what a process can use. Limit CPU, memory, IO, and other resources. Account for resource usage.

That’s it. A “container” is a process with namespace isolation and cgroup limits. Everything else — images, layers, runtimes — is tooling built on top of these primitives.

Linux has eight namespace types. Each isolates a different aspect of the system:

Namespace Isolates Flag
PID Process IDs CLONE_NEWPID
Network Network stack (interfaces, routing, ports) CLONE_NEWNET
Mount Filesystem mounts CLONE_NEWNS
UTS Hostname and domain name CLONE_NEWUTS
IPC Inter-process communication (semaphores, message queues) CLONE_NEWIPC
User User and group IDs CLONE_NEWUSER
Cgroup Cgroup root directory CLONE_NEWCGROUP
Time System clocks (Linux 5.6+) CLONE_NEWTIME

In a new PID namespace, the first process becomes PID 1. It can only see processes in its namespace and descendants:

# Host sees hundreds of processes
$ ps aux | wc -l
247

# Create a new PID namespace
$ sudo unshare --pid --fork --mount-proc bash

# Inside: only see processes in this namespace
$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   9032  5120 pts/0    S    10:00   0:00 bash
root         8  0.0  0.0  10768  3328 pts/0    R+   10:00   0:00 ps aux

PID 1 inside the namespace is not the real init. It’s just bash, but it has PID 1 from its own perspective. The host still sees it with its real PID.

Each network namespace has its own interfaces, routing tables, iptables rules, and ports:

# Create a new network namespace
$ sudo ip netns add mycontainer

# The new namespace is isolated — no interfaces except loopback
$ sudo ip netns exec mycontainer ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN

# Create a veth pair to connect namespaces
$ sudo ip link add veth-host type veth peer name veth-container
$ sudo ip link set veth-container netns mycontainer

# Configure the interface inside the namespace
$ sudo ip netns exec mycontainer ip addr add 10.0.0.2/24 dev veth-container
$ sudo ip netns exec mycontainer ip link set veth-container up
$ sudo ip netns exec mycontainer ip link set lo up

# Configure host side
$ sudo ip addr add 10.0.0.1/24 dev veth-host
$ sudo ip link set veth-host up

# Now they can communicate
$ sudo ip netns exec mycontainer ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.048 ms

This is exactly how container networking works — veth pairs connecting container network namespaces to a bridge on the host.

Isolates the filesystem view. Each namespace can have different mounts:

# Create a new mount namespace
$ sudo unshare --mount bash

# Mounts here don't affect the host
$ mount -t tmpfs tmpfs /mnt
$ echo "secret" > /mnt/data

# Exit and check — /mnt is empty on host
$ exit
$ ls /mnt
# Empty

This is how containers get their own root filesystem. The container’s mount namespace has the container image mounted as root, while the host sees its normal filesystem.

Isolates hostname and domain name:

$ sudo unshare --uts bash
$ hostname container-1
$ hostname
container-1

# Host hostname unchanged
$ exit
$ hostname
myhost.example.com

Maps user IDs between namespaces. Root inside the container can be a non-root user on the host:

# Create user namespace (can be done without root)
$ unshare --user --map-root-user bash

$ whoami
root

$ id
uid=0(root) gid=0(root) groups=0(root)

# But on the host, this process runs as your regular user

This is the basis of “rootless containers” — the process thinks it’s root, but has no real root privileges on the host.

Every process has namespace references in /proc:

$ ls -la /proc/$$/ns/
lrwxrwxrwx 1 user user 0 Jan 25 10:00 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 net -> 'net:[4026531992]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 user -> 'user:[4026531837]'
lrwxrwxrwx 1 user user 0 Jan 25 10:00 uts -> 'uts:[4026531838]'

The numbers (like 4026531836) are inode numbers. Processes in the same namespace share the same inode.

Enter a container’s namespace:

# Find container's PID on host
$ docker inspect --format '{{.State.Pid}}' mycontainer
12345

# Enter its namespaces
$ sudo nsenter --target 12345 --mount --uts --ipc --net --pid bash

Namespaces isolate view. Cgroups limit resources.

Cgroups (control groups) allow you to:

  • Limit resources (max 1 CPU, max 512MB memory)
  • Prioritize resources (this group gets more CPU than that one)
  • Account for usage (how much CPU has this group used?)
  • Control processes (freeze all processes in a group)

Cgroups are organized in a hierarchy, exposed as a filesystem:

# Cgroups v2 unified hierarchy (modern)
$ ls /sys/fs/cgroup/
cgroup.controllers  cgroup.subtree_control  cpu.stat  memory.current  ...

# Cgroups v1 (legacy) — separate hierarchies per controller
$ ls /sys/fs/cgroup/
blkio  cpu  cpuacct  cpuset  devices  freezer  memory  net_cls  pids  ...

Cgroups v1 (legacy):

  • Separate hierarchy per resource controller
  • A process can be in different cgroups for CPU vs memory
  • Complex, hard to manage
  • Still common in older systems

Cgroups v2 (unified, recommended):

  • Single unified hierarchy
  • A process is in one cgroup for all controllers
  • Simpler, better resource distribution
  • Default in recent Linux distributions (Ubuntu 22.04+, Fedora 31+)

Check which version you’re using:

# If this exists and has content, you're on v2
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

# If this exists, you're on v1 or hybrid
$ ls /sys/fs/cgroup/memory/
# Create a new cgroup
$ sudo mkdir /sys/fs/cgroup/mycontainer

# Set memory limit (256MB)
$ echo $((256 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/mycontainer/memory.max

# Set CPU limit (50% of one core)
$ echo "50000 100000" | sudo tee /sys/fs/cgroup/mycontainer/cpu.max

# Add a process to the cgroup
$ echo $$ | sudo tee /sys/fs/cgroup/mycontainer/cgroup.procs

# Now this shell and its children are limited
# memory.max — hard limit (OOM kill if exceeded)
$ cat /sys/fs/cgroup/mycontainer/memory.max
268435456

# memory.current — current usage
$ cat /sys/fs/cgroup/mycontainer/memory.current
8462336

# memory.high — throttling threshold (soft limit)
$ cat /sys/fs/cgroup/mycontainer/memory.high
max

When a process exceeds memory.max, the kernel OOM-kills it. This is why containers get “OOMKilled” — they hit their cgroup memory limit.

CPU limits use the CFS (Completely Fair Scheduler) bandwidth controller:

# cpu.max — format: "$MAX $PERIOD" (microseconds)
$ cat /sys/fs/cgroup/mycontainer/cpu.max
50000 100000

This means: in every 100ms period, the cgroup can use 50ms of CPU time — effectively 50% of one core.

The CPU throttling problem:

If your application uses all its quota early in the period, it gets throttled until the next period. This causes latency spikes:

Period 1: |████████░░░░░░░░░░░░| Used quota early, throttled for 50ms
Period 2: |████████░░░░░░░░░░░░| Same pattern

This is why Kubernetes apps with CPU limits can have inconsistent latency — they’re being throttled even when the node has spare CPU.

# io.max — limit IO per device
# Format: "$MAJOR:$MINOR rbps=$READ_BPS wbps=$WRITE_BPS riops=$READ_IOPS wiops=$WRITE_IOPS"
$ echo "8:0 rbps=10485760 wbps=10485760" | sudo tee /sys/fs/cgroup/mycontainer/io.max

Limit the number of processes:

# pids.max — maximum number of processes
$ echo 100 | sudo tee /sys/fs/cgroup/mycontainer/pids.max

This prevents fork bombs from taking down the host.

Find a container’s cgroup:

# Get container PID
$ docker inspect --format '{{.State.Pid}}' mycontainer
12345

# Find its cgroup
$ cat /proc/12345/cgroup
0::/system.slice/docker-abc123.scope

# View its limits
$ cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max
536870912

Let’s create a container using only Linux primitives — no Docker, no containerd:

#!/bin/bash
# mini-container.sh — A container in 50 lines

set -e

ROOTFS="/tmp/container-rootfs"
CGROUP="/sys/fs/cgroup/mini-container"

# Create a minimal rootfs (using Alpine)
mkdir -p "$ROOTFS"
if [ ! -f "$ROOTFS/bin/sh" ]; then
    echo "Downloading Alpine rootfs..."
    curl -L https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.0-x86_64.tar.gz | tar xz -C "$ROOTFS"
fi

# Create cgroup with limits
sudo mkdir -p "$CGROUP"
echo $((128 * 1024 * 1024)) | sudo tee "$CGROUP/memory.max" > /dev/null  # 128MB
echo "50000 100000" | sudo tee "$CGROUP/cpu.max" > /dev/null              # 50% CPU
echo 50 | sudo tee "$CGROUP/pids.max" > /dev/null                         # 50 processes

# Launch container with namespaces
sudo unshare \
    --pid \
    --mount \
    --uts \
    --ipc \
    --net \
    --fork \
    --cgroup \
    /bin/bash -c "
        # Add ourselves to the cgroup
        echo \$\$ > $CGROUP/cgroup.procs

        # Set hostname
        hostname container

        # Setup mount namespace
        mount --make-rprivate /
        
        # Mount proc and sys in new root
        mkdir -p $ROOTFS/proc $ROOTFS/sys $ROOTFS/dev
        
        # Pivot to new root
        cd $ROOTFS
        mkdir -p .old-root
        pivot_root . .old-root
        
        # Mount essential filesystems
        mount -t proc proc /proc
        mount -t sysfs sys /sys
        mount -t devtmpfs dev /dev
        
        # Unmount old root
        umount -l /.old-root
        rmdir /.old-root
        
        # Run shell
        exec /bin/sh
    "

# Cleanup cgroup on exit
sudo rmdir "$CGROUP" 2>/dev/null || true

This creates a process that:

  1. Has its own PID namespace (PID 1 inside)
  2. Has its own mount namespace (Alpine rootfs as /)
  3. Has its own hostname
  4. Is limited to 128MB memory, 50% CPU, 50 processes

That’s a container. No Docker required.

When you run docker run alpine sh, Docker:

  1. Pulls the image — downloads and extracts layers to a directory
  2. Creates a rootfs — uses a union filesystem (overlayfs) to combine layers
  3. Creates namespaces — PID, mount, network, UTS, IPC
  4. Creates a cgroup — with resource limits from --memory, --cpus, etc.
  5. Sets up networking — creates veth pair, connects to bridge
  6. Pivots root — makes the image rootfs the container’s /
  7. Drops privileges — applies seccomp, capabilities, etc.
  8. Execs the entrypoint — runs your process

Every container runtime (Docker, containerd, CRI-O, podman) does these same steps, just with different tooling.

Knowing the primitives helps you debug:

See a container’s namespaces:

$ sudo ls -la /proc/<container-pid>/ns/

Enter a container’s namespace:

$ sudo nsenter -t <container-pid> -n ip addr  # Network namespace only
$ sudo nsenter -t <container-pid> -a bash     # All namespaces

Check cgroup limits:

$ cat /sys/fs/cgroup/<cgroup-path>/memory.max
$ cat /sys/fs/cgroup/<cgroup-path>/cpu.max

Check cgroup usage:

$ cat /sys/fs/cgroup/<cgroup-path>/memory.current
$ cat /sys/fs/cgroup/<cgroup-path>/cpu.stat

See what’s throttling:

$ cat /sys/fs/cgroup/<cgroup-path>/cpu.stat
usage_usec 123456789
user_usec 100000000
system_usec 23456789
nr_periods 10000
nr_throttled 500       # <-- Throttled 500 times
throttled_usec 5000000 # <-- 5 seconds total throttle time

Containers are not magic. They’re Linux processes with:

  • Namespaces for isolation (what the process can see)
  • Cgroups for resource control (what the process can use)

Understanding these primitives helps you:

  • Debug container issues at the source
  • Understand resource limits and why things get OOM-killed or throttled
  • Demystify container networking
  • Build custom isolation when needed

When a container misbehaves, you’re not debugging “container technology” — you’re debugging Linux processes with isolation. The same tools work: /proc, strace, nsenter, and the cgroup filesystem.