You apply a Pod manifest. Seconds later, your container is running. But what actually happened between kubectl apply and your process starting?
The answer involves six layers: kubelet, CRI, containerd, shim, runc, and finally your process. Each layer exists for a reason, and knowing them helps you debug when things go wrong.
The Stack ¶
kubectl apply
|
v
API Server ----------- stores Pod in etcd
| watch
v
kubelet -------------- node agent, manages pod lifecycle
| CRI gRPC
v
containerd ----------- container runtime, manages images
|
v
containerd-shim ------ per-container process, survives restarts
|
v
runc ----------------- OCI runtime, sets up namespaces/cgroups
| fork/exec
v
Your Container ------- just a Linux process with isolation
Let’s trace a pod creation through each layer.
Layer 1: kubelet ¶
The kubelet is the node agent. It watches the API server for pods assigned to its node and makes them reality.
What kubelet does: ¶
- Watches for pods scheduled to this node
- Computes the desired state (which containers should exist)
- Calls the container runtime via CRI to create/start/stop containers
- Reports pod status back to the API server
- Manages pod lifecycle (liveness probes, restarts, etc.)
Where kubelet runs: ¶
# Usually a systemd service
$ systemctl status kubelet
# Configuration
$ cat /var/lib/kubelet/config.yaml
# Logs
$ journalctl -u kubelet -f
kubelet doesn’t create containers directly ¶
Before ~2017, kubelet had Docker-specific code built in. Now it uses the Container Runtime Interface (CRI) to talk to any compliant runtime.
kubelet -> CRI (gRPC) -> containerd
-> CRI-O
-> Docker (via cri-dockerd shim)
Layer 2: CRI (Container Runtime Interface) ¶
CRI is a gRPC API that kubelet uses to communicate with container runtimes. It defines two services:
RuntimeService: Container lifecycle
CreateContainer,StartContainer,StopContainer,RemoveContainerListContainers,ContainerStatusExecSync,Exec,Attach
ImageService: Image management
PullImage,ListImages,RemoveImage
CRI in action ¶
You can talk to CRI directly using crictl:
# List containers (like docker ps)
$ crictl ps
CONTAINER IMAGE CREATED STATE NAME POD ID
a1b2c3d4e5 nginx 2 hours ago Running nginx x1y2z3
# List pods
$ crictl pods
POD ID CREATED STATE NAME NAMESPACE
x1y2z3 2 hours ago Ready nginx-7d9fc5... default
# Pull an image
$ crictl pull nginx:latest
# Get container logs
$ crictl logs a1b2c3d4e5
# Exec into container
$ crictl exec -it a1b2c3d4e5 sh
Checking CRI endpoint ¶
# See what runtime kubelet is using
$ cat /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS="--container-runtime-endpoint=unix:///run/containerd/containerd.sock"
# Or from kubelet config
$ grep containerRuntime /var/lib/kubelet/config.yaml
Layer 3: containerd ¶
containerd is the most common container runtime in Kubernetes. It’s what Docker uses under the hood (Docker = containerd + additional tooling).
What containerd does: ¶
- Manages images — pulls, stores, unpacks OCI images
- Manages containers — creates, starts, stops containers
- Manages snapshots — filesystem layers (overlayfs)
- Manages tasks — running processes within containers
- Spawns shims — one shim per container
containerd architecture ¶
+----------------------------------------------------------+
| containerd |
| +--------+ +----------+ +---------+ +-------------+ |
| | Images | |Containers| |Snapshots| | Tasks | |
| |Service | | Service | | Service | | Service | |
| +--------+ +----------+ +---------+ +-------------+ |
+----------------------------+-----------------------------+
|
+----------------+----------------+
v v v
+--------+ +--------+ +--------+
| shim | | shim | | shim |
| (ctr1) | | (ctr2) | | (ctr3) |
+--------+ +--------+ +--------+
Interacting with containerd ¶
Use ctr (low-level) or nerdctl (Docker-compatible):
# List namespaces (containerd uses namespaces for isolation)
$ ctr namespaces ls
NAME LABELS
k8s.io # Kubernetes containers
moby # Docker containers (if Docker is installed)
# List containers in k8s.io namespace
$ ctr -n k8s.io containers ls
# List images
$ ctr -n k8s.io images ls
# List running tasks (processes)
$ ctr -n k8s.io tasks ls
containerd configuration ¶
$ cat /etc/containerd/config.toml
# Key settings:
[plugins."io.containerd.grpc.v1.cri"]
# CRI plugin configuration
[plugins."io.containerd.grpc.v1.cri".containerd]
# Default runtime
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
Layer 4: containerd-shim ¶
The shim is a small process that sits between containerd and runc. There’s one shim per container.
Why shims exist: ¶
- Decoupling: Container survives containerd restart
- Stdio handling: Keeps stdin/stdout/stderr open
- Exit status: Reports container exit to containerd
- Reaping: Acts as subreaper for orphaned processes
containerd crash/restart
|
| Containers keep running!
v
+---------------+ +---------------+
| shim | | shim |
| (container1) | | (container2) |
+-------+-------+ +-------+-------+
| |
v v
[container1] [container2]
still running still running
Finding shims: ¶
$ ps aux | grep containerd-shim
root 1234 containerd-shim-runc-v2 -namespace k8s.io -id abc123...
root 1235 containerd-shim-runc-v2 -namespace k8s.io -id def456...
Each shim manages one container. The container ID matches what you see in crictl ps.
Layer 5: runc ¶
runc is the OCI (Open Container Initiative) reference runtime. It does the actual work of creating the container: setting up namespaces, cgroups, and executing the process.
What runc does: ¶
- Parses the OCI runtime spec (config.json)
- Creates namespaces (pid, net, mnt, uts, ipc, user)
- Sets up cgroups (CPU, memory, IO limits)
- Mounts filesystems (rootfs, /proc, /sys, volumes)
- Applies security (seccomp, capabilities, SELinux/AppArmor)
- Executes the container entrypoint
The OCI Runtime Spec ¶
runc reads a config.json that defines everything about the container:
{
"ociVersion": "1.0.2",
"process": {
"terminal": true,
"user": { "uid": 0, "gid": 0 },
"args": ["sh"],
"env": ["PATH=/usr/bin:/bin", "TERM=xterm"],
"cwd": "/"
},
"root": {
"path": "rootfs",
"readonly": false
},
"hostname": "container",
"mounts": [
{ "destination": "/proc", "type": "proc", "source": "proc" },
{ "destination": "/dev", "type": "tmpfs", "source": "tmpfs" }
],
"linux": {
"namespaces": [
{ "type": "pid" },
{ "type": "network" },
{ "type": "ipc" },
{ "type": "uts" },
{ "type": "mount" }
],
"resources": {
"memory": { "limit": 536870912 },
"cpu": { "quota": 50000, "period": 100000 }
}
}
}
Using runc directly ¶
You can run runc manually (useful for debugging):
# Create a bundle directory
$ mkdir -p mycontainer/rootfs
# Extract an image to rootfs
$ docker export $(docker create alpine) | tar -C mycontainer/rootfs -xf -
# Generate a spec
$ cd mycontainer
$ runc spec
# Edit config.json if needed, then run
$ runc run mycontainer
Find container’s runc state ¶
# List runc containers
$ runc list
# Get container state
$ runc state <container-id>
{
"ociVersion": "1.0.2",
"id": "abc123",
"pid": 12345,
"status": "running",
"bundle": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/abc123",
"rootfs": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/abc123/rootfs",
"created": "2025-01-25T10:00:00Z"
}
Layer 6: Your Container ¶
After all these layers, your container is just a Linux process. It has:
- Its own PID namespace (PID 1 inside)
- Its own network namespace (separate interfaces)
- Its own mount namespace (container rootfs as /)
- Cgroup limits (CPU, memory, etc.)
- Seccomp filters (restricted syscalls)
- Dropped capabilities (limited root powers)
# From the host, it's just a process
$ ps aux | grep <your-entrypoint>
root 12345 ... /your/entrypoint
# Its namespaces
$ ls -la /proc/12345/ns/
The OCI Image Spec ¶
We’ve covered the runtime spec. The other OCI spec is the image spec — how container images are structured.
Image layers ¶
An OCI image is a stack of filesystem layers:
+-------------------------------------+
| Layer 3: Application code | (your Dockerfile additions)
+-----------+-------------------------+
| Layer 2: Runtime dependencies | (apt-get install ...)
+-----------+-------------------------+
| Layer 1: Base image | (ubuntu:22.04)
+-------------------------------------+
Layers are content-addressed (by SHA256 hash), immutable, and shared between images.
Image manifest ¶
{
"schemaVersion": 2,
"mediaType": "application/vnd.oci.image.manifest.v1+json",
"config": {
"mediaType": "application/vnd.oci.image.config.v1+json",
"digest": "sha256:abc123...",
"size": 1234
},
"layers": [
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:layer1...",
"size": 12345678
},
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:layer2...",
"size": 23456789
}
]
}
How containerd uses images ¶
- Pull: Download manifest and layers from registry
- Unpack: Extract layers to snapshotter (overlayfs)
- Mount: Stack layers using overlayfs for container rootfs
# See image layers
$ ctr -n k8s.io images ls
$ ctr -n k8s.io content ls
# See snapshots (unpacked layers)
$ ctr -n k8s.io snapshots ls
Debugging at Each Layer ¶
Layer 1: kubelet ¶
# Check kubelet logs
$ journalctl -u kubelet -f
# Common issues:
# - "failed to pull image" → registry/network issue
# - "failed to create sandbox" → containerd issue
# - "failed to start container" → runtime issue
# Check kubelet is talking to containerd
$ systemctl status containerd
Layer 2: CRI (crictl) ¶
# Check CRI is responding
$ crictl info
# List pods (should match kubectl get pods)
$ crictl pods
# List containers
$ crictl ps -a
# Get container details
$ crictl inspect <container-id>
# Check why a container failed
$ crictl logs <container-id>
# Debug pod sandbox issues
$ crictl inspectp <pod-id>
Layer 3: containerd (ctr) ¶
# Check containerd health
$ ctr -n k8s.io version
# List containers at containerd level
$ ctr -n k8s.io containers ls
# List tasks (running containers)
$ ctr -n k8s.io tasks ls
# Check container bundle
$ ctr -n k8s.io containers info <container-id>
Layer 4-5: shim and runc ¶
# Find shim process
$ ps aux | grep "containerd-shim.*<container-id>"
# Check runc state
$ runc --root /run/containerd/runc/k8s.io state <container-id>
# List all runc containers
$ runc --root /run/containerd/runc/k8s.io list
Layer 6: The Container Process ¶
# Find container's PID on host
$ crictl inspect <container-id> | jq '.info.pid'
12345
# Enter container's namespaces
$ nsenter -t 12345 -a bash
# Or just one namespace
$ nsenter -t 12345 -n ip addr # Network namespace
$ nsenter -t 12345 -m ls / # Mount namespace
$ nsenter -t 12345 -p -r ps aux # PID namespace
# Check cgroup limits
$ cat /proc/12345/cgroup
0::/kubepods/burstable/pod-xyz/container-abc
$ cat /sys/fs/cgroup/kubepods/burstable/pod-xyz/container-abc/memory.max
# Trace syscalls
$ strace -p 12345
Common Debugging Scenarios ¶
Container won’t start ¶
# 1. Check kubelet logs
$ journalctl -u kubelet | grep <pod-name>
# 2. Check container status
$ crictl ps -a | grep <pod-name>
$ crictl logs <container-id>
# 3. Check events
$ kubectl describe pod <pod-name>
# 4. Check runc directly
$ runc --root /run/containerd/runc/k8s.io state <container-id>
Container starts but exits immediately ¶
# Check exit code
$ crictl inspect <container-id> | jq '.status.exitCode'
# Check logs
$ crictl logs <container-id>
# Common causes:
# - Exit 0: Command completed (wrong entrypoint)
# - Exit 1: Application error
# - Exit 137: OOM killed (128 + 9 = SIGKILL)
# - Exit 139: Segfault (128 + 11 = SIGSEGV)
Container is slow/throttled ¶
# Find container's cgroup
$ crictl inspect <container-id> | jq '.info.runtimeSpec.linux.cgroupsPath'
# Check CPU throttling
$ cat /sys/fs/cgroup/<cgroup-path>/cpu.stat
nr_throttled 5000 # Throttled 5000 times!
throttled_usec 60000000 # 60 seconds total throttle time
# Check memory pressure
$ cat /sys/fs/cgroup/<cgroup-path>/memory.current
$ cat /sys/fs/cgroup/<cgroup-path>/memory.max
Image pull failures ¶
# Check image pull with crictl
$ crictl pull <image>
# Check containerd logs
$ journalctl -u containerd | grep <image>
# Common issues:
# - Registry auth: check /var/lib/kubelet/config.json
# - Network: can node reach registry?
# - Disk space: df -h /var/lib/containerd
Putting It Together ¶
The full sequence when you kubectl apply a pod:
- API Server stores Pod in etcd
- Scheduler assigns Pod to a node
- kubelet on that node sees the Pod (via watch)
- kubelet calls containerd via CRI:
CreatePodSandbox - containerd creates the pause container (network namespace holder)
- kubelet calls containerd:
CreateContainerfor each container - containerd prepares the rootfs (overlayfs from image layers)
- containerd spawns a shim for each container
- shim calls runc with the OCI spec
- runc creates namespaces, cgroups, mounts, security settings
- runc
execs your entrypoint - runc exits, shim monitors the container
- kubelet reports status to API server
When something goes wrong, trace backwards through these layers until you find where it broke.
Summary ¶
Kubernetes doesn’t run containers — it orchestrates a stack of tools that do:
| Layer | Tool | Purpose |
|---|---|---|
| 1 | kubelet | Node agent, pod lifecycle |
| 2 | CRI | gRPC API to runtime |
| 3 | containerd | Image and container management |
| 4 | shim | Per-container daemon |
| 5 | runc | OCI runtime, creates namespaces/cgroups |
| 6 | Your process | Just a Linux process with isolation |
Each layer has its own tools:
| Layer | Debug Tool |
|---|---|
| kubelet | journalctl -u kubelet |
| CRI | crictl |
| containerd | ctr |
| runc | runc state/list |
| Container | nsenter, /proc, cgroup fs |
When debugging, start at the top (kubectl describe, kubelet logs) and work down. By the time you’re running runc state, you’re debugging Linux primitives, not Kubernetes.