From kubelet to Process: How Kubernetes Actually Runs Your Container

You apply a Pod manifest. Seconds later, your container is running. But what actually happened between kubectl apply and your process starting?

The answer involves six layers: kubelet, CRI, containerd, shim, runc, and finally your process. Each layer exists for a reason, and knowing them helps you debug when things go wrong.

The Stack ¶

kubectl apply
     |
     v
API Server ----------- stores Pod in etcd
     | watch
     v
kubelet -------------- node agent, manages pod lifecycle
     | CRI gRPC
     v
containerd ----------- container runtime, manages images
     |
     v
containerd-shim ------ per-container process, survives restarts
     |
     v
runc ----------------- OCI runtime, sets up namespaces/cgroups
     | fork/exec
     v
Your Container ------- just a Linux process with isolation

Let’s trace a pod creation through each layer.

Layer 1: kubelet ¶

The kubelet is the node agent. It watches the API server for pods assigned to its node and makes them reality.

What kubelet does: ¶

Watches for pods scheduled to this node
Computes the desired state (which containers should exist)
Calls the container runtime via CRI to create/start/stop containers
Reports pod status back to the API server
Manages pod lifecycle (liveness probes, restarts, etc.)

Where kubelet runs: ¶

# Usually a systemd service
$ systemctl status kubelet

# Configuration
$ cat /var/lib/kubelet/config.yaml

# Logs
$ journalctl -u kubelet -f

kubelet doesn’t create containers directly ¶

Before ~2017, kubelet had Docker-specific code built in. Now it uses the Container Runtime Interface (CRI) to talk to any compliant runtime.

kubelet -> CRI (gRPC) -> containerd
                      -> CRI-O
                      -> Docker (via cri-dockerd shim)

Layer 2: CRI (Container Runtime Interface) ¶

CRI is a gRPC API that kubelet uses to communicate with container runtimes. It defines two services:

RuntimeService: Container lifecycle

CreateContainer, StartContainer, StopContainer, RemoveContainer
ListContainers, ContainerStatus
ExecSync, Exec, Attach

ImageService: Image management

PullImage, ListImages, RemoveImage

CRI in action ¶

You can talk to CRI directly using crictl:

# List containers (like docker ps)
$ crictl ps
CONTAINER    IMAGE     CREATED       STATE    NAME         POD ID
a1b2c3d4e5   nginx     2 hours ago   Running  nginx        x1y2z3

# List pods
$ crictl pods
POD ID       CREATED       STATE   NAME                NAMESPACE
x1y2z3       2 hours ago   Ready   nginx-7d9fc5...     default

# Pull an image
$ crictl pull nginx:latest

# Get container logs
$ crictl logs a1b2c3d4e5

# Exec into container
$ crictl exec -it a1b2c3d4e5 sh

Checking CRI endpoint ¶

# See what runtime kubelet is using
$ cat /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS="--container-runtime-endpoint=unix:///run/containerd/containerd.sock"

# Or from kubelet config
$ grep containerRuntime /var/lib/kubelet/config.yaml

Layer 3: containerd ¶

containerd is the most common container runtime in Kubernetes. It’s what Docker uses under the hood (Docker = containerd + additional tooling).

What containerd does: ¶

Manages images — pulls, stores, unpacks OCI images
Manages containers — creates, starts, stops containers
Manages snapshots — filesystem layers (overlayfs)
Manages tasks — running processes within containers
Spawns shims — one shim per container

containerd architecture ¶

+----------------------------------------------------------+
|                       containerd                         |
|  +--------+  +----------+  +---------+  +-------------+  |
|  | Images |  |Containers|  |Snapshots|  |   Tasks     |  |
|  |Service |  | Service  |  | Service |  |  Service    |  |
|  +--------+  +----------+  +---------+  +-------------+  |
+----------------------------+-----------------------------+
                             |
            +----------------+----------------+
            v                v                v
       +--------+       +--------+       +--------+
       |  shim  |       |  shim  |       |  shim  |
       | (ctr1) |       | (ctr2) |       | (ctr3) |
       +--------+       +--------+       +--------+

Interacting with containerd ¶

Use ctr (low-level) or nerdctl (Docker-compatible):

# List namespaces (containerd uses namespaces for isolation)
$ ctr namespaces ls
NAME   LABELS
k8s.io        # Kubernetes containers
moby          # Docker containers (if Docker is installed)

# List containers in k8s.io namespace
$ ctr -n k8s.io containers ls

# List images
$ ctr -n k8s.io images ls

# List running tasks (processes)
$ ctr -n k8s.io tasks ls

containerd configuration ¶

$ cat /etc/containerd/config.toml

# Key settings:
[plugins."io.containerd.grpc.v1.cri"]
  # CRI plugin configuration
  
[plugins."io.containerd.grpc.v1.cri".containerd]
  # Default runtime
  default_runtime_name = "runc"
  
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

Layer 4: containerd-shim ¶

The shim is a small process that sits between containerd and runc. There’s one shim per container.

Why shims exist: ¶

Decoupling: Container survives containerd restart
Stdio handling: Keeps stdin/stdout/stderr open
Exit status: Reports container exit to containerd
Reaping: Acts as subreaper for orphaned processes

containerd crash/restart
        |
        |  Containers keep running!
        v
+---------------+     +---------------+
|     shim      |     |     shim      |
|  (container1) |     |  (container2) |
+-------+-------+     +-------+-------+
        |                     |
        v                     v
   [container1]          [container2]
   still running         still running

Finding shims: ¶

$ ps aux | grep containerd-shim
root  1234  containerd-shim-runc-v2 -namespace k8s.io -id abc123...
root  1235  containerd-shim-runc-v2 -namespace k8s.io -id def456...

Each shim manages one container. The container ID matches what you see in crictl ps.

Layer 5: runc ¶

runc is the OCI (Open Container Initiative) reference runtime. It does the actual work of creating the container: setting up namespaces, cgroups, and executing the process.

What runc does: ¶

Parses the OCI runtime spec (config.json)
Creates namespaces (pid, net, mnt, uts, ipc, user)
Sets up cgroups (CPU, memory, IO limits)
Mounts filesystems (rootfs, /proc, /sys, volumes)
Applies security (seccomp, capabilities, SELinux/AppArmor)
Executes the container entrypoint

The OCI Runtime Spec ¶

runc reads a config.json that defines everything about the container:

{
  "ociVersion": "1.0.2",
  "process": {
    "terminal": true,
    "user": { "uid": 0, "gid": 0 },
    "args": ["sh"],
    "env": ["PATH=/usr/bin:/bin", "TERM=xterm"],
    "cwd": "/"
  },
  "root": {
    "path": "rootfs",
    "readonly": false
  },
  "hostname": "container",
  "mounts": [
    { "destination": "/proc", "type": "proc", "source": "proc" },
    { "destination": "/dev", "type": "tmpfs", "source": "tmpfs" }
  ],
  "linux": {
    "namespaces": [
      { "type": "pid" },
      { "type": "network" },
      { "type": "ipc" },
      { "type": "uts" },
      { "type": "mount" }
    ],
    "resources": {
      "memory": { "limit": 536870912 },
      "cpu": { "quota": 50000, "period": 100000 }
    }
  }
}

Using runc directly ¶

You can run runc manually (useful for debugging):

# Create a bundle directory
$ mkdir -p mycontainer/rootfs

# Extract an image to rootfs
$ docker export $(docker create alpine) | tar -C mycontainer/rootfs -xf -

# Generate a spec
$ cd mycontainer
$ runc spec

# Edit config.json if needed, then run
$ runc run mycontainer

Find container’s runc state ¶

# List runc containers
$ runc list

# Get container state
$ runc state <container-id>
{
  "ociVersion": "1.0.2",
  "id": "abc123",
  "pid": 12345,
  "status": "running",
  "bundle": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/abc123",
  "rootfs": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/abc123/rootfs",
  "created": "2025-01-25T10:00:00Z"
}

Layer 6: Your Container ¶

After all these layers, your container is just a Linux process. It has:

Its own PID namespace (PID 1 inside)
Its own network namespace (separate interfaces)
Its own mount namespace (container rootfs as /)
Cgroup limits (CPU, memory, etc.)
Seccomp filters (restricted syscalls)
Dropped capabilities (limited root powers)

# From the host, it's just a process
$ ps aux | grep <your-entrypoint>
root 12345 ... /your/entrypoint

# Its namespaces
$ ls -la /proc/12345/ns/

The OCI Image Spec ¶

We’ve covered the runtime spec. The other OCI spec is the image spec — how container images are structured.

Image layers ¶

An OCI image is a stack of filesystem layers:

+-------------------------------------+
|  Layer 3: Application code          |  (your Dockerfile additions)
+-----------+-------------------------+
|  Layer 2: Runtime dependencies      |  (apt-get install ...)
+-----------+-------------------------+
|  Layer 1: Base image                |  (ubuntu:22.04)
+-------------------------------------+

Layers are content-addressed (by SHA256 hash), immutable, and shared between images.

Image manifest ¶

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:abc123...",
    "size": 1234
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:layer1...",
      "size": 12345678
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:layer2...",
      "size": 23456789
    }
  ]
}

How containerd uses images ¶

Pull: Download manifest and layers from registry
Unpack: Extract layers to snapshotter (overlayfs)
Mount: Stack layers using overlayfs for container rootfs

# See image layers
$ ctr -n k8s.io images ls
$ ctr -n k8s.io content ls

# See snapshots (unpacked layers)
$ ctr -n k8s.io snapshots ls

Debugging at Each Layer ¶

Layer 1: kubelet ¶

# Check kubelet logs
$ journalctl -u kubelet -f

# Common issues:
# - "failed to pull image" → registry/network issue
# - "failed to create sandbox" → containerd issue
# - "failed to start container" → runtime issue

# Check kubelet is talking to containerd
$ systemctl status containerd

Layer 2: CRI (crictl) ¶

# Check CRI is responding
$ crictl info

# List pods (should match kubectl get pods)
$ crictl pods

# List containers
$ crictl ps -a

# Get container details
$ crictl inspect <container-id>

# Check why a container failed
$ crictl logs <container-id>

# Debug pod sandbox issues
$ crictl inspectp <pod-id>

Layer 3: containerd (ctr) ¶

# Check containerd health
$ ctr -n k8s.io version

# List containers at containerd level
$ ctr -n k8s.io containers ls

# List tasks (running containers)
$ ctr -n k8s.io tasks ls

# Check container bundle
$ ctr -n k8s.io containers info <container-id>

Layer 4-5: shim and runc ¶

# Find shim process
$ ps aux | grep "containerd-shim.*<container-id>"

# Check runc state
$ runc --root /run/containerd/runc/k8s.io state <container-id>

# List all runc containers
$ runc --root /run/containerd/runc/k8s.io list

Layer 6: The Container Process ¶

# Find container's PID on host
$ crictl inspect <container-id> | jq '.info.pid'
12345

# Enter container's namespaces
$ nsenter -t 12345 -a bash

# Or just one namespace
$ nsenter -t 12345 -n ip addr        # Network namespace
$ nsenter -t 12345 -m ls /           # Mount namespace
$ nsenter -t 12345 -p -r ps aux      # PID namespace

# Check cgroup limits
$ cat /proc/12345/cgroup
0::/kubepods/burstable/pod-xyz/container-abc

$ cat /sys/fs/cgroup/kubepods/burstable/pod-xyz/container-abc/memory.max

# Trace syscalls
$ strace -p 12345

Common Debugging Scenarios ¶

Container won’t start ¶

# 1. Check kubelet logs
$ journalctl -u kubelet | grep <pod-name>

# 2. Check container status
$ crictl ps -a | grep <pod-name>
$ crictl logs <container-id>

# 3. Check events
$ kubectl describe pod <pod-name>

# 4. Check runc directly
$ runc --root /run/containerd/runc/k8s.io state <container-id>

Container starts but exits immediately ¶

# Check exit code
$ crictl inspect <container-id> | jq '.status.exitCode'

# Check logs
$ crictl logs <container-id>

# Common causes:
# - Exit 0: Command completed (wrong entrypoint)
# - Exit 1: Application error
# - Exit 137: OOM killed (128 + 9 = SIGKILL)
# - Exit 139: Segfault (128 + 11 = SIGSEGV)

Container is slow/throttled ¶

# Find container's cgroup
$ crictl inspect <container-id> | jq '.info.runtimeSpec.linux.cgroupsPath'

# Check CPU throttling
$ cat /sys/fs/cgroup/<cgroup-path>/cpu.stat
nr_throttled 5000       # Throttled 5000 times!
throttled_usec 60000000 # 60 seconds total throttle time

# Check memory pressure
$ cat /sys/fs/cgroup/<cgroup-path>/memory.current
$ cat /sys/fs/cgroup/<cgroup-path>/memory.max

Image pull failures ¶

# Check image pull with crictl
$ crictl pull <image>

# Check containerd logs
$ journalctl -u containerd | grep <image>

# Common issues:
# - Registry auth: check /var/lib/kubelet/config.json
# - Network: can node reach registry?
# - Disk space: df -h /var/lib/containerd

Putting It Together ¶

The full sequence when you kubectl apply a pod:

API Server stores Pod in etcd
Scheduler assigns Pod to a node
kubelet on that node sees the Pod (via watch)
kubelet calls containerd via CRI: CreatePodSandbox
containerd creates the pause container (network namespace holder)
kubelet calls containerd: CreateContainer for each container
containerd prepares the rootfs (overlayfs from image layers)
containerd spawns a shim for each container
shim calls runc with the OCI spec
runc creates namespaces, cgroups, mounts, security settings
runc execs your entrypoint
runc exits, shim monitors the container
kubelet reports status to API server

When something goes wrong, trace backwards through these layers until you find where it broke.

Summary ¶

Kubernetes doesn’t run containers — it orchestrates a stack of tools that do:

Layer	Tool	Purpose
1	kubelet	Node agent, pod lifecycle
2	CRI	gRPC API to runtime
3	containerd	Image and container management
4	shim	Per-container daemon
5	runc	OCI runtime, creates namespaces/cgroups
6	Your process	Just a Linux process with isolation

Each layer has its own tools:

Layer	Debug Tool
kubelet	`journalctl -u kubelet`
CRI	`crictl`
containerd	`ctr`
runc	`runc state/list`
Container	`nsenter`, `/proc`, cgroup fs

When debugging, start at the top (kubectl describe, kubelet logs) and work down. By the time you’re running runc state, you’re debugging Linux primitives, not Kubernetes.