From kubelet to Process: How Kubernetes Actually Runs Your Container


You apply a Pod manifest. Seconds later, your container is running. But what actually happened between kubectl apply and your process starting?

The answer involves six layers: kubelet, CRI, containerd, shim, runc, and finally your process. Each layer exists for a reason, and knowing them helps you debug when things go wrong.

kubectl apply
     |
     v
API Server ----------- stores Pod in etcd
     | watch
     v
kubelet -------------- node agent, manages pod lifecycle
     | CRI gRPC
     v
containerd ----------- container runtime, manages images
     |
     v
containerd-shim ------ per-container process, survives restarts
     |
     v
runc ----------------- OCI runtime, sets up namespaces/cgroups
     | fork/exec
     v
Your Container ------- just a Linux process with isolation

Let’s trace a pod creation through each layer.

The kubelet is the node agent. It watches the API server for pods assigned to its node and makes them reality.

  1. Watches for pods scheduled to this node
  2. Computes the desired state (which containers should exist)
  3. Calls the container runtime via CRI to create/start/stop containers
  4. Reports pod status back to the API server
  5. Manages pod lifecycle (liveness probes, restarts, etc.)
# Usually a systemd service
$ systemctl status kubelet

# Configuration
$ cat /var/lib/kubelet/config.yaml

# Logs
$ journalctl -u kubelet -f

Before ~2017, kubelet had Docker-specific code built in. Now it uses the Container Runtime Interface (CRI) to talk to any compliant runtime.

kubelet -> CRI (gRPC) -> containerd
                      -> CRI-O
                      -> Docker (via cri-dockerd shim)

CRI is a gRPC API that kubelet uses to communicate with container runtimes. It defines two services:

RuntimeService: Container lifecycle

  • CreateContainer, StartContainer, StopContainer, RemoveContainer
  • ListContainers, ContainerStatus
  • ExecSync, Exec, Attach

ImageService: Image management

  • PullImage, ListImages, RemoveImage

You can talk to CRI directly using crictl:

# List containers (like docker ps)
$ crictl ps
CONTAINER    IMAGE     CREATED       STATE    NAME         POD ID
a1b2c3d4e5   nginx     2 hours ago   Running  nginx        x1y2z3

# List pods
$ crictl pods
POD ID       CREATED       STATE   NAME                NAMESPACE
x1y2z3       2 hours ago   Ready   nginx-7d9fc5...     default

# Pull an image
$ crictl pull nginx:latest

# Get container logs
$ crictl logs a1b2c3d4e5

# Exec into container
$ crictl exec -it a1b2c3d4e5 sh
# See what runtime kubelet is using
$ cat /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS="--container-runtime-endpoint=unix:///run/containerd/containerd.sock"

# Or from kubelet config
$ grep containerRuntime /var/lib/kubelet/config.yaml

containerd is the most common container runtime in Kubernetes. It’s what Docker uses under the hood (Docker = containerd + additional tooling).

  1. Manages images — pulls, stores, unpacks OCI images
  2. Manages containers — creates, starts, stops containers
  3. Manages snapshots — filesystem layers (overlayfs)
  4. Manages tasks — running processes within containers
  5. Spawns shims — one shim per container
+----------------------------------------------------------+
|                       containerd                         |
|  +--------+  +----------+  +---------+  +-------------+  |
|  | Images |  |Containers|  |Snapshots|  |   Tasks     |  |
|  |Service |  | Service  |  | Service |  |  Service    |  |
|  +--------+  +----------+  +---------+  +-------------+  |
+----------------------------+-----------------------------+
                             |
            +----------------+----------------+
            v                v                v
       +--------+       +--------+       +--------+
       |  shim  |       |  shim  |       |  shim  |
       | (ctr1) |       | (ctr2) |       | (ctr3) |
       +--------+       +--------+       +--------+

Use ctr (low-level) or nerdctl (Docker-compatible):

# List namespaces (containerd uses namespaces for isolation)
$ ctr namespaces ls
NAME   LABELS
k8s.io        # Kubernetes containers
moby          # Docker containers (if Docker is installed)

# List containers in k8s.io namespace
$ ctr -n k8s.io containers ls

# List images
$ ctr -n k8s.io images ls

# List running tasks (processes)
$ ctr -n k8s.io tasks ls
$ cat /etc/containerd/config.toml

# Key settings:
[plugins."io.containerd.grpc.v1.cri"]
  # CRI plugin configuration
  
[plugins."io.containerd.grpc.v1.cri".containerd]
  # Default runtime
  default_runtime_name = "runc"
  
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

The shim is a small process that sits between containerd and runc. There’s one shim per container.

  1. Decoupling: Container survives containerd restart
  2. Stdio handling: Keeps stdin/stdout/stderr open
  3. Exit status: Reports container exit to containerd
  4. Reaping: Acts as subreaper for orphaned processes
containerd crash/restart
        |
        |  Containers keep running!
        v
+---------------+     +---------------+
|     shim      |     |     shim      |
|  (container1) |     |  (container2) |
+-------+-------+     +-------+-------+
        |                     |
        v                     v
   [container1]          [container2]
   still running         still running
$ ps aux | grep containerd-shim
root  1234  containerd-shim-runc-v2 -namespace k8s.io -id abc123...
root  1235  containerd-shim-runc-v2 -namespace k8s.io -id def456...

Each shim manages one container. The container ID matches what you see in crictl ps.

runc is the OCI (Open Container Initiative) reference runtime. It does the actual work of creating the container: setting up namespaces, cgroups, and executing the process.

  1. Parses the OCI runtime spec (config.json)
  2. Creates namespaces (pid, net, mnt, uts, ipc, user)
  3. Sets up cgroups (CPU, memory, IO limits)
  4. Mounts filesystems (rootfs, /proc, /sys, volumes)
  5. Applies security (seccomp, capabilities, SELinux/AppArmor)
  6. Executes the container entrypoint

runc reads a config.json that defines everything about the container:

{
  "ociVersion": "1.0.2",
  "process": {
    "terminal": true,
    "user": { "uid": 0, "gid": 0 },
    "args": ["sh"],
    "env": ["PATH=/usr/bin:/bin", "TERM=xterm"],
    "cwd": "/"
  },
  "root": {
    "path": "rootfs",
    "readonly": false
  },
  "hostname": "container",
  "mounts": [
    { "destination": "/proc", "type": "proc", "source": "proc" },
    { "destination": "/dev", "type": "tmpfs", "source": "tmpfs" }
  ],
  "linux": {
    "namespaces": [
      { "type": "pid" },
      { "type": "network" },
      { "type": "ipc" },
      { "type": "uts" },
      { "type": "mount" }
    ],
    "resources": {
      "memory": { "limit": 536870912 },
      "cpu": { "quota": 50000, "period": 100000 }
    }
  }
}

You can run runc manually (useful for debugging):

# Create a bundle directory
$ mkdir -p mycontainer/rootfs

# Extract an image to rootfs
$ docker export $(docker create alpine) | tar -C mycontainer/rootfs -xf -

# Generate a spec
$ cd mycontainer
$ runc spec

# Edit config.json if needed, then run
$ runc run mycontainer
# List runc containers
$ runc list

# Get container state
$ runc state <container-id>
{
  "ociVersion": "1.0.2",
  "id": "abc123",
  "pid": 12345,
  "status": "running",
  "bundle": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/abc123",
  "rootfs": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/abc123/rootfs",
  "created": "2025-01-25T10:00:00Z"
}

After all these layers, your container is just a Linux process. It has:

  • Its own PID namespace (PID 1 inside)
  • Its own network namespace (separate interfaces)
  • Its own mount namespace (container rootfs as /)
  • Cgroup limits (CPU, memory, etc.)
  • Seccomp filters (restricted syscalls)
  • Dropped capabilities (limited root powers)
# From the host, it's just a process
$ ps aux | grep <your-entrypoint>
root 12345 ... /your/entrypoint

# Its namespaces
$ ls -la /proc/12345/ns/

We’ve covered the runtime spec. The other OCI spec is the image spec — how container images are structured.

An OCI image is a stack of filesystem layers:

+-------------------------------------+
|  Layer 3: Application code          |  (your Dockerfile additions)
+-----------+-------------------------+
|  Layer 2: Runtime dependencies      |  (apt-get install ...)
+-----------+-------------------------+
|  Layer 1: Base image                |  (ubuntu:22.04)
+-------------------------------------+

Layers are content-addressed (by SHA256 hash), immutable, and shared between images.

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:abc123...",
    "size": 1234
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:layer1...",
      "size": 12345678
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:layer2...",
      "size": 23456789
    }
  ]
}
  1. Pull: Download manifest and layers from registry
  2. Unpack: Extract layers to snapshotter (overlayfs)
  3. Mount: Stack layers using overlayfs for container rootfs
# See image layers
$ ctr -n k8s.io images ls
$ ctr -n k8s.io content ls

# See snapshots (unpacked layers)
$ ctr -n k8s.io snapshots ls
# Check kubelet logs
$ journalctl -u kubelet -f

# Common issues:
# - "failed to pull image" → registry/network issue
# - "failed to create sandbox" → containerd issue
# - "failed to start container" → runtime issue

# Check kubelet is talking to containerd
$ systemctl status containerd
# Check CRI is responding
$ crictl info

# List pods (should match kubectl get pods)
$ crictl pods

# List containers
$ crictl ps -a

# Get container details
$ crictl inspect <container-id>

# Check why a container failed
$ crictl logs <container-id>

# Debug pod sandbox issues
$ crictl inspectp <pod-id>
# Check containerd health
$ ctr -n k8s.io version

# List containers at containerd level
$ ctr -n k8s.io containers ls

# List tasks (running containers)
$ ctr -n k8s.io tasks ls

# Check container bundle
$ ctr -n k8s.io containers info <container-id>
# Find shim process
$ ps aux | grep "containerd-shim.*<container-id>"

# Check runc state
$ runc --root /run/containerd/runc/k8s.io state <container-id>

# List all runc containers
$ runc --root /run/containerd/runc/k8s.io list
# Find container's PID on host
$ crictl inspect <container-id> | jq '.info.pid'
12345

# Enter container's namespaces
$ nsenter -t 12345 -a bash

# Or just one namespace
$ nsenter -t 12345 -n ip addr        # Network namespace
$ nsenter -t 12345 -m ls /           # Mount namespace
$ nsenter -t 12345 -p -r ps aux      # PID namespace

# Check cgroup limits
$ cat /proc/12345/cgroup
0::/kubepods/burstable/pod-xyz/container-abc

$ cat /sys/fs/cgroup/kubepods/burstable/pod-xyz/container-abc/memory.max

# Trace syscalls
$ strace -p 12345
# 1. Check kubelet logs
$ journalctl -u kubelet | grep <pod-name>

# 2. Check container status
$ crictl ps -a | grep <pod-name>
$ crictl logs <container-id>

# 3. Check events
$ kubectl describe pod <pod-name>

# 4. Check runc directly
$ runc --root /run/containerd/runc/k8s.io state <container-id>
# Check exit code
$ crictl inspect <container-id> | jq '.status.exitCode'

# Check logs
$ crictl logs <container-id>

# Common causes:
# - Exit 0: Command completed (wrong entrypoint)
# - Exit 1: Application error
# - Exit 137: OOM killed (128 + 9 = SIGKILL)
# - Exit 139: Segfault (128 + 11 = SIGSEGV)
# Find container's cgroup
$ crictl inspect <container-id> | jq '.info.runtimeSpec.linux.cgroupsPath'

# Check CPU throttling
$ cat /sys/fs/cgroup/<cgroup-path>/cpu.stat
nr_throttled 5000       # Throttled 5000 times!
throttled_usec 60000000 # 60 seconds total throttle time

# Check memory pressure
$ cat /sys/fs/cgroup/<cgroup-path>/memory.current
$ cat /sys/fs/cgroup/<cgroup-path>/memory.max
# Check image pull with crictl
$ crictl pull <image>

# Check containerd logs
$ journalctl -u containerd | grep <image>

# Common issues:
# - Registry auth: check /var/lib/kubelet/config.json
# - Network: can node reach registry?
# - Disk space: df -h /var/lib/containerd

The full sequence when you kubectl apply a pod:

  1. API Server stores Pod in etcd
  2. Scheduler assigns Pod to a node
  3. kubelet on that node sees the Pod (via watch)
  4. kubelet calls containerd via CRI: CreatePodSandbox
  5. containerd creates the pause container (network namespace holder)
  6. kubelet calls containerd: CreateContainer for each container
  7. containerd prepares the rootfs (overlayfs from image layers)
  8. containerd spawns a shim for each container
  9. shim calls runc with the OCI spec
  10. runc creates namespaces, cgroups, mounts, security settings
  11. runc execs your entrypoint
  12. runc exits, shim monitors the container
  13. kubelet reports status to API server

When something goes wrong, trace backwards through these layers until you find where it broke.

Kubernetes doesn’t run containers — it orchestrates a stack of tools that do:

Layer Tool Purpose
1 kubelet Node agent, pod lifecycle
2 CRI gRPC API to runtime
3 containerd Image and container management
4 shim Per-container daemon
5 runc OCI runtime, creates namespaces/cgroups
6 Your process Just a Linux process with isolation

Each layer has its own tools:

Layer Debug Tool
kubelet journalctl -u kubelet
CRI crictl
containerd ctr
runc runc state/list
Container nsenter, /proc, cgroup fs

When debugging, start at the top (kubectl describe, kubelet logs) and work down. By the time you’re running runc state, you’re debugging Linux primitives, not Kubernetes.