Running AI on Kubernetes: What Breaks and What’s Being Built

Kubernetes was built to orchestrate containerized workloads at scale. But its default abstractions — Deployments, Services, HPA, state-unaware load balancing — are optimized for a specific shape: stateless, ephemeral pods handling roughly uniform requests that complete in milliseconds. The resource model, scheduler, autoscaler, and load balancer were all designed around this pattern.

The unit of compute for AI is a 70-billion-parameter model loaded across 4 GPUs, managing a KV cache larger than most databases, serving requests that vary in cost by 1000x. It’s not stateless, not ephemeral, not uniform — and every K8s abstraction that assumes otherwise breaks in a specific, diagnosable way. Understanding how each one breaks tells you where the infrastructure needs to go next.

Three Mismatches ¶

1. Resources: GPUs Aren’t CPUs ¶

Kubernetes models resources as continuous and fungible. You request 500m CPU, 256Mi memory. Any CPU will do. You can slice finely — 100m here, 200m there — and the kernel handles sharing through CFS quotas and cgroup limits.

GPUs break this at every level. GPU memory is the binding constraint for inference, not compute — a 70B model in fp16 needs roughly 140GB just for weights, and it either fits or it doesn’t. There’s no “give me 0.7 of a GPU.” An A100-40GB, A100-80GB, and H100-80GB are not interchangeable, but Kubernetes sees all three as nvidia.com/gpu: 1.

The workarounds trade something. MIG physically partitions supported GPUs into isolated instances but requires supported hardware and reconfiguration with no running workloads on the GPU. Time-slicing lets multiple pods share a GPU but provides no memory isolation — one pod’s allocation spike can OOM-kill another’s inference. Dynamic Resource Allocation (DRA) is the most significant step forward, introducing a claim-based model with rich device attribute matching. But ecosystem adoption is still catching up, and expressing GPU memory capacity and interconnect topology as schedulable attributes requires device plugin support that not all vendors provide yet.

The principle: Kubernetes’s resource model works when resources are continuous and interchangeable. GPU resources are discrete and heterogeneous — a model either fits on a device or it doesn’t, and which specific device matters for performance in ways that CPUs never did. (For a deeper look at DRA and GPU scheduling mechanics, see GPU Scheduling and Dynamic Resource Allocation. For how the CPU resource model itself breaks under CFS quotas, see CPU Throttling.)

2. State: Models Are Expensive to Move ¶

Kubernetes assumes pods are ephemeral — kill one, start another in seconds. Rolling deployments, preemption, rescheduling, autoscaling — all assume cheap pod replacement.

For inference, it isn’t cheap. Loading a model into GPU memory — downloading weights, transferring to VRAM, warming up the runtime — takes 30-60 seconds for a 7B model, 2-10 minutes for a 70B model. During loading, the replica serves nothing.

This makes standard K8s operations actively harmful. Preemption, scale-down, rescheduling — each destroys a loaded model and triggers a multi-minute cold start. Scale-down is particularly bad: HPA has no concept of “this pod’s loaded state is expensive to reconstruct,” so it tears down a warm replica that might be needed again in five minutes, paying the full loading cost twice. The cattle-not-pets assumption breaks when each pod carries minutes of initialization state in GPU memory.

The principle: Kubernetes treats compute as stateless. AI compute is stateful in ways that don’t map to PersistentVolumes — model weights live in GPU memory, KV caches are ephemeral but valuable, and the infrastructure has no concept of “prefer to keep this pod alive.”

3. Traffic: Requests Aren’t Equal ¶

Kubernetes load balancing assumes roughly uniform requests. A Service distributes via random selection across backends, blind to what each request actually costs to serve.

In inference, a 100-token request and a 32K-token request can differ by orders of magnitude in compute cost. vLLM’s continuous batching mitigates this within a replica, but across replicas, default routing has no visibility into per-replica state — one replica queues while another idles. Even with IPVS mode or external load balancers that support least-connections, connection count is the wrong proxy: one long-running generation is more load than ten short completions.

The metrics that actually matter — tokens/sec, time to first token, KV cache utilization, queue depth — don’t exist in K8s natively. They come from the inference server and require custom metrics adapters to surface. GPU utilization, the metric everyone reaches for first, is misleading: 95% means the hardware is busy, not that it’s making progress. And autoscaling based on these wrong metrics makes the cold start problem from the previous section worse — HPA scales up, a new pod spends minutes loading a model, and the burst passes before it serves a single request.

The principle: Kubernetes optimizes for request count. AI workloads need to optimize for token throughput — a fundamentally different unit of work.

And Then There Are Agents ¶

Agents compound all three mismatches at once. An agent mid-task accumulates irreproducible reasoning state with no checkpoint or recovery — kill the pod, lose the chain. Its resource needs are unpredictable — 1K tokens or 500K with 20 tool calls. And tasks run minutes to hours, not milliseconds. K8s quotas, autoscaling, and load balancing all operate on the wrong unit: pods and requests, when what you need is task-level token budgets and state-aware scheduling.

Can Kubernetes Adapt? ¶

K8s wasn’t built for stateful workloads either. Then StatefulSets, PersistentVolumes, and operators emerged. The question is whether AI workloads can be solved with the same pattern — extend K8s through CRDs, operators, and plugins.

StatefulSets worked because the core scheduling model was still valid for databases. You still placed a pod on a node with CPU and memory — the pod just needed stable identity and persistent storage. For AI workloads, the scheduling model itself is mismatched: the resource abstraction, scaling model, and load balancing model all assume properties that AI workloads don’t have.

The answer is playing out as “both.” K8s is being extended to handle some of these problems, while a new layer emerges above it for the rest.

The Emerging Stack ¶

Three approaches are competing. They’re not mutually exclusive — most production stacks will combine elements of all three.

Extend Kubernetes ¶

Add AI-specific semantics through K8s extension points. The bet: K8s won the orchestration war, and the ecosystem benefits are too valuable to abandon.

The Gateway API Inference Extension adds inference-aware routing as a first-class Gateway API concept. Its InferencePool and InferenceModel CRDs let the gateway examine live pod metrics — queue depth, KV cache utilization, loaded adapters — to pick the optimal backend per request. Already integrated into Envoy Gateway, kgateway, Istio, and NGINX Gateway Fabric. Its partnership with vLLM on llm-d pushes further — splitting inference into separate prefill and decode phases on independent pod pools, with KV-cache-aware scheduling.

On scheduling, Kueue intercepts jobs before pods are created, admitting them only when all required resources are available — solving the partial scheduling problem for multi-GPU workloads. LeaderWorkerSet handles co-scheduling for tensor parallelism. NVIDIA’s KAI Scheduler adds topology-aware bin-packing and hierarchical queues for GPU workloads at scale.

The strength: you keep your existing K8s investment. The weakness: a working inference stack now requires the GPU Operator, a scheduler, a model serving platform, a custom metrics adapter, and an inference gateway — each with its own CRDs and failure modes. And none of this addresses agent orchestration.

Abstract Above Kubernetes ¶

Build a higher-level runtime where K8s becomes the node manager underneath.

Ray Serve is the most mature example. Ray’s actor model — a stateful object that lives on a specific node and maintains state across calls — is a much better primitive for an inference server than a K8s pod. Ray handles scheduling, autoscaling, and fault recovery at the application level. Modal goes further: deploy a model with a Python decorator, never think about infrastructure. Managed platforms (Replicate, Baseten, Fireworks) do the same at the endpoint level.

The strength is simplicity — developers think in models, not pods. The weakness is the usual managed-services tradeoff: reduced control, vendor coupling, compliance constraints. Ray occupies a middle ground — a layer above K8s that preserves control while offloading AI-specific orchestration, at the cost of running two orchestration systems.

Purpose-Built Agent Infrastructure ¶

Neither extending K8s nor abstracting above it solves agent orchestration, because the problem is new enough that nobody has a production-grade solution.

Agent frameworks (LangGraph, CrewAI, AutoGen) handle conversation state and tool orchestration but are increasingly absorbing infrastructure concerns — token budgeting, state persistence, error recovery. AI Gateways (Portkey, LiteLLM) handle model routing and cost tracking. MCP standardizes tool discovery.

What’s missing is the layer that manages agent lifecycles, enforces task-level token budgets, persists state across failures, and coordinates multi-agent collaboration. This is 2015-era container orchestration — multiple competing approaches, no dominant abstraction.

Where It’s Heading ¶

The historical pattern: VMs didn’t disappear when containers arrived. Each new abstraction layer coexists with those below it.

+----------------------------------------------------------+
|  Agent Orchestration                                     |
|  task management, tool permissions, token budgets,       |
|  state persistence, multi-agent coordination             |
|  (LangGraph, CrewAI... wide open, nobody has won)        |
+----------------------------------------------------------+
|  Model Serving                                           |
|  inference routing, token-aware load balancing,          |
|  KV cache management, model-specific autoscaling         |
|  (Gateway API Inference Ext, KServe, Ray Serve, vLLM)    |
+----------------------------------------------------------+
|  GPU / Compute Orchestration                             |
|  device scheduling, topology awareness, gang scheduling, |
|  model loading, memory management                        |
|  (Kueue, KAI Scheduler, NVIDIA GPU Operator, DRA)        |
+----------------------------------------------------------+
|  Kubernetes                                              |
|  node management, networking, storage, RBAC,             |
|  observability -- the "distributed OS" substrate          |
+----------------------------------------------------------+
|  Cloud / Bare Metal                                      |
+----------------------------------------------------------+

K8s doesn’t get replaced — it drops down a layer, becoming the distributed OS that handles nodes, networking, storage, and security. The AI infrastructure layer builds on K8s the same way K8s built on container runtimes: use the layer below for what it’s good at, and handle the new concerns above.

If you’re running inference today and the K8s ecosystem is familiar, extend it — Gateway API Inference Extension for routing, Kueue for scheduling, DRA for device management. The components are maturing fast and the integration paths are well-documented.

If you need the model-level abstraction natively, Ray gives you that without abandoning K8s underneath. You trade K8s-native simplicity for a runtime that was designed for stateful AI workloads.

If you’re building agent-native products, you’re defining the orchestration layer as you go — bespoke, opinionated, purpose-built. It’ll look a lot like K8s did in 2015. The gap between what K8s can do and what AI workloads need is closing, just not as fast as the workloads themselves are evolving.