Your ML team needs GPUs. You add nodes with NVIDIA A100s, install the device plugin, and suddenly Kubernetes can schedule GPU workloads. But then the requests start: “Can we share a GPU between pods?” “Why is my training job slow even though I have 8 GPUs?” “Can we request a specific GPU model?”
GPU scheduling in Kubernetes has evolved from a simple device plugin model to the more flexible Dynamic Resource Allocation (DRA). This post covers both, explaining how they work and when to use each.
The Problem: GPUs Aren’t Like CPU or Memory ¶
CPU and memory are fungible. If you request 2 CPUs, any 2 CPUs work. The scheduler doesn’t care which ones.
GPUs are different:
- Heterogeneous: A100 vs V100 vs T4 have vastly different capabilities
- Topology matters: GPU-to-GPU and GPU-to-CPU connectivity affects performance
- Not easily divisible: You can’t give a pod “0.5 GPUs” the way you give it 500m CPU
- State and configuration: GPUs have drivers, compute modes, memory configurations
- Expensive: At $2-10/hour per GPU, idle GPUs hurt
The standard Kubernetes resource model (requests/limits) wasn’t designed for this.
Device Plugins: The Current Model ¶
Since Kubernetes 1.8, device plugins let vendors expose hardware to the scheduler.
How Device Plugins Work ¶
+------------------+ +-------------------+
| kubelet |<----->| Device Plugin |
| | gRPC | (e.g., NVIDIA) |
+------------------+ +-------------------+
| |
| v
| +-------------------+
| | GPU Hardware |
v +-------------------+
+-------------------+
| API Serverv |
| |
| Node resources: |
| nvidia.com/gpu: 4|
+-------------------+
- Device plugin registers with kubelet via gRPC
- Reports available devices (e.g., 4 GPUs)
- kubelet advertises to API server as extended resources
- Scheduler sees
nvidia.com/gpu: 4as allocatable - When pod scheduled, device plugin tells kubelet which device(s) to assign
Installing NVIDIA Device Plugin ¶
# Add NVIDIA Helm repo
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
# Install device plugin
helm install nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace
Verify:
$ kubectl describe node gpu-node-1 | grep -A 5 "Allocatable"
Allocatable:
cpu: 32
memory: 128Gi
nvidia.com/gpu: 4
Requesting GPUs ¶
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:12.0-base
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
Note: For device plugin resources, limits and requests must be equal. You can’t “burst” GPU usage.
What Happens at Runtime ¶
When the pod is scheduled:
- Device plugin’s
Allocate()called with device IDs - Plugin returns environment variables and device mounts:
// Device plugin returns
ContainerAllocateResponse{
Envs: map[string]string{
"NVIDIA_VISIBLE_DEVICES": "GPU-abc123",
},
Mounts: []*Mount{
{ContainerPath: "/dev/nvidia0", HostPath: "/dev/nvidia0"},
},
}
- Container sees only assigned GPU(s)
Limitations of Device Plugins ¶
1. Whole Devices Only ¶
resources:
limits:
nvidia.com/gpu: 1 # OK
nvidia.com/gpu: 0.5 # NOT POSSIBLE
Can’t share a GPU between pods. A $10k A100 sits 90% idle because one pod claimed it.
2. No Device Selection ¶
You can’t say “give me an A100, not a T4.” The scheduler just sees a count:
# What you want
nvidia.com/gpu:
model: A100
memory: 80Gi
# What you can do
nvidia.com/gpu: 1 # Could be anything
Workaround: Use node labels and node selectors:
nodeSelector:
gpu-type: a100
But this is coarse-grained and doesn’t scale.
3. No Topology Awareness ¶
Multi-GPU training performance depends on GPU interconnects:
Best: NVLink (600 GB/s)
OK: PCIe (64 GB/s)
Bad: Cross-socket PCIe
Pod gets GPU 0 and GPU 3
GPU 0 <--NVLink--> GPU 1
GPU 2 <--NVLink--> GPU 3
GPU 0 <--PCIe-----> GPU 3 ← Slow!
Device plugins don’t consider topology. Your 8-GPU training job might get the worst possible GPU combination.
4. No Preparation or Cleanup ¶
Some devices need setup before use:
- Configure compute mode
- Allocate memory partitions (MIG)
- Load firmware
Device plugins have no lifecycle hooks for this.
5. Scheduling Races ¶
1. Scheduler sees: Node has 2 GPUs free
2. Scheduler assigns Pod A (needs 2 GPUs) to node
3. Before Pod A starts, Pod B (needs 1 GPU) also scheduled to node
4. Conflict!
Extended resources are accounted at scheduling time, but there’s a window for races.
Fractional GPUs: MIG and Time-Slicing ¶
NVIDIA Multi-Instance GPU (MIG) ¶
MIG physically partitions A100/A30/H100 GPUs:
A100 80GB
├── MIG 1g.10gb (instance 1)
├── MIG 1g.10gb (instance 2)
├── MIG 1g.10gb (instance 3)
├── MIG 2g.20gb (instance 4)
└── MIG 3g.40gb (instance 5)
Each MIG instance is isolated: separate memory, separate compute units.
Configure MIG with device plugin:
# nvidia-device-plugin config
config:
map:
default: mixed
sharing:
mig:
strategy: mixed
Then request specific MIG profiles:
resources:
limits:
nvidia.com/mig-1g.10gb: 1
Pros: True isolation, guaranteed resources Cons: Only certain GPUs support MIG, reconfiguration requires empty GPU
Time-Slicing ¶
Multiple pods share one GPU by taking turns:
# Device plugin ConfigMap
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Each GPU appears as 4 resources
Now nvidia.com/gpu: 4 becomes nvidia.com/gpu: 16 (4 GPUs × 4 replicas).
# Pod requests "1 GPU" but actually gets 1/4
resources:
limits:
nvidia.com/gpu: 1
Pros: Works on any NVIDIA GPU, no reconfiguration Cons: No isolation—one pod can starve others, no memory limits
Comparison ¶
| Feature | MIG | Time-Slicing |
|---|---|---|
| Isolation | Full (memory + compute) | None |
| Supported GPUs | A100, A30, H100 | Any NVIDIA |
| Reconfiguration | Requires empty GPU | Dynamic |
| Memory guarantee | Yes | No |
| Best for | Production inference | Dev/test, bursty workloads |
Dynamic Resource Allocation (DRA) ¶
DRA (alpha in 1.26, graduating in 1.31+) is the next evolution. It addresses device plugin limitations with a claim-based model.
Key Concepts ¶
ResourceClaim: A request for resources (like PVC for storage)
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
name: gpu-claim
spec:
resourceClassName: gpu.nvidia.com
ResourceClass: Defines a type of resource and its driver
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClass
metadata:
name: gpu.nvidia.com
driverName: gpu.nvidia.com
ResourceClaimTemplate: For dynamic claim creation
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
name: gpu-template
spec:
spec:
resourceClassName: gpu.nvidia.com
How DRA Works ¶
1. User creates ResourceClaim (or template in Pod)
2. Scheduler finds nodes where claim can be satisfied
3. DRA driver's "allocate" called with node context
4. Driver prepares device (configure MIG, set mode, etc.)
5. Pod starts with device available
6. On pod termination, driver cleans up
DRA vs Device Plugins ¶
| Aspect | Device Plugins | DRA |
|---|---|---|
| Granularity | Whole devices | Flexible (fractions, attributes) |
| Device selection | Count only | Rich selectors |
| Lifecycle | None | Prepare/cleanup hooks |
| Scheduling | Simple counting | Structured parameters |
| State | Stateless | Claim tracks allocation |
Requesting GPUs with DRA ¶
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:12.0-base
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
source:
resourceClaimTemplateName: gpu-template
With structured parameters (future):
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
name: specific-gpu
spec:
resourceClassName: gpu.nvidia.com
parametersRef:
apiGroup: gpu.nvidia.com
kind: GpuClaimParameters
name: my-params
---
apiVersion: gpu.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
name: my-params
spec:
selector:
model: A100
memory: 80Gi
sharing:
strategy: MIG
profile: 3g.40gb
Current State (Kubernetes 1.31+) ¶
DRA is maturing but still evolving:
- Core API: Stable enough for testing
- NVIDIA DRA driver: Available, replacing device plugin in some deployments
- Structured parameters: Still developing
- Production readiness: Check your version’s feature gates
# Enable DRA feature gates (if not default)
--feature-gates=DynamicResourceAllocation=true
Topology-Aware Scheduling ¶
The Problem ¶
8-GPU training job needs GPUs that can communicate fast:
Ideal: All 8 GPUs on same NVLink domain
OK: 4+4 across two NVLink domains
Bad: 8 GPUs scattered across PCIe
Topology Manager ¶
kubelet’s Topology Manager aligns resource allocation:
# kubelet configuration
topologyManagerPolicy: best-effort # or: restricted, single-numa-node
topologyManagerScope: container # or: pod
Policies:
none: No topology alignmentbest-effort: Try to align, but schedule anywayrestricted: Fail if can’t alignsingle-numa-node: All resources from one NUMA node
NVIDIA GPU Operator ¶
The GPU Operator automates GPU node setup and includes topology awareness:
helm install gpu-operator nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set mig.strategy=mixed
It handles:
- Driver installation
- Container toolkit
- Device plugin
- GPU feature discovery
- MIG management
- Monitoring
GPU Feature Discovery ¶
Automatically labels nodes with GPU details:
$ kubectl describe node gpu-node | grep nvidia
nvidia.com/cuda.driver.major=535
nvidia.com/cuda.driver.minor=129
nvidia.com/cuda.runtime.major=12
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.count=4
nvidia.com/gpu.family=ampere
nvidia.com/gpu.memory=81920
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/mig.capable=true
Now you can select by GPU type:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
Or with affinity for flexibility:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
- NVIDIA-A100-PCIE-80GB
Best Practices ¶
1. Right-Size GPU Requests ¶
Don’t request 8 GPUs for a job that uses 1. GPUs are expensive.
# Profile your workload first
resources:
limits:
nvidia.com/gpu: 1 # Start small, scale up if needed
2. Use MIG for Inference ¶
Inference workloads often don’t need a full A100:
resources:
limits:
nvidia.com/mig-1g.10gb: 1 # 1/7 of an A100
3. Node Pools by GPU Type ¶
Separate node pools for different GPU types:
# Training pool: A100s
nodeSelector:
gpu-pool: training
# Inference pool: T4s (cheaper)
nodeSelector:
gpu-pool: inference
4. Set Resource Quotas ¶
Prevent GPU hoarding:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
spec:
hard:
nvidia.com/gpu: "10" # Max 10 GPUs per namespace
5. Use Priority Classes ¶
Critical training jobs should preempt development workloads:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: training-critical
value: 1000000
preemptionPolicy: PreemptLowerPriority
---
apiVersion: v1
kind: Pod
spec:
priorityClassName: training-critical
# ...
6. Monitor GPU Utilization ¶
Low utilization = wasted money:
# DCGM exporter metrics
dcgm_gpu_utilization
dcgm_memory_utilization
dcgm_power_usage
Alert if GPUs are allocated but idle:
# GPU allocated but <10% utilized for 30m
(dcgm_gpu_utilization < 10) and (kube_pod_container_resource_limits{resource="nvidia_com_gpu"} > 0)
Debugging GPU Scheduling ¶
Pod Stuck Pending ¶
kubectl describe pod gpu-pod
# Common messages:
# "Insufficient nvidia.com/gpu" - No nodes with free GPUs
# "0/10 nodes available: 10 node(s) didn't match node selector" - Wrong labels
Check Node GPU Status ¶
# Available GPUs
kubectl describe node gpu-node | grep -A 5 Allocatable
# Allocated GPUs
kubectl describe node gpu-node | grep -A 5 "Allocated resources"
Verify Device Plugin ¶
# Check device plugin pods
kubectl get pods -n nvidia-device-plugin
# Check logs
kubectl logs -n nvidia-device-plugin -l app=nvidia-device-plugin
Check GPU Health ¶
# On the node
nvidia-smi
# Look for:
# - GPU memory errors
# - Temperature
# - Power state
# - Running processes
MIG Configuration Issues ¶
# Check MIG status
nvidia-smi mig -lgi
# Check device plugin sees MIG devices
kubectl logs -n nvidia-device-plugin <pod> | grep -i mig
Summary ¶
GPU scheduling in Kubernetes has evolved:
| Generation | Mechanism | Capabilities |
|---|---|---|
| 1st | Device Plugins | Whole GPU allocation, simple counting |
| 1.5 | + MIG/Time-slicing | Fractional GPUs, but hacks on top of device plugins |
| 2nd | DRA | Rich selectors, lifecycle hooks, structured parameters |
Current recommendations:
| Use Case | Approach |
|---|---|
| Simple GPU workloads | Device plugin + node selectors |
| Fractional GPUs (A100) | MIG via device plugin |
| Shared dev/test GPUs | Time-slicing |
| Complex requirements | Evaluate DRA (if K8s 1.31+) |
| Multi-GPU training | Topology Manager + GPU Operator |
The device plugin model works but hits walls at scale. DRA is the future—a proper API for hardware allocation. As it matures, expect richer GPU scheduling: specific models, memory requirements, topology constraints, all expressible in standard Kubernetes resources.