GPU Scheduling in Kubernetes: From Device Plugins to Dynamic Resource Allocation

Your ML team needs GPUs. You add nodes with NVIDIA A100s, install the device plugin, and suddenly Kubernetes can schedule GPU workloads. But then the requests start: “Can we share a GPU between pods?” “Why is my training job slow even though I have 8 GPUs?” “Can we request a specific GPU model?”

GPU scheduling in Kubernetes has evolved from a simple device plugin model to the more flexible Dynamic Resource Allocation (DRA). This post covers both, explaining how they work and when to use each.

The Problem: GPUs Aren’t Like CPU or Memory ¶

CPU and memory are fungible. If you request 2 CPUs, any 2 CPUs work. The scheduler doesn’t care which ones.

GPUs are different:

Heterogeneous: A100 vs V100 vs T4 have vastly different capabilities
Topology matters: GPU-to-GPU and GPU-to-CPU connectivity affects performance
Not easily divisible: You can’t give a pod “0.5 GPUs” the way you give it 500m CPU
State and configuration: GPUs have drivers, compute modes, memory configurations
Expensive: At $2-10/hour per GPU, idle GPUs hurt

The standard Kubernetes resource model (requests/limits) wasn’t designed for this.

Device Plugins: The Current Model ¶

Since Kubernetes 1.8, device plugins let vendors expose hardware to the scheduler.

How Device Plugins Work ¶

+------------------+       +-------------------+
|    kubelet       |<----->|   Device Plugin   |
|                  |  gRPC |  (e.g., NVIDIA)   |
+------------------+       +-------------------+
         |                          |
         |                          v
         |                 +-------------------+
         |                 |   GPU Hardware    |
         v                 +-------------------+
+-------------------+
|   API Serverv     |
|                   |
|  Node resources:  |
|  nvidia.com/gpu: 4|
+-------------------+

Device plugin registers with kubelet via gRPC
Reports available devices (e.g., 4 GPUs)
kubelet advertises to API server as extended resources
Scheduler sees nvidia.com/gpu: 4 as allocatable
When pod scheduled, device plugin tells kubelet which device(s) to assign

Installing NVIDIA Device Plugin ¶

# Add NVIDIA Helm repo
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

# Install device plugin
helm install nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace

Verify:

$ kubectl describe node gpu-node-1 | grep -A 5 "Allocatable"
Allocatable:
  cpu:                32
  memory:             128Gi
  nvidia.com/gpu:     4

Requesting GPUs ¶

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:12.0-base
      resources:
        limits:
          nvidia.com/gpu: 1  # Request 1 GPU

Note: For device plugin resources, limits and requests must be equal. You can’t “burst” GPU usage.

What Happens at Runtime ¶

When the pod is scheduled:

Device plugin’s Allocate() called with device IDs
Plugin returns environment variables and device mounts:

// Device plugin returns
ContainerAllocateResponse{
    Envs: map[string]string{
        "NVIDIA_VISIBLE_DEVICES": "GPU-abc123",
    },
    Mounts: []*Mount{
        {ContainerPath: "/dev/nvidia0", HostPath: "/dev/nvidia0"},
    },
}

Container sees only assigned GPU(s)

Limitations of Device Plugins ¶

1. Whole Devices Only ¶

resources:
  limits:
    nvidia.com/gpu: 1    # OK
    nvidia.com/gpu: 0.5  # NOT POSSIBLE

Can’t share a GPU between pods. A $10k A100 sits 90% idle because one pod claimed it.

2. No Device Selection ¶

You can’t say “give me an A100, not a T4.” The scheduler just sees a count:

# What you want
nvidia.com/gpu:
  model: A100
  memory: 80Gi

# What you can do
nvidia.com/gpu: 1  # Could be anything

Workaround: Use node labels and node selectors:

nodeSelector:
  gpu-type: a100

But this is coarse-grained and doesn’t scale.

3. No Topology Awareness ¶

Multi-GPU training performance depends on GPU interconnects:

Best: NVLink (600 GB/s)
OK: PCIe (64 GB/s)
Bad: Cross-socket PCIe

Pod gets GPU 0 and GPU 3
GPU 0 <--NVLink--> GPU 1
GPU 2 <--NVLink--> GPU 3
GPU 0 <--PCIe-----> GPU 3  ← Slow!

Device plugins don’t consider topology. Your 8-GPU training job might get the worst possible GPU combination.

4. No Preparation or Cleanup ¶

Some devices need setup before use:

Configure compute mode
Allocate memory partitions (MIG)
Load firmware

Device plugins have no lifecycle hooks for this.

5. Scheduling Races ¶

1. Scheduler sees: Node has 2 GPUs free
2. Scheduler assigns Pod A (needs 2 GPUs) to node
3. Before Pod A starts, Pod B (needs 1 GPU) also scheduled to node
4. Conflict!

Extended resources are accounted at scheduling time, but there’s a window for races.

Fractional GPUs: MIG and Time-Slicing ¶

NVIDIA Multi-Instance GPU (MIG) ¶

MIG physically partitions A100/A30/H100 GPUs:

A100 80GB
├── MIG 1g.10gb (instance 1)
├── MIG 1g.10gb (instance 2)
├── MIG 1g.10gb (instance 3)
├── MIG 2g.20gb (instance 4)
└── MIG 3g.40gb (instance 5)

Each MIG instance is isolated: separate memory, separate compute units.

Configure MIG with device plugin:

# nvidia-device-plugin config
config:
  map:
    default: mixed
  sharing:
    mig:
      strategy: mixed

Then request specific MIG profiles:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

Pros: True isolation, guaranteed resources Cons: Only certain GPUs support MIG, reconfiguration requires empty GPU

Time-Slicing ¶

Multiple pods share one GPU by taking turns:

# Device plugin ConfigMap
sharing:
  timeSlicing:
    renameByDefault: false
    resources:
      - name: nvidia.com/gpu
        replicas: 4  # Each GPU appears as 4 resources

Now nvidia.com/gpu: 4 becomes nvidia.com/gpu: 16 (4 GPUs × 4 replicas).

# Pod requests "1 GPU" but actually gets 1/4
resources:
  limits:
    nvidia.com/gpu: 1

Pros: Works on any NVIDIA GPU, no reconfiguration Cons: No isolation—one pod can starve others, no memory limits

Comparison ¶

Feature	MIG	Time-Slicing
Isolation	Full (memory + compute)	None
Supported GPUs	A100, A30, H100	Any NVIDIA
Reconfiguration	Requires empty GPU	Dynamic
Memory guarantee	Yes	No
Best for	Production inference	Dev/test, bursty workloads

Dynamic Resource Allocation (DRA) ¶

DRA (alpha in 1.26, graduating in 1.31+) is the next evolution. It addresses device plugin limitations with a claim-based model.

Key Concepts ¶

ResourceClaim: A request for resources (like PVC for storage)

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  resourceClassName: gpu.nvidia.com

ResourceClass: Defines a type of resource and its driver

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClass
metadata:
  name: gpu.nvidia.com
driverName: gpu.nvidia.com

ResourceClaimTemplate: For dynamic claim creation

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  name: gpu-template
spec:
  spec:
    resourceClassName: gpu.nvidia.com

How DRA Works ¶

1. User creates ResourceClaim (or template in Pod)
2. Scheduler finds nodes where claim can be satisfied
3. DRA driver's "allocate" called with node context
4. Driver prepares device (configure MIG, set mode, etc.)
5. Pod starts with device available
6. On pod termination, driver cleans up

DRA vs Device Plugins ¶

Aspect	Device Plugins	DRA
Granularity	Whole devices	Flexible (fractions, attributes)
Device selection	Count only	Rich selectors
Lifecycle	None	Prepare/cleanup hooks
Scheduling	Simple counting	Structured parameters
State	Stateless	Claim tracks allocation

Requesting GPUs with DRA ¶

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:12.0-base
      resources:
        claims:
          - name: gpu
  resourceClaims:
    - name: gpu
      source:
        resourceClaimTemplateName: gpu-template

With structured parameters (future):

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  name: specific-gpu
spec:
  resourceClassName: gpu.nvidia.com
  parametersRef:
    apiGroup: gpu.nvidia.com
    kind: GpuClaimParameters
    name: my-params
---
apiVersion: gpu.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  name: my-params
spec:
  selector:
    model: A100
    memory: 80Gi
  sharing:
    strategy: MIG
    profile: 3g.40gb

Current State (Kubernetes 1.31+) ¶

DRA is maturing but still evolving:

Core API: Stable enough for testing
NVIDIA DRA driver: Available, replacing device plugin in some deployments
Structured parameters: Still developing
Production readiness: Check your version’s feature gates

# Enable DRA feature gates (if not default)
--feature-gates=DynamicResourceAllocation=true

Topology-Aware Scheduling ¶

The Problem ¶

8-GPU training job needs GPUs that can communicate fast:

Ideal: All 8 GPUs on same NVLink domain
OK: 4+4 across two NVLink domains  
Bad: 8 GPUs scattered across PCIe

Topology Manager ¶

kubelet’s Topology Manager aligns resource allocation:

# kubelet configuration
topologyManagerPolicy: best-effort  # or: restricted, single-numa-node
topologyManagerScope: container     # or: pod

Policies:

none: No topology alignment
best-effort: Try to align, but schedule anyway
restricted: Fail if can’t align
single-numa-node: All resources from one NUMA node

NVIDIA GPU Operator ¶

The GPU Operator automates GPU node setup and includes topology awareness:

helm install gpu-operator nvidia/gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set mig.strategy=mixed

It handles:

Driver installation
Container toolkit
Device plugin
GPU feature discovery
MIG management
Monitoring

GPU Feature Discovery ¶

Automatically labels nodes with GPU details:

$ kubectl describe node gpu-node | grep nvidia
  nvidia.com/cuda.driver.major=535
  nvidia.com/cuda.driver.minor=129
  nvidia.com/cuda.runtime.major=12
  nvidia.com/gpu.compute.major=8
  nvidia.com/gpu.count=4
  nvidia.com/gpu.family=ampere
  nvidia.com/gpu.memory=81920
  nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
  nvidia.com/mig.capable=true

Now you can select by GPU type:

nodeSelector:
  nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB

Or with affinity for flexibility:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: nvidia.com/gpu.product
              operator: In
              values:
                - NVIDIA-A100-SXM4-80GB
                - NVIDIA-A100-PCIE-80GB

Best Practices ¶

1. Right-Size GPU Requests ¶

Don’t request 8 GPUs for a job that uses 1. GPUs are expensive.

# Profile your workload first
resources:
  limits:
    nvidia.com/gpu: 1  # Start small, scale up if needed

2. Use MIG for Inference ¶

Inference workloads often don’t need a full A100:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1  # 1/7 of an A100

3. Node Pools by GPU Type ¶

Separate node pools for different GPU types:

# Training pool: A100s
nodeSelector:
  gpu-pool: training
  
# Inference pool: T4s (cheaper)
nodeSelector:
  gpu-pool: inference

4. Set Resource Quotas ¶

Prevent GPU hoarding:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
spec:
  hard:
    nvidia.com/gpu: "10"  # Max 10 GPUs per namespace

5. Use Priority Classes ¶

Critical training jobs should preempt development workloads:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: training-critical
value: 1000000
preemptionPolicy: PreemptLowerPriority
---
apiVersion: v1
kind: Pod
spec:
  priorityClassName: training-critical
  # ...

6. Monitor GPU Utilization ¶

Low utilization = wasted money:

# DCGM exporter metrics
dcgm_gpu_utilization
dcgm_memory_utilization
dcgm_power_usage

Alert if GPUs are allocated but idle:

# GPU allocated but <10% utilized for 30m
(dcgm_gpu_utilization < 10) and (kube_pod_container_resource_limits{resource="nvidia_com_gpu"} > 0)

Debugging GPU Scheduling ¶

Pod Stuck Pending ¶

kubectl describe pod gpu-pod

# Common messages:
# "Insufficient nvidia.com/gpu" - No nodes with free GPUs
# "0/10 nodes available: 10 node(s) didn't match node selector" - Wrong labels

Check Node GPU Status ¶

# Available GPUs
kubectl describe node gpu-node | grep -A 5 Allocatable

# Allocated GPUs
kubectl describe node gpu-node | grep -A 5 "Allocated resources"

Verify Device Plugin ¶

# Check device plugin pods
kubectl get pods -n nvidia-device-plugin

# Check logs
kubectl logs -n nvidia-device-plugin -l app=nvidia-device-plugin

Check GPU Health ¶

# On the node
nvidia-smi

# Look for:
# - GPU memory errors
# - Temperature
# - Power state
# - Running processes

MIG Configuration Issues ¶

# Check MIG status
nvidia-smi mig -lgi

# Check device plugin sees MIG devices
kubectl logs -n nvidia-device-plugin <pod> | grep -i mig

Summary ¶

GPU scheduling in Kubernetes has evolved:

Generation	Mechanism	Capabilities
1st	Device Plugins	Whole GPU allocation, simple counting
1.5	+ MIG/Time-slicing	Fractional GPUs, but hacks on top of device plugins
2nd	DRA	Rich selectors, lifecycle hooks, structured parameters

Current recommendations:

Use Case	Approach
Simple GPU workloads	Device plugin + node selectors
Fractional GPUs (A100)	MIG via device plugin
Shared dev/test GPUs	Time-slicing
Complex requirements	Evaluate DRA (if K8s 1.31+)
Multi-GPU training	Topology Manager + GPU Operator

The device plugin model works but hits walls at scale. DRA is the future—a proper API for hardware allocation. As it matures, expect richer GPU scheduling: specific models, memory requirements, topology constraints, all expressible in standard Kubernetes resources.