GPU Scheduling in Kubernetes: From Device Plugins to Dynamic Resource Allocation


Your ML team needs GPUs. You add nodes with NVIDIA A100s, install the device plugin, and suddenly Kubernetes can schedule GPU workloads. But then the requests start: “Can we share a GPU between pods?” “Why is my training job slow even though I have 8 GPUs?” “Can we request a specific GPU model?”

GPU scheduling in Kubernetes has evolved from a simple device plugin model to the more flexible Dynamic Resource Allocation (DRA). This post covers both, explaining how they work and when to use each.

CPU and memory are fungible. If you request 2 CPUs, any 2 CPUs work. The scheduler doesn’t care which ones.

GPUs are different:

  1. Heterogeneous: A100 vs V100 vs T4 have vastly different capabilities
  2. Topology matters: GPU-to-GPU and GPU-to-CPU connectivity affects performance
  3. Not easily divisible: You can’t give a pod “0.5 GPUs” the way you give it 500m CPU
  4. State and configuration: GPUs have drivers, compute modes, memory configurations
  5. Expensive: At $2-10/hour per GPU, idle GPUs hurt

The standard Kubernetes resource model (requests/limits) wasn’t designed for this.

Since Kubernetes 1.8, device plugins let vendors expose hardware to the scheduler.

+------------------+       +-------------------+
|    kubelet       |<----->|   Device Plugin   |
|                  |  gRPC |  (e.g., NVIDIA)   |
+------------------+       +-------------------+
         |                          |
         |                          v
         |                 +-------------------+
         |                 |   GPU Hardware    |
         v                 +-------------------+
+-------------------+
|   API Serverv     |
|                   |
|  Node resources:  |
|  nvidia.com/gpu: 4|
+-------------------+
  1. Device plugin registers with kubelet via gRPC
  2. Reports available devices (e.g., 4 GPUs)
  3. kubelet advertises to API server as extended resources
  4. Scheduler sees nvidia.com/gpu: 4 as allocatable
  5. When pod scheduled, device plugin tells kubelet which device(s) to assign
# Add NVIDIA Helm repo
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

# Install device plugin
helm install nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace

Verify:

$ kubectl describe node gpu-node-1 | grep -A 5 "Allocatable"
Allocatable:
  cpu:                32
  memory:             128Gi
  nvidia.com/gpu:     4
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:12.0-base
      resources:
        limits:
          nvidia.com/gpu: 1  # Request 1 GPU

Note: For device plugin resources, limits and requests must be equal. You can’t “burst” GPU usage.

When the pod is scheduled:

  1. Device plugin’s Allocate() called with device IDs
  2. Plugin returns environment variables and device mounts:
// Device plugin returns
ContainerAllocateResponse{
    Envs: map[string]string{
        "NVIDIA_VISIBLE_DEVICES": "GPU-abc123",
    },
    Mounts: []*Mount{
        {ContainerPath: "/dev/nvidia0", HostPath: "/dev/nvidia0"},
    },
}
  1. Container sees only assigned GPU(s)
resources:
  limits:
    nvidia.com/gpu: 1    # OK
    nvidia.com/gpu: 0.5  # NOT POSSIBLE

Can’t share a GPU between pods. A $10k A100 sits 90% idle because one pod claimed it.

You can’t say “give me an A100, not a T4.” The scheduler just sees a count:

# What you want
nvidia.com/gpu:
  model: A100
  memory: 80Gi

# What you can do
nvidia.com/gpu: 1  # Could be anything

Workaround: Use node labels and node selectors:

nodeSelector:
  gpu-type: a100

But this is coarse-grained and doesn’t scale.

Multi-GPU training performance depends on GPU interconnects:

Best: NVLink (600 GB/s)
OK: PCIe (64 GB/s)
Bad: Cross-socket PCIe

Pod gets GPU 0 and GPU 3
GPU 0 <--NVLink--> GPU 1
GPU 2 <--NVLink--> GPU 3
GPU 0 <--PCIe-----> GPU 3  ← Slow!

Device plugins don’t consider topology. Your 8-GPU training job might get the worst possible GPU combination.

Some devices need setup before use:

  • Configure compute mode
  • Allocate memory partitions (MIG)
  • Load firmware

Device plugins have no lifecycle hooks for this.

1. Scheduler sees: Node has 2 GPUs free
2. Scheduler assigns Pod A (needs 2 GPUs) to node
3. Before Pod A starts, Pod B (needs 1 GPU) also scheduled to node
4. Conflict!

Extended resources are accounted at scheduling time, but there’s a window for races.

MIG physically partitions A100/A30/H100 GPUs:

A100 80GB
├── MIG 1g.10gb (instance 1)
├── MIG 1g.10gb (instance 2)
├── MIG 1g.10gb (instance 3)
├── MIG 2g.20gb (instance 4)
└── MIG 3g.40gb (instance 5)

Each MIG instance is isolated: separate memory, separate compute units.

Configure MIG with device plugin:

# nvidia-device-plugin config
config:
  map:
    default: mixed
  sharing:
    mig:
      strategy: mixed

Then request specific MIG profiles:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

Pros: True isolation, guaranteed resources Cons: Only certain GPUs support MIG, reconfiguration requires empty GPU

Multiple pods share one GPU by taking turns:

# Device plugin ConfigMap
sharing:
  timeSlicing:
    renameByDefault: false
    resources:
      - name: nvidia.com/gpu
        replicas: 4  # Each GPU appears as 4 resources

Now nvidia.com/gpu: 4 becomes nvidia.com/gpu: 16 (4 GPUs × 4 replicas).

# Pod requests "1 GPU" but actually gets 1/4
resources:
  limits:
    nvidia.com/gpu: 1

Pros: Works on any NVIDIA GPU, no reconfiguration Cons: No isolation—one pod can starve others, no memory limits

Feature MIG Time-Slicing
Isolation Full (memory + compute) None
Supported GPUs A100, A30, H100 Any NVIDIA
Reconfiguration Requires empty GPU Dynamic
Memory guarantee Yes No
Best for Production inference Dev/test, bursty workloads

DRA (alpha in 1.26, graduating in 1.31+) is the next evolution. It addresses device plugin limitations with a claim-based model.

ResourceClaim: A request for resources (like PVC for storage)

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  resourceClassName: gpu.nvidia.com

ResourceClass: Defines a type of resource and its driver

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClass
metadata:
  name: gpu.nvidia.com
driverName: gpu.nvidia.com

ResourceClaimTemplate: For dynamic claim creation

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  name: gpu-template
spec:
  spec:
    resourceClassName: gpu.nvidia.com
1. User creates ResourceClaim (or template in Pod)
2. Scheduler finds nodes where claim can be satisfied
3. DRA driver's "allocate" called with node context
4. Driver prepares device (configure MIG, set mode, etc.)
5. Pod starts with device available
6. On pod termination, driver cleans up
Aspect Device Plugins DRA
Granularity Whole devices Flexible (fractions, attributes)
Device selection Count only Rich selectors
Lifecycle None Prepare/cleanup hooks
Scheduling Simple counting Structured parameters
State Stateless Claim tracks allocation
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:12.0-base
      resources:
        claims:
          - name: gpu
  resourceClaims:
    - name: gpu
      source:
        resourceClaimTemplateName: gpu-template

With structured parameters (future):

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  name: specific-gpu
spec:
  resourceClassName: gpu.nvidia.com
  parametersRef:
    apiGroup: gpu.nvidia.com
    kind: GpuClaimParameters
    name: my-params
---
apiVersion: gpu.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  name: my-params
spec:
  selector:
    model: A100
    memory: 80Gi
  sharing:
    strategy: MIG
    profile: 3g.40gb

DRA is maturing but still evolving:

  • Core API: Stable enough for testing
  • NVIDIA DRA driver: Available, replacing device plugin in some deployments
  • Structured parameters: Still developing
  • Production readiness: Check your version’s feature gates
# Enable DRA feature gates (if not default)
--feature-gates=DynamicResourceAllocation=true

8-GPU training job needs GPUs that can communicate fast:

Ideal: All 8 GPUs on same NVLink domain
OK: 4+4 across two NVLink domains  
Bad: 8 GPUs scattered across PCIe

kubelet’s Topology Manager aligns resource allocation:

# kubelet configuration
topologyManagerPolicy: best-effort  # or: restricted, single-numa-node
topologyManagerScope: container     # or: pod

Policies:

  • none: No topology alignment
  • best-effort: Try to align, but schedule anyway
  • restricted: Fail if can’t align
  • single-numa-node: All resources from one NUMA node

The GPU Operator automates GPU node setup and includes topology awareness:

helm install gpu-operator nvidia/gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set mig.strategy=mixed

It handles:

  • Driver installation
  • Container toolkit
  • Device plugin
  • GPU feature discovery
  • MIG management
  • Monitoring

Automatically labels nodes with GPU details:

$ kubectl describe node gpu-node | grep nvidia
  nvidia.com/cuda.driver.major=535
  nvidia.com/cuda.driver.minor=129
  nvidia.com/cuda.runtime.major=12
  nvidia.com/gpu.compute.major=8
  nvidia.com/gpu.count=4
  nvidia.com/gpu.family=ampere
  nvidia.com/gpu.memory=81920
  nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
  nvidia.com/mig.capable=true

Now you can select by GPU type:

nodeSelector:
  nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB

Or with affinity for flexibility:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: nvidia.com/gpu.product
              operator: In
              values:
                - NVIDIA-A100-SXM4-80GB
                - NVIDIA-A100-PCIE-80GB

Don’t request 8 GPUs for a job that uses 1. GPUs are expensive.

# Profile your workload first
resources:
  limits:
    nvidia.com/gpu: 1  # Start small, scale up if needed

Inference workloads often don’t need a full A100:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1  # 1/7 of an A100

Separate node pools for different GPU types:

# Training pool: A100s
nodeSelector:
  gpu-pool: training
  
# Inference pool: T4s (cheaper)
nodeSelector:
  gpu-pool: inference

Prevent GPU hoarding:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
spec:
  hard:
    nvidia.com/gpu: "10"  # Max 10 GPUs per namespace

Critical training jobs should preempt development workloads:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: training-critical
value: 1000000
preemptionPolicy: PreemptLowerPriority
---
apiVersion: v1
kind: Pod
spec:
  priorityClassName: training-critical
  # ...

Low utilization = wasted money:

# DCGM exporter metrics
dcgm_gpu_utilization
dcgm_memory_utilization
dcgm_power_usage

Alert if GPUs are allocated but idle:

# GPU allocated but <10% utilized for 30m
(dcgm_gpu_utilization < 10) and (kube_pod_container_resource_limits{resource="nvidia_com_gpu"} > 0)
kubectl describe pod gpu-pod

# Common messages:
# "Insufficient nvidia.com/gpu" - No nodes with free GPUs
# "0/10 nodes available: 10 node(s) didn't match node selector" - Wrong labels
# Available GPUs
kubectl describe node gpu-node | grep -A 5 Allocatable

# Allocated GPUs
kubectl describe node gpu-node | grep -A 5 "Allocated resources"
# Check device plugin pods
kubectl get pods -n nvidia-device-plugin

# Check logs
kubectl logs -n nvidia-device-plugin -l app=nvidia-device-plugin
# On the node
nvidia-smi

# Look for:
# - GPU memory errors
# - Temperature
# - Power state
# - Running processes
# Check MIG status
nvidia-smi mig -lgi

# Check device plugin sees MIG devices
kubectl logs -n nvidia-device-plugin <pod> | grep -i mig

GPU scheduling in Kubernetes has evolved:

Generation Mechanism Capabilities
1st Device Plugins Whole GPU allocation, simple counting
1.5 + MIG/Time-slicing Fractional GPUs, but hacks on top of device plugins
2nd DRA Rich selectors, lifecycle hooks, structured parameters

Current recommendations:

Use Case Approach
Simple GPU workloads Device plugin + node selectors
Fractional GPUs (A100) MIG via device plugin
Shared dev/test GPUs Time-slicing
Complex requirements Evaluate DRA (if K8s 1.31+)
Multi-GPU training Topology Manager + GPU Operator

The device plugin model works but hits walls at scale. DRA is the future—a proper API for hardware allocation. As it matures, expect richer GPU scheduling: specific models, memory requirements, topology constraints, all expressible in standard Kubernetes resources.