How the Kubernetes Scheduler Actually Works


A pod is created. Seconds later, it’s running on a node. But how did Kubernetes decide which node? The scheduler—kube-scheduler—makes this decision hundreds of times per second in large clusters. Understanding how it works helps you debug “Pending” pods and optimize placement.

The scheduler solves a bin-packing problem: given N pods and M nodes, assign each pod to a node such that:

  1. Constraints are satisfied — resource requests fit, taints/tolerations match, affinity rules hold
  2. Resources are balanced — don’t overload some nodes while others sit idle
  3. Preferences are respected — spread pods across zones, colocate related pods

This is NP-hard in the general case, so the scheduler uses heuristics—fast, good-enough decisions rather than optimal ones.

When a pod is created without a nodeName, it enters the scheduling queue. The scheduler processes it through two phases:

Pod created (nodeName empty)
         |
         v
+------------------+
| Scheduling Queue |
+------------------+
         |
         v
+------------------+
| Scheduling Cycle |
|                  |
|  1. Filtering    |  Which nodes CAN run this pod?
|  2. Scoring      |  Which node is BEST?
|  3. Binding      |  Assign pod to chosen node
|                  |
+------------------+
         |
         v
Pod bound to node

Filtering eliminates nodes that cannot run the pod. Each filter plugin checks one constraint:

All Nodes: [node-1, node-2, node-3, node-4, node-5]
                            |
                            v
+--------------------------------------------------+
| NodeResourcesFit: Does node have enough CPU/mem? |
+--------------------------------------------------+
                            |
        Remaining: [node-1, node-2, node-4, node-5]
                            |
                            v
+--------------------------------------------------+
| NodeAffinity: Does node match required affinity? |
+--------------------------------------------------+
                            |
        Remaining: [node-1, node-2, node-5]
                            |
                            v
+--------------------------------------------------+
| TaintToleration: Does pod tolerate node taints?  |
+--------------------------------------------------+
                            |
        Remaining: [node-1, node-5]
                            |
                            v
              Feasible nodes for scoring

Built-in filter plugins:

Plugin What it checks
NodeResourcesFit CPU, memory, ephemeral storage requests fit
NodePorts Requested host ports are available
NodeAffinity Node matches nodeAffinity rules
TaintToleration Pod tolerates node’s taints
PodTopologySpread Spread constraints are satisfiable
VolumeBinding Required PVs can be bound to this node
InterPodAffinity Pod affinity/anti-affinity constraints
NodeUnschedulable Node isn’t cordoned

If no nodes pass filtering, the pod stays Pending.

Scoring ranks the feasible nodes. Each scoring plugin assigns a score (0-100), and scores are weighted and summed:

Feasible nodes: [node-1, node-5]
                     |
                     v
+----------------------------------------+
| NodeResourcesBalancedAllocation        |
|   node-1: 60  (moderate utilization)   |
|   node-5: 80  (low utilization)        |
+----------------------------------------+
                     |
                     v
+----------------------------------------+
| InterPodAffinity                       |
|   node-1: 100 (preferred pods nearby)  |
|   node-5: 50  (no preferred pods)      |
+----------------------------------------+
                     |
                     v
+----------------------------------------+
| ImageLocality                          |
|   node-1: 70  (some images cached)     |
|   node-5: 30  (need to pull images)    |
+----------------------------------------+
                     |
                     v
Final scores (weighted sum):
  node-1: 230
  node-5: 160

Winner: node-1

Built-in scoring plugins:

Plugin What it scores
NodeResourcesBalancedAllocation Prefer balanced CPU/memory usage
NodeResourcesLeastAllocated Prefer nodes with most free resources
NodeResourcesMostAllocated Prefer nodes with least free resources (bin packing)
InterPodAffinity Prefer nodes matching pod affinity
ImageLocality Prefer nodes with container images cached
TaintToleration Prefer nodes with fewer taints
NodeAffinity Prefer nodes matching preferred affinity
PodTopologySpread Prefer nodes that balance spread

Once a node is selected, the scheduler “binds” the pod:

  1. Optimistic binding: Scheduler assumes success, updates internal cache
  2. API binding: Sends Binding object to API server
  3. Kubelet takes over: Kubelet sees pod assigned to its node, starts it
// Simplified binding
binding := &v1.Binding{
    ObjectMeta: metav1.ObjectMeta{
        Name:      pod.Name,
        Namespace: pod.Namespace,
    },
    Target: v1.ObjectReference{
        Kind: "Node",
        Name: selectedNode,
    },
}
client.CoreV1().Pods(pod.Namespace).Bind(ctx, binding, metav1.CreateOptions{})

The scheduler doesn’t process pods in simple FIFO order. It uses a priority queue with three sub-queues:

+------------------+
| ActiveQ          |  Pods ready to schedule (heap by priority)
+------------------+
         |
         | (scheduling fails)
         v
+------------------+
| BackoffQ         |  Pods waiting after failure (exponential backoff)
+------------------+
         |
         | (cluster state changes)
         v
+------------------+
| UnschedulableQ   |  Pods that can't be scheduled (waiting for change)
+------------------+

Pods ready for scheduling, ordered by priority:

// PriorityClass affects queue position
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000  // Higher = scheduled first
globalDefault: false

When scheduling fails (e.g., race condition, transient error), pods go here with exponential backoff:

1st failure: wait 1s
2nd failure: wait 2s
3rd failure: wait 4s
...up to 10s max

When filtering finds no feasible nodes, pods go here. They’re retried when cluster state changes (node added, pod deleted, etc.).

Pod can't fit anywhere → UnschedulableQ
Node added to cluster → Move pods back to ActiveQ

The scheduler only considers requests, not limits:

resources:
  requests:
    cpu: 100m      # Scheduler uses this
    memory: 128Mi  # Scheduler uses this
  limits:
    cpu: 500m      # Scheduler ignores this
    memory: 512Mi  # Scheduler ignores this

Why? Requests represent guaranteed resources. Limits allow bursting but aren’t guaranteed. The scheduler ensures the sum of requests fits on the node.

Nodes report both capacity and allocatable:

kubectl describe node worker-1 | grep -A 6 "Capacity\|Allocatable"

Capacity:
  cpu:                4
  memory:             16Gi
  pods:               110
Allocatable:
  cpu:                3800m    # 200m reserved for system
  memory:             15Gi     # 1Gi reserved for system
  pods:               110

The scheduler uses allocatable, which excludes resources reserved for kubelet, OS, etc.

Custom resources (GPUs, FPGAs, etc.) work the same way:

# Node advertises GPUs
status:
  allocatable:
    nvidia.com/gpu: 4

# Pod requests GPUs
resources:
  requests:
    nvidia.com/gpu: 2  # Scheduler checks this fits

Simple label matching:

spec:
  nodeSelector:
    disktype: ssd
    zone: us-west-2a

All labels must match. No flexibility.

More expressive, with required and preferred rules:

spec:
  affinity:
    nodeAffinity:
      # Hard requirement (like nodeSelector but with operators)
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-west-2a
                  - us-west-2b
      # Soft preference (try but don't require)
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: node-type
                operator: In
                values:
                  - high-memory

Operators: In, NotIn, Exists, DoesNotExist, Gt, Lt

Taints repel pods; tolerations allow pods to schedule despite taints:

# Taint a node
kubectl taint nodes worker-1 dedicated=ml:NoSchedule
# Pod that tolerates the taint
spec:
  tolerations:
    - key: dedicated
      operator: Equal
      value: ml
      effect: NoSchedule

Taint effects:

  • NoSchedule: Don’t schedule new pods (existing stay)
  • PreferNoSchedule: Try not to schedule (soft)
  • NoExecute: Evict existing pods + don’t schedule new

Place pods relative to other pods:

spec:
  affinity:
    # Run near pods with app=cache
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: cache
          topologyKey: kubernetes.io/hostname

    # Don't run on same node as other app=web pods
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: web
            topologyKey: kubernetes.io/hostname

topologyKey: Defines what “same place” means:

  • kubernetes.io/hostname: Same node
  • topology.kubernetes.io/zone: Same availability zone
  • topology.kubernetes.io/region: Same region

Warning: Pod affinity with requiredDuringScheduling can make pods unschedulable if the target pods don’t exist yet.

Distribute pods evenly across topology domains:

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: web

This ensures pods are spread across zones with at most 1 pod difference between any two zones.

When a high-priority pod can’t be scheduled, the scheduler may preempt (evict) lower-priority pods:

High-priority pod pending
         |
         v
+------------------+
| Find nodes where |
| preemption would |
| allow scheduling |
+------------------+
         |
         v
+------------------+
| Select victim    |
| pods to evict    |
+------------------+
         |
         v
+------------------+
| Evict victims,   |
| schedule pod     |
+------------------+
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority  # or Never
description: "Critical workloads"
---
apiVersion: v1
kind: Pod
metadata:
  name: critical-pod
spec:
  priorityClassName: critical
  # ...
  1. Identify candidates: Nodes where evicting pods would make room
  2. Minimize disruption: Prefer evicting fewer/lower-priority pods
  3. Respect PDBs: Don’t violate PodDisruptionBudgets if possible
  4. Execute: Delete victim pods, schedule the preemptor

Note: Preemption is “graceful”—victims get their termination grace period.

# Pod that cannot be preempted
spec:
  priorityClassName: high-priority
  preemptionPolicy: Never  # Can't preempt others

# Or use PodDisruptionBudget to limit disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: critical

The default scheduler handles ~100 pods/second. At scale, several factors matter:

For large clusters, scoring all feasible nodes is expensive. The scheduler samples:

# kube-scheduler config
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 50  # Only score 50% of feasible nodes

With 5000 nodes and 50%, the scheduler scores at most 2500 nodes per pod.

The scheduler can evaluate multiple pods concurrently:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
parallelism: 16  # Concurrent scheduling goroutines

The scheduler maintains a cache of node states to avoid hitting the API server:

API Server <--watch-- Scheduler Cache
                           |
                           v
                    Scheduling decisions
                    (reads from cache)

Cache includes: node allocatable, running pods, requested resources.

kubectl describe pod pending-pod

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  10s   default-scheduler  0/5 nodes are available:
    2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate,
    3 node(s) didn't match Pod's node affinity/selector.

Common failure reasons:

Message Meaning Fix
Insufficient cpu No node has enough CPU Reduce requests or add nodes
Insufficient memory No node has enough memory Reduce requests or add nodes
node(s) had taint...didn't tolerate Taints blocking scheduling Add tolerations or remove taints
node(s) didn't match node affinity Affinity rules too restrictive Relax affinity or label nodes
node(s) didn't match pod topology spread Can’t satisfy spread constraints Add nodes in needed topologies
persistentvolumeclaim not found PVC doesn’t exist Create the PVC
node(s) had volume node affinity conflict PV is in different zone Create PV in correct zone

Check why a pod can’t schedule without actually creating it:

# Dry run scheduling (requires scheduler extender or custom script)
# Or use kubectl-scheduler_simulator plugin
# See allocatable vs allocated
kubectl describe node worker-1 | grep -A 10 "Allocated resources"

Allocated resources:
  Resource           Requests     Limits
  --------           --------     ------
  cpu                3500m (92%)  7000m (184%)
  memory             12Gi (80%)   20Gi (133%)

# Detailed pod resource usage
kubectl top pods --containers
# View scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler

# Increase verbosity
# Edit kube-scheduler manifest, add --v=4

You can run multiple schedulers or write your own:

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduled-pod
spec:
  schedulerName: my-custom-scheduler  # Use custom scheduler
  containers:
    - name: app
      image: nginx

Extend the default scheduler with webhook calls:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
  - urlPrefix: "http://my-extender:8080"
    filterVerb: "filter"
    prioritizeVerb: "prioritize"
    weight: 5
    enableHTTPS: false

The scheduler calls your extender for additional filtering/scoring.

The Scheduling Framework allows writing plugins in Go:

// Custom filter plugin
type MyPlugin struct{}

func (p *MyPlugin) Name() string { return "MyPlugin" }

func (p *MyPlugin) Filter(ctx context.Context, state *framework.CycleState, 
    pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    
    // Custom filtering logic
    if !myCustomCheck(pod, nodeInfo.Node()) {
        return framework.NewStatus(framework.Unschedulable, "custom check failed")
    }
    return framework.NewStatus(framework.Success, "")
}

Build a custom scheduler binary with your plugins included.

Full scheduler configuration example:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
percentageOfNodesToScore: 50
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
  - schedulerName: default-scheduler
    plugins:
      score:
        disabled:
          - name: NodeResourcesLeastAllocated
        enabled:
          - name: NodeResourcesMostAllocated  # Bin packing instead
            weight: 1
      filter:
        enabled:
          - name: NodeResourcesFit
          - name: NodePorts
          - name: TaintToleration

The scheduler’s job is simple: pick a node for each pod. The implementation is sophisticated:

Phase What happens
Queue Pods ordered by priority, backoff for failures
Filter Eliminate nodes that can’t run the pod
Score Rank remaining nodes by preference
Bind Assign pod to winning node
Preempt Evict lower-priority pods if needed

Key takeaways:

  1. Scheduler uses requests, not limits
  2. Filtering is pass/fail; scoring is best-effort
  3. Pod affinity can create deadlocks—use carefully
  4. Topology spread is the modern way to distribute pods
  5. Preemption respects PDBs when possible
  6. Debug with kubectl describe pod and scheduler logs

When pods are stuck Pending, the answer is almost always in the scheduling failure message. Read it carefully—it tells you exactly which constraint failed.