How the Kubernetes Scheduler Actually Works

A pod is created. Seconds later, it’s running on a node. But how did Kubernetes decide which node? The scheduler—kube-scheduler—makes this decision hundreds of times per second in large clusters. Understanding how it works helps you debug “Pending” pods and optimize placement.

The Scheduling Problem ¶

The scheduler solves a bin-packing problem: given N pods and M nodes, assign each pod to a node such that:

Constraints are satisfied — resource requests fit, taints/tolerations match, affinity rules hold
Resources are balanced — don’t overload some nodes while others sit idle
Preferences are respected — spread pods across zones, colocate related pods

This is NP-hard in the general case, so the scheduler uses heuristics—fast, good-enough decisions rather than optimal ones.

The Scheduling Cycle ¶

When a pod is created without a nodeName, it enters the scheduling queue. The scheduler processes it through two phases:

Pod created (nodeName empty)
         |
         v
+------------------+
| Scheduling Queue |
+------------------+
         |
         v
+------------------+
| Scheduling Cycle |
|                  |
|  1. Filtering    |  Which nodes CAN run this pod?
|  2. Scoring      |  Which node is BEST?
|  3. Binding      |  Assign pod to chosen node
|                  |
+------------------+
         |
         v
Pod bound to node

Phase 1: Filtering ¶

Filtering eliminates nodes that cannot run the pod. Each filter plugin checks one constraint:

All Nodes: [node-1, node-2, node-3, node-4, node-5]
                            |
                            v
+--------------------------------------------------+
| NodeResourcesFit: Does node have enough CPU/mem? |
+--------------------------------------------------+
                            |
        Remaining: [node-1, node-2, node-4, node-5]
                            |
                            v
+--------------------------------------------------+
| NodeAffinity: Does node match required affinity? |
+--------------------------------------------------+
                            |
        Remaining: [node-1, node-2, node-5]
                            |
                            v
+--------------------------------------------------+
| TaintToleration: Does pod tolerate node taints?  |
+--------------------------------------------------+
                            |
        Remaining: [node-1, node-5]
                            |
                            v
              Feasible nodes for scoring

Built-in filter plugins:

Plugin	What it checks
`NodeResourcesFit`	CPU, memory, ephemeral storage requests fit
`NodePorts`	Requested host ports are available
`NodeAffinity`	Node matches `nodeAffinity` rules
`TaintToleration`	Pod tolerates node’s taints
`PodTopologySpread`	Spread constraints are satisfiable
`VolumeBinding`	Required PVs can be bound to this node
`InterPodAffinity`	Pod affinity/anti-affinity constraints
`NodeUnschedulable`	Node isn’t cordoned

If no nodes pass filtering, the pod stays Pending.

Phase 2: Scoring ¶

Scoring ranks the feasible nodes. Each scoring plugin assigns a score (0-100), and scores are weighted and summed:

Feasible nodes: [node-1, node-5]
                     |
                     v
+----------------------------------------+
| NodeResourcesBalancedAllocation        |
|   node-1: 60  (moderate utilization)   |
|   node-5: 80  (low utilization)        |
+----------------------------------------+
                     |
                     v
+----------------------------------------+
| InterPodAffinity                       |
|   node-1: 100 (preferred pods nearby)  |
|   node-5: 50  (no preferred pods)      |
+----------------------------------------+
                     |
                     v
+----------------------------------------+
| ImageLocality                          |
|   node-1: 70  (some images cached)     |
|   node-5: 30  (need to pull images)    |
+----------------------------------------+
                     |
                     v
Final scores (weighted sum):
  node-1: 230
  node-5: 160

Winner: node-1

Built-in scoring plugins:

Plugin	What it scores
`NodeResourcesBalancedAllocation`	Prefer balanced CPU/memory usage
`NodeResourcesLeastAllocated`	Prefer nodes with most free resources
`NodeResourcesMostAllocated`	Prefer nodes with least free resources (bin packing)
`InterPodAffinity`	Prefer nodes matching pod affinity
`ImageLocality`	Prefer nodes with container images cached
`TaintToleration`	Prefer nodes with fewer taints
`NodeAffinity`	Prefer nodes matching preferred affinity
`PodTopologySpread`	Prefer nodes that balance spread

Phase 3: Binding ¶

Once a node is selected, the scheduler “binds” the pod:

Optimistic binding: Scheduler assumes success, updates internal cache
API binding: Sends Binding object to API server
Kubelet takes over: Kubelet sees pod assigned to its node, starts it

// Simplified binding
binding := &v1.Binding{
    ObjectMeta: metav1.ObjectMeta{
        Name:      pod.Name,
        Namespace: pod.Namespace,
    },
    Target: v1.ObjectReference{
        Kind: "Node",
        Name: selectedNode,
    },
}
client.CoreV1().Pods(pod.Namespace).Bind(ctx, binding, metav1.CreateOptions{})

Scheduling Queue Internals ¶

The scheduler doesn’t process pods in simple FIFO order. It uses a priority queue with three sub-queues:

+------------------+
| ActiveQ          |  Pods ready to schedule (heap by priority)
+------------------+
         |
         | (scheduling fails)
         v
+------------------+
| BackoffQ         |  Pods waiting after failure (exponential backoff)
+------------------+
         |
         | (cluster state changes)
         v
+------------------+
| UnschedulableQ   |  Pods that can't be scheduled (waiting for change)
+------------------+

ActiveQ ¶

Pods ready for scheduling, ordered by priority:

// PriorityClass affects queue position
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000  // Higher = scheduled first
globalDefault: false

BackoffQ ¶

When scheduling fails (e.g., race condition, transient error), pods go here with exponential backoff:

1st failure: wait 1s
2nd failure: wait 2s
3rd failure: wait 4s
...up to 10s max

UnschedulableQ ¶

When filtering finds no feasible nodes, pods go here. They’re retried when cluster state changes (node added, pod deleted, etc.).

Pod can't fit anywhere → UnschedulableQ
Node added to cluster → Move pods back to ActiveQ

Resource Requests and Limits ¶

The scheduler only considers requests, not limits:

resources:
  requests:
    cpu: 100m      # Scheduler uses this
    memory: 128Mi  # Scheduler uses this
  limits:
    cpu: 500m      # Scheduler ignores this
    memory: 512Mi  # Scheduler ignores this

Why? Requests represent guaranteed resources. Limits allow bursting but aren’t guaranteed. The scheduler ensures the sum of requests fits on the node.

Allocatable vs Capacity ¶

Nodes report both capacity and allocatable:

kubectl describe node worker-1 | grep -A 6 "Capacity\|Allocatable"

Capacity:
  cpu:                4
  memory:             16Gi
  pods:               110
Allocatable:
  cpu:                3800m    # 200m reserved for system
  memory:             15Gi     # 1Gi reserved for system
  pods:               110

The scheduler uses allocatable, which excludes resources reserved for kubelet, OS, etc.

Extended Resources ¶

Custom resources (GPUs, FPGAs, etc.) work the same way:

# Node advertises GPUs
status:
  allocatable:
    nvidia.com/gpu: 4

# Pod requests GPUs
resources:
  requests:
    nvidia.com/gpu: 2  # Scheduler checks this fits

Node Selection Deep Dive ¶

Node Selectors ¶

Simple label matching:

spec:
  nodeSelector:
    disktype: ssd
    zone: us-west-2a

All labels must match. No flexibility.

Node Affinity ¶

More expressive, with required and preferred rules:

spec:
  affinity:
    nodeAffinity:
      # Hard requirement (like nodeSelector but with operators)
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-west-2a
                  - us-west-2b
      # Soft preference (try but don't require)
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: node-type
                operator: In
                values:
                  - high-memory

Operators: In, NotIn, Exists, DoesNotExist, Gt, Lt

Taints and Tolerations ¶

Taints repel pods; tolerations allow pods to schedule despite taints:

# Taint a node
kubectl taint nodes worker-1 dedicated=ml:NoSchedule

# Pod that tolerates the taint
spec:
  tolerations:
    - key: dedicated
      operator: Equal
      value: ml
      effect: NoSchedule

Taint effects:

NoSchedule: Don’t schedule new pods (existing stay)
PreferNoSchedule: Try not to schedule (soft)
NoExecute: Evict existing pods + don’t schedule new

Pod Affinity and Anti-Affinity ¶

Place pods relative to other pods:

spec:
  affinity:
    # Run near pods with app=cache
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: cache
          topologyKey: kubernetes.io/hostname

    # Don't run on same node as other app=web pods
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: web
            topologyKey: kubernetes.io/hostname

topologyKey: Defines what “same place” means:

kubernetes.io/hostname: Same node
topology.kubernetes.io/zone: Same availability zone
topology.kubernetes.io/region: Same region

Warning: Pod affinity with requiredDuringScheduling can make pods unschedulable if the target pods don’t exist yet.

Pod Topology Spread ¶

Distribute pods evenly across topology domains:

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: web

This ensures pods are spread across zones with at most 1 pod difference between any two zones.

Preemption ¶

When a high-priority pod can’t be scheduled, the scheduler may preempt (evict) lower-priority pods:

High-priority pod pending
         |
         v
+------------------+
| Find nodes where |
| preemption would |
| allow scheduling |
+------------------+
         |
         v
+------------------+
| Select victim    |
| pods to evict    |
+------------------+
         |
         v
+------------------+
| Evict victims,   |
| schedule pod     |
+------------------+

PriorityClasses ¶

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority  # or Never
description: "Critical workloads"
---
apiVersion: v1
kind: Pod
metadata:
  name: critical-pod
spec:
  priorityClassName: critical
  # ...

Preemption Algorithm ¶

Identify candidates: Nodes where evicting pods would make room
Minimize disruption: Prefer evicting fewer/lower-priority pods
Respect PDBs: Don’t violate PodDisruptionBudgets if possible
Execute: Delete victim pods, schedule the preemptor

Note: Preemption is “graceful”—victims get their termination grace period.

Preventing Preemption ¶

# Pod that cannot be preempted
spec:
  priorityClassName: high-priority
  preemptionPolicy: Never  # Can't preempt others

# Or use PodDisruptionBudget to limit disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: critical

Scheduler Performance ¶

The default scheduler handles ~100 pods/second. At scale, several factors matter:

Percentage of Nodes to Score ¶

For large clusters, scoring all feasible nodes is expensive. The scheduler samples:

# kube-scheduler config
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 50  # Only score 50% of feasible nodes

With 5000 nodes and 50%, the scheduler scores at most 2500 nodes per pod.

Parallelism ¶

The scheduler can evaluate multiple pods concurrently:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
parallelism: 16  # Concurrent scheduling goroutines

Cache ¶

The scheduler maintains a cache of node states to avoid hitting the API server:

API Server <--watch-- Scheduler Cache
                           |
                           v
                    Scheduling decisions
                    (reads from cache)

Cache includes: node allocatable, running pods, requested resources.

Debugging Scheduling Failures ¶

Check Pod Events ¶

kubectl describe pod pending-pod

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  10s   default-scheduler  0/5 nodes are available:
    2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate,
    3 node(s) didn't match Pod's node affinity/selector.

Understand the Message ¶

Common failure reasons:

Message	Meaning	Fix
`Insufficient cpu`	No node has enough CPU	Reduce requests or add nodes
`Insufficient memory`	No node has enough memory	Reduce requests or add nodes
`node(s) had taint...didn't tolerate`	Taints blocking scheduling	Add tolerations or remove taints
`node(s) didn't match node affinity`	Affinity rules too restrictive	Relax affinity or label nodes
`node(s) didn't match pod topology spread`	Can’t satisfy spread constraints	Add nodes in needed topologies
`persistentvolumeclaim not found`	PVC doesn’t exist	Create the PVC
`node(s) had volume node affinity conflict`	PV is in different zone	Create PV in correct zone

Simulate Scheduling ¶

Check why a pod can’t schedule without actually creating it:

# Dry run scheduling (requires scheduler extender or custom script)
# Or use kubectl-scheduler_simulator plugin

Check Node Resources ¶

# See allocatable vs allocated
kubectl describe node worker-1 | grep -A 10 "Allocated resources"

Allocated resources:
  Resource           Requests     Limits
  --------           --------     ------
  cpu                3500m (92%)  7000m (184%)
  memory             12Gi (80%)   20Gi (133%)

# Detailed pod resource usage
kubectl top pods --containers

Scheduler Logs ¶

# View scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler

# Increase verbosity
# Edit kube-scheduler manifest, add --v=4

Custom Schedulers ¶

You can run multiple schedulers or write your own:

Using a Custom Scheduler ¶

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduled-pod
spec:
  schedulerName: my-custom-scheduler  # Use custom scheduler
  containers:
    - name: app
      image: nginx

Scheduler Extenders (Legacy) ¶

Extend the default scheduler with webhook calls:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
  - urlPrefix: "http://my-extender:8080"
    filterVerb: "filter"
    prioritizeVerb: "prioritize"
    weight: 5
    enableHTTPS: false

The scheduler calls your extender for additional filtering/scoring.

Scheduling Framework (Modern) ¶

The Scheduling Framework allows writing plugins in Go:

// Custom filter plugin
type MyPlugin struct{}

func (p *MyPlugin) Name() string { return "MyPlugin" }

func (p *MyPlugin) Filter(ctx context.Context, state *framework.CycleState, 
    pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    
    // Custom filtering logic
    if !myCustomCheck(pod, nodeInfo.Node()) {
        return framework.NewStatus(framework.Unschedulable, "custom check failed")
    }
    return framework.NewStatus(framework.Success, "")
}

Build a custom scheduler binary with your plugins included.

Scheduler Configuration ¶

Full scheduler configuration example:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
percentageOfNodesToScore: 50
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
  - schedulerName: default-scheduler
    plugins:
      score:
        disabled:
          - name: NodeResourcesLeastAllocated
        enabled:
          - name: NodeResourcesMostAllocated  # Bin packing instead
            weight: 1
      filter:
        enabled:
          - name: NodeResourcesFit
          - name: NodePorts
          - name: TaintToleration

Summary ¶

The scheduler’s job is simple: pick a node for each pod. The implementation is sophisticated:

Phase	What happens
Queue	Pods ordered by priority, backoff for failures
Filter	Eliminate nodes that can’t run the pod
Score	Rank remaining nodes by preference
Bind	Assign pod to winning node
Preempt	Evict lower-priority pods if needed

Key takeaways:

Scheduler uses requests, not limits
Filtering is pass/fail; scoring is best-effort
Pod affinity can create deadlocks—use carefully
Topology spread is the modern way to distribute pods
Preemption respects PDBs when possible
Debug with kubectl describe pod and scheduler logs

When pods are stuck Pending, the answer is almost always in the scheduling failure message. Read it carefully—it tells you exactly which constraint failed.