A pod is created. Seconds later, it’s running on a node. But how did Kubernetes decide which node? The scheduler—kube-scheduler—makes this decision hundreds of times per second in large clusters. Understanding how it works helps you debug “Pending” pods and optimize placement.
The Scheduling Problem ¶
The scheduler solves a bin-packing problem: given N pods and M nodes, assign each pod to a node such that:
- Constraints are satisfied — resource requests fit, taints/tolerations match, affinity rules hold
- Resources are balanced — don’t overload some nodes while others sit idle
- Preferences are respected — spread pods across zones, colocate related pods
This is NP-hard in the general case, so the scheduler uses heuristics—fast, good-enough decisions rather than optimal ones.
The Scheduling Cycle ¶
When a pod is created without a nodeName, it enters the scheduling queue. The scheduler processes it through two phases:
Pod created (nodeName empty)
|
v
+------------------+
| Scheduling Queue |
+------------------+
|
v
+------------------+
| Scheduling Cycle |
| |
| 1. Filtering | Which nodes CAN run this pod?
| 2. Scoring | Which node is BEST?
| 3. Binding | Assign pod to chosen node
| |
+------------------+
|
v
Pod bound to node
Phase 1: Filtering ¶
Filtering eliminates nodes that cannot run the pod. Each filter plugin checks one constraint:
All Nodes: [node-1, node-2, node-3, node-4, node-5]
|
v
+--------------------------------------------------+
| NodeResourcesFit: Does node have enough CPU/mem? |
+--------------------------------------------------+
|
Remaining: [node-1, node-2, node-4, node-5]
|
v
+--------------------------------------------------+
| NodeAffinity: Does node match required affinity? |
+--------------------------------------------------+
|
Remaining: [node-1, node-2, node-5]
|
v
+--------------------------------------------------+
| TaintToleration: Does pod tolerate node taints? |
+--------------------------------------------------+
|
Remaining: [node-1, node-5]
|
v
Feasible nodes for scoring
Built-in filter plugins:
| Plugin | What it checks |
|---|---|
NodeResourcesFit |
CPU, memory, ephemeral storage requests fit |
NodePorts |
Requested host ports are available |
NodeAffinity |
Node matches nodeAffinity rules |
TaintToleration |
Pod tolerates node’s taints |
PodTopologySpread |
Spread constraints are satisfiable |
VolumeBinding |
Required PVs can be bound to this node |
InterPodAffinity |
Pod affinity/anti-affinity constraints |
NodeUnschedulable |
Node isn’t cordoned |
If no nodes pass filtering, the pod stays Pending.
Phase 2: Scoring ¶
Scoring ranks the feasible nodes. Each scoring plugin assigns a score (0-100), and scores are weighted and summed:
Feasible nodes: [node-1, node-5]
|
v
+----------------------------------------+
| NodeResourcesBalancedAllocation |
| node-1: 60 (moderate utilization) |
| node-5: 80 (low utilization) |
+----------------------------------------+
|
v
+----------------------------------------+
| InterPodAffinity |
| node-1: 100 (preferred pods nearby) |
| node-5: 50 (no preferred pods) |
+----------------------------------------+
|
v
+----------------------------------------+
| ImageLocality |
| node-1: 70 (some images cached) |
| node-5: 30 (need to pull images) |
+----------------------------------------+
|
v
Final scores (weighted sum):
node-1: 230
node-5: 160
Winner: node-1
Built-in scoring plugins:
| Plugin | What it scores |
|---|---|
NodeResourcesBalancedAllocation |
Prefer balanced CPU/memory usage |
NodeResourcesLeastAllocated |
Prefer nodes with most free resources |
NodeResourcesMostAllocated |
Prefer nodes with least free resources (bin packing) |
InterPodAffinity |
Prefer nodes matching pod affinity |
ImageLocality |
Prefer nodes with container images cached |
TaintToleration |
Prefer nodes with fewer taints |
NodeAffinity |
Prefer nodes matching preferred affinity |
PodTopologySpread |
Prefer nodes that balance spread |
Phase 3: Binding ¶
Once a node is selected, the scheduler “binds” the pod:
- Optimistic binding: Scheduler assumes success, updates internal cache
- API binding: Sends Binding object to API server
- Kubelet takes over: Kubelet sees pod assigned to its node, starts it
// Simplified binding
binding := &v1.Binding{
ObjectMeta: metav1.ObjectMeta{
Name: pod.Name,
Namespace: pod.Namespace,
},
Target: v1.ObjectReference{
Kind: "Node",
Name: selectedNode,
},
}
client.CoreV1().Pods(pod.Namespace).Bind(ctx, binding, metav1.CreateOptions{})
Scheduling Queue Internals ¶
The scheduler doesn’t process pods in simple FIFO order. It uses a priority queue with three sub-queues:
+------------------+
| ActiveQ | Pods ready to schedule (heap by priority)
+------------------+
|
| (scheduling fails)
v
+------------------+
| BackoffQ | Pods waiting after failure (exponential backoff)
+------------------+
|
| (cluster state changes)
v
+------------------+
| UnschedulableQ | Pods that can't be scheduled (waiting for change)
+------------------+
ActiveQ ¶
Pods ready for scheduling, ordered by priority:
// PriorityClass affects queue position
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 // Higher = scheduled first
globalDefault: false
BackoffQ ¶
When scheduling fails (e.g., race condition, transient error), pods go here with exponential backoff:
1st failure: wait 1s
2nd failure: wait 2s
3rd failure: wait 4s
...up to 10s max
UnschedulableQ ¶
When filtering finds no feasible nodes, pods go here. They’re retried when cluster state changes (node added, pod deleted, etc.).
Pod can't fit anywhere → UnschedulableQ
Node added to cluster → Move pods back to ActiveQ
Resource Requests and Limits ¶
The scheduler only considers requests, not limits:
resources:
requests:
cpu: 100m # Scheduler uses this
memory: 128Mi # Scheduler uses this
limits:
cpu: 500m # Scheduler ignores this
memory: 512Mi # Scheduler ignores this
Why? Requests represent guaranteed resources. Limits allow bursting but aren’t guaranteed. The scheduler ensures the sum of requests fits on the node.
Allocatable vs Capacity ¶
Nodes report both capacity and allocatable:
kubectl describe node worker-1 | grep -A 6 "Capacity\|Allocatable"
Capacity:
cpu: 4
memory: 16Gi
pods: 110
Allocatable:
cpu: 3800m # 200m reserved for system
memory: 15Gi # 1Gi reserved for system
pods: 110
The scheduler uses allocatable, which excludes resources reserved for kubelet, OS, etc.
Extended Resources ¶
Custom resources (GPUs, FPGAs, etc.) work the same way:
# Node advertises GPUs
status:
allocatable:
nvidia.com/gpu: 4
# Pod requests GPUs
resources:
requests:
nvidia.com/gpu: 2 # Scheduler checks this fits
Node Selection Deep Dive ¶
Node Selectors ¶
Simple label matching:
spec:
nodeSelector:
disktype: ssd
zone: us-west-2a
All labels must match. No flexibility.
Node Affinity ¶
More expressive, with required and preferred rules:
spec:
affinity:
nodeAffinity:
# Hard requirement (like nodeSelector but with operators)
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-west-2a
- us-west-2b
# Soft preference (try but don't require)
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- high-memory
Operators: In, NotIn, Exists, DoesNotExist, Gt, Lt
Taints and Tolerations ¶
Taints repel pods; tolerations allow pods to schedule despite taints:
# Taint a node
kubectl taint nodes worker-1 dedicated=ml:NoSchedule
# Pod that tolerates the taint
spec:
tolerations:
- key: dedicated
operator: Equal
value: ml
effect: NoSchedule
Taint effects:
NoSchedule: Don’t schedule new pods (existing stay)PreferNoSchedule: Try not to schedule (soft)NoExecute: Evict existing pods + don’t schedule new
Pod Affinity and Anti-Affinity ¶
Place pods relative to other pods:
spec:
affinity:
# Run near pods with app=cache
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: cache
topologyKey: kubernetes.io/hostname
# Don't run on same node as other app=web pods
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web
topologyKey: kubernetes.io/hostname
topologyKey: Defines what “same place” means:
kubernetes.io/hostname: Same nodetopology.kubernetes.io/zone: Same availability zonetopology.kubernetes.io/region: Same region
Warning: Pod affinity with requiredDuringScheduling can make pods unschedulable if the target pods don’t exist yet.
Pod Topology Spread ¶
Distribute pods evenly across topology domains:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
This ensures pods are spread across zones with at most 1 pod difference between any two zones.
Preemption ¶
When a high-priority pod can’t be scheduled, the scheduler may preempt (evict) lower-priority pods:
High-priority pod pending
|
v
+------------------+
| Find nodes where |
| preemption would |
| allow scheduling |
+------------------+
|
v
+------------------+
| Select victim |
| pods to evict |
+------------------+
|
v
+------------------+
| Evict victims, |
| schedule pod |
+------------------+
PriorityClasses ¶
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority # or Never
description: "Critical workloads"
---
apiVersion: v1
kind: Pod
metadata:
name: critical-pod
spec:
priorityClassName: critical
# ...
Preemption Algorithm ¶
- Identify candidates: Nodes where evicting pods would make room
- Minimize disruption: Prefer evicting fewer/lower-priority pods
- Respect PDBs: Don’t violate PodDisruptionBudgets if possible
- Execute: Delete victim pods, schedule the preemptor
Note: Preemption is “graceful”—victims get their termination grace period.
Preventing Preemption ¶
# Pod that cannot be preempted
spec:
priorityClassName: high-priority
preemptionPolicy: Never # Can't preempt others
# Or use PodDisruptionBudget to limit disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: critical
Scheduler Performance ¶
The default scheduler handles ~100 pods/second. At scale, several factors matter:
Percentage of Nodes to Score ¶
For large clusters, scoring all feasible nodes is expensive. The scheduler samples:
# kube-scheduler config
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 50 # Only score 50% of feasible nodes
With 5000 nodes and 50%, the scheduler scores at most 2500 nodes per pod.
Parallelism ¶
The scheduler can evaluate multiple pods concurrently:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
parallelism: 16 # Concurrent scheduling goroutines
Cache ¶
The scheduler maintains a cache of node states to avoid hitting the API server:
API Server <--watch-- Scheduler Cache
|
v
Scheduling decisions
(reads from cache)
Cache includes: node allocatable, running pods, requested resources.
Debugging Scheduling Failures ¶
Check Pod Events ¶
kubectl describe pod pending-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 10s default-scheduler 0/5 nodes are available:
2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate,
3 node(s) didn't match Pod's node affinity/selector.
Understand the Message ¶
Common failure reasons:
| Message | Meaning | Fix |
|---|---|---|
Insufficient cpu |
No node has enough CPU | Reduce requests or add nodes |
Insufficient memory |
No node has enough memory | Reduce requests or add nodes |
node(s) had taint...didn't tolerate |
Taints blocking scheduling | Add tolerations or remove taints |
node(s) didn't match node affinity |
Affinity rules too restrictive | Relax affinity or label nodes |
node(s) didn't match pod topology spread |
Can’t satisfy spread constraints | Add nodes in needed topologies |
persistentvolumeclaim not found |
PVC doesn’t exist | Create the PVC |
node(s) had volume node affinity conflict |
PV is in different zone | Create PV in correct zone |
Simulate Scheduling ¶
Check why a pod can’t schedule without actually creating it:
# Dry run scheduling (requires scheduler extender or custom script)
# Or use kubectl-scheduler_simulator plugin
Check Node Resources ¶
# See allocatable vs allocated
kubectl describe node worker-1 | grep -A 10 "Allocated resources"
Allocated resources:
Resource Requests Limits
-------- -------- ------
cpu 3500m (92%) 7000m (184%)
memory 12Gi (80%) 20Gi (133%)
# Detailed pod resource usage
kubectl top pods --containers
Scheduler Logs ¶
# View scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler
# Increase verbosity
# Edit kube-scheduler manifest, add --v=4
Custom Schedulers ¶
You can run multiple schedulers or write your own:
Using a Custom Scheduler ¶
apiVersion: v1
kind: Pod
metadata:
name: custom-scheduled-pod
spec:
schedulerName: my-custom-scheduler # Use custom scheduler
containers:
- name: app
image: nginx
Scheduler Extenders (Legacy) ¶
Extend the default scheduler with webhook calls:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
- urlPrefix: "http://my-extender:8080"
filterVerb: "filter"
prioritizeVerb: "prioritize"
weight: 5
enableHTTPS: false
The scheduler calls your extender for additional filtering/scoring.
Scheduling Framework (Modern) ¶
The Scheduling Framework allows writing plugins in Go:
// Custom filter plugin
type MyPlugin struct{}
func (p *MyPlugin) Name() string { return "MyPlugin" }
func (p *MyPlugin) Filter(ctx context.Context, state *framework.CycleState,
pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
// Custom filtering logic
if !myCustomCheck(pod, nodeInfo.Node()) {
return framework.NewStatus(framework.Unschedulable, "custom check failed")
}
return framework.NewStatus(framework.Success, "")
}
Build a custom scheduler binary with your plugins included.
Scheduler Configuration ¶
Full scheduler configuration example:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
percentageOfNodesToScore: 50
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
- schedulerName: default-scheduler
plugins:
score:
disabled:
- name: NodeResourcesLeastAllocated
enabled:
- name: NodeResourcesMostAllocated # Bin packing instead
weight: 1
filter:
enabled:
- name: NodeResourcesFit
- name: NodePorts
- name: TaintToleration
Summary ¶
The scheduler’s job is simple: pick a node for each pod. The implementation is sophisticated:
| Phase | What happens |
|---|---|
| Queue | Pods ordered by priority, backoff for failures |
| Filter | Eliminate nodes that can’t run the pod |
| Score | Rank remaining nodes by preference |
| Bind | Assign pod to winning node |
| Preempt | Evict lower-priority pods if needed |
Key takeaways:
- Scheduler uses requests, not limits
- Filtering is pass/fail; scoring is best-effort
- Pod affinity can create deadlocks—use carefully
- Topology spread is the modern way to distribute pods
- Preemption respects PDBs when possible
- Debug with
kubectl describe podand scheduler logs
When pods are stuck Pending, the answer is almost always in the scheduling failure message. Read it carefully—it tells you exactly which constraint failed.