Your application has plenty of CPU headroom—at least according to your metrics. Average CPU is 200m, limit is 1000m. But requests are timing out, p99 latency is through the roof, and users are complaining.
The culprit: CPU throttling. Your containers are being throttled even though they’re nowhere near their “limit.” This post explains why, how to detect it, and what to do about it.
The CPU Limit Lie ¶
When you set a CPU limit, you’re not saying “this container can use up to 1 CPU.” You’re saying “in any 100ms period, this container can use up to 100ms of CPU time.”
That’s a very different statement.
resources:
limits:
cpu: "1" # NOT "up to 1 CPU"
# Actually: "100ms of CPU time per 100ms period"
This is the CFS (Completely Fair Scheduler) bandwidth control mechanism. And it’s the source of most CPU throttling pain.
How CFS Bandwidth Control Works ¶
The Linux CFS scheduler allocates CPU time in periods (default: 100ms). Your CPU limit translates to a quota within that period.
CPU Limit Period Quota
--------- ------ -----
500m 100ms 50ms
1 100ms 100ms
2 100ms 200ms
In any 100ms period, your container can use up to its quota of CPU time. Once exhausted, it’s throttled—all threads are blocked until the next period starts.
Period 1 (0-100ms)
|████████████████░░░░░░░░░░░░░░░░░░░░░|
^ ^ ^
| | |
Start Quota exhausted Period ends
at 40ms (throttled for 60ms!)
The Burstiness Problem ¶
Applications aren’t steady-state. They burst. A web server might idle at 50m CPU, then spike to 800m when a request arrives.
Actual CPU usage pattern:
Time: 0ms 20ms 40ms 60ms 80ms 100ms
|------|------|------|------|------|
CPU: [50m ][850m ][50m ][50m ][50m ]
^
Burst to handle request
With 500m limit (50ms quota):
|------|------|------|------|------|
[50m ][████THROTTLED████][50m ]
^ ^
| |
Quota exhausted at 25ms
Throttled until 100ms!
Your average CPU is well under the limit. But the burst exceeded the quota within a single period. Result: 75ms of throttling.
That request that should have taken 30ms? It took 105ms because your container was frozen for 75ms waiting for the next period.
Measuring Throttling ¶
cgroup Metrics ¶
Throttling is tracked in the cgroup filesystem:
# For cgroups v2
cat /sys/fs/cgroup/<pod-cgroup>/cpu.stat
usage_usec 123456789
user_usec 100000000
system_usec 23456789
nr_periods 50000
nr_throttled 5000 # <-- Throttled 5000 times!
throttled_usec 300000000 # <-- 300 seconds total throttle time
Key metrics:
nr_periods: Total scheduling periodsnr_throttled: Periods where container was throttledthrottled_usec: Total time spent throttled (microseconds)
Calculating Throttle Percentage ¶
Throttle % = (nr_throttled / nr_periods) × 100
Example:
nr_throttled = 5000
nr_periods = 50000
Throttle % = 10%
10% throttling means in 10% of all 100ms periods, your container hit its quota and was frozen.
Prometheus Metrics ¶
If you’re using cAdvisor or the kubelet metrics endpoint:
# Throttling rate
rate(container_cpu_cfs_throttled_periods_total[5m])
/
rate(container_cpu_cfs_periods_total[5m])
# Throttle time per second
rate(container_cpu_cfs_throttled_seconds_total[5m])
kubectl top vs Reality ¶
kubectl top pod my-pod
NAME CPU(cores) MEMORY(bytes)
my-pod 200m 512Mi
This shows average CPU over the sampling window. It doesn’t show:
- Bursts within the window
- Throttling events
- Per-period behavior
You can be at 200m average and still be heavily throttled because of bursts.
The Multi-Core Trap ¶
CPU limits get even more confusing with multi-threaded applications.
Scenario: 4 Threads, 2 CPU Limit ¶
resources:
limits:
cpu: "2" # 200ms quota per 100ms period
Your app has 4 threads that all wake up to handle a request:
Thread 1: |██████████| (50ms CPU)
Thread 2: |██████████| (50ms CPU)
Thread 3: |██████████| (50ms CPU)
Thread 4: |██████████| (50ms CPU)
^ ^
0ms 50ms
Total CPU time used: 200ms
Quota: 200ms per 100ms period
All 4 threads ran for 50ms wall-clock time, consuming 200ms of CPU time total. Quota exhausted at wall-clock 50ms. All threads throttled for remaining 50ms of the period.
Wall clock: 0ms 50ms 100ms
|------|-----------|
Thread 1: [█████][THROTTLED ]
Thread 2: [█████][THROTTLED ]
Thread 3: [█████][THROTTLED ]
Thread 4: [█████][THROTTLED ]
From the application’s perspective, a task that needed 50ms of wall-clock time took 100ms because of throttling.
The Parallel Burst Problem ¶
High-parallelism workloads burn through quota fast:
8 threads × 25ms each = 200ms CPU time
With 2 CPU limit (200ms quota):
Quota exhausted in 25ms wall-clock time!
Even though you’re “only” using 2 CPUs worth of work, you’re using it all at once. The quota doesn’t care about parallelism—it’s a budget of CPU microseconds.
Real-World Impact ¶
Latency Spikes ¶
Throttling causes latency spikes, not average latency increases:
Without throttling:
p50: 10ms, p99: 30ms
With 20% throttling:
p50: 10ms, p99: 130ms ← Tail latency explodes
When you get throttled, you wait up to 100ms for the next period. This adds 100ms to whatever you were doing.
Cascading Failures ¶
Service A calls Service B with a 100ms timeout. Service B gets throttled for 80ms. Service A times out. Service A retries. Service B gets more requests. More throttling. Cascade.
Health Check Failures ¶
Kubelet sends a health check. Container is throttled. Health check times out. Kubelet kills the pod. Repeat.
livenessProbe:
httpGet:
path: /health
timeoutSeconds: 1 # 1 second might not be enough if throttled
Diagnosing Throttling ¶
Step 1: Check if Throttling is Happening ¶
# Find the cgroup path
CONTAINER_ID=$(kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].containerID}' | cut -d'/' -f3)
# SSH to the node and check
cat /sys/fs/cgroup/kubepods/pod<pod-uid>/<container-id>/cpu.stat
Or use Prometheus:
# Top 10 throttled containers
topk(10,
rate(container_cpu_cfs_throttled_periods_total{container!=""}[5m])
/
rate(container_cpu_cfs_periods_total{container!=""}[5m])
)
Step 2: Compare Throttling to Utilization ¶
# CPU utilization
rate(container_cpu_usage_seconds_total{container="my-container"}[5m])
# Throttle rate
rate(container_cpu_cfs_throttled_periods_total{container="my-container"}[5m])
/
rate(container_cpu_cfs_periods_total{container="my-container"}[5m])
If utilization is low but throttling is high, you have bursty workloads hitting quota limits.
Step 3: Look at Request Patterns ¶
Are latency spikes correlated with incoming request bursts? Use tracing or request logs to correlate.
10:00:00 - Request burst (50 concurrent)
10:00:00 - Throttle rate spikes to 40%
10:00:00 - p99 latency spikes to 500ms
Solutions ¶
Option 1: Remove CPU Limits ¶
The nuclear option: don’t set CPU limits at all.
resources:
requests:
cpu: "500m" # Scheduler uses this for placement
# No limits! # Container can burst freely
Pros:
- No throttling, ever
- Bursts are handled gracefully
Cons:
- Noisy neighbor problem: One container can starve others
- Harder capacity planning
- May violate resource quotas
This is appropriate for:
- Latency-sensitive workloads
- Trusted workloads (you control all code on the node)
- When requests are set correctly for bin packing
Option 2: Set Limits = Requests (Guaranteed QoS) ¶
resources:
requests:
cpu: "2"
limits:
cpu: "2"
With Guaranteed QoS, Kubernetes gives you dedicated resources. Less contention, more predictable throttling.
Pros:
- Predictable behavior
- Highest priority during resource pressure
Cons:
- Can’t burst above request
- May waste resources if workload is variable
Option 3: Increase Limits ¶
If you’re throttling at 1 CPU limit with 500m average usage, try 2 CPU limit:
resources:
requests:
cpu: "500m" # What you typically use
limits:
cpu: "2" # Headroom for bursts
Rule of thumb: Set limits to 2-3x your p99 CPU usage, not your average.
Option 4: Tune CFS Period (Advanced) ¶
The default 100ms period can be changed. Shorter periods reduce max throttle duration but increase overhead.
# Check current period (in microseconds)
cat /sys/fs/cgroup/cpu/kubepods/cpu.cfs_period_us
100000
# Shorter period (10ms) - requires node configuration
echo 10000 > /sys/fs/cgroup/cpu/kubepods/cpu.cfs_period_us
With 10ms periods and 1 CPU limit:
- Quota: 10ms per period
- Max throttle duration: 10ms instead of 100ms
- More frequent throttling, but shorter each time
Trade-off: Shorter periods have higher scheduling overhead.
Option 5: CPU Manager (Static Policy) ¶
For latency-critical workloads, use CPU Manager with static policy to get dedicated CPUs:
# kubelet configuration
cpuManagerPolicy: static
# Pod with integer CPU request gets dedicated cores
resources:
requests:
cpu: "2" # Must be integer
limits:
cpu: "2" # Must equal requests
With static policy, containers with integer CPU requests get pinned to specific CPU cores. No CFS bandwidth control, no throttling.
Pros:
- Zero throttling
- Best latency
Cons:
- Requires Guaranteed QoS
- Must request whole CPUs
- Fragments node capacity
Option 6: Reduce Parallelism ¶
If your app spawns too many threads:
// Bad: Unbounded parallelism
for _, item := range items {
go process(item)
}
// Better: Bounded parallelism
sem := make(chan struct{}, runtime.NumCPU())
for _, item := range items {
sem <- struct{}{}
go func(item Item) {
defer func() { <-sem }()
process(item)
}(item)
}
For GOMAXPROCS in Go, or thread pools in other languages, consider setting them based on your CPU limit, not the node’s CPUs:
import _ "go.uber.org/automaxprocs" // Automatically sets GOMAXPROCS based on cgroup
Kubernetes 1.20+: CPUThrottlingEnabled Feature Gate ¶
Kubernetes has been working on improvements to CPU throttling visibility and control. The PodAndContainerStatsFromCRI feature exposes throttling metrics more consistently.
Check your cluster’s feature gates:
kubectl get cm -n kube-system kubelet-config -o yaml | grep -i throttl
Practical Recommendations ¶
For Latency-Sensitive Services ¶
resources:
requests:
cpu: "500m"
limits:
cpu: "2000m" # 4x headroom for bursts, or remove entirely
Or remove limits entirely if you trust your workloads.
For Batch Jobs ¶
resources:
requests:
cpu: "1"
limits:
cpu: "1" # Guaranteed QoS, predictable scheduling
Batch jobs care about throughput, not latency. Throttling is acceptable.
For Mixed Workloads ¶
Separate latency-sensitive and batch workloads onto different nodes using taints/tolerations:
# Latency-sensitive nodes: no CPU limits enforced
# Batch nodes: CPU limits enforced, bin-packed
Monitoring Recommendations ¶
Always monitor:
# Alert if any container is throttled more than 25%
avg(
rate(container_cpu_cfs_throttled_periods_total[5m])
/
rate(container_cpu_cfs_periods_total[5m])
) by (namespace, pod, container) > 0.25
Summary ¶
CPU limits in Kubernetes don’t mean what you think:
| What You Think | Reality |
|---|---|
| “Max 1 CPU” | “100ms CPU time per 100ms period” |
| “I’m at 50% utilization” | “I might still be throttled on bursts” |
| “Limit > usage, so no problem” | “Bursts within a period can still throttle” |
Throttling causes:
- Latency spikes (not averages)
- p99 degradation
- Cascading failures
- Health check timeouts
Diagnose with:
- cgroup
cpu.stat:nr_throttled,throttled_usec - Prometheus:
container_cpu_cfs_throttled_*metrics
Fix with:
- Remove limits (for trusted, latency-sensitive workloads)
- Increase limits (2-3x peak, not average)
- Guaranteed QoS (limits = requests)
- CPU Manager static policy (dedicated cores)
- Reduce application parallelism
The safest approach for latency-sensitive services: set CPU requests accurately for scheduling, and either remove limits or set them very high. Let the scheduler handle placement; don’t let CFS bandwidth control ruin your latency.