CPU Throttling in Kubernetes: Why Your Limits Are Lying to You


Your application has plenty of CPU headroom—at least according to your metrics. Average CPU is 200m, limit is 1000m. But requests are timing out, p99 latency is through the roof, and users are complaining.

The culprit: CPU throttling. Your containers are being throttled even though they’re nowhere near their “limit.” This post explains why, how to detect it, and what to do about it.

When you set a CPU limit, you’re not saying “this container can use up to 1 CPU.” You’re saying “in any 100ms period, this container can use up to 100ms of CPU time.”

That’s a very different statement.

resources:
  limits:
    cpu: "1"  # NOT "up to 1 CPU"
              # Actually: "100ms of CPU time per 100ms period"

This is the CFS (Completely Fair Scheduler) bandwidth control mechanism. And it’s the source of most CPU throttling pain.

The Linux CFS scheduler allocates CPU time in periods (default: 100ms). Your CPU limit translates to a quota within that period.

CPU Limit    Period    Quota
---------    ------    -----
500m         100ms     50ms
1            100ms     100ms
2            100ms     200ms

In any 100ms period, your container can use up to its quota of CPU time. Once exhausted, it’s throttled—all threads are blocked until the next period starts.

Period 1 (0-100ms)
|████████████████░░░░░░░░░░░░░░░░░░░░░|
 ^              ^                     ^
 |              |                     |
 Start          Quota exhausted       Period ends
                at 40ms               (throttled for 60ms!)

Applications aren’t steady-state. They burst. A web server might idle at 50m CPU, then spike to 800m when a request arrives.

Actual CPU usage pattern:

Time: 0ms    20ms   40ms   60ms   80ms   100ms
      |------|------|------|------|------|
CPU:  [50m   ][850m ][50m  ][50m  ][50m  ]
                ^
                Burst to handle request

With 500m limit (50ms quota):
      |------|------|------|------|------|
      [50m   ][████THROTTLED████][50m  ]
               ^    ^
               |    |
               Quota exhausted at 25ms
               Throttled until 100ms!

Your average CPU is well under the limit. But the burst exceeded the quota within a single period. Result: 75ms of throttling.

That request that should have taken 30ms? It took 105ms because your container was frozen for 75ms waiting for the next period.

Throttling is tracked in the cgroup filesystem:

# For cgroups v2
cat /sys/fs/cgroup/<pod-cgroup>/cpu.stat

usage_usec 123456789
user_usec 100000000
system_usec 23456789
nr_periods 50000
nr_throttled 5000        # <-- Throttled 5000 times!
throttled_usec 300000000 # <-- 300 seconds total throttle time

Key metrics:

  • nr_periods: Total scheduling periods
  • nr_throttled: Periods where container was throttled
  • throttled_usec: Total time spent throttled (microseconds)
Throttle % = (nr_throttled / nr_periods) × 100

Example:
nr_throttled = 5000
nr_periods = 50000
Throttle % = 10%

10% throttling means in 10% of all 100ms periods, your container hit its quota and was frozen.

If you’re using cAdvisor or the kubelet metrics endpoint:

# Throttling rate
rate(container_cpu_cfs_throttled_periods_total[5m]) 
/ 
rate(container_cpu_cfs_periods_total[5m])

# Throttle time per second
rate(container_cpu_cfs_throttled_seconds_total[5m])
kubectl top pod my-pod
NAME     CPU(cores)   MEMORY(bytes)
my-pod   200m         512Mi

This shows average CPU over the sampling window. It doesn’t show:

  • Bursts within the window
  • Throttling events
  • Per-period behavior

You can be at 200m average and still be heavily throttled because of bursts.

CPU limits get even more confusing with multi-threaded applications.

resources:
  limits:
    cpu: "2"  # 200ms quota per 100ms period

Your app has 4 threads that all wake up to handle a request:

Thread 1: |██████████|              (50ms CPU)
Thread 2: |██████████|              (50ms CPU)
Thread 3: |██████████|              (50ms CPU)
Thread 4: |██████████|              (50ms CPU)
          ^         ^
          0ms      50ms

Total CPU time used: 200ms
Quota: 200ms per 100ms period

All 4 threads ran for 50ms wall-clock time, consuming 200ms of CPU time total. Quota exhausted at wall-clock 50ms. All threads throttled for remaining 50ms of the period.

Wall clock: 0ms    50ms        100ms
            |------|-----------|
Thread 1:   [█████][THROTTLED ]
Thread 2:   [█████][THROTTLED ]
Thread 3:   [█████][THROTTLED ]
Thread 4:   [█████][THROTTLED ]

From the application’s perspective, a task that needed 50ms of wall-clock time took 100ms because of throttling.

High-parallelism workloads burn through quota fast:

8 threads × 25ms each = 200ms CPU time
With 2 CPU limit (200ms quota):
Quota exhausted in 25ms wall-clock time!

Even though you’re “only” using 2 CPUs worth of work, you’re using it all at once. The quota doesn’t care about parallelism—it’s a budget of CPU microseconds.

Throttling causes latency spikes, not average latency increases:

Without throttling:
p50: 10ms, p99: 30ms

With 20% throttling:
p50: 10ms, p99: 130ms  ← Tail latency explodes

When you get throttled, you wait up to 100ms for the next period. This adds 100ms to whatever you were doing.

Service A calls Service B with a 100ms timeout. Service B gets throttled for 80ms. Service A times out. Service A retries. Service B gets more requests. More throttling. Cascade.

Kubelet sends a health check. Container is throttled. Health check times out. Kubelet kills the pod. Repeat.

livenessProbe:
  httpGet:
    path: /health
  timeoutSeconds: 1  # 1 second might not be enough if throttled
# Find the cgroup path
CONTAINER_ID=$(kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].containerID}' | cut -d'/' -f3)

# SSH to the node and check
cat /sys/fs/cgroup/kubepods/pod<pod-uid>/<container-id>/cpu.stat

Or use Prometheus:

# Top 10 throttled containers
topk(10, 
  rate(container_cpu_cfs_throttled_periods_total{container!=""}[5m]) 
  / 
  rate(container_cpu_cfs_periods_total{container!=""}[5m])
)
# CPU utilization
rate(container_cpu_usage_seconds_total{container="my-container"}[5m])

# Throttle rate
rate(container_cpu_cfs_throttled_periods_total{container="my-container"}[5m])
/
rate(container_cpu_cfs_periods_total{container="my-container"}[5m])

If utilization is low but throttling is high, you have bursty workloads hitting quota limits.

Are latency spikes correlated with incoming request bursts? Use tracing or request logs to correlate.

10:00:00 - Request burst (50 concurrent)
10:00:00 - Throttle rate spikes to 40%
10:00:00 - p99 latency spikes to 500ms

The nuclear option: don’t set CPU limits at all.

resources:
  requests:
    cpu: "500m"    # Scheduler uses this for placement
  # No limits!     # Container can burst freely

Pros:

  • No throttling, ever
  • Bursts are handled gracefully

Cons:

  • Noisy neighbor problem: One container can starve others
  • Harder capacity planning
  • May violate resource quotas

This is appropriate for:

  • Latency-sensitive workloads
  • Trusted workloads (you control all code on the node)
  • When requests are set correctly for bin packing
resources:
  requests:
    cpu: "2"
  limits:
    cpu: "2"

With Guaranteed QoS, Kubernetes gives you dedicated resources. Less contention, more predictable throttling.

Pros:

  • Predictable behavior
  • Highest priority during resource pressure

Cons:

  • Can’t burst above request
  • May waste resources if workload is variable

If you’re throttling at 1 CPU limit with 500m average usage, try 2 CPU limit:

resources:
  requests:
    cpu: "500m"  # What you typically use
  limits:
    cpu: "2"     # Headroom for bursts

Rule of thumb: Set limits to 2-3x your p99 CPU usage, not your average.

The default 100ms period can be changed. Shorter periods reduce max throttle duration but increase overhead.

# Check current period (in microseconds)
cat /sys/fs/cgroup/cpu/kubepods/cpu.cfs_period_us
100000

# Shorter period (10ms) - requires node configuration
echo 10000 > /sys/fs/cgroup/cpu/kubepods/cpu.cfs_period_us

With 10ms periods and 1 CPU limit:

  • Quota: 10ms per period
  • Max throttle duration: 10ms instead of 100ms
  • More frequent throttling, but shorter each time

Trade-off: Shorter periods have higher scheduling overhead.

For latency-critical workloads, use CPU Manager with static policy to get dedicated CPUs:

# kubelet configuration
cpuManagerPolicy: static
# Pod with integer CPU request gets dedicated cores
resources:
  requests:
    cpu: "2"    # Must be integer
  limits:
    cpu: "2"    # Must equal requests

With static policy, containers with integer CPU requests get pinned to specific CPU cores. No CFS bandwidth control, no throttling.

Pros:

  • Zero throttling
  • Best latency

Cons:

  • Requires Guaranteed QoS
  • Must request whole CPUs
  • Fragments node capacity

If your app spawns too many threads:

// Bad: Unbounded parallelism
for _, item := range items {
    go process(item)
}

// Better: Bounded parallelism
sem := make(chan struct{}, runtime.NumCPU())
for _, item := range items {
    sem <- struct{}{}
    go func(item Item) {
        defer func() { <-sem }()
        process(item)
    }(item)
}

For GOMAXPROCS in Go, or thread pools in other languages, consider setting them based on your CPU limit, not the node’s CPUs:

import _ "go.uber.org/automaxprocs"  // Automatically sets GOMAXPROCS based on cgroup

Kubernetes has been working on improvements to CPU throttling visibility and control. The PodAndContainerStatsFromCRI feature exposes throttling metrics more consistently.

Check your cluster’s feature gates:

kubectl get cm -n kube-system kubelet-config -o yaml | grep -i throttl
resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "2000m"  # 4x headroom for bursts, or remove entirely

Or remove limits entirely if you trust your workloads.

resources:
  requests:
    cpu: "1"
  limits:
    cpu: "1"  # Guaranteed QoS, predictable scheduling

Batch jobs care about throughput, not latency. Throttling is acceptable.

Separate latency-sensitive and batch workloads onto different nodes using taints/tolerations:

# Latency-sensitive nodes: no CPU limits enforced
# Batch nodes: CPU limits enforced, bin-packed

Always monitor:

# Alert if any container is throttled more than 25%
avg(
  rate(container_cpu_cfs_throttled_periods_total[5m])
  /
  rate(container_cpu_cfs_periods_total[5m])
) by (namespace, pod, container) > 0.25

CPU limits in Kubernetes don’t mean what you think:

What You Think Reality
“Max 1 CPU” “100ms CPU time per 100ms period”
“I’m at 50% utilization” “I might still be throttled on bursts”
“Limit > usage, so no problem” “Bursts within a period can still throttle”

Throttling causes:

  • Latency spikes (not averages)
  • p99 degradation
  • Cascading failures
  • Health check timeouts

Diagnose with:

  • cgroup cpu.stat: nr_throttled, throttled_usec
  • Prometheus: container_cpu_cfs_throttled_* metrics

Fix with:

  • Remove limits (for trusted, latency-sensitive workloads)
  • Increase limits (2-3x peak, not average)
  • Guaranteed QoS (limits = requests)
  • CPU Manager static policy (dedicated cores)
  • Reduce application parallelism

The safest approach for latency-sensitive services: set CPU requests accurately for scheduling, and either remove limits or set them very high. Let the scheduler handle placement; don’t let CFS bandwidth control ruin your latency.