Every Kubernetes controller depends on watches. Create a Deployment, and within seconds the ReplicaSet controller sees it and creates pods. But how does this actually work? The answer runs from etcd’s data model through the API server’s watch cache to your client’s informer — and understanding this chain explains why watches sometimes fail and how to fix them.
etcd’s Data Model ¶
Kubernetes stores all cluster state in etcd, a distributed key-value store. But etcd isn’t a simple key-value store — it uses Multi-Version Concurrency Control (MVCC).
Revisions, Not Overwrites ¶
When you update a key in etcd, it doesn’t overwrite the old value. It creates a new revision:
Revision 100: /registry/pods/default/nginx -> {pod spec v1}
Revision 101: /registry/pods/default/nginx -> {pod spec v2} # Updated
Revision 102: /registry/pods/default/redis -> {pod spec v1} # New pod
Revision 103: /registry/pods/default/nginx -> tombstone # Deleted
Every write operation increments a global revision counter. This revision is monotonically increasing across the entire etcd cluster — not per key.
Key insight: You can ask etcd “what changed after revision 100?” and get a consistent stream of all modifications.
History and Compaction ¶
etcd keeps historical revisions, but not forever. Compaction removes old revisions to reclaim space:
Before compaction (keeping revisions 100-200):
Revision 100: key1 -> value1
Revision 101: key1 -> value2
...
Revision 200: key1 -> value100
After compaction at revision 150:
Revision 150: key1 -> value50 # Oldest available
...
Revision 200: key1 -> value100
After compaction, you cannot watch from revision 100 — that history is gone. This becomes important later.
etcd Watches ¶
etcd natively supports watches:
// Watch all changes to keys with prefix "/registry/pods/" starting from revision 1000
watcher := client.Watch(context.Background(), "/registry/pods/",
clientv3.WithPrefix(),
clientv3.WithRev(1000))
for response := range watcher {
for _, event := range response.Events {
fmt.Printf("Type: %s, Key: %s, Revision: %d\n",
event.Type, event.Kv.Key, event.Kv.ModRevision)
}
}
The watch returns a stream of events: PUT (create/update) and DELETE operations, each tagged with the revision when it occurred.
The Problem: etcd Can’t Handle Kubernetes Scale ¶
Here’s the issue: a busy Kubernetes cluster might have thousands of watchers.
- Every kubelet watches pods scheduled to its node
- Every controller watches its relevant resources
- Every client running
kubectl get pods -wopens a watch - Service meshes, monitoring, logging — all watching
A 5,000 node cluster easily has 10,000+ concurrent watches. etcd can handle this in theory, but:
- Memory: Each watch consumes memory in etcd
- Fan-out: A single pod update must be sent to potentially thousands of watchers
- Connection overhead: Each watch is a gRPC stream
Having every Kubernetes component directly watch etcd would kill it.
The API Server Watch Cache ¶
The API server solves this with a watch cache — a layer between etcd and clients.
Architecture ¶
+-------------------+
| Client Watch 1 |
+---------+---------+
^
|
+--------+ +---------------+ +--+--+ +-------------------+
| etcd |---->| Watch Cache |---->| Fan |---->| Client Watch N |
+--------+ +---------------+ +--+--+ +-------------------+
|
One etcd watch Broadcaster fans out
per resource type to many client watches
How it works:
- The API server opens one watch per resource type to etcd (e.g., one watch for all pods)
- Events flow into the watch cache, which stores recent events in memory
- The broadcaster fans out events to all client watches
- Clients watch the API server, not etcd directly
This transforms the problem:
- etcd handles a handful of watches (one per resource type)
- The API server handles thousands of client watches
- The watch cache absorbs the fan-out cost
Inside the Watch Cache ¶
The watch cache (k8s.io/apiserver/pkg/storage/cacher) maintains:
A sliding window of recent events:
type watchCache struct {
// Ring buffer of recent events
cache []*watchCacheEvent
startIndex int
endIndex int
// All objects currently in the cache (latest version)
store cache.Indexer
// Current resource version
resourceVersion uint64
}
Event storage:
type watchCacheEvent struct {
Type watch.EventType // ADDED, MODIFIED, DELETED
Object runtime.Object // The object
ObjLabels labels.Set // For filtering
ObjFields fields.Set // For filtering
PrevObject runtime.Object // Previous version (for MODIFIED)
ResourceVersion uint64
}
When a client starts a watch:
- If they request a specific resourceVersion and it’s in the cache window -> replay from that point
- If the version is too old (not in window) -> return “410 Gone”
- If they request resourceVersion=“0” -> start from current state
- Then stream new events as they arrive
The Cacher ¶
The Cacher is the component that ties it together:
type Cacher struct {
// Underlying storage (etcd)
storage storage.Interface
// The watch cache
watchCache *watchCache
// Broadcasts events to watchers
watchers indexedWatchers
// Handles reflector lifecycle
reflector *cache.Reflector
}
The reflector does List+Watch against etcd, feeding events into the watch cache. Client watches subscribe to the broadcaster.
ResourceVersion Demystified ¶
Every Kubernetes object has a metadata.resourceVersion field:
apiVersion: v1
kind: Pod
metadata:
name: nginx
resourceVersion: "12345678"
This value comes from etcd’s revision system, but with a twist.
What ResourceVersion Actually Is ¶
In most cases, resourceVersion equals the etcd modification revision of that object. When you update a pod and etcd records it at revision 12345678, the pod’s resourceVersion becomes “12345678”.
But it’s not always a direct mapping:
- The API server may encode additional information
- Different storage backends could use different schemes
- It’s intentionally opaque — don’t parse or compare numerically
Treat it as an opaque string that happens to increase over time.
List and Watch ResourceVersion Semantics ¶
When you list or watch resources, resourceVersion has specific meanings:
resourceVersion="" (not specified):
- List: Return from API server cache (may be slightly stale)
- Watch: Start from “now” (current resource version)
resourceVersion="0":
- List: Return from API server cache (any version)
- Watch: Start from “any” — API server chooses (usually current)
resourceVersion="12345678" (specific value):
- List: Return data at least as fresh as this version
- Watch: Start streaming from this exact point
For controllers, the pattern is:
- List with resourceVersion="" -> get current objects + a resourceVersion
- Watch with that resourceVersion -> see all changes since the list
list, err := client.CoreV1().Pods("").List(ctx, metav1.ListOptions{})
if err != nil {
return err
}
// Watch starting from where list ended
watch, err := client.CoreV1().Pods("").Watch(ctx, metav1.ListOptions{
ResourceVersion: list.ResourceVersion,
})
The “410 Gone” Error ¶
HTTP 410 Gone
the server does not allow this method on the requested resource
This happens when you request a resourceVersion that’s too old — it’s been compacted from the watch cache.
The timeline:
Watch cache window: [revision 1000 ... revision 2000]
Client requests: Watch from revision 500
API server: "I don't have revision 500 anymore" -> 410 Gone
Causes:
- Client disconnected too long, missed too many events
- Watch cache is small relative to event rate
- etcd compacted before client reconnected
The proper response:
- Catch the 410 error
- Re-list to get current state and new resourceVersion
- Resume watching from the new resourceVersion
client-go’s informers handle this automatically.
Bookmarks ¶
Watches can go quiet — if nothing changes, no events flow. But the client needs to know: “Am I still connected? What’s the current resourceVersion?”
Bookmarks solve this. They’re synthetic events that communicate the current resourceVersion without an actual object change:
// Enable bookmarks
watch, err := client.Watch(ctx, metav1.ListOptions{
ResourceVersion: "12345",
AllowWatchBookmarks: true, // Request bookmarks
})
for event := range watch.ResultChan() {
if event.Type == watch.Bookmark {
// No object change, but we know we're caught up to this version
fmt.Printf("Bookmark at %s\n", event.Object.GetResourceVersion())
}
}
Why bookmarks matter:
- Progress tracking: Client knows how far behind it is
- Faster recovery: After disconnect, client can resume from bookmark’s resourceVersion instead of re-listing
- Preventing 410: Regular bookmarks keep the client’s resourceVersion fresh
The API server sends bookmarks periodically (default: every minute if there’s no activity).
Where It Breaks ¶
Understanding the architecture reveals the failure modes.
Watch Cache Memory Pressure ¶
The watch cache has finite size. Under heavy write load:
Events arriving: 1000/second
Cache size: 1000 events
Cache window: 1 second of history
A client that disconnects for 2 seconds and tries to resume will get 410 Gone.
Symptoms:
- Controllers constantly re-listing
- High API server memory usage
apiserver_watch_cache_capacity_increase_totalmetric increasing
Mitigation:
- Increase watch cache size (API server flag)
- Use bookmarks (keeps client resourceVersion fresh)
- Reduce event rate (fewer unnecessary updates)
etcd Compaction vs Slow Clients ¶
etcd compacts old revisions. If a client is watching etcd directly (rare, but some systems do), compaction can remove history the client needs.
For Kubernetes, this manifests indirectly — the watch cache’s underlying reflector gets 410 from etcd, triggering a full re-list.
Symptoms:
- Periodic spikes in API server memory and etcd load
apiserver_watch_cache_list_totalmetric spikes
Mitigation:
- Tune etcd compaction interval
- Ensure watch cache is sized for your workload
Watch Storms After API Server Restart ¶
When the API server restarts:
- All client watches disconnect
- API server starts fresh — empty watch cache
- All clients reconnect and re-list
- API server hammers etcd with list requests
A 1000-node cluster with 10 controllers each -> 10,000 simultaneous list requests.
Symptoms:
- API server slow immediately after restart
- etcd latency spikes
- Controllers report sync failures
Mitigation:
- API server caching settings (
--watch-cache-sizes) - Client backoff (client-go does this automatically with jitter)
- Priority and fairness (APF) to protect against thundering herd
The “Too Old Resource Version” Problem ¶
Informers cache the resourceVersion of the last event. If an informer is too slow processing events:
API server at revision: 10000
Informer last saw: 8000
Informer's local cache at: 8000
API server watch cache: [9000 ... 10000] (only 1000 events)
Informer tries to resume from 8000 -> 410 Gone
Symptoms:
- Informer re-lists repeatedly
- Controller appears to “miss” events
- High memory churn (re-list allocates new objects)
Mitigation:
- Speed up event handlers (don’t block the informer)
- Increase watch cache size
- Check for slow reconcilers
Tuning and Debugging ¶
API Server Watch Cache Flags ¶
--watch-cache=true # Enable watch cache (default: true)
--watch-cache-sizes=pods#1000 # Per-resource cache sizes
--default-watch-cache-size=100 # Default size for resources not specified
Format for --watch-cache-sizes:
resource#size,resource#size,...
Examples:
pods#5000,secrets#1000,configmaps#1000
Larger cache = more memory, but fewer 410 errors.
etcd Settings ¶
# etcd configuration
auto-compaction-mode: periodic
auto-compaction-retention: "1h" # Keep 1 hour of history
Longer retention = more history available = fewer compaction-related 410s. But also more disk and memory usage.
Metrics to Watch ¶
API server:
# Watch cache size by resource
apiserver_watch_cache_capacity{resource="pods"}
# Events in watch cache
apiserver_watch_cache_events_received_total
# 410 errors (client needed older data than available)
apiserver_watch_cache_stale_total
# Watch count by resource
apiserver_registered_watchers{resource="pods"}
etcd:
# Current revision
etcd_debugging_mvcc_db_compaction_last
# Compaction stats
etcd_debugging_mvcc_db_compaction_total_duration_seconds
# Watch count
etcd_debugging_mvcc_watcher_total
Diagnosing Watch Disconnects ¶
Client-side symptoms:
- Informer logs show “watch ended” followed by re-list
- ResourceVersion jumps (indicates re-list happened)
- Events appear “missed”
Check API server logs:
kubectl logs -n kube-system kube-apiserver-<node> | grep -i "watch"
Check etcd health:
etcdctl endpoint health
etcdctl endpoint status
Trace a specific watch:
Enable verbose logging in client-go:
import "k8s.io/klog/v2"
klog.SetOutput(os.Stderr)
klog.InitFlags(nil)
flag.Set("v", "6") // Verbose watch logging
Testing Watch Behavior ¶
Simulate watch cache pressure:
func stressTest(client kubernetes.Interface) {
// Create and delete pods rapidly
for i := 0; i < 10000; i++ {
pod := &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("stress-%d", i),
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{{
Name: "test",
Image: "nginx",
}},
},
}
client.CoreV1().Pods("default").Create(ctx, pod, metav1.CreateOptions{})
client.CoreV1().Pods("default").Delete(ctx, pod.Name, metav1.DeleteOptions{})
}
}
Monitor apiserver_watch_cache_stale_total during the test.
The Full Picture ¶
Putting it together:
+-------------------------------------------------------------------+
| Your Controller |
| +--------------------------------------------------------------+ |
| | SharedInformer | |
| | | |
| | Reflector ---> DeltaFIFO ---> Indexer ---> Event Handler | |
| | (List+Watch) (Cache) | |
| +---------|----------------------------------------------------+ |
+------------|------------------------------------------------------+
|
| HTTP Watch (long-lived connection)
v
+-------------------------------------------------------------------+
| API Server |
| +--------------------------------------------------------------+ |
| | Cacher | |
| | | |
| | Reflector ---> Watch Cache ---> Broadcaster (fan-out) | |
| | (etcd watch) | |
| +------|-------------------------------------------------------+ |
+---------|---------------------------------------------------------+
|
| gRPC Watch (one per resource type)
v
+-------------------------------------------------------------------+
| etcd |
| +--------------------------------------------------------------+ |
| | MVCC: Rev 1000 | Rev 1001 | Rev 1002 | Rev 1003 | ... | |
| +--------------------------------------------------------------+ |
+-------------------------------------------------------------------+
- etcd stores versioned data using MVCC, supports native watches
- API server watch cache multiplexes one etcd watch to many clients
- Client informers maintain local caches, synced via watches
- ResourceVersion ties it all together — an opaque token representing a point in time
When something goes wrong:
- 410 errors -> client’s resourceVersion is too old, needs re-list
- Missing events -> look for watch disconnects, slow handlers
- High latency -> check watch cache size, etcd health, event rate
Understanding this chain helps you build robust controllers and diagnose issues that mystify most operators.