From etcd to Watch: How Kubernetes Watches Actually Work

Every Kubernetes controller depends on watches. Create a Deployment, and within seconds the ReplicaSet controller sees it and creates pods. But how does this actually work? The answer runs from etcd’s data model through the API server’s watch cache to your client’s informer — and understanding this chain explains why watches sometimes fail and how to fix them.

etcd’s Data Model ¶

Kubernetes stores all cluster state in etcd, a distributed key-value store. But etcd isn’t a simple key-value store — it uses Multi-Version Concurrency Control (MVCC).

Revisions, Not Overwrites ¶

When you update a key in etcd, it doesn’t overwrite the old value. It creates a new revision:

Revision 100: /registry/pods/default/nginx -> {pod spec v1}
Revision 101: /registry/pods/default/nginx -> {pod spec v2}  # Updated
Revision 102: /registry/pods/default/redis -> {pod spec v1}  # New pod
Revision 103: /registry/pods/default/nginx -> tombstone       # Deleted

Every write operation increments a global revision counter. This revision is monotonically increasing across the entire etcd cluster — not per key.

Key insight: You can ask etcd “what changed after revision 100?” and get a consistent stream of all modifications.

History and Compaction ¶

etcd keeps historical revisions, but not forever. Compaction removes old revisions to reclaim space:

Before compaction (keeping revisions 100-200):
  Revision 100: key1 -> value1
  Revision 101: key1 -> value2
  ...
  Revision 200: key1 -> value100

After compaction at revision 150:
  Revision 150: key1 -> value50  # Oldest available
  ...
  Revision 200: key1 -> value100

After compaction, you cannot watch from revision 100 — that history is gone. This becomes important later.

etcd Watches ¶

etcd natively supports watches:

// Watch all changes to keys with prefix "/registry/pods/" starting from revision 1000
watcher := client.Watch(context.Background(), "/registry/pods/", 
    clientv3.WithPrefix(),
    clientv3.WithRev(1000))

for response := range watcher {
    for _, event := range response.Events {
        fmt.Printf("Type: %s, Key: %s, Revision: %d\n", 
            event.Type, event.Kv.Key, event.Kv.ModRevision)
    }
}

The watch returns a stream of events: PUT (create/update) and DELETE operations, each tagged with the revision when it occurred.

The Problem: etcd Can’t Handle Kubernetes Scale ¶

Here’s the issue: a busy Kubernetes cluster might have thousands of watchers.

Every kubelet watches pods scheduled to its node
Every controller watches its relevant resources
Every client running kubectl get pods -w opens a watch
Service meshes, monitoring, logging — all watching

A 5,000 node cluster easily has 10,000+ concurrent watches. etcd can handle this in theory, but:

Memory: Each watch consumes memory in etcd
Fan-out: A single pod update must be sent to potentially thousands of watchers
Connection overhead: Each watch is a gRPC stream

Having every Kubernetes component directly watch etcd would kill it.

The API Server Watch Cache ¶

The API server solves this with a watch cache — a layer between etcd and clients.

Architecture ¶

                              +-------------------+
                              |  Client Watch 1   |
                              +---------+---------+
                                        ^
                                        |
+--------+     +---------------+     +--+--+     +-------------------+
|  etcd  |---->|  Watch Cache  |---->| Fan |---->|  Client Watch N   |
+--------+     +---------------+     +--+--+     +-------------------+
                                        |
   One etcd watch              Broadcaster fans out
   per resource type           to many client watches

How it works:

The API server opens one watch per resource type to etcd (e.g., one watch for all pods)
Events flow into the watch cache, which stores recent events in memory
The broadcaster fans out events to all client watches
Clients watch the API server, not etcd directly

This transforms the problem:

etcd handles a handful of watches (one per resource type)
The API server handles thousands of client watches
The watch cache absorbs the fan-out cost

Inside the Watch Cache ¶

The watch cache (k8s.io/apiserver/pkg/storage/cacher) maintains:

A sliding window of recent events:

type watchCache struct {
    // Ring buffer of recent events
    cache      []*watchCacheEvent
    startIndex int
    endIndex   int
    
    // All objects currently in the cache (latest version)
    store      cache.Indexer
    
    // Current resource version
    resourceVersion uint64
}

Event storage:

type watchCacheEvent struct {
    Type            watch.EventType  // ADDED, MODIFIED, DELETED
    Object          runtime.Object   // The object
    ObjLabels       labels.Set       // For filtering
    ObjFields       fields.Set       // For filtering
    PrevObject      runtime.Object   // Previous version (for MODIFIED)
    ResourceVersion uint64
}

When a client starts a watch:

If they request a specific resourceVersion and it’s in the cache window -> replay from that point
If the version is too old (not in window) -> return “410 Gone”
If they request resourceVersion=“0” -> start from current state
Then stream new events as they arrive

The Cacher ¶

The Cacher is the component that ties it together:

type Cacher struct {
    // Underlying storage (etcd)
    storage     storage.Interface
    
    // The watch cache
    watchCache  *watchCache
    
    // Broadcasts events to watchers
    watchers    indexedWatchers
    
    // Handles reflector lifecycle
    reflector   *cache.Reflector
}

The reflector does List+Watch against etcd, feeding events into the watch cache. Client watches subscribe to the broadcaster.

ResourceVersion Demystified ¶

Every Kubernetes object has a metadata.resourceVersion field:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  resourceVersion: "12345678"

This value comes from etcd’s revision system, but with a twist.

What ResourceVersion Actually Is ¶

In most cases, resourceVersion equals the etcd modification revision of that object. When you update a pod and etcd records it at revision 12345678, the pod’s resourceVersion becomes “12345678”.

But it’s not always a direct mapping:

The API server may encode additional information
Different storage backends could use different schemes
It’s intentionally opaque — don’t parse or compare numerically

Treat it as an opaque string that happens to increase over time.

List and Watch ResourceVersion Semantics ¶

When you list or watch resources, resourceVersion has specific meanings:

resourceVersion="" (not specified):

List: Return from API server cache (may be slightly stale)
Watch: Start from “now” (current resource version)

resourceVersion="0":

List: Return from API server cache (any version)
Watch: Start from “any” — API server chooses (usually current)

resourceVersion="12345678" (specific value):

List: Return data at least as fresh as this version
Watch: Start streaming from this exact point

For controllers, the pattern is:

List with resourceVersion="" -> get current objects + a resourceVersion
Watch with that resourceVersion -> see all changes since the list

list, err := client.CoreV1().Pods("").List(ctx, metav1.ListOptions{})
if err != nil {
    return err
}

// Watch starting from where list ended
watch, err := client.CoreV1().Pods("").Watch(ctx, metav1.ListOptions{
    ResourceVersion: list.ResourceVersion,
})

The “410 Gone” Error ¶

HTTP 410 Gone
the server does not allow this method on the requested resource

This happens when you request a resourceVersion that’s too old — it’s been compacted from the watch cache.

The timeline:

Watch cache window: [revision 1000 ... revision 2000]

Client requests: Watch from revision 500
API server: "I don't have revision 500 anymore" -> 410 Gone

Causes:

Client disconnected too long, missed too many events
Watch cache is small relative to event rate
etcd compacted before client reconnected

The proper response:

Catch the 410 error
Re-list to get current state and new resourceVersion
Resume watching from the new resourceVersion

client-go’s informers handle this automatically.

Bookmarks ¶

Watches can go quiet — if nothing changes, no events flow. But the client needs to know: “Am I still connected? What’s the current resourceVersion?”

Bookmarks solve this. They’re synthetic events that communicate the current resourceVersion without an actual object change:

// Enable bookmarks
watch, err := client.Watch(ctx, metav1.ListOptions{
    ResourceVersion:     "12345",
    AllowWatchBookmarks: true,  // Request bookmarks
})

for event := range watch.ResultChan() {
    if event.Type == watch.Bookmark {
        // No object change, but we know we're caught up to this version
        fmt.Printf("Bookmark at %s\n", event.Object.GetResourceVersion())
    }
}

Why bookmarks matter:

Progress tracking: Client knows how far behind it is
Faster recovery: After disconnect, client can resume from bookmark’s resourceVersion instead of re-listing
Preventing 410: Regular bookmarks keep the client’s resourceVersion fresh

The API server sends bookmarks periodically (default: every minute if there’s no activity).

Where It Breaks ¶

Understanding the architecture reveals the failure modes.

Watch Cache Memory Pressure ¶

The watch cache has finite size. Under heavy write load:

Events arriving:    1000/second
Cache size:         1000 events
Cache window:       1 second of history

A client that disconnects for 2 seconds and tries to resume will get 410 Gone.

Symptoms:

Controllers constantly re-listing
High API server memory usage
apiserver_watch_cache_capacity_increase_total metric increasing

Mitigation:

Increase watch cache size (API server flag)
Use bookmarks (keeps client resourceVersion fresh)
Reduce event rate (fewer unnecessary updates)

etcd Compaction vs Slow Clients ¶

etcd compacts old revisions. If a client is watching etcd directly (rare, but some systems do), compaction can remove history the client needs.

For Kubernetes, this manifests indirectly — the watch cache’s underlying reflector gets 410 from etcd, triggering a full re-list.

Symptoms:

Periodic spikes in API server memory and etcd load
apiserver_watch_cache_list_total metric spikes

Mitigation:

Tune etcd compaction interval
Ensure watch cache is sized for your workload

Watch Storms After API Server Restart ¶

When the API server restarts:

All client watches disconnect
API server starts fresh — empty watch cache
All clients reconnect and re-list
API server hammers etcd with list requests

A 1000-node cluster with 10 controllers each -> 10,000 simultaneous list requests.

Symptoms:

API server slow immediately after restart
etcd latency spikes
Controllers report sync failures

Mitigation:

API server caching settings (--watch-cache-sizes)
Client backoff (client-go does this automatically with jitter)
Priority and fairness (APF) to protect against thundering herd

The “Too Old Resource Version” Problem ¶

Informers cache the resourceVersion of the last event. If an informer is too slow processing events:

API server at revision:     10000
Informer last saw:          8000
Informer's local cache at:  8000

API server watch cache:     [9000 ... 10000]  (only 1000 events)

Informer tries to resume from 8000 -> 410 Gone

Symptoms:

Informer re-lists repeatedly
Controller appears to “miss” events
High memory churn (re-list allocates new objects)

Mitigation:

Speed up event handlers (don’t block the informer)
Increase watch cache size
Check for slow reconcilers

Tuning and Debugging ¶

API Server Watch Cache Flags ¶

--watch-cache=true                  # Enable watch cache (default: true)
--watch-cache-sizes=pods#1000       # Per-resource cache sizes
--default-watch-cache-size=100      # Default size for resources not specified

Format for --watch-cache-sizes:

resource#size,resource#size,...

Examples:
pods#5000,secrets#1000,configmaps#1000

Larger cache = more memory, but fewer 410 errors.

etcd Settings ¶

# etcd configuration
auto-compaction-mode: periodic
auto-compaction-retention: "1h"     # Keep 1 hour of history

Longer retention = more history available = fewer compaction-related 410s. But also more disk and memory usage.

Metrics to Watch ¶

API server:

# Watch cache size by resource
apiserver_watch_cache_capacity{resource="pods"}

# Events in watch cache
apiserver_watch_cache_events_received_total

# 410 errors (client needed older data than available)
apiserver_watch_cache_stale_total

# Watch count by resource
apiserver_registered_watchers{resource="pods"}

etcd:

# Current revision
etcd_debugging_mvcc_db_compaction_last

# Compaction stats
etcd_debugging_mvcc_db_compaction_total_duration_seconds

# Watch count
etcd_debugging_mvcc_watcher_total

Diagnosing Watch Disconnects ¶

Client-side symptoms:

Informer logs show “watch ended” followed by re-list
ResourceVersion jumps (indicates re-list happened)
Events appear “missed”

Check API server logs:

kubectl logs -n kube-system kube-apiserver-<node> | grep -i "watch"

Check etcd health:

etcdctl endpoint health
etcdctl endpoint status

Trace a specific watch:

Enable verbose logging in client-go:

import "k8s.io/klog/v2"

klog.SetOutput(os.Stderr)
klog.InitFlags(nil)
flag.Set("v", "6")  // Verbose watch logging

Testing Watch Behavior ¶

Simulate watch cache pressure:

func stressTest(client kubernetes.Interface) {
    // Create and delete pods rapidly
    for i := 0; i < 10000; i++ {
        pod := &corev1.Pod{
            ObjectMeta: metav1.ObjectMeta{
                Name: fmt.Sprintf("stress-%d", i),
            },
            Spec: corev1.PodSpec{
                Containers: []corev1.Container{{
                    Name:  "test",
                    Image: "nginx",
                }},
            },
        }
        client.CoreV1().Pods("default").Create(ctx, pod, metav1.CreateOptions{})
        client.CoreV1().Pods("default").Delete(ctx, pod.Name, metav1.DeleteOptions{})
    }
}

Monitor apiserver_watch_cache_stale_total during the test.

The Full Picture ¶

Putting it together:

+-------------------------------------------------------------------+
|                      Your Controller                              |
|  +--------------------------------------------------------------+ |
|  |                    SharedInformer                            | |
|  |                                                              | |
|  |  Reflector ---> DeltaFIFO ---> Indexer ---> Event Handler    | |
|  |  (List+Watch)                  (Cache)                       | |
|  +---------|----------------------------------------------------+ |
+------------|------------------------------------------------------+
             |
             | HTTP Watch (long-lived connection)
             v
+-------------------------------------------------------------------+
|                        API Server                                 |
|  +--------------------------------------------------------------+ |
|  |                         Cacher                               | |
|  |                                                              | |
|  |  Reflector ---> Watch Cache ---> Broadcaster (fan-out)       | |
|  |  (etcd watch)                                                | |
|  +------|-------------------------------------------------------+ |
+---------|---------------------------------------------------------+
          |
          | gRPC Watch (one per resource type)
          v
+-------------------------------------------------------------------+
|                           etcd                                    |
|  +--------------------------------------------------------------+ |
|  |  MVCC: Rev 1000 | Rev 1001 | Rev 1002 | Rev 1003 | ...       | |
|  +--------------------------------------------------------------+ |
+-------------------------------------------------------------------+

etcd stores versioned data using MVCC, supports native watches
API server watch cache multiplexes one etcd watch to many clients
Client informers maintain local caches, synced via watches
ResourceVersion ties it all together — an opaque token representing a point in time

When something goes wrong:

410 errors -> client’s resourceVersion is too old, needs re-list
Missing events -> look for watch disconnects, slow handlers
High latency -> check watch cache size, etcd health, event rate

Understanding this chain helps you build robust controllers and diagnose issues that mystify most operators.