From etcd to Watch: How Kubernetes Watches Actually Work


Every Kubernetes controller depends on watches. Create a Deployment, and within seconds the ReplicaSet controller sees it and creates pods. But how does this actually work? The answer runs from etcd’s data model through the API server’s watch cache to your client’s informer — and understanding this chain explains why watches sometimes fail and how to fix them.

Kubernetes stores all cluster state in etcd, a distributed key-value store. But etcd isn’t a simple key-value store — it uses Multi-Version Concurrency Control (MVCC).

When you update a key in etcd, it doesn’t overwrite the old value. It creates a new revision:

Revision 100: /registry/pods/default/nginx -> {pod spec v1}
Revision 101: /registry/pods/default/nginx -> {pod spec v2}  # Updated
Revision 102: /registry/pods/default/redis -> {pod spec v1}  # New pod
Revision 103: /registry/pods/default/nginx -> tombstone       # Deleted

Every write operation increments a global revision counter. This revision is monotonically increasing across the entire etcd cluster — not per key.

Key insight: You can ask etcd “what changed after revision 100?” and get a consistent stream of all modifications.

etcd keeps historical revisions, but not forever. Compaction removes old revisions to reclaim space:

Before compaction (keeping revisions 100-200):
  Revision 100: key1 -> value1
  Revision 101: key1 -> value2
  ...
  Revision 200: key1 -> value100

After compaction at revision 150:
  Revision 150: key1 -> value50  # Oldest available
  ...
  Revision 200: key1 -> value100

After compaction, you cannot watch from revision 100 — that history is gone. This becomes important later.

etcd natively supports watches:

// Watch all changes to keys with prefix "/registry/pods/" starting from revision 1000
watcher := client.Watch(context.Background(), "/registry/pods/", 
    clientv3.WithPrefix(),
    clientv3.WithRev(1000))

for response := range watcher {
    for _, event := range response.Events {
        fmt.Printf("Type: %s, Key: %s, Revision: %d\n", 
            event.Type, event.Kv.Key, event.Kv.ModRevision)
    }
}

The watch returns a stream of events: PUT (create/update) and DELETE operations, each tagged with the revision when it occurred.

Here’s the issue: a busy Kubernetes cluster might have thousands of watchers.

  • Every kubelet watches pods scheduled to its node
  • Every controller watches its relevant resources
  • Every client running kubectl get pods -w opens a watch
  • Service meshes, monitoring, logging — all watching

A 5,000 node cluster easily has 10,000+ concurrent watches. etcd can handle this in theory, but:

  1. Memory: Each watch consumes memory in etcd
  2. Fan-out: A single pod update must be sent to potentially thousands of watchers
  3. Connection overhead: Each watch is a gRPC stream

Having every Kubernetes component directly watch etcd would kill it.

The API server solves this with a watch cache — a layer between etcd and clients.

                              +-------------------+
                              |  Client Watch 1   |
                              +---------+---------+
                                        ^
                                        |
+--------+     +---------------+     +--+--+     +-------------------+
|  etcd  |---->|  Watch Cache  |---->| Fan |---->|  Client Watch N   |
+--------+     +---------------+     +--+--+     +-------------------+
                                        |
   One etcd watch              Broadcaster fans out
   per resource type           to many client watches

How it works:

  1. The API server opens one watch per resource type to etcd (e.g., one watch for all pods)
  2. Events flow into the watch cache, which stores recent events in memory
  3. The broadcaster fans out events to all client watches
  4. Clients watch the API server, not etcd directly

This transforms the problem:

  • etcd handles a handful of watches (one per resource type)
  • The API server handles thousands of client watches
  • The watch cache absorbs the fan-out cost

The watch cache (k8s.io/apiserver/pkg/storage/cacher) maintains:

A sliding window of recent events:

type watchCache struct {
    // Ring buffer of recent events
    cache      []*watchCacheEvent
    startIndex int
    endIndex   int
    
    // All objects currently in the cache (latest version)
    store      cache.Indexer
    
    // Current resource version
    resourceVersion uint64
}

Event storage:

type watchCacheEvent struct {
    Type            watch.EventType  // ADDED, MODIFIED, DELETED
    Object          runtime.Object   // The object
    ObjLabels       labels.Set       // For filtering
    ObjFields       fields.Set       // For filtering
    PrevObject      runtime.Object   // Previous version (for MODIFIED)
    ResourceVersion uint64
}

When a client starts a watch:

  1. If they request a specific resourceVersion and it’s in the cache window -> replay from that point
  2. If the version is too old (not in window) -> return “410 Gone”
  3. If they request resourceVersion=“0” -> start from current state
  4. Then stream new events as they arrive

The Cacher is the component that ties it together:

type Cacher struct {
    // Underlying storage (etcd)
    storage     storage.Interface
    
    // The watch cache
    watchCache  *watchCache
    
    // Broadcasts events to watchers
    watchers    indexedWatchers
    
    // Handles reflector lifecycle
    reflector   *cache.Reflector
}

The reflector does List+Watch against etcd, feeding events into the watch cache. Client watches subscribe to the broadcaster.

Every Kubernetes object has a metadata.resourceVersion field:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  resourceVersion: "12345678"

This value comes from etcd’s revision system, but with a twist.

In most cases, resourceVersion equals the etcd modification revision of that object. When you update a pod and etcd records it at revision 12345678, the pod’s resourceVersion becomes “12345678”.

But it’s not always a direct mapping:

  • The API server may encode additional information
  • Different storage backends could use different schemes
  • It’s intentionally opaque — don’t parse or compare numerically

Treat it as an opaque string that happens to increase over time.

When you list or watch resources, resourceVersion has specific meanings:

resourceVersion="" (not specified):

  • List: Return from API server cache (may be slightly stale)
  • Watch: Start from “now” (current resource version)

resourceVersion="0":

  • List: Return from API server cache (any version)
  • Watch: Start from “any” — API server chooses (usually current)

resourceVersion="12345678" (specific value):

  • List: Return data at least as fresh as this version
  • Watch: Start streaming from this exact point

For controllers, the pattern is:

  1. List with resourceVersion="" -> get current objects + a resourceVersion
  2. Watch with that resourceVersion -> see all changes since the list
list, err := client.CoreV1().Pods("").List(ctx, metav1.ListOptions{})
if err != nil {
    return err
}

// Watch starting from where list ended
watch, err := client.CoreV1().Pods("").Watch(ctx, metav1.ListOptions{
    ResourceVersion: list.ResourceVersion,
})
HTTP 410 Gone
the server does not allow this method on the requested resource

This happens when you request a resourceVersion that’s too old — it’s been compacted from the watch cache.

The timeline:

Watch cache window: [revision 1000 ... revision 2000]

Client requests: Watch from revision 500
API server: "I don't have revision 500 anymore" -> 410 Gone

Causes:

  • Client disconnected too long, missed too many events
  • Watch cache is small relative to event rate
  • etcd compacted before client reconnected

The proper response:

  1. Catch the 410 error
  2. Re-list to get current state and new resourceVersion
  3. Resume watching from the new resourceVersion

client-go’s informers handle this automatically.

Watches can go quiet — if nothing changes, no events flow. But the client needs to know: “Am I still connected? What’s the current resourceVersion?”

Bookmarks solve this. They’re synthetic events that communicate the current resourceVersion without an actual object change:

// Enable bookmarks
watch, err := client.Watch(ctx, metav1.ListOptions{
    ResourceVersion:     "12345",
    AllowWatchBookmarks: true,  // Request bookmarks
})

for event := range watch.ResultChan() {
    if event.Type == watch.Bookmark {
        // No object change, but we know we're caught up to this version
        fmt.Printf("Bookmark at %s\n", event.Object.GetResourceVersion())
    }
}

Why bookmarks matter:

  1. Progress tracking: Client knows how far behind it is
  2. Faster recovery: After disconnect, client can resume from bookmark’s resourceVersion instead of re-listing
  3. Preventing 410: Regular bookmarks keep the client’s resourceVersion fresh

The API server sends bookmarks periodically (default: every minute if there’s no activity).

Understanding the architecture reveals the failure modes.

The watch cache has finite size. Under heavy write load:

Events arriving:    1000/second
Cache size:         1000 events
Cache window:       1 second of history

A client that disconnects for 2 seconds and tries to resume will get 410 Gone.

Symptoms:

  • Controllers constantly re-listing
  • High API server memory usage
  • apiserver_watch_cache_capacity_increase_total metric increasing

Mitigation:

  • Increase watch cache size (API server flag)
  • Use bookmarks (keeps client resourceVersion fresh)
  • Reduce event rate (fewer unnecessary updates)

etcd compacts old revisions. If a client is watching etcd directly (rare, but some systems do), compaction can remove history the client needs.

For Kubernetes, this manifests indirectly — the watch cache’s underlying reflector gets 410 from etcd, triggering a full re-list.

Symptoms:

  • Periodic spikes in API server memory and etcd load
  • apiserver_watch_cache_list_total metric spikes

Mitigation:

  • Tune etcd compaction interval
  • Ensure watch cache is sized for your workload

When the API server restarts:

  1. All client watches disconnect
  2. API server starts fresh — empty watch cache
  3. All clients reconnect and re-list
  4. API server hammers etcd with list requests

A 1000-node cluster with 10 controllers each -> 10,000 simultaneous list requests.

Symptoms:

  • API server slow immediately after restart
  • etcd latency spikes
  • Controllers report sync failures

Mitigation:

  • API server caching settings (--watch-cache-sizes)
  • Client backoff (client-go does this automatically with jitter)
  • Priority and fairness (APF) to protect against thundering herd

Informers cache the resourceVersion of the last event. If an informer is too slow processing events:

API server at revision:     10000
Informer last saw:          8000
Informer's local cache at:  8000

API server watch cache:     [9000 ... 10000]  (only 1000 events)

Informer tries to resume from 8000 -> 410 Gone

Symptoms:

  • Informer re-lists repeatedly
  • Controller appears to “miss” events
  • High memory churn (re-list allocates new objects)

Mitigation:

  • Speed up event handlers (don’t block the informer)
  • Increase watch cache size
  • Check for slow reconcilers
--watch-cache=true                  # Enable watch cache (default: true)
--watch-cache-sizes=pods#1000       # Per-resource cache sizes
--default-watch-cache-size=100      # Default size for resources not specified

Format for --watch-cache-sizes:

resource#size,resource#size,...

Examples:
pods#5000,secrets#1000,configmaps#1000

Larger cache = more memory, but fewer 410 errors.

# etcd configuration
auto-compaction-mode: periodic
auto-compaction-retention: "1h"     # Keep 1 hour of history

Longer retention = more history available = fewer compaction-related 410s. But also more disk and memory usage.

API server:

# Watch cache size by resource
apiserver_watch_cache_capacity{resource="pods"}

# Events in watch cache
apiserver_watch_cache_events_received_total

# 410 errors (client needed older data than available)
apiserver_watch_cache_stale_total

# Watch count by resource
apiserver_registered_watchers{resource="pods"}

etcd:

# Current revision
etcd_debugging_mvcc_db_compaction_last

# Compaction stats
etcd_debugging_mvcc_db_compaction_total_duration_seconds

# Watch count
etcd_debugging_mvcc_watcher_total

Client-side symptoms:

  • Informer logs show “watch ended” followed by re-list
  • ResourceVersion jumps (indicates re-list happened)
  • Events appear “missed”

Check API server logs:

kubectl logs -n kube-system kube-apiserver-<node> | grep -i "watch"

Check etcd health:

etcdctl endpoint health
etcdctl endpoint status

Trace a specific watch:

Enable verbose logging in client-go:

import "k8s.io/klog/v2"

klog.SetOutput(os.Stderr)
klog.InitFlags(nil)
flag.Set("v", "6")  // Verbose watch logging

Simulate watch cache pressure:

func stressTest(client kubernetes.Interface) {
    // Create and delete pods rapidly
    for i := 0; i < 10000; i++ {
        pod := &corev1.Pod{
            ObjectMeta: metav1.ObjectMeta{
                Name: fmt.Sprintf("stress-%d", i),
            },
            Spec: corev1.PodSpec{
                Containers: []corev1.Container{{
                    Name:  "test",
                    Image: "nginx",
                }},
            },
        }
        client.CoreV1().Pods("default").Create(ctx, pod, metav1.CreateOptions{})
        client.CoreV1().Pods("default").Delete(ctx, pod.Name, metav1.DeleteOptions{})
    }
}

Monitor apiserver_watch_cache_stale_total during the test.

Putting it together:

+-------------------------------------------------------------------+
|                      Your Controller                              |
|  +--------------------------------------------------------------+ |
|  |                    SharedInformer                            | |
|  |                                                              | |
|  |  Reflector ---> DeltaFIFO ---> Indexer ---> Event Handler    | |
|  |  (List+Watch)                  (Cache)                       | |
|  +---------|----------------------------------------------------+ |
+------------|------------------------------------------------------+
             |
             | HTTP Watch (long-lived connection)
             v
+-------------------------------------------------------------------+
|                        API Server                                 |
|  +--------------------------------------------------------------+ |
|  |                         Cacher                               | |
|  |                                                              | |
|  |  Reflector ---> Watch Cache ---> Broadcaster (fan-out)       | |
|  |  (etcd watch)                                                | |
|  +------|-------------------------------------------------------+ |
+---------|---------------------------------------------------------+
          |
          | gRPC Watch (one per resource type)
          v
+-------------------------------------------------------------------+
|                           etcd                                    |
|  +--------------------------------------------------------------+ |
|  |  MVCC: Rev 1000 | Rev 1001 | Rev 1002 | Rev 1003 | ...       | |
|  +--------------------------------------------------------------+ |
+-------------------------------------------------------------------+
  1. etcd stores versioned data using MVCC, supports native watches
  2. API server watch cache multiplexes one etcd watch to many clients
  3. Client informers maintain local caches, synced via watches
  4. ResourceVersion ties it all together — an opaque token representing a point in time

When something goes wrong:

  • 410 errors -> client’s resourceVersion is too old, needs re-list
  • Missing events -> look for watch disconnects, slow handlers
  • High latency -> check watch cache size, etcd health, event rate

Understanding this chain helps you build robust controllers and diagnose issues that mystify most operators.