Eventual Consistency and Stale Caches in Kubernetes Controllers


Your custom controller watches for a label change. A user adds the label. Your controller does nothing — or worse, does the wrong thing. Thirty seconds later, it finally reacts. What happened?

The answer lies in the informer cache, and understanding it is the difference between controllers that work in demos and controllers that work in production.

Consider a simple scenario:

  1. User runs: kubectl label deployment web feature.example.com/inject-sidecar=true
  2. API server accepts the write, returns success
  3. Your controller’s reconcile loop runs
  4. Controller checks: “Does this Deployment have the sidecar label?”
  5. Cache says: “No”
  6. Controller does nothing

The label exists in etcd. The API server knows about it. But your controller’s local cache hasn’t caught up yet. Your controller just made a decision based on a lie.

This gets worse on busy clusters. When the API server is under load, watch events queue up. When your controller is processing a backlog, event handlers fall behind. The window between “truth in etcd” and “truth in your cache” widens from milliseconds to seconds — sometimes longer.

Symptoms you’ll see:

  • Flickering state: Resource toggles between two states as controllers fight over stale views
  • Unnecessary reconciliations: Controller keeps requeueing because it can’t see its own writes
  • Race conditions: Two controllers both think they need to act, both act, chaos ensues
  • Silent failures: Controller checks a condition, condition appears false, controller exits early

To fix these problems, you need to understand the machinery between the API server and your Reconcile() function.

Kubernetes controllers don’t poll the API server. Instead, they use the watch protocol:

  1. Initial List: On startup, the controller fetches all relevant objects (e.g., all Deployments in the cluster)
  2. Watch: Controller opens a long-lived HTTP/2 stream. The API server pushes events (ADDED, MODIFIED, DELETED) as objects change

This is efficient — you get updates pushed to you rather than polling. But it introduces a fundamental reality: your controller sees an eventually consistent view of the cluster.

The client-go library provides SharedInformer to manage this watch lifecycle. Here’s what’s actually happening:

API Server
    |
    |  Watch stream (HTTP/2)
    v
Reflector ------- Consumes watch events, handles reconnection
    |
    v
DeltaFIFO ------- Buffers changes, coalesces multiple updates
    |
    v
Indexer --------- The actual cache (thread-safe in-memory store)
    |
    |  Event handlers (OnAdd, OnUpdate, OnDelete)
    v
Work Queue ------ Rate-limited queue of keys to reconcile
    |
    v
Reconcile ------- Your code runs here

Every box in this diagram is a place where delay can accumulate.

Every object in Kubernetes has a metadata.resourceVersion field. This isn’t a version number you control — it’s an opaque string derived from the etcd revision.

metadata:
  name: web
  resourceVersion: "1847293"  # etcd revision when this object was last modified

When you watch resources, the API server tracks your position in the event stream using ResourceVersion. When your watch reconnects, it resumes from where it left off (if possible) or relists everything.

Key insight: If you read an object from cache and it has resourceVersion: "1847293", you’re seeing the state as of etcd revision 1847293. The object might have been modified since then — your cache just hasn’t received the event yet.

The time between “API server accepts a write” and “your Reconcile() sees it” is your lag window. Let’s trace where time goes:

The API server doesn’t stream directly from etcd. It maintains an in-memory watch cache and flushes events to watchers periodically. Default flush interval: ~100ms.

Delay contribution: 0-100ms typically

Events travel from API server to your controller over the network.

Delay contribution: <1ms (same node) to 10-50ms (cross-region)

The Reflector pushes events into DeltaFIFO. A separate goroutine pops events and updates the Indexer. If events arrive faster than they’re processed, they queue up.

Delay contribution: Microseconds normally, can spike to seconds under load

When the cache updates, your event handlers run (OnAdd, OnUpdate, OnDelete). If you do anything slow here — logging, metrics, complex filtering — you block subsequent events.

Delay contribution: Should be microseconds, but bad code makes this milliseconds or worse

Event handlers typically just add a key to the work queue. But if your reconciler is slow, the queue grows. New events wait behind old ones.

Delay contribution: Depends entirely on your reconciler throughput

Finally, your code runs. But you’re reading from the cache, which reflects state as of when the event handler ran — not when Reconcile runs.

Total lag budget: On a healthy cluster, 100-500ms is typical. On a busy cluster with slow reconcilers, 5-30 seconds is possible.

Want to see this in your cluster? Log the ResourceVersion at write time and compare to what your cache returns:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var obj appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }
    
    log.Info("Reconciling",
        "name", obj.Name,
        "cacheResourceVersion", obj.ResourceVersion,
        "queuedAt", req./* if you track this */))
    
    // ...
}

Compare against kubectl get deployment web -o jsonpath='{.metadata.resourceVersion}' to see how far behind your cache is.

The most common bug:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var deploy appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &deploy); err != nil {
        return ctrl.Result{}, err
    }
    
    // Add a label
    if deploy.Labels == nil {
        deploy.Labels = make(map[string]string)
    }
    deploy.Labels["my-controller/processed"] = "true"
    
    if err := r.Update(ctx, &deploy); err != nil {
        return ctrl.Result{}, err
    }
    
    // BUG: Reading immediately after writing
    var updated appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &updated); err != nil {
        return ctrl.Result{}, err
    }
    
    // This might be false! Cache hasn't caught up to our write.
    if updated.Labels["my-controller/processed"] != "true" {
        log.Error(nil, "Label not found after update!")  // This happens.
    }
    
    return ctrl.Result{}, nil
}

The Update() call succeeds and returns the updated object. But r.Get() reads from the cache, which hasn’t received the watch event yet.

This pattern is deceptively dangerous:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var deploy appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &deploy); err != nil {
        return ctrl.Result{}, err
    }
    
    // Dangerous: acting on absence
    if _, exists := deploy.Labels["feature.example.com/sidecar"]; !exists {
        log.Info("Sidecar label not present, skipping")
        return ctrl.Result{}, nil
    }
    
    // ... inject sidecar
}

If a user just added the label, your cache might not have it yet. You skip processing, and the user wonders why nothing happened. Worse, you don’t requeue — so you might never process it until the next resync.

Controllers often watch multiple resource types:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var deploy appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &deploy); err != nil {
        return ctrl.Result{}, err
    }
    
    // Get the associated ConfigMap
    var configMap corev1.ConfigMap
    configMapName := deploy.Annotations["my-controller/config"]
    if err := r.Get(ctx, types.NamespacedName{
        Namespace: deploy.Namespace,
        Name:      configMapName,
    }, &configMap); err != nil {
        if apierrors.IsNotFound(err) {
            // BUG: ConfigMap might exist but not be in cache yet
            log.Info("ConfigMap not found, waiting...")
            return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
        }
        return ctrl.Result{}, err
    }
    
    // ...
}

You’re watching Deployments. Someone creates a ConfigMap and then annotates a Deployment to reference it. The Deployment event arrives, but the ConfigMap watch might not have received its ADDED event yet. You conclude the ConfigMap doesn’t exist.

Two controllers watching the same resource, both with stale views:

Controller A cache: Deployment has 3 replicas
Controller B cache: Deployment has 3 replicas

User sets replicas to 5

Controller A sees update event (replicas=5)
Controller A: "I need to create a monitoring config for 5 replicas"
Controller A updates Deployment annotations

Controller B sees the annotation update (but has stale replicas=3 in cache)
Controller B: "Annotations changed, let me process this... replicas=3"
Controller B "fixes" replicas back to 3 based on stale cache

Controller A sees replicas change to 3...

Both controllers are acting rationally based on their view. But their views are inconsistent, and they fight.

Don’t trust a single reconciliation. If conditions aren’t met, requeue and check again:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var deploy appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &deploy); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }
    
    // Check if we've processed this
    if deploy.Annotations["my-controller/processed"] != "true" {
        // Do processing...
        deploy.Annotations["my-controller/processed"] = "true"
        if err := r.Update(ctx, &deploy); err != nil {
            return ctrl.Result{}, err
        }
        // Don't trust the update immediately - requeue to verify
        return ctrl.Result{RequeueAfter: 1 * time.Second}, nil
    }
    
    return ctrl.Result{}, nil
}

Always use ResourceVersion when updating. The API server rejects updates with stale ResourceVersion:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var deploy appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &deploy); err != nil {
        return ctrl.Result{}, err
    }
    
    // Modify
    deploy.Labels["processed"] = "true"
    
    // Update - this uses the ResourceVersion from Get()
    if err := r.Update(ctx, &deploy); err != nil {
        if apierrors.IsConflict(err) {
            // Someone else modified it - requeue and try again
            log.Info("Conflict detected, requeueing")
            return ctrl.Result{Requeue: true}, nil
        }
        return ctrl.Result{}, err
    }
    
    return ctrl.Result{}, nil
}

The conflict error is your friend. It tells you your view was stale and prevents you from clobbering someone else’s changes.

When you absolutely need fresh data, bypass the cache:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Normal cached read
    var deploy appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &deploy); err != nil {
        return ctrl.Result{}, err
    }
    
    // Need fresh data for critical decision? Bypass cache.
    var freshDeploy appsv1.Deployment
    if err := r.APIReader.Get(ctx, req.NamespacedName, &freshDeploy); err != nil {
        return ctrl.Result{}, err
    }
    
    // freshDeploy is read directly from API server
}

Use this sparingly — it adds API server load and defeats the purpose of caching. But for critical decisions where eventual consistency isn’t acceptable, it’s the right tool.

The kube-controller-manager uses this pattern for ReplicaSets. Track what you expect to happen, and wait for the cache to confirm:

type Expectations struct {
    mu       sync.Mutex
    expected map[string]expectation
}

type expectation struct {
    add    int  // expecting this many adds
    delete int  // expecting this many deletes
}

func (e *Expectations) ExpectCreations(key string, count int) {
    e.mu.Lock()
    defer e.mu.Unlock()
    exp := e.expected[key]
    exp.add += count
    e.expected[key] = exp
}

func (e *Expectations) CreationObserved(key string) {
    e.mu.Lock()
    defer e.mu.Unlock()
    exp := e.expected[key]
    exp.add--
    e.expected[key] = exp
}

func (e *Expectations) SatisfiedExpectations(key string) bool {
    e.mu.Lock()
    defer e.mu.Unlock()
    exp := e.expected[key]
    return exp.add <= 0 && exp.delete <= 0
}

In your reconciler:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    key := req.NamespacedName.String()
    
    // Don't reconcile until expectations are met
    if !r.expectations.SatisfiedExpectations(key) {
        log.Info("Expectations not yet satisfied, skipping")
        return ctrl.Result{}, nil
    }
    
    // ... determine we need to create 3 pods
    
    r.expectations.ExpectCreations(key, 3)
    for i := 0; i < 3; i++ {
        if err := r.Create(ctx, &pod); err != nil {
            // Creation failed - adjust expectations
            r.expectations.CreationObserved(key)
            return ctrl.Result{}, err
        }
    }
    
    return ctrl.Result{}, nil
}

Your pod informer’s event handler calls CreationObserved() when it sees new pods. This prevents the reconciler from creating duplicate pods because it doesn’t see the ones it just created.

For tracking whether a controller has processed the latest spec changes, use the Generation pattern:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var obj myv1.MyResource
    if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {
        return ctrl.Result{}, err
    }
    
    // Skip if we've already processed this generation
    if obj.Status.ObservedGeneration == obj.Generation {
        return ctrl.Result{}, nil
    }
    
    // Process the spec...
    
    // Update status to record that we've processed this generation
    obj.Status.ObservedGeneration = obj.Generation
    obj.Status.Conditions = append(obj.Status.Conditions, metav1.Condition{
        Type:               "Ready",
        Status:             metav1.ConditionTrue,
        ObservedGeneration: obj.Generation,
        LastTransitionTime: metav1.Now(),
    })
    
    if err := r.Status().Update(ctx, &obj); err != nil {
        return ctrl.Result{}, err
    }
    
    return ctrl.Result{}, nil
}

metadata.generation increments only when spec changes. status.observedGeneration records which generation your controller last processed. This gives you a reliable way to detect “is there new work to do” even with cache lag.

Add metrics to track how stale your cache reads are:

var (
    cacheAgeHistogram = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "controller_cache_age_seconds",
            Help:    "Age of objects read from cache",
            Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30},
        },
        []string{"controller", "resource"},
    )
)

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    start := time.Now()
    
    var deploy appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &deploy); err != nil {
        return ctrl.Result{}, err
    }
    
    // Estimate cache age from last update timestamp
    if lastUpdate := deploy.ManagedFields[len(deploy.ManagedFields)-1].Time; lastUpdate != nil {
        age := time.Since(lastUpdate.Time).Seconds()
        cacheAgeHistogram.WithLabelValues("mycontroller", "deployment").Observe(age)
    }
    
    // ...
}

For testing, add artificial delay to your event handlers:

func setupEventHandlers(mgr ctrl.Manager) error {
    informer, err := mgr.GetCache().GetInformer(context.Background(), &appsv1.Deployment{})
    if err != nil {
        return err
    }
    
    informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
        AddFunc: func(obj interface{}) {
            if simulateLag {
                time.Sleep(2 * time.Second)  // Simulate busy cluster
            }
        },
        UpdateFunc: func(old, new interface{}) {
            if simulateLag {
                time.Sleep(2 * time.Second)
            }
        },
    })
    
    return nil
}

Run your controller with this lag injected and watch how it behaves.

Log enough context to debug stale cache issues:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)
    
    var deploy appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &deploy); err != nil {
        return ctrl.Result{}, err
    }
    
    log.V(1).Info("Reconciling",
        "resourceVersion", deploy.ResourceVersion,
        "generation", deploy.Generation,
        "observedGeneration", deploy.Status.ObservedGeneration,
        "labels", deploy.Labels,
    )
    
    // ... do work
    
    log.V(1).Info("Reconcile complete",
        "resultingResourceVersion", deploy.ResourceVersion,
    )
    
    return ctrl.Result{}, nil
}

When debugging, compare these ResourceVersions across log entries to trace propagation delays.

The informer cache isn’t broken — it’s working as designed. Kubernetes is an eventually consistent system, and your controller must be too.

Design principles:

  1. Idempotency: Running your reconciler twice with the same input should produce the same result
  2. Convergence: Given enough time without new inputs, the system reaches the desired state
  3. Tolerance: Your controller handles stale data gracefully — it might make suboptimal decisions, but never catastrophically wrong ones
  4. Verification: Don’t trust a single check. Requeue, recheck, confirm

The goal isn’t to eliminate cache lag — it’s to build controllers that work correctly despite it.