Leader Election in Kubernetes Controllers


You deploy your controller with 3 replicas for high availability. But if all 3 try to reconcile simultaneously, you get duplicate actions, race conditions, and chaos. The solution: leader election. Only one replica is active; the others wait on standby.

This post covers how leader election works in Kubernetes, how to implement it, and what happens when things go wrong.

Without leader election, multiple controller replicas all watch the same resources:

Pod created
    |
    +-----> Controller-1: Creates Deployment
    |
    +-----> Controller-2: Creates Deployment (duplicate!)
    |
    +-----> Controller-3: Creates Deployment (duplicate!)

Results:

  • Duplicate resources created
  • Conflicting updates overwrite each other
  • Resource counts are wrong
  • State becomes inconsistent

You need exactly one active controller at a time.

Leader election ensures only one replica (the “leader”) is active:

+-------------+     +-------------+     +-------------+
| Controller-1|     | Controller-2|     | Controller-3|
|   LEADER    |     |   STANDBY   |     |   STANDBY   |
|  (active)   |     |  (waiting)  |     |  (waiting)  |
+------+------+     +------+------+     +------+------+
       |                   |                   |
       v                   v                   v
+--------------------------------------------------+
|              Kubernetes API Server               |
|                                                  |
|  Lock Object (Lease/ConfigMap/Endpoint)          |
|  holder: controller-1                            |
|  renewTime: 2025-01-26T10:00:00Z                 |
+--------------------------------------------------+

The leader periodically renews its lock. If it stops (crash, network partition), the lock expires and another replica becomes leader.

Kubernetes leader election uses a simple lease-based algorithm:

1. Try to create/update lock object with my identity
2. If successful, I'm the leader
3. If lock exists and isn't expired, wait and retry
4. If lock exists but is expired, try to take over
While I'm the leader:
  1. Do controller work
  2. Periodically renew the lock (update renewTime)
  3. If renewal fails, stop doing work and re-enter election

The lock is a Kubernetes object—historically ConfigMaps or Endpoints, now preferably Leases:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: my-controller
  namespace: kube-system
spec:
  holderIdentity: controller-1-abc123
  leaseDurationSeconds: 15
  acquireTime: "2025-01-26T10:00:00Z"
  renewTime: "2025-01-26T10:00:10Z"
  leaseTransitions: 5

Key fields:

  • holderIdentity: Who holds the lock (usually pod name)
  • leaseDurationSeconds: How long the lock is valid without renewal
  • renewTime: Last time the holder renewed
  • leaseTransitions: How many times leadership changed
LeaseDuration: 15 * time.Second  // Lock valid for this long
RenewDeadline: 10 * time.Second  // Must renew within this time
RetryPeriod:   2 * time.Second   // How often to retry acquiring

Timeline for failover:

0s   - Leader renews lock
2s   - Leader crashes
15s  - Lock expires (LeaseDuration)
15s  - Standby notices expired lock
17s  - Standby acquires lock, becomes leader

Total failover time: ~15 seconds
package main

import (
    "context"
    "os"
    "time"

    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/tools/leaderelection"
    "k8s.io/client-go/tools/leaderelection/resourcelock"
    "k8s.io/klog/v2"
)

func main() {
    // Create clientset
    clientset, err := kubernetes.NewForConfig(config)
    if err != nil {
        klog.Fatal(err)
    }

    // Get pod identity
    id, err := os.Hostname()
    if err != nil {
        klog.Fatal(err)
    }

    // Create the lock
    lock := &resourcelock.LeaseLock{
        LeaseMeta: metav1.ObjectMeta{
            Name:      "my-controller",
            Namespace: "default",
        },
        Client: clientset.CoordinationV1(),
        LockConfig: resourcelock.ResourceLockConfig{
            Identity: id,
        },
    }

    // Start leader election
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
        Lock:            lock,
        LeaseDuration:   15 * time.Second,
        RenewDeadline:   10 * time.Second,
        RetryPeriod:     2 * time.Second,
        Callbacks: leaderelection.LeaderCallbacks{
            OnStartedLeading: func(ctx context.Context) {
                // This is called when we become the leader
                klog.Info("Started leading")
                runController(ctx)
            },
            OnStoppedLeading: func() {
                // This is called when we stop being the leader
                klog.Info("Stopped leading")
                os.Exit(0)  // Exit so Kubernetes restarts us
            },
            OnNewLeader: func(identity string) {
                // This is called when leadership changes
                if identity == id {
                    return  // It's us
                }
                klog.Infof("New leader elected: %s", identity)
            },
        },
        ReleaseOnCancel: true,
    })
}

func runController(ctx context.Context) {
    // Your controller logic here
    // This runs only while we're the leader
    <-ctx.Done()
}

OnStartedLeading: Called when you become leader. Start your controller work here. The context is cancelled when you lose leadership.

OnStoppedLeading: Called when you lose leadership. Usually you should exit so Kubernetes can restart you cleanly.

ReleaseOnCancel: If true, releases the lock when context is cancelled (graceful shutdown).

controller-runtime (used by Kubebuilder/Operator SDK) has built-in leader election:

package main

import (
    "os"

    "sigs.k8s.io/controller-runtime/pkg/manager"
    "sigs.k8s.io/controller-runtime/pkg/manager/signals"
)

func main() {
    mgr, err := manager.New(config, manager.Options{
        // Enable leader election
        LeaderElection:          true,
        LeaderElectionID:        "my-controller.example.com",
        LeaderElectionNamespace: "default",

        // Optional: customize timing
        LeaseDuration: durationPtr(15 * time.Second),
        RenewDeadline: durationPtr(10 * time.Second),
        RetryPeriod:   durationPtr(2 * time.Second),
    })
    if err != nil {
        os.Exit(1)
    }

    // Add your controller to the manager
    if err := (&MyReconciler{}).SetupWithManager(mgr); err != nil {
        os.Exit(1)
    }

    // Start manager - handles leader election automatically
    if err := mgr.Start(signals.SetupSignalHandler()); err != nil {
        os.Exit(1)
    }
}

func durationPtr(d time.Duration) *time.Duration {
    return &d
}

That’s it! The manager handles:

  • Acquiring leadership before starting controllers
  • Renewing the lock periodically
  • Stopping controllers when leadership is lost
  • Graceful shutdown and lock release

In Kubebuilder-generated projects, enable in main.go:

var enableLeaderElection bool

func init() {
    flag.BoolVar(&enableLeaderElection, "leader-elect", false,
        "Enable leader election for controller manager.")
}

func main() {
    // ...
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        LeaderElection:   enableLeaderElection,
        LeaderElectionID: "my-operator.example.com",
    })
    // ...
}

Then deploy with --leader-elect=true:

spec:
  containers:
    - name: controller
      args:
        - --leader-elect=true

The built-in controller manager uses leader election:

kubectl get lease -n kube-system kube-controller-manager -o yaml

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-controller-manager
  namespace: kube-system
spec:
  holderIdentity: master-1_abc123
  leaseDurationSeconds: 15
  renewTime: "2025-01-26T10:00:00.000000Z"

Same pattern:

kubectl get lease -n kube-system kube-scheduler -o yaml
# For any controller using Lease
kubectl get lease -n <namespace> <lock-name> \
  -o jsonpath='{.spec.holderIdentity}'

When the leader terminates gracefully (SIGTERM), it should release the lock:

// With ReleaseOnCancel: true
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
    // ...
    ReleaseOnCancel: true,  // Release lock on graceful shutdown
})

This allows immediate failover instead of waiting for lock expiry.

If the leader crashes (SIGKILL, node failure), it can’t release the lock. Other replicas must wait for expiry:

Leader crashes (no graceful release)
    |
    v
Other replicas see lock still held
    |
    v
Wait for LeaseDuration (15s default)
    |
    v
Lock expires
    |
    v
New leader acquires lock

Trade-off: Shorter LeaseDuration = faster failover but more API server load from frequent renewals.

What if the leader can’t reach the API server but is still running?

Leader               API Server          Standby
   |                      |                  |
   |---renew (fails)----->|                  |
   |                      |                  |
   |  (network issue)     |                  |
   |                      |                  |
   |                      |<---acquire lock--|
   |                      |                  |
   |                      |---OK (new leader)|
   |                      |                  |
   v                      v                  v
Leader thinks          Standby becomes
it's still leader?     new leader!

The old leader MUST stop working when it can’t renew. This is why RenewDeadline exists:

RenewDeadline: 10 * time.Second  // Must renew within 10s

If renewal fails for 10 seconds, the leader:

  1. Stops doing work (context cancelled)
  2. Calls OnStoppedLeading
  3. Usually exits

Critical: Your controller must respect the context. If it ignores cancellation, you get split-brain:

// GOOD: Respects context
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Check context throughout long operations
    select {
    case <-ctx.Done():
        return ctrl.Result{}, ctx.Err()
    default:
    }
    
    // Do work...
}

// BAD: Ignores context
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Long-running work that ignores ctx
    time.Sleep(30 * time.Second)  // Doesn't check ctx!
    doWork()
}

Leader election relies on time. Significant clock skew between nodes can cause issues:

Node A clock: 10:00:00
Node B clock: 10:00:30 (30s ahead)

Node A holds lock, renewTime = 10:00:00
Node B sees lock, thinks it expired 15s ago!
Node B tries to take over...

Mitigation:

  • Use NTP to keep clocks synchronized
  • Kubernetes tolerates small skew (a few seconds)
  • LeaseDuration should be » expected clock skew

The lock object must be in a namespace the controller can access:

LeaderElectionNamespace: "my-controller-system",

Common patterns:

  • Same namespace as the controller deployment
  • kube-system for cluster-wide controllers
  • Dedicated namespace for all controllers’ locks
# Get current leader
kubectl get lease my-controller -n default -o jsonpath='{.spec.holderIdentity}'
controller-1-abc123

# Get full lease details
kubectl get lease my-controller -n default -o yaml

Check the lease:

kubectl describe lease my-controller -n default

Name:         my-controller
Namespace:    default
...
Spec:
  Holder Identity:        controller-1-abc123
  Lease Duration Seconds: 15
  Renew Time:             2025-01-26T10:00:00.000000Z

If renewTime is recent: Current leader is healthy. Your replica is correctly waiting.

If renewTime is stale: Lock should have expired. Check if your replica has permission to update the lease:

# Check RBAC
kubectl auth can-i update leases.coordination.k8s.io --as=system:serviceaccount:default:my-controller

If leadership bounces between replicas:

kubectl get lease my-controller -o jsonpath='{.spec.leaseTransitions}'

High leaseTransitions indicates instability. Common causes:

  • Network instability between controller and API server
  • Controller crash-looping
  • Resource starvation (CPU/memory) causing slow renewals
  • API server overload causing timeout on renewals

Check logs for:

"Stopped leading"

If this doesn’t appear, or controller continues working:

  • OnStoppedLeading might not be calling os.Exit()
  • Context isn’t being propagated/respected
  • Long-running operations ignoring cancellation

What we’ve discussed: one active, others standby.

Replicas: 3
Active: 1
Failover time: ~15 seconds

For some controllers, you can shard work across replicas:

// Each replica handles different namespaces
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    if !r.shouldHandle(req.Namespace) {
        return ctrl.Result{}, nil  // Let another replica handle it
    }
    // ...
}

func (r *Reconciler) shouldHandle(namespace string) bool {
    hash := fnv.New32()
    hash.Write([]byte(namespace))
    return hash.Sum32() % r.totalReplicas == r.replicaIndex
}

Pros: Better throughput, no failover delay Cons: More complex, need to handle rebalancing

Use leader election for cluster-scoped resources, sharding for namespaced:

// Cluster-scoped: requires leadership
// Namespaced: sharded across replicas

Your controller’s ServiceAccount needs permission to manage the lock:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: leader-election-role
  namespace: default
rules:
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: leader-election-rolebinding
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: leader-election-role
subjects:
  - kind: ServiceAccount
    name: my-controller
    namespace: default

If using ConfigMaps or Endpoints (legacy):

rules:
  - apiGroups: [""]
    resources: ["configmaps"]  # or "endpoints"
    verbs: ["get", "create", "update"]

Leases are purpose-built for leader election. ConfigMaps and Endpoints work but have drawbacks:

  • ConfigMaps: Extra data in etcd
  • Endpoints: Confusion with actual service endpoints
lock := &resourcelock.LeaseLock{...}  // Preferred

Include your controller/operator name to avoid conflicts:

LeaderElectionID: "my-company.my-operator.example.com"

Don’t try to be clever. When you lose leadership, exit:

OnStoppedLeading: func() {
    klog.Info("Lost leadership, exiting")
    os.Exit(0)  // Let Kubernetes restart us
},

Trying to re-acquire in the same process is fragile.

All reconciliation should check context:

func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    if err := ctx.Err(); err != nil {
        return ctrl.Result{}, err
    }
    // ...
}

Expose metrics about leadership:

var isLeader = prometheus.NewGauge(prometheus.GaugeOpts{
    Name: "controller_is_leader",
    Help: "1 if this instance is the leader, 0 otherwise",
})

OnStartedLeading: func(ctx context.Context) {
    isLeader.Set(1)
    runController(ctx)
},
OnStoppedLeading: func() {
    isLeader.Set(0)
    os.Exit(0)
},

Leader election ensures exactly one controller replica is active:

Component Purpose
Lock object (Lease) Stores current leader identity
LeaseDuration How long lock is valid
RenewDeadline Max time to renew before giving up
RetryPeriod How often standbys check the lock

Key timing:

  • Graceful failover: Immediate (lock released)
  • Ungraceful failover: LeaseDuration (default 15s)

Implementation:

  • client-go: leaderelection.RunOrDie() with callbacks
  • controller-runtime: LeaderElection: true in manager options

Critical rules:

  1. Leader must stop work when it can’t renew
  2. All work must respect context cancellation
  3. Exit on leadership loss—don’t try to recover
  4. Use Leases, not ConfigMaps/Endpoints

Leader election is what makes HA controllers possible. Without it, you get chaos. With it, you get automatic failover with minimal downtime.