Leader Election in Kubernetes Controllers

You deploy your controller with 3 replicas for high availability. But if all 3 try to reconcile simultaneously, you get duplicate actions, race conditions, and chaos. The solution: leader election. Only one replica is active; the others wait on standby.

This post covers how leader election works in Kubernetes, how to implement it, and what happens when things go wrong.

The Problem: Multiple Active Controllers ¶

Without leader election, multiple controller replicas all watch the same resources:

Pod created
    |
    +-----> Controller-1: Creates Deployment
    |
    +-----> Controller-2: Creates Deployment (duplicate!)
    |
    +-----> Controller-3: Creates Deployment (duplicate!)

Results:

Duplicate resources created
Conflicting updates overwrite each other
Resource counts are wrong
State becomes inconsistent

You need exactly one active controller at a time.

Leader Election Overview ¶

Leader election ensures only one replica (the “leader”) is active:

+-------------+     +-------------+     +-------------+
| Controller-1|     | Controller-2|     | Controller-3|
|   LEADER    |     |   STANDBY   |     |   STANDBY   |
|  (active)   |     |  (waiting)  |     |  (waiting)  |
+------+------+     +------+------+     +------+------+
       |                   |                   |
       v                   v                   v
+--------------------------------------------------+
|              Kubernetes API Server               |
|                                                  |
|  Lock Object (Lease/ConfigMap/Endpoint)          |
|  holder: controller-1                            |
|  renewTime: 2025-01-26T10:00:00Z                 |
+--------------------------------------------------+

The leader periodically renews its lock. If it stops (crash, network partition), the lock expires and another replica becomes leader.

How It Works: The Algorithm ¶

Kubernetes leader election uses a simple lease-based algorithm:

Acquiring Leadership ¶

1. Try to create/update lock object with my identity
2. If successful, I'm the leader
3. If lock exists and isn't expired, wait and retry
4. If lock exists but is expired, try to take over

Maintaining Leadership ¶

While I'm the leader:
  1. Do controller work
  2. Periodically renew the lock (update renewTime)
  3. If renewal fails, stop doing work and re-enter election

Lock Object ¶

The lock is a Kubernetes object—historically ConfigMaps or Endpoints, now preferably Leases:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: my-controller
  namespace: kube-system
spec:
  holderIdentity: controller-1-abc123
  leaseDurationSeconds: 15
  acquireTime: "2025-01-26T10:00:00Z"
  renewTime: "2025-01-26T10:00:10Z"
  leaseTransitions: 5

Key fields:

holderIdentity: Who holds the lock (usually pod name)
leaseDurationSeconds: How long the lock is valid without renewal
renewTime: Last time the holder renewed
leaseTransitions: How many times leadership changed

Timing Parameters ¶

LeaseDuration: 15 * time.Second  // Lock valid for this long
RenewDeadline: 10 * time.Second  // Must renew within this time
RetryPeriod:   2 * time.Second   // How often to retry acquiring

Timeline for failover:

0s   - Leader renews lock
2s   - Leader crashes
15s  - Lock expires (LeaseDuration)
15s  - Standby notices expired lock
17s  - Standby acquires lock, becomes leader

Total failover time: ~15 seconds

Implementation with client-go ¶

Basic Leader Election ¶

package main

import (
    "context"
    "os"
    "time"

    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/tools/leaderelection"
    "k8s.io/client-go/tools/leaderelection/resourcelock"
    "k8s.io/klog/v2"
)

func main() {
    // Create clientset
    clientset, err := kubernetes.NewForConfig(config)
    if err != nil {
        klog.Fatal(err)
    }

    // Get pod identity
    id, err := os.Hostname()
    if err != nil {
        klog.Fatal(err)
    }

    // Create the lock
    lock := &resourcelock.LeaseLock{
        LeaseMeta: metav1.ObjectMeta{
            Name:      "my-controller",
            Namespace: "default",
        },
        Client: clientset.CoordinationV1(),
        LockConfig: resourcelock.ResourceLockConfig{
            Identity: id,
        },
    }

    // Start leader election
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
        Lock:            lock,
        LeaseDuration:   15 * time.Second,
        RenewDeadline:   10 * time.Second,
        RetryPeriod:     2 * time.Second,
        Callbacks: leaderelection.LeaderCallbacks{
            OnStartedLeading: func(ctx context.Context) {
                // This is called when we become the leader
                klog.Info("Started leading")
                runController(ctx)
            },
            OnStoppedLeading: func() {
                // This is called when we stop being the leader
                klog.Info("Stopped leading")
                os.Exit(0)  // Exit so Kubernetes restarts us
            },
            OnNewLeader: func(identity string) {
                // This is called when leadership changes
                if identity == id {
                    return  // It's us
                }
                klog.Infof("New leader elected: %s", identity)
            },
        },
        ReleaseOnCancel: true,
    })
}

func runController(ctx context.Context) {
    // Your controller logic here
    // This runs only while we're the leader
    <-ctx.Done()
}

Key Points ¶

OnStartedLeading: Called when you become leader. Start your controller work here. The context is cancelled when you lose leadership.

OnStoppedLeading: Called when you lose leadership. Usually you should exit so Kubernetes can restart you cleanly.

ReleaseOnCancel: If true, releases the lock when context is cancelled (graceful shutdown).

Implementation with controller-runtime ¶

controller-runtime (used by Kubebuilder/Operator SDK) has built-in leader election:

package main

import (
    "os"

    "sigs.k8s.io/controller-runtime/pkg/manager"
    "sigs.k8s.io/controller-runtime/pkg/manager/signals"
)

func main() {
    mgr, err := manager.New(config, manager.Options{
        // Enable leader election
        LeaderElection:          true,
        LeaderElectionID:        "my-controller.example.com",
        LeaderElectionNamespace: "default",

        // Optional: customize timing
        LeaseDuration: durationPtr(15 * time.Second),
        RenewDeadline: durationPtr(10 * time.Second),
        RetryPeriod:   durationPtr(2 * time.Second),
    })
    if err != nil {
        os.Exit(1)
    }

    // Add your controller to the manager
    if err := (&MyReconciler{}).SetupWithManager(mgr); err != nil {
        os.Exit(1)
    }

    // Start manager - handles leader election automatically
    if err := mgr.Start(signals.SetupSignalHandler()); err != nil {
        os.Exit(1)
    }
}

func durationPtr(d time.Duration) *time.Duration {
    return &d
}

That’s it! The manager handles:

Acquiring leadership before starting controllers
Renewing the lock periodically
Stopping controllers when leadership is lost
Graceful shutdown and lock release

Kubebuilder Projects ¶

In Kubebuilder-generated projects, enable in main.go:

var enableLeaderElection bool

func init() {
    flag.BoolVar(&enableLeaderElection, "leader-elect", false,
        "Enable leader election for controller manager.")
}

func main() {
    // ...
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        LeaderElection:   enableLeaderElection,
        LeaderElectionID: "my-operator.example.com",
    })
    // ...
}

Then deploy with --leader-elect=true:

spec:
  containers:
    - name: controller
      args:
        - --leader-elect=true

Leader Election in the Wild ¶

kube-controller-manager ¶

The built-in controller manager uses leader election:

kubectl get lease -n kube-system kube-controller-manager -o yaml

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-controller-manager
  namespace: kube-system
spec:
  holderIdentity: master-1_abc123
  leaseDurationSeconds: 15
  renewTime: "2025-01-26T10:00:00.000000Z"

kube-scheduler ¶

Same pattern:

kubectl get lease -n kube-system kube-scheduler -o yaml

Checking Current Leader ¶

# For any controller using Lease
kubectl get lease -n <namespace> <lock-name> \
  -o jsonpath='{.spec.holderIdentity}'

Handling Edge Cases ¶

Graceful Shutdown ¶

When the leader terminates gracefully (SIGTERM), it should release the lock:

// With ReleaseOnCancel: true
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
    // ...
    ReleaseOnCancel: true,  // Release lock on graceful shutdown
})

This allows immediate failover instead of waiting for lock expiry.

Ungraceful Termination ¶

If the leader crashes (SIGKILL, node failure), it can’t release the lock. Other replicas must wait for expiry:

Leader crashes (no graceful release)
    |
    v
Other replicas see lock still held
    |
    v
Wait for LeaseDuration (15s default)
    |
    v
Lock expires
    |
    v
New leader acquires lock

Trade-off: Shorter LeaseDuration = faster failover but more API server load from frequent renewals.

Network Partition (Split Brain?) ¶

What if the leader can’t reach the API server but is still running?

Leader               API Server          Standby
   |                      |                  |
   |---renew (fails)----->|                  |
   |                      |                  |
   |  (network issue)     |                  |
   |                      |                  |
   |                      |<---acquire lock--|
   |                      |                  |
   |                      |---OK (new leader)|
   |                      |                  |
   v                      v                  v
Leader thinks          Standby becomes
it's still leader?     new leader!

The old leader MUST stop working when it can’t renew. This is why RenewDeadline exists:

RenewDeadline: 10 * time.Second  // Must renew within 10s

If renewal fails for 10 seconds, the leader:

Stops doing work (context cancelled)
Calls OnStoppedLeading
Usually exits

Critical: Your controller must respect the context. If it ignores cancellation, you get split-brain:

// GOOD: Respects context
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Check context throughout long operations
    select {
    case <-ctx.Done():
        return ctrl.Result{}, ctx.Err()
    default:
    }
    
    // Do work...
}

// BAD: Ignores context
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Long-running work that ignores ctx
    time.Sleep(30 * time.Second)  // Doesn't check ctx!
    doWork()
}

Clock Skew ¶

Leader election relies on time. Significant clock skew between nodes can cause issues:

Node A clock: 10:00:00
Node B clock: 10:00:30 (30s ahead)

Node A holds lock, renewTime = 10:00:00
Node B sees lock, thinks it expired 15s ago!
Node B tries to take over...

Mitigation:

Use NTP to keep clocks synchronized
Kubernetes tolerates small skew (a few seconds)
LeaseDuration should be » expected clock skew

Namespace Considerations ¶

The lock object must be in a namespace the controller can access:

LeaderElectionNamespace: "my-controller-system",

Common patterns:

Same namespace as the controller deployment
kube-system for cluster-wide controllers
Dedicated namespace for all controllers’ locks

Debugging Leader Election ¶

Who’s the Leader? ¶

# Get current leader
kubectl get lease my-controller -n default -o jsonpath='{.spec.holderIdentity}'
controller-1-abc123

# Get full lease details
kubectl get lease my-controller -n default -o yaml

Why Isn’t My Replica Becoming Leader? ¶

Check the lease:

kubectl describe lease my-controller -n default

Name:         my-controller
Namespace:    default
...
Spec:
  Holder Identity:        controller-1-abc123
  Lease Duration Seconds: 15
  Renew Time:             2025-01-26T10:00:00.000000Z

If renewTime is recent: Current leader is healthy. Your replica is correctly waiting.

If renewTime is stale: Lock should have expired. Check if your replica has permission to update the lease:

# Check RBAC
kubectl auth can-i update leases.coordination.k8s.io --as=system:serviceaccount:default:my-controller

Frequent Leadership Changes ¶

If leadership bounces between replicas:

kubectl get lease my-controller -o jsonpath='{.spec.leaseTransitions}'

High leaseTransitions indicates instability. Common causes:

Network instability between controller and API server
Controller crash-looping
Resource starvation (CPU/memory) causing slow renewals
API server overload causing timeout on renewals

Controller Not Stopping After Losing Leadership ¶

Check logs for:

"Stopped leading"

If this doesn’t appear, or controller continues working:

OnStoppedLeading might not be calling os.Exit()
Context isn’t being propagated/respected
Long-running operations ignoring cancellation

High Availability Patterns ¶

Active-Passive (Leader Election) ¶

What we’ve discussed: one active, others standby.

Replicas: 3
Active: 1
Failover time: ~15 seconds

Active-Active (Sharding) ¶

For some controllers, you can shard work across replicas:

// Each replica handles different namespaces
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    if !r.shouldHandle(req.Namespace) {
        return ctrl.Result{}, nil  // Let another replica handle it
    }
    // ...
}

func (r *Reconciler) shouldHandle(namespace string) bool {
    hash := fnv.New32()
    hash.Write([]byte(namespace))
    return hash.Sum32() % r.totalReplicas == r.replicaIndex
}

Pros: Better throughput, no failover delay Cons: More complex, need to handle rebalancing

Hybrid ¶

Use leader election for cluster-scoped resources, sharding for namespaced:

// Cluster-scoped: requires leadership
// Namespaced: sharded across replicas

RBAC for Leader Election ¶

Your controller’s ServiceAccount needs permission to manage the lock:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: leader-election-role
  namespace: default
rules:
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: leader-election-rolebinding
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: leader-election-role
subjects:
  - kind: ServiceAccount
    name: my-controller
    namespace: default

If using ConfigMaps or Endpoints (legacy):

rules:
  - apiGroups: [""]
    resources: ["configmaps"]  # or "endpoints"
    verbs: ["get", "create", "update"]

Best Practices ¶

1. Use Leases ¶

Leases are purpose-built for leader election. ConfigMaps and Endpoints work but have drawbacks:

ConfigMaps: Extra data in etcd
Endpoints: Confusion with actual service endpoints

lock := &resourcelock.LeaseLock{...}  // Preferred

2. Unique Lock Names ¶

Include your controller/operator name to avoid conflicts:

LeaderElectionID: "my-company.my-operator.example.com"

3. Exit on Leadership Loss ¶

Don’t try to be clever. When you lose leadership, exit:

OnStoppedLeading: func() {
    klog.Info("Lost leadership, exiting")
    os.Exit(0)  // Let Kubernetes restart us
},

Trying to re-acquire in the same process is fragile.

4. Respect Context Cancellation ¶

All reconciliation should check context:

func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    if err := ctx.Err(); err != nil {
        return ctrl.Result{}, err
    }
    // ...
}

5. Monitor Leadership ¶

Expose metrics about leadership:

var isLeader = prometheus.NewGauge(prometheus.GaugeOpts{
    Name: "controller_is_leader",
    Help: "1 if this instance is the leader, 0 otherwise",
})

OnStartedLeading: func(ctx context.Context) {
    isLeader.Set(1)
    runController(ctx)
},
OnStoppedLeading: func() {
    isLeader.Set(0)
    os.Exit(0)
},

Summary ¶

Leader election ensures exactly one controller replica is active:

Component	Purpose
Lock object (Lease)	Stores current leader identity
LeaseDuration	How long lock is valid
RenewDeadline	Max time to renew before giving up
RetryPeriod	How often standbys check the lock

Key timing:

Graceful failover: Immediate (lock released)
Ungraceful failover: LeaseDuration (default 15s)

Implementation:

client-go: leaderelection.RunOrDie() with callbacks
controller-runtime: LeaderElection: true in manager options

Critical rules:

Leader must stop work when it can’t renew
All work must respect context cancellation
Exit on leadership loss—don’t try to recover
Use Leases, not ConfigMaps/Endpoints

Leader election is what makes HA controllers possible. Without it, you get chaos. With it, you get automatic failover with minimal downtime.