You deploy your controller with 3 replicas for high availability. But if all 3 try to reconcile simultaneously, you get duplicate actions, race conditions, and chaos. The solution: leader election. Only one replica is active; the others wait on standby.
This post covers how leader election works in Kubernetes, how to implement it, and what happens when things go wrong.
The Problem: Multiple Active Controllers ¶
Without leader election, multiple controller replicas all watch the same resources:
Pod created
|
+-----> Controller-1: Creates Deployment
|
+-----> Controller-2: Creates Deployment (duplicate!)
|
+-----> Controller-3: Creates Deployment (duplicate!)
Results:
- Duplicate resources created
- Conflicting updates overwrite each other
- Resource counts are wrong
- State becomes inconsistent
You need exactly one active controller at a time.
Leader Election Overview ¶
Leader election ensures only one replica (the “leader”) is active:
+-------------+ +-------------+ +-------------+
| Controller-1| | Controller-2| | Controller-3|
| LEADER | | STANDBY | | STANDBY |
| (active) | | (waiting) | | (waiting) |
+------+------+ +------+------+ +------+------+
| | |
v v v
+--------------------------------------------------+
| Kubernetes API Server |
| |
| Lock Object (Lease/ConfigMap/Endpoint) |
| holder: controller-1 |
| renewTime: 2025-01-26T10:00:00Z |
+--------------------------------------------------+
The leader periodically renews its lock. If it stops (crash, network partition), the lock expires and another replica becomes leader.
How It Works: The Algorithm ¶
Kubernetes leader election uses a simple lease-based algorithm:
Acquiring Leadership ¶
1. Try to create/update lock object with my identity
2. If successful, I'm the leader
3. If lock exists and isn't expired, wait and retry
4. If lock exists but is expired, try to take over
Maintaining Leadership ¶
While I'm the leader:
1. Do controller work
2. Periodically renew the lock (update renewTime)
3. If renewal fails, stop doing work and re-enter election
Lock Object ¶
The lock is a Kubernetes object—historically ConfigMaps or Endpoints, now preferably Leases:
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: my-controller
namespace: kube-system
spec:
holderIdentity: controller-1-abc123
leaseDurationSeconds: 15
acquireTime: "2025-01-26T10:00:00Z"
renewTime: "2025-01-26T10:00:10Z"
leaseTransitions: 5
Key fields:
holderIdentity: Who holds the lock (usually pod name)leaseDurationSeconds: How long the lock is valid without renewalrenewTime: Last time the holder renewedleaseTransitions: How many times leadership changed
Timing Parameters ¶
LeaseDuration: 15 * time.Second // Lock valid for this long
RenewDeadline: 10 * time.Second // Must renew within this time
RetryPeriod: 2 * time.Second // How often to retry acquiring
Timeline for failover:
0s - Leader renews lock
2s - Leader crashes
15s - Lock expires (LeaseDuration)
15s - Standby notices expired lock
17s - Standby acquires lock, becomes leader
Total failover time: ~15 seconds
Implementation with client-go ¶
Basic Leader Election ¶
package main
import (
"context"
"os"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/leaderelection"
"k8s.io/client-go/tools/leaderelection/resourcelock"
"k8s.io/klog/v2"
)
func main() {
// Create clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
klog.Fatal(err)
}
// Get pod identity
id, err := os.Hostname()
if err != nil {
klog.Fatal(err)
}
// Create the lock
lock := &resourcelock.LeaseLock{
LeaseMeta: metav1.ObjectMeta{
Name: "my-controller",
Namespace: "default",
},
Client: clientset.CoordinationV1(),
LockConfig: resourcelock.ResourceLockConfig{
Identity: id,
},
}
// Start leader election
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
Lock: lock,
LeaseDuration: 15 * time.Second,
RenewDeadline: 10 * time.Second,
RetryPeriod: 2 * time.Second,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: func(ctx context.Context) {
// This is called when we become the leader
klog.Info("Started leading")
runController(ctx)
},
OnStoppedLeading: func() {
// This is called when we stop being the leader
klog.Info("Stopped leading")
os.Exit(0) // Exit so Kubernetes restarts us
},
OnNewLeader: func(identity string) {
// This is called when leadership changes
if identity == id {
return // It's us
}
klog.Infof("New leader elected: %s", identity)
},
},
ReleaseOnCancel: true,
})
}
func runController(ctx context.Context) {
// Your controller logic here
// This runs only while we're the leader
<-ctx.Done()
}
Key Points ¶
OnStartedLeading: Called when you become leader. Start your controller work here. The context is cancelled when you lose leadership.
OnStoppedLeading: Called when you lose leadership. Usually you should exit so Kubernetes can restart you cleanly.
ReleaseOnCancel: If true, releases the lock when context is cancelled (graceful shutdown).
Implementation with controller-runtime ¶
controller-runtime (used by Kubebuilder/Operator SDK) has built-in leader election:
package main
import (
"os"
"sigs.k8s.io/controller-runtime/pkg/manager"
"sigs.k8s.io/controller-runtime/pkg/manager/signals"
)
func main() {
mgr, err := manager.New(config, manager.Options{
// Enable leader election
LeaderElection: true,
LeaderElectionID: "my-controller.example.com",
LeaderElectionNamespace: "default",
// Optional: customize timing
LeaseDuration: durationPtr(15 * time.Second),
RenewDeadline: durationPtr(10 * time.Second),
RetryPeriod: durationPtr(2 * time.Second),
})
if err != nil {
os.Exit(1)
}
// Add your controller to the manager
if err := (&MyReconciler{}).SetupWithManager(mgr); err != nil {
os.Exit(1)
}
// Start manager - handles leader election automatically
if err := mgr.Start(signals.SetupSignalHandler()); err != nil {
os.Exit(1)
}
}
func durationPtr(d time.Duration) *time.Duration {
return &d
}
That’s it! The manager handles:
- Acquiring leadership before starting controllers
- Renewing the lock periodically
- Stopping controllers when leadership is lost
- Graceful shutdown and lock release
Kubebuilder Projects ¶
In Kubebuilder-generated projects, enable in main.go:
var enableLeaderElection bool
func init() {
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
"Enable leader election for controller manager.")
}
func main() {
// ...
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
LeaderElection: enableLeaderElection,
LeaderElectionID: "my-operator.example.com",
})
// ...
}
Then deploy with --leader-elect=true:
spec:
containers:
- name: controller
args:
- --leader-elect=true
Leader Election in the Wild ¶
kube-controller-manager ¶
The built-in controller manager uses leader election:
kubectl get lease -n kube-system kube-controller-manager -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: kube-controller-manager
namespace: kube-system
spec:
holderIdentity: master-1_abc123
leaseDurationSeconds: 15
renewTime: "2025-01-26T10:00:00.000000Z"
kube-scheduler ¶
Same pattern:
kubectl get lease -n kube-system kube-scheduler -o yaml
Checking Current Leader ¶
# For any controller using Lease
kubectl get lease -n <namespace> <lock-name> \
-o jsonpath='{.spec.holderIdentity}'
Handling Edge Cases ¶
Graceful Shutdown ¶
When the leader terminates gracefully (SIGTERM), it should release the lock:
// With ReleaseOnCancel: true
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
// ...
ReleaseOnCancel: true, // Release lock on graceful shutdown
})
This allows immediate failover instead of waiting for lock expiry.
Ungraceful Termination ¶
If the leader crashes (SIGKILL, node failure), it can’t release the lock. Other replicas must wait for expiry:
Leader crashes (no graceful release)
|
v
Other replicas see lock still held
|
v
Wait for LeaseDuration (15s default)
|
v
Lock expires
|
v
New leader acquires lock
Trade-off: Shorter LeaseDuration = faster failover but more API server load from frequent renewals.
Network Partition (Split Brain?) ¶
What if the leader can’t reach the API server but is still running?
Leader API Server Standby
| | |
|---renew (fails)----->| |
| | |
| (network issue) | |
| | |
| |<---acquire lock--|
| | |
| |---OK (new leader)|
| | |
v v v
Leader thinks Standby becomes
it's still leader? new leader!
The old leader MUST stop working when it can’t renew. This is why RenewDeadline exists:
RenewDeadline: 10 * time.Second // Must renew within 10s
If renewal fails for 10 seconds, the leader:
- Stops doing work (context cancelled)
- Calls OnStoppedLeading
- Usually exits
Critical: Your controller must respect the context. If it ignores cancellation, you get split-brain:
// GOOD: Respects context
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Check context throughout long operations
select {
case <-ctx.Done():
return ctrl.Result{}, ctx.Err()
default:
}
// Do work...
}
// BAD: Ignores context
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Long-running work that ignores ctx
time.Sleep(30 * time.Second) // Doesn't check ctx!
doWork()
}
Clock Skew ¶
Leader election relies on time. Significant clock skew between nodes can cause issues:
Node A clock: 10:00:00
Node B clock: 10:00:30 (30s ahead)
Node A holds lock, renewTime = 10:00:00
Node B sees lock, thinks it expired 15s ago!
Node B tries to take over...
Mitigation:
- Use NTP to keep clocks synchronized
- Kubernetes tolerates small skew (a few seconds)
- LeaseDuration should be » expected clock skew
Namespace Considerations ¶
The lock object must be in a namespace the controller can access:
LeaderElectionNamespace: "my-controller-system",
Common patterns:
- Same namespace as the controller deployment
kube-systemfor cluster-wide controllers- Dedicated namespace for all controllers’ locks
Debugging Leader Election ¶
Who’s the Leader? ¶
# Get current leader
kubectl get lease my-controller -n default -o jsonpath='{.spec.holderIdentity}'
controller-1-abc123
# Get full lease details
kubectl get lease my-controller -n default -o yaml
Why Isn’t My Replica Becoming Leader? ¶
Check the lease:
kubectl describe lease my-controller -n default
Name: my-controller
Namespace: default
...
Spec:
Holder Identity: controller-1-abc123
Lease Duration Seconds: 15
Renew Time: 2025-01-26T10:00:00.000000Z
If renewTime is recent: Current leader is healthy. Your replica is correctly waiting.
If renewTime is stale: Lock should have expired. Check if your replica has permission to update the lease:
# Check RBAC
kubectl auth can-i update leases.coordination.k8s.io --as=system:serviceaccount:default:my-controller
Frequent Leadership Changes ¶
If leadership bounces between replicas:
kubectl get lease my-controller -o jsonpath='{.spec.leaseTransitions}'
High leaseTransitions indicates instability. Common causes:
- Network instability between controller and API server
- Controller crash-looping
- Resource starvation (CPU/memory) causing slow renewals
- API server overload causing timeout on renewals
Controller Not Stopping After Losing Leadership ¶
Check logs for:
"Stopped leading"
If this doesn’t appear, or controller continues working:
OnStoppedLeadingmight not be callingos.Exit()- Context isn’t being propagated/respected
- Long-running operations ignoring cancellation
High Availability Patterns ¶
Active-Passive (Leader Election) ¶
What we’ve discussed: one active, others standby.
Replicas: 3
Active: 1
Failover time: ~15 seconds
Active-Active (Sharding) ¶
For some controllers, you can shard work across replicas:
// Each replica handles different namespaces
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
if !r.shouldHandle(req.Namespace) {
return ctrl.Result{}, nil // Let another replica handle it
}
// ...
}
func (r *Reconciler) shouldHandle(namespace string) bool {
hash := fnv.New32()
hash.Write([]byte(namespace))
return hash.Sum32() % r.totalReplicas == r.replicaIndex
}
Pros: Better throughput, no failover delay Cons: More complex, need to handle rebalancing
Hybrid ¶
Use leader election for cluster-scoped resources, sharding for namespaced:
// Cluster-scoped: requires leadership
// Namespaced: sharded across replicas
RBAC for Leader Election ¶
Your controller’s ServiceAccount needs permission to manage the lock:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: leader-election-role
namespace: default
rules:
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: leader-election-rolebinding
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: leader-election-role
subjects:
- kind: ServiceAccount
name: my-controller
namespace: default
If using ConfigMaps or Endpoints (legacy):
rules:
- apiGroups: [""]
resources: ["configmaps"] # or "endpoints"
verbs: ["get", "create", "update"]
Best Practices ¶
1. Use Leases ¶
Leases are purpose-built for leader election. ConfigMaps and Endpoints work but have drawbacks:
- ConfigMaps: Extra data in etcd
- Endpoints: Confusion with actual service endpoints
lock := &resourcelock.LeaseLock{...} // Preferred
2. Unique Lock Names ¶
Include your controller/operator name to avoid conflicts:
LeaderElectionID: "my-company.my-operator.example.com"
3. Exit on Leadership Loss ¶
Don’t try to be clever. When you lose leadership, exit:
OnStoppedLeading: func() {
klog.Info("Lost leadership, exiting")
os.Exit(0) // Let Kubernetes restart us
},
Trying to re-acquire in the same process is fragile.
4. Respect Context Cancellation ¶
All reconciliation should check context:
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
if err := ctx.Err(); err != nil {
return ctrl.Result{}, err
}
// ...
}
5. Monitor Leadership ¶
Expose metrics about leadership:
var isLeader = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "controller_is_leader",
Help: "1 if this instance is the leader, 0 otherwise",
})
OnStartedLeading: func(ctx context.Context) {
isLeader.Set(1)
runController(ctx)
},
OnStoppedLeading: func() {
isLeader.Set(0)
os.Exit(0)
},
Summary ¶
Leader election ensures exactly one controller replica is active:
| Component | Purpose |
|---|---|
| Lock object (Lease) | Stores current leader identity |
| LeaseDuration | How long lock is valid |
| RenewDeadline | Max time to renew before giving up |
| RetryPeriod | How often standbys check the lock |
Key timing:
- Graceful failover: Immediate (lock released)
- Ungraceful failover: LeaseDuration (default 15s)
Implementation:
- client-go:
leaderelection.RunOrDie()with callbacks - controller-runtime:
LeaderElection: truein manager options
Critical rules:
- Leader must stop work when it can’t renew
- All work must respect context cancellation
- Exit on leadership loss—don’t try to recover
- Use Leases, not ConfigMaps/Endpoints
Leader election is what makes HA controllers possible. Without it, you get chaos. With it, you get automatic failover with minimal downtime.