Scaling Beyond 5,000 Nodes: Anatomy of Kubernetes Limits and Multi-Cluster Architecture


If you’ve operated Kubernetes at scale, you’ve encountered the infamous 5,000 node recommendation. But what actually breaks at that threshold? And when you adopt multi-cluster to scale horizontally, what are you really buying? This post dissects the technical constraints behind the limit and explores how the pull-based multi-cluster model addresses them.

The 5K figure isn’t arbitrary — it’s the point where Kubernetes can still meet its Service Level Objectives:

  • 99th percentile API call latency < 1 second
  • 99th percentile pod startup latency < 5 seconds

Beyond this, the control plane struggles to keep these promises. But understanding why requires examining what’s actually under pressure.

etcd is a distributed key-value store that holds all cluster state. Every object in Kubernetes — Pods, Services, ConfigMaps, Secrets, CRDs — lives here.

Watch channel explosion: The API server maintains watches on etcd for every controller, scheduler, and kubelet. In a 5,000 node cluster with 110 pods per node (the default max), you have approximately 550,000 pod objects. Each state change generates watch events that must be serialized and fanned out.

The math gets ugly:

5,000 nodes × 110 pods = 550,000 pods
+ Services, Endpoints, ConfigMaps, Secrets, etc.
≈ 1-2 million objects in etcd

etcd’s Raft consensus requires every write to be replicated across the cluster. At high write rates, leader election latency increases, compaction falls behind, and eventually the database exceeds its recommended size (8GB default).

What breaks: Watch latency spikes. Controllers see stale state. The scheduler makes decisions on outdated information. Pods get scheduled to nodes that are already full.

The API server is stateless, but it maintains in-memory watch caches for performance. Each watch consumes memory and CPU for serialization.

Watch fanout: When a pod changes state, that event must be sent to:

  • The scheduler (if pending)
  • The relevant controller (Deployment, ReplicaSet, etc.)
  • The kubelet on the node
  • Any monitoring systems watching pods
  • Service mesh sidecars
  • Network policy controllers

In a large cluster, a single pod update can trigger dozens of watch notifications. Multiply by churn rate (pods starting, stopping, failing) and you get amplification.

Webhook latency: If you’re running admission webhooks (and you probably are — OPA, Pod Security, etc.), every pod creation traverses the webhook. At scale, webhook latency becomes a bottleneck. A webhook that adds 50ms is fine at 10 pods/sec but devastating at 500 pods/sec.

What breaks: API request queuing. Clients see timeouts. kubectl commands hang. Deployments appear stuck.

The default Kubernetes scheduler processes roughly 100 pods per second under optimal conditions. This throughput depends on:

  • Filtering: Evaluating which nodes can run a pod (resource requests, taints, affinity)
  • Scoring: Ranking feasible nodes by preference
  • Preemption: Evicting lower-priority pods to make room

With 5,000 nodes, each scheduling decision evaluates thousands of candidates. Enable pod affinity/anti-affinity and the complexity explodes — the scheduler must examine co-located pods across all nodes.

What breaks: Pending pod queues grow. Batch jobs that spawn 10,000 pods take hours to schedule. Autoscaling lags behind demand.

Every controller in the system (Deployment, ReplicaSet, Job, DaemonSet, etc.) watches its relevant objects and reconciles state. With more objects, work queues grow deeper.

Garbage collection pressure: The GC controller tracks owner references across all objects. At scale, orphan detection becomes expensive.

What breaks: Controllers fall behind. You delete a Deployment and its pods linger. ReplicaSets don’t scale down properly.

The bottlenecks above share a common theme: centralized state and control. One etcd, one API server, one scheduler — all trying to manage hundreds of thousands of objects.

Multi-cluster addresses this by horizontal partitioning. Instead of one 20,000-node cluster, you run four 5,000-node clusters. Each cluster has its own:

  • etcd (handling 1/4 the objects)
  • API server (1/4 the watches)
  • Scheduler (1/4 the pods)
  • Controllers (1/4 the work queues)

But this raises a new problem: how do you manage workloads across multiple clusters without creating a new centralized bottleneck?

KubeFleet (the CNCF project underlying Azure Fleet Manager) uses a hub-spoke architecture with a pull-based model. This design is critical for scalability.

The hub agent runs on a designated “hub” cluster — a lightweight cluster that serves as the control plane for your fleet. It doesn’t run workloads; it coordinates.

Responsibilities:

  • Watches ClusterResourcePlacement objects (your intent: “deploy this namespace to clusters matching these criteria”)
  • Evaluates placement policies against member cluster properties
  • Creates Work objects in per-member namespaces
  • Tracks rollout status across the fleet

The hub agent doesn’t push anything to member clusters. It simply writes Work objects to the hub cluster’s API server. This is crucial — the hub doesn’t need network connectivity to member clusters.

Each member cluster runs a member agent that pulls its work from the hub.

Responsibilities:

  • Watches its dedicated namespace on the hub cluster (e.g., fleet-member-cluster-1)
  • Fetches Work objects containing Kubernetes manifests
  • Applies manifests to the local cluster
  • Reports status back to the hub (via Work status updates)

The pull model means:

Hub cluster ← Member agents connect outbound
           ← Member agents poll for Work objects  
           ← Member agents push status updates

No inbound connections to member clusters required.

A Work object is the unit of propagation. When you create a ClusterResourcePlacement, the hub agent translates your intent into concrete Work objects:

apiVersion: placement.kubernetes-fleet.io/v1
kind: Work
metadata:
  name: crp-my-app-0
  namespace: fleet-member-cluster-1
spec:
  workload:
    manifests:
      - apiVersion: v1
        kind: Namespace
        metadata:
          name: my-app
      - apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: web
          namespace: my-app
        spec:
          replicas: 3
          # ... full deployment spec

The member agent watches for these objects, extracts the manifests, and applies them locally using server-side apply.

In a push model, the hub would need to:

  1. Maintain connections to all member clusters
  2. Have credentials/kubeconfig for each member
  3. Handle retries, timeouts, and failures for each push
  4. Deal with network partitions gracefully

With 100 member clusters, that’s 100 connections to manage, 100 failure domains to handle, and significant blast radius if the hub has issues.

In the pull model:

  1. Hub just writes to its local API server
  2. Member agents are responsible for their own connectivity
  3. If a member is temporarily unreachable, it catches up when reconnected
  4. Hub failure means no new placements, but existing workloads keep running

Load distribution: Each member agent does its own reconciliation. The work of applying manifests, tracking status, and handling drift is distributed across the fleet, not centralized on the hub.

Understanding failure scenarios is essential for production deployments.

Impact: No new placements. No updates to existing ClusterResourcePlacement objects take effect. No new clusters can join.

What keeps working: Member clusters continue running their workloads. The member agent retries connecting to the hub. Existing Work objects (cached or already applied) remain in effect.

Recovery: Restore hub from etcd backup. Member agents reconnect automatically.

Design implication: The hub should be a highly available, multi-zone deployment. But it doesn’t need to be large — it’s only managing fleet metadata, not running workloads.

Impact: That specific member cluster stops receiving updates from the hub. Drift from desired state won’t be corrected.

What keeps working: Workloads on that cluster continue running (Kubernetes controllers are local). Other member clusters are unaffected.

Detection: The MemberCluster object on the hub tracks last heartbeat time:

status:
  agentStatus:
    - type: MemberAgent
      lastReceivedHeartbeat: "2025-01-25T10:30:00Z"
      conditions:
        - type: Joined
          status: "True"

Recovery: Restart the member agent. It will re-sync by fetching current Work objects from the hub.

Scenario: Member cluster loses connectivity to hub but maintains local network.

Impact: Similar to hub failure from that member’s perspective. No new Work objects fetched. Status updates not reported.

Behavior: The member agent uses exponential backoff on reconnection attempts. Once connectivity restores, it fetches the latest Work objects and reconciles.

Risk: If someone modifies manifests on the hub during the partition, the member won’t see changes until reconnection. This isn’t split-brain (there’s only one source of truth — the hub), but it is temporal inconsistency.

What if someone manually edits a resource on a member cluster that’s managed by Fleet?

The member agent can detect drift between the Work spec and actual cluster state. You can configure behavior:

  • Apply mode: Overwrite local changes (eventual consistency with hub)
  • ReportDiff mode: Report the difference but don’t overwrite (useful for debugging)
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
  name: my-app
spec:
  resourceSelectors:
    - group: ""
      kind: Namespace
      name: my-app
      version: v1
  strategy:
    applyStrategy:
      type: ClientSideApply  # or ServerSideApply
      allowCoOwnership: false

Don’t default to “max out at 5K nodes per cluster.” Consider:

Blast radius: A cluster-wide outage (bad config push, control plane failure) affects all nodes. Smaller clusters = smaller blast radius.

Upgrade complexity: Upgrading a 5,000-node cluster is an all-day event. Upgrading ten 500-node clusters can be parallelized and staged.

Workload isolation: Different teams, environments, or compliance zones might warrant separate clusters regardless of size.

Practical guidance: Many organizations find 500-2,000 nodes per cluster to be a sweet spot — large enough to be efficient, small enough to be manageable.

The hub is lightweight. It runs:

  • The Kubernetes control plane
  • fleet-hub-agent
  • Standard system components

For most fleets (up to 100 member clusters), a 3-node hub cluster with modest sizing (4 vCPU, 16GB RAM per node) is sufficient. The hub’s etcd stores fleet metadata — MemberCluster, ClusterResourcePlacement, Work objects — not your application workloads.

Hub placement: The hub should be reachable from all member clusters. For on-prem deployments spanning multiple data centers, consider:

  • Hub in a central location with good connectivity to all sites
  • Hub behind a load balancer for HA
  • DNS-based failover if you run hub replicas

Latency budget: Member agents poll the hub for Work objects. Higher latency means slower propagation of changes, but the system tolerates it gracefully. Sub-second latency isn’t required; sub-minute is fine.

Egress-only: Member clusters only need outbound HTTPS to the hub. No firewall rules for inbound traffic to your on-prem clusters.

  • 3+ Kubernetes clusters (1 hub, 2+ members)
  • kubectl access to all clusters
  • Container registry accessible from all clusters
  • Helm 3.x
# Clone KubeFleet
git clone https://github.com/kubefleet-dev/kubefleet.git
cd kubefleet

# Set your registry
export REGISTRY="your-registry.example.com"
export TAG="v0.10.0"  # Use latest stable

# Build and push hub agent image
make docker-build-hub-agent
make docker-push-hub-agent

# Switch to hub cluster context
kubectl config use-context hub-cluster

# Install via Helm
helm install hub-agent ./charts/hub-agent \
  --set image.repository=${REGISTRY}/hub-agent \
  --set image.tag=${TAG}

Verify the hub agent is running:

kubectl get pods -n fleet-system
# NAME                         READY   STATUS    RESTARTS   AGE
# hub-agent-xxxxxxxxx-xxxxx    1/1     Running   0          30s

KubeFleet provides a script to automate member joining:

# Set member cluster details
export MEMBER_CLUSTER="dc1-cluster"
export MEMBER_CLUSTER_CONTEXT="dc1-cluster-admin"

# Build member agent image
make docker-build-member-agent
make docker-push-member-agent

# Run the join script
./hack/membership/joinMC.sh ${TAG} hub-cluster ${MEMBER_CLUSTER}

Verify the member joined:

kubectl config use-context hub-cluster
kubectl get membercluster

# NAME          JOINED   AGE   MEMBER-AGENT-LAST-SEEN   NODE-COUNT
# dc1-cluster   True     60s   10s                      150

Watch the member agent sync:

# On the hub cluster, watch Work objects for a member
kubectl get work -n fleet-member-dc1-cluster -w

Simulate a placement:

# Create a namespace on the hub
kubectl create namespace demo-app

# Create a ClusterResourcePlacement
cat <<EOF | kubectl apply -f -
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
  name: demo-app
spec:
  resourceSelectors:
    - group: ""
      kind: Namespace
      name: demo-app
      version: v1
  policy:
    placementType: PickAll
EOF

Watch propagation:

# See the Work object created
kubectl get work -n fleet-member-dc1-cluster

# Check status
kubectl describe clusterresourceplacement demo-app

On the member cluster, verify the namespace appeared:

kubectl config use-context dc1-cluster
kubectl get namespace demo-app

If a placement isn’t working, check the ClusterResourcePlacement status:

kubectl describe crp demo-app

Look for conditions:

  • Scheduled: Did the scheduler find matching clusters?
  • Applied: Did the member agent successfully apply the manifests?
  • Available: Are the workloads actually running?

Check Work object status on the hub:

kubectl get work -n fleet-member-dc1-cluster -o yaml

Check member agent logs:

kubectl config use-context dc1-cluster
kubectl logs -n fleet-system -l app=member-agent --tail=100

Kill the member agent:

kubectl config use-context dc1-cluster
kubectl delete pod -n fleet-system -l app=member-agent

Watch it restart and re-sync. Workloads on the cluster are unaffected.

Create drift:

# Manually edit a Fleet-managed resource
kubectl config use-context dc1-cluster
kubectl annotate namespace demo-app manual-edit="true"

# Watch the member agent reconcile it back (if drift detection enabled)
kubectl get namespace demo-app -o yaml

The 5,000-node limit isn’t a hard wall — it’s the boundary where Kubernetes SLOs start degrading due to centralized control plane bottlenecks. Multi-cluster architecture addresses this through horizontal partitioning: instead of scaling one control plane vertically, you scale the number of control planes horizontally.

KubeFleet’s pull-based model takes this further by distributing the reconciliation load to member clusters themselves. The hub becomes a lightweight coordination point rather than a bottleneck. Member agents pull their work, apply it locally, and report status back — no push infrastructure, no inbound network requirements, and graceful degradation under failure.

For on-premises deployments where you control the infrastructure but need to scale beyond what a single cluster can handle, this architecture offers a path forward without sacrificing operational simplicity.

Further reading: