Bottlenecks in Large-Scale Multi-Cluster Kubernetes

You went multi-cluster to escape the limits of a single control plane. Now you have 50 clusters, and new bottlenecks have emerged — at the coordination layer, at shared infrastructure, at the tools that manage the fleet. The problems are different, but the pain is familiar.

This post surveys the common bottlenecks in large-scale multi-cluster Kubernetes environments and how to identify them.

The Shape of Multi-Cluster Bottlenecks ¶

In a single cluster, bottlenecks are usually:

etcd (write throughput, watch fan-out)
API server (request rate, webhook latency)
Scheduler (pod throughput)
Controllers (reconciliation speed)

Multi-cluster solves these by partitioning. But it introduces new chokepoints:

Coordination layer: The hub cluster, fleet management APIs
Shared infrastructure: Image registries, secret stores, Git repos
Cross-cutting tools: ArgoCD, observability stack, policy engines

The pattern: what worked at 5 clusters breaks at 50. A tool that seemed lightweight becomes a bottleneck when multiplied across the fleet.

Hub Cluster Pressure ¶

If you’re using KubeFleet, Azure Fleet Manager, or similar hub-spoke architectures, the hub cluster coordinates the fleet. It’s lightweight by design — but not infinitely so.

Watch Fan-Out ¶

The hub maintains state for every member cluster:

MemberCluster objects (one per cluster)
ClusterResourcePlacement objects (your placement intents)
Work objects (propagated resources, potentially thousands)

Member agents watch the hub for changes. With 50 clusters, that’s 50 agents maintaining watches. With 100 placements generating 10 Work objects each across 50 clusters, you have 50,000 Work objects.

Symptoms:

Hub API server latency increases
Member agents report slow sync
apiserver_request_duration_seconds shows elevated P99

Diagnosis:

# On hub cluster
kubectl top pods -n fleet-system
kubectl get --raw /metrics | grep apiserver_request_duration

# Watch count
kubectl get --raw /metrics | grep apiserver_registered_watchers

Mitigation:

Right-size hub cluster (it’s often under-provisioned)
Reduce Work object churn (batch changes, avoid frequent updates)
Consider multiple hubs for very large fleets (federation of federations)

Hub etcd Sizing ¶

The hub’s etcd stores all fleet coordination state. More clusters and placements = more objects = more etcd pressure.

Watch for:

etcd latency (etcd_request_duration_seconds)
Database size (etcd_debugging_mvcc_db_total_size_in_bytes)
Compaction falling behind

Mitigation:

Dedicated etcd nodes with SSDs
Increase etcd quota if hitting limits
Clean up stale Work objects and completed placements

Member Agent Sync ¶

Each member cluster runs an agent that pulls state from the hub. At scale, these agents become a factor.

Pull Frequency vs Freshness ¶

Agents poll the hub for Work objects. More frequent polling = fresher state but more hub load. Less frequent = stale state but lighter load.

Trade-off:

50 clusters × 1 sync/second = 50 requests/second to hub
50 clusters × 1 sync/10 seconds = 5 requests/second to hub

Most agents use watches (efficient), but reconnections and resyncs generate load.

Work Object Size ¶

A Work object contains the full manifest of propagated resources. Propagating a large ConfigMap or a Deployment with lengthy specs means large Work objects.

Symptoms:

Slow sync times
High memory usage in member agents
Network bandwidth between hub and members

Diagnosis:

# Size of Work objects
kubectl get work -n fleet-member-cluster-1 -o json | wc -c

Mitigation:

Avoid propagating large ConfigMaps (use external config stores)
Propagate references instead of data where possible
Compress or chunk large manifests

Status Reporting Storms ¶

Member agents report status back to the hub. With many resources across many clusters, status updates can overwhelm the hub.

Symptoms:

Hub API server write latency spikes
etcd write throughput saturated
Agents backing off on status updates

Mitigation:

Batch status updates
Report status less frequently for stable resources
Use conditions efficiently (don’t update if unchanged)

ArgoCD at Scale ¶

ArgoCD is often the tool managing deployments across multi-cluster fleets. A single ArgoCD instance managing 50+ clusters hits limits.

Application Controller ¶

The application controller reconciles Applications — comparing desired state (Git) with actual state (clusters). Each Application means:

Watching the target cluster
Generating manifests (calling repo server)
Computing diff
Optionally syncing

At scale:

500 Applications × 3-minute sync interval = ~3 reconciliations/second

This seems manageable until you account for:

Manifest generation time (Helm templates, Kustomize)
Target cluster API latency
Diff computation for large Applications

Symptoms:

Applications stuck in “Progressing”
Long sync times
Controller CPU pegged

Diagnosis:

# Controller metrics
kubectl port-forward svc/argocd-metrics 8082:8082 -n argocd
curl localhost:8082/metrics | grep argocd_app_reconcile

# Queue depth
curl localhost:8082/metrics | grep workqueue_depth

Mitigation:

Increase controller replicas (with sharding)
Reduce sync frequency for stable Applications
Use Server-Side Apply (faster diffs)

Repo Server Bottleneck ¶

The repo server generates manifests from Git repos. It’s CPU and memory intensive:

Cloning repos
Running Helm template
Running Kustomize build
Caching results

Symptoms:

Slow manifest generation
Repo server OOMKilled
Applications show “ComparisonError”

Diagnosis:

kubectl top pods -n argocd -l app.kubernetes.io/component=repo-server
kubectl logs -n argocd -l app.kubernetes.io/component=repo-server | grep -i error

Mitigation:

Scale repo server horizontally
Increase memory limits (Helm/Kustomize can be memory-hungry)
Use repo server parallelism settings
Cache Helm dependencies (avoid re-downloading)

Redis Pressure ¶

ArgoCD uses Redis for caching. With many Applications and clusters, Redis becomes a factor.

Symptoms:

High Redis memory usage
Slow cache operations
Evictions causing cache misses (re-generating manifests)

Mitigation:

Increase Redis memory
Use Redis Cluster for HA (ArgoCD 2.x supports this)
Tune cache TTLs

Git Rate Limits ¶

ArgoCD polls Git repos for changes. With many Applications:

500 Applications polling every 3 minutes = 167 Git fetches/minute

If using GitHub, you’ll hit rate limits. If using webhooks, you’ll generate storms on every push.

Symptoms:

“rate limit exceeded” errors
Applications not detecting changes
Webhook timeouts

Mitigation:

Use webhooks instead of polling (more efficient)
Consolidate repos (fewer repos = fewer fetches)
Increase polling interval
Use GitHub App authentication (higher rate limits)

Sharding Strategies ¶

A single ArgoCD can’t manage thousands of Applications efficiently. Sharding options:

Option 1: Shard by cluster

Multiple ArgoCD controller replicas
Each controller handles a subset of clusters
Use --application-namespaces and cluster labels

# Controller deployment
env:
  - name: ARGOCD_CONTROLLER_REPLICAS
    value: "3"

Option 2: Multiple ArgoCD instances

Dedicated ArgoCD per environment (prod, staging)
Or per team / business unit
More operational overhead but better isolation

Option 3: ApplicationSets with progressive sync

Generate Applications dynamically
Use rolling sync strategies
Limit concurrent syncs

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            environment: production
  strategy:
    type: RollingSync
    rollingSync:
      steps:
        - matchExpressions:
            - key: region
              operator: In
              values: [us-east-1]
        - matchExpressions:
            - key: region
              operator: In
              values: [us-west-2]
  template:
    # ...

Image Registry ¶

Every cluster pulls container images. At scale, the registry becomes critical infrastructure.

The Thundering Herd ¶

You push a new image and update 50 clusters. All 50 start pulling simultaneously:

50 clusters × 10 nodes × 500MB image = 250GB of transfer, all at once

Symptoms:

Registry timeouts
Image pull failures (“429 Too Many Requests”)
Slow deployments

Mitigation:

Registry caching / pull-through cache:

# Deploy a registry mirror in each cluster or region
# Configure containerd/docker to use local mirror
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["https://registry-cache.internal:5000"]

Geo-distributed registries:

Replicate images to regional registries
Route clusters to nearest registry

Pre-pulling:

DaemonSet that pulls images before deployment
Reduces thundering herd by spreading pull over time

Staggered rollouts:

Don’t update all 50 clusters simultaneously
Roll out region by region, cluster by cluster

Registry as SPOF ¶

If your single registry goes down, no cluster can pull new images.

Mitigation:

HA registry deployment
Multiple registry replicas across zones
Fallback registries in image pull specs (limited support)

Observability Overhead ¶

Monitoring 50 clusters generates massive telemetry. The observability stack itself becomes a scaling challenge.

Prometheus Federation Limits ¶

Classic pattern: Prometheus per cluster, federate to central Prometheus.

Problems at scale:

Federation scrapes are expensive (pulls all series)
Central Prometheus cardinality explodes
Query latency increases

Symptoms:

Federation scrapes timing out
Central Prometheus OOM
Slow dashboards

Mitigation:

Thanos/Cortex/Mimir: Scalable backends that accept remote-write
Remote write: Push metrics instead of federation pull
Recording rules: Aggregate at edge, send summaries
Reduce cardinality: Drop high-cardinality labels before sending

# Per-cluster Prometheus: remote write to central
remote_write:
  - url: https://thanos-receive.monitoring:19291/api/v1/receive

Central Logging ¶

50 clusters × 1000 pods × 10 log lines/second = 500,000 lines/second.

Symptoms:

Log ingestion lag
Dropped logs
Query timeouts

Mitigation:

Sampling: Don’t ship all logs (sample debug, keep errors)
Edge aggregation: Aggregate common patterns locally
Tiered storage: Hot/cold storage for logs
Per-cluster Loki: Query individual clusters, federate on demand

When Monitoring Causes the Problem ¶

Heavy monitoring can stress clusters:

Prometheus scraping thousands of targets
Logging agents consuming CPU/memory
Tracing overhead on every request

Watch for:

Monitoring pods consuming significant cluster resources
Scrape intervals too aggressive
Overly verbose logging levels

Mitigation:

Scrape less frequently for stable metrics
Use service discovery efficiently (don’t scrape what you don’t need)
Set appropriate resource limits on monitoring components

Secrets Distribution ¶

Multi-cluster secrets management adds latency and complexity.

Vault at Scale ¶

Vault is often the central secrets store. Every cluster fetches secrets:

50 clusters × 100 secrets × refresh every 5 minutes = 1000 requests/minute

Symptoms:

Vault latency increases
Secret sync delays
Pod startup blocked waiting for secrets

Mitigation:

Vault replication (regional Vault clusters)
Caching (external-secrets-operator caches locally)
Longer TTLs for stable secrets
Batch secret fetches

External-Secrets-Operator ¶

Runs in each cluster, syncs secrets from external stores.

At scale:

Each cluster runs reconciliation loops
All hitting the same Vault/AWS Secrets Manager/etc.

Mitigation:

Tune sync intervals (not everything needs 30-second refresh)
Use refresh strategies (only refresh on pod restart)
Batch requests where possible

Diagnosis: Finding the Bottleneck ¶

When things slow down, where do you look first?

Systematic Approach ¶

Start at the symptom
- Slow deployments? → ArgoCD, registry
- Stale state in member clusters? → Hub, agent sync
- Metrics gaps? → Observability stack
Check the coordination layer
- Hub cluster health (API server, etcd)
- Member agent logs
- Work object backlogs
Check shared infrastructure
- Registry response times
- Git repo rate limits
- Vault/secrets latency
Check cross-cutting tools
- ArgoCD queue depths and reconciliation times
- Prometheus scrape durations
- Logging ingestion lag

Key Metrics Across the Fleet ¶

Hub cluster:

# API server latency
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb))

# etcd latency
histogram_quantile(0.99, sum(rate(etcd_request_duration_seconds_bucket[5m])) by (le, operation))

# Watch count
sum(apiserver_registered_watchers) by (group, resource)

ArgoCD:

# Reconciliation duration
histogram_quantile(0.99, sum(rate(argocd_app_reconcile_bucket[5m])) by (le))

# Queue depth
workqueue_depth{name="app_operation_processing_queue"}

# Sync failures
sum(increase(argocd_app_sync_total{phase="Failed"}[1h])) by (dest_server)

Registry (if you expose metrics):

# Pull latency
histogram_quantile(0.99, sum(rate(registry_http_request_duration_seconds_bucket[5m])) by (le))

# Request rate
sum(rate(registry_http_requests_total[5m])) by (method)

Per-cluster Prometheus (federate or remote-write these):

# API server health across fleet
sum(apiserver_request_total) by (cluster, code)

# Pod startup latency across fleet
histogram_quantile(0.99, sum(rate(kubelet_pod_start_duration_seconds_bucket[5m])) by (le, cluster))

When You’re Stuck ¶

If metrics don’t reveal the bottleneck:

Add tracing: Instrument the slow path (ArgoCD reconciliation, agent sync)
Profile: pprof on Go components (ArgoCD, fleet agents)
Simplify: Reduce fleet size temporarily to isolate
Bisect: Disable half the clusters/Applications, see if problem persists

Summary ¶

Multi-cluster Kubernetes trades single-cluster bottlenecks for coordination-layer bottlenecks. At scale, watch for:

Layer	Bottlenecks
Hub cluster	API server load, etcd size, watch fan-out
Member agents	Sync frequency, Work object size, status storms
ArgoCD	Application controller, repo server, Git rate limits
Image registry	Pull thundering herd, single registry SPOF
Observability	Federation limits, logging ingestion, cardinality
Secrets	Vault load, sync latency

The tools that manage your fleet can become the bottleneck your fleet was supposed to escape. Monitor the coordinators, not just the clusters.