You went multi-cluster to escape the limits of a single control plane. Now you have 50 clusters, and new bottlenecks have emerged — at the coordination layer, at shared infrastructure, at the tools that manage the fleet. The problems are different, but the pain is familiar.
This post surveys the common bottlenecks in large-scale multi-cluster Kubernetes environments and how to identify them.
The Shape of Multi-Cluster Bottlenecks ¶
In a single cluster, bottlenecks are usually:
- etcd (write throughput, watch fan-out)
- API server (request rate, webhook latency)
- Scheduler (pod throughput)
- Controllers (reconciliation speed)
Multi-cluster solves these by partitioning. But it introduces new chokepoints:
- Coordination layer: The hub cluster, fleet management APIs
- Shared infrastructure: Image registries, secret stores, Git repos
- Cross-cutting tools: ArgoCD, observability stack, policy engines
The pattern: what worked at 5 clusters breaks at 50. A tool that seemed lightweight becomes a bottleneck when multiplied across the fleet.
Hub Cluster Pressure ¶
If you’re using KubeFleet, Azure Fleet Manager, or similar hub-spoke architectures, the hub cluster coordinates the fleet. It’s lightweight by design — but not infinitely so.
Watch Fan-Out ¶
The hub maintains state for every member cluster:
MemberClusterobjects (one per cluster)ClusterResourcePlacementobjects (your placement intents)Workobjects (propagated resources, potentially thousands)
Member agents watch the hub for changes. With 50 clusters, that’s 50 agents maintaining watches. With 100 placements generating 10 Work objects each across 50 clusters, you have 50,000 Work objects.
Symptoms:
- Hub API server latency increases
- Member agents report slow sync
apiserver_request_duration_secondsshows elevated P99
Diagnosis:
# On hub cluster
kubectl top pods -n fleet-system
kubectl get --raw /metrics | grep apiserver_request_duration
# Watch count
kubectl get --raw /metrics | grep apiserver_registered_watchers
Mitigation:
- Right-size hub cluster (it’s often under-provisioned)
- Reduce Work object churn (batch changes, avoid frequent updates)
- Consider multiple hubs for very large fleets (federation of federations)
Hub etcd Sizing ¶
The hub’s etcd stores all fleet coordination state. More clusters and placements = more objects = more etcd pressure.
Watch for:
- etcd latency (
etcd_request_duration_seconds) - Database size (
etcd_debugging_mvcc_db_total_size_in_bytes) - Compaction falling behind
Mitigation:
- Dedicated etcd nodes with SSDs
- Increase etcd quota if hitting limits
- Clean up stale Work objects and completed placements
Member Agent Sync ¶
Each member cluster runs an agent that pulls state from the hub. At scale, these agents become a factor.
Pull Frequency vs Freshness ¶
Agents poll the hub for Work objects. More frequent polling = fresher state but more hub load. Less frequent = stale state but lighter load.
Trade-off:
50 clusters × 1 sync/second = 50 requests/second to hub
50 clusters × 1 sync/10 seconds = 5 requests/second to hub
Most agents use watches (efficient), but reconnections and resyncs generate load.
Work Object Size ¶
A Work object contains the full manifest of propagated resources. Propagating a large ConfigMap or a Deployment with lengthy specs means large Work objects.
Symptoms:
- Slow sync times
- High memory usage in member agents
- Network bandwidth between hub and members
Diagnosis:
# Size of Work objects
kubectl get work -n fleet-member-cluster-1 -o json | wc -c
Mitigation:
- Avoid propagating large ConfigMaps (use external config stores)
- Propagate references instead of data where possible
- Compress or chunk large manifests
Status Reporting Storms ¶
Member agents report status back to the hub. With many resources across many clusters, status updates can overwhelm the hub.
Symptoms:
- Hub API server write latency spikes
- etcd write throughput saturated
- Agents backing off on status updates
Mitigation:
- Batch status updates
- Report status less frequently for stable resources
- Use conditions efficiently (don’t update if unchanged)
ArgoCD at Scale ¶
ArgoCD is often the tool managing deployments across multi-cluster fleets. A single ArgoCD instance managing 50+ clusters hits limits.
Application Controller ¶
The application controller reconciles Applications — comparing desired state (Git) with actual state (clusters). Each Application means:
- Watching the target cluster
- Generating manifests (calling repo server)
- Computing diff
- Optionally syncing
At scale:
500 Applications × 3-minute sync interval = ~3 reconciliations/second
This seems manageable until you account for:
- Manifest generation time (Helm templates, Kustomize)
- Target cluster API latency
- Diff computation for large Applications
Symptoms:
- Applications stuck in “Progressing”
- Long sync times
- Controller CPU pegged
Diagnosis:
# Controller metrics
kubectl port-forward svc/argocd-metrics 8082:8082 -n argocd
curl localhost:8082/metrics | grep argocd_app_reconcile
# Queue depth
curl localhost:8082/metrics | grep workqueue_depth
Mitigation:
- Increase controller replicas (with sharding)
- Reduce sync frequency for stable Applications
- Use Server-Side Apply (faster diffs)
Repo Server Bottleneck ¶
The repo server generates manifests from Git repos. It’s CPU and memory intensive:
- Cloning repos
- Running Helm template
- Running Kustomize build
- Caching results
Symptoms:
- Slow manifest generation
- Repo server OOMKilled
- Applications show “ComparisonError”
Diagnosis:
kubectl top pods -n argocd -l app.kubernetes.io/component=repo-server
kubectl logs -n argocd -l app.kubernetes.io/component=repo-server | grep -i error
Mitigation:
- Scale repo server horizontally
- Increase memory limits (Helm/Kustomize can be memory-hungry)
- Use repo server parallelism settings
- Cache Helm dependencies (avoid re-downloading)
Redis Pressure ¶
ArgoCD uses Redis for caching. With many Applications and clusters, Redis becomes a factor.
Symptoms:
- High Redis memory usage
- Slow cache operations
- Evictions causing cache misses (re-generating manifests)
Mitigation:
- Increase Redis memory
- Use Redis Cluster for HA (ArgoCD 2.x supports this)
- Tune cache TTLs
Git Rate Limits ¶
ArgoCD polls Git repos for changes. With many Applications:
500 Applications polling every 3 minutes = 167 Git fetches/minute
If using GitHub, you’ll hit rate limits. If using webhooks, you’ll generate storms on every push.
Symptoms:
- “rate limit exceeded” errors
- Applications not detecting changes
- Webhook timeouts
Mitigation:
- Use webhooks instead of polling (more efficient)
- Consolidate repos (fewer repos = fewer fetches)
- Increase polling interval
- Use GitHub App authentication (higher rate limits)
Sharding Strategies ¶
A single ArgoCD can’t manage thousands of Applications efficiently. Sharding options:
Option 1: Shard by cluster
- Multiple ArgoCD controller replicas
- Each controller handles a subset of clusters
- Use
--application-namespacesand cluster labels
# Controller deployment
env:
- name: ARGOCD_CONTROLLER_REPLICAS
value: "3"
Option 2: Multiple ArgoCD instances
- Dedicated ArgoCD per environment (prod, staging)
- Or per team / business unit
- More operational overhead but better isolation
Option 3: ApplicationSets with progressive sync
- Generate Applications dynamically
- Use rolling sync strategies
- Limit concurrent syncs
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: my-app
spec:
generators:
- clusters:
selector:
matchLabels:
environment: production
strategy:
type: RollingSync
rollingSync:
steps:
- matchExpressions:
- key: region
operator: In
values: [us-east-1]
- matchExpressions:
- key: region
operator: In
values: [us-west-2]
template:
# ...
Image Registry ¶
Every cluster pulls container images. At scale, the registry becomes critical infrastructure.
The Thundering Herd ¶
You push a new image and update 50 clusters. All 50 start pulling simultaneously:
50 clusters × 10 nodes × 500MB image = 250GB of transfer, all at once
Symptoms:
- Registry timeouts
- Image pull failures (“429 Too Many Requests”)
- Slow deployments
Mitigation:
Registry caching / pull-through cache:
# Deploy a registry mirror in each cluster or region
# Configure containerd/docker to use local mirror
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-cache.internal:5000"]
Geo-distributed registries:
- Replicate images to regional registries
- Route clusters to nearest registry
Pre-pulling:
- DaemonSet that pulls images before deployment
- Reduces thundering herd by spreading pull over time
Staggered rollouts:
- Don’t update all 50 clusters simultaneously
- Roll out region by region, cluster by cluster
Registry as SPOF ¶
If your single registry goes down, no cluster can pull new images.
Mitigation:
- HA registry deployment
- Multiple registry replicas across zones
- Fallback registries in image pull specs (limited support)
Observability Overhead ¶
Monitoring 50 clusters generates massive telemetry. The observability stack itself becomes a scaling challenge.
Prometheus Federation Limits ¶
Classic pattern: Prometheus per cluster, federate to central Prometheus.
Problems at scale:
- Federation scrapes are expensive (pulls all series)
- Central Prometheus cardinality explodes
- Query latency increases
Symptoms:
- Federation scrapes timing out
- Central Prometheus OOM
- Slow dashboards
Mitigation:
- Thanos/Cortex/Mimir: Scalable backends that accept remote-write
- Remote write: Push metrics instead of federation pull
- Recording rules: Aggregate at edge, send summaries
- Reduce cardinality: Drop high-cardinality labels before sending
# Per-cluster Prometheus: remote write to central
remote_write:
- url: https://thanos-receive.monitoring:19291/api/v1/receive
Central Logging ¶
50 clusters × 1000 pods × 10 log lines/second = 500,000 lines/second.
Symptoms:
- Log ingestion lag
- Dropped logs
- Query timeouts
Mitigation:
- Sampling: Don’t ship all logs (sample debug, keep errors)
- Edge aggregation: Aggregate common patterns locally
- Tiered storage: Hot/cold storage for logs
- Per-cluster Loki: Query individual clusters, federate on demand
When Monitoring Causes the Problem ¶
Heavy monitoring can stress clusters:
- Prometheus scraping thousands of targets
- Logging agents consuming CPU/memory
- Tracing overhead on every request
Watch for:
- Monitoring pods consuming significant cluster resources
- Scrape intervals too aggressive
- Overly verbose logging levels
Mitigation:
- Scrape less frequently for stable metrics
- Use service discovery efficiently (don’t scrape what you don’t need)
- Set appropriate resource limits on monitoring components
Secrets Distribution ¶
Multi-cluster secrets management adds latency and complexity.
Vault at Scale ¶
Vault is often the central secrets store. Every cluster fetches secrets:
50 clusters × 100 secrets × refresh every 5 minutes = 1000 requests/minute
Symptoms:
- Vault latency increases
- Secret sync delays
- Pod startup blocked waiting for secrets
Mitigation:
- Vault replication (regional Vault clusters)
- Caching (external-secrets-operator caches locally)
- Longer TTLs for stable secrets
- Batch secret fetches
External-Secrets-Operator ¶
Runs in each cluster, syncs secrets from external stores.
At scale:
- Each cluster runs reconciliation loops
- All hitting the same Vault/AWS Secrets Manager/etc.
Mitigation:
- Tune sync intervals (not everything needs 30-second refresh)
- Use refresh strategies (only refresh on pod restart)
- Batch requests where possible
Diagnosis: Finding the Bottleneck ¶
When things slow down, where do you look first?
Systematic Approach ¶
-
Start at the symptom
- Slow deployments? → ArgoCD, registry
- Stale state in member clusters? → Hub, agent sync
- Metrics gaps? → Observability stack
-
Check the coordination layer
- Hub cluster health (API server, etcd)
- Member agent logs
- Work object backlogs
-
Check shared infrastructure
- Registry response times
- Git repo rate limits
- Vault/secrets latency
-
Check cross-cutting tools
- ArgoCD queue depths and reconciliation times
- Prometheus scrape durations
- Logging ingestion lag
Key Metrics Across the Fleet ¶
Hub cluster:
# API server latency
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb))
# etcd latency
histogram_quantile(0.99, sum(rate(etcd_request_duration_seconds_bucket[5m])) by (le, operation))
# Watch count
sum(apiserver_registered_watchers) by (group, resource)
ArgoCD:
# Reconciliation duration
histogram_quantile(0.99, sum(rate(argocd_app_reconcile_bucket[5m])) by (le))
# Queue depth
workqueue_depth{name="app_operation_processing_queue"}
# Sync failures
sum(increase(argocd_app_sync_total{phase="Failed"}[1h])) by (dest_server)
Registry (if you expose metrics):
# Pull latency
histogram_quantile(0.99, sum(rate(registry_http_request_duration_seconds_bucket[5m])) by (le))
# Request rate
sum(rate(registry_http_requests_total[5m])) by (method)
Per-cluster Prometheus (federate or remote-write these):
# API server health across fleet
sum(apiserver_request_total) by (cluster, code)
# Pod startup latency across fleet
histogram_quantile(0.99, sum(rate(kubelet_pod_start_duration_seconds_bucket[5m])) by (le, cluster))
When You’re Stuck ¶
If metrics don’t reveal the bottleneck:
- Add tracing: Instrument the slow path (ArgoCD reconciliation, agent sync)
- Profile: pprof on Go components (ArgoCD, fleet agents)
- Simplify: Reduce fleet size temporarily to isolate
- Bisect: Disable half the clusters/Applications, see if problem persists
Summary ¶
Multi-cluster Kubernetes trades single-cluster bottlenecks for coordination-layer bottlenecks. At scale, watch for:
| Layer | Bottlenecks |
|---|---|
| Hub cluster | API server load, etcd size, watch fan-out |
| Member agents | Sync frequency, Work object size, status storms |
| ArgoCD | Application controller, repo server, Git rate limits |
| Image registry | Pull thundering herd, single registry SPOF |
| Observability | Federation limits, logging ingestion, cardinality |
| Secrets | Vault load, sync latency |
The tools that manage your fleet can become the bottleneck your fleet was supposed to escape. Monitor the coordinators, not just the clusters.