Bottlenecks in Large-Scale Multi-Cluster Kubernetes


You went multi-cluster to escape the limits of a single control plane. Now you have 50 clusters, and new bottlenecks have emerged — at the coordination layer, at shared infrastructure, at the tools that manage the fleet. The problems are different, but the pain is familiar.

This post surveys the common bottlenecks in large-scale multi-cluster Kubernetes environments and how to identify them.

In a single cluster, bottlenecks are usually:

  • etcd (write throughput, watch fan-out)
  • API server (request rate, webhook latency)
  • Scheduler (pod throughput)
  • Controllers (reconciliation speed)

Multi-cluster solves these by partitioning. But it introduces new chokepoints:

  • Coordination layer: The hub cluster, fleet management APIs
  • Shared infrastructure: Image registries, secret stores, Git repos
  • Cross-cutting tools: ArgoCD, observability stack, policy engines

The pattern: what worked at 5 clusters breaks at 50. A tool that seemed lightweight becomes a bottleneck when multiplied across the fleet.

If you’re using KubeFleet, Azure Fleet Manager, or similar hub-spoke architectures, the hub cluster coordinates the fleet. It’s lightweight by design — but not infinitely so.

The hub maintains state for every member cluster:

  • MemberCluster objects (one per cluster)
  • ClusterResourcePlacement objects (your placement intents)
  • Work objects (propagated resources, potentially thousands)

Member agents watch the hub for changes. With 50 clusters, that’s 50 agents maintaining watches. With 100 placements generating 10 Work objects each across 50 clusters, you have 50,000 Work objects.

Symptoms:

  • Hub API server latency increases
  • Member agents report slow sync
  • apiserver_request_duration_seconds shows elevated P99

Diagnosis:

# On hub cluster
kubectl top pods -n fleet-system
kubectl get --raw /metrics | grep apiserver_request_duration

# Watch count
kubectl get --raw /metrics | grep apiserver_registered_watchers

Mitigation:

  • Right-size hub cluster (it’s often under-provisioned)
  • Reduce Work object churn (batch changes, avoid frequent updates)
  • Consider multiple hubs for very large fleets (federation of federations)

The hub’s etcd stores all fleet coordination state. More clusters and placements = more objects = more etcd pressure.

Watch for:

  • etcd latency (etcd_request_duration_seconds)
  • Database size (etcd_debugging_mvcc_db_total_size_in_bytes)
  • Compaction falling behind

Mitigation:

  • Dedicated etcd nodes with SSDs
  • Increase etcd quota if hitting limits
  • Clean up stale Work objects and completed placements

Each member cluster runs an agent that pulls state from the hub. At scale, these agents become a factor.

Agents poll the hub for Work objects. More frequent polling = fresher state but more hub load. Less frequent = stale state but lighter load.

Trade-off:

50 clusters × 1 sync/second = 50 requests/second to hub
50 clusters × 1 sync/10 seconds = 5 requests/second to hub

Most agents use watches (efficient), but reconnections and resyncs generate load.

A Work object contains the full manifest of propagated resources. Propagating a large ConfigMap or a Deployment with lengthy specs means large Work objects.

Symptoms:

  • Slow sync times
  • High memory usage in member agents
  • Network bandwidth between hub and members

Diagnosis:

# Size of Work objects
kubectl get work -n fleet-member-cluster-1 -o json | wc -c

Mitigation:

  • Avoid propagating large ConfigMaps (use external config stores)
  • Propagate references instead of data where possible
  • Compress or chunk large manifests

Member agents report status back to the hub. With many resources across many clusters, status updates can overwhelm the hub.

Symptoms:

  • Hub API server write latency spikes
  • etcd write throughput saturated
  • Agents backing off on status updates

Mitigation:

  • Batch status updates
  • Report status less frequently for stable resources
  • Use conditions efficiently (don’t update if unchanged)

ArgoCD is often the tool managing deployments across multi-cluster fleets. A single ArgoCD instance managing 50+ clusters hits limits.

The application controller reconciles Applications — comparing desired state (Git) with actual state (clusters). Each Application means:

  • Watching the target cluster
  • Generating manifests (calling repo server)
  • Computing diff
  • Optionally syncing

At scale:

500 Applications × 3-minute sync interval = ~3 reconciliations/second

This seems manageable until you account for:

  • Manifest generation time (Helm templates, Kustomize)
  • Target cluster API latency
  • Diff computation for large Applications

Symptoms:

  • Applications stuck in “Progressing”
  • Long sync times
  • Controller CPU pegged

Diagnosis:

# Controller metrics
kubectl port-forward svc/argocd-metrics 8082:8082 -n argocd
curl localhost:8082/metrics | grep argocd_app_reconcile

# Queue depth
curl localhost:8082/metrics | grep workqueue_depth

Mitigation:

  • Increase controller replicas (with sharding)
  • Reduce sync frequency for stable Applications
  • Use Server-Side Apply (faster diffs)

The repo server generates manifests from Git repos. It’s CPU and memory intensive:

  • Cloning repos
  • Running Helm template
  • Running Kustomize build
  • Caching results

Symptoms:

  • Slow manifest generation
  • Repo server OOMKilled
  • Applications show “ComparisonError”

Diagnosis:

kubectl top pods -n argocd -l app.kubernetes.io/component=repo-server
kubectl logs -n argocd -l app.kubernetes.io/component=repo-server | grep -i error

Mitigation:

  • Scale repo server horizontally
  • Increase memory limits (Helm/Kustomize can be memory-hungry)
  • Use repo server parallelism settings
  • Cache Helm dependencies (avoid re-downloading)

ArgoCD uses Redis for caching. With many Applications and clusters, Redis becomes a factor.

Symptoms:

  • High Redis memory usage
  • Slow cache operations
  • Evictions causing cache misses (re-generating manifests)

Mitigation:

  • Increase Redis memory
  • Use Redis Cluster for HA (ArgoCD 2.x supports this)
  • Tune cache TTLs

ArgoCD polls Git repos for changes. With many Applications:

500 Applications polling every 3 minutes = 167 Git fetches/minute

If using GitHub, you’ll hit rate limits. If using webhooks, you’ll generate storms on every push.

Symptoms:

  • “rate limit exceeded” errors
  • Applications not detecting changes
  • Webhook timeouts

Mitigation:

  • Use webhooks instead of polling (more efficient)
  • Consolidate repos (fewer repos = fewer fetches)
  • Increase polling interval
  • Use GitHub App authentication (higher rate limits)

A single ArgoCD can’t manage thousands of Applications efficiently. Sharding options:

Option 1: Shard by cluster

  • Multiple ArgoCD controller replicas
  • Each controller handles a subset of clusters
  • Use --application-namespaces and cluster labels
# Controller deployment
env:
  - name: ARGOCD_CONTROLLER_REPLICAS
    value: "3"

Option 2: Multiple ArgoCD instances

  • Dedicated ArgoCD per environment (prod, staging)
  • Or per team / business unit
  • More operational overhead but better isolation

Option 3: ApplicationSets with progressive sync

  • Generate Applications dynamically
  • Use rolling sync strategies
  • Limit concurrent syncs
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            environment: production
  strategy:
    type: RollingSync
    rollingSync:
      steps:
        - matchExpressions:
            - key: region
              operator: In
              values: [us-east-1]
        - matchExpressions:
            - key: region
              operator: In
              values: [us-west-2]
  template:
    # ...

Every cluster pulls container images. At scale, the registry becomes critical infrastructure.

You push a new image and update 50 clusters. All 50 start pulling simultaneously:

50 clusters × 10 nodes × 500MB image = 250GB of transfer, all at once

Symptoms:

  • Registry timeouts
  • Image pull failures (“429 Too Many Requests”)
  • Slow deployments

Mitigation:

Registry caching / pull-through cache:

# Deploy a registry mirror in each cluster or region
# Configure containerd/docker to use local mirror
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["https://registry-cache.internal:5000"]

Geo-distributed registries:

  • Replicate images to regional registries
  • Route clusters to nearest registry

Pre-pulling:

  • DaemonSet that pulls images before deployment
  • Reduces thundering herd by spreading pull over time

Staggered rollouts:

  • Don’t update all 50 clusters simultaneously
  • Roll out region by region, cluster by cluster

If your single registry goes down, no cluster can pull new images.

Mitigation:

  • HA registry deployment
  • Multiple registry replicas across zones
  • Fallback registries in image pull specs (limited support)

Monitoring 50 clusters generates massive telemetry. The observability stack itself becomes a scaling challenge.

Classic pattern: Prometheus per cluster, federate to central Prometheus.

Problems at scale:

  • Federation scrapes are expensive (pulls all series)
  • Central Prometheus cardinality explodes
  • Query latency increases

Symptoms:

  • Federation scrapes timing out
  • Central Prometheus OOM
  • Slow dashboards

Mitigation:

  • Thanos/Cortex/Mimir: Scalable backends that accept remote-write
  • Remote write: Push metrics instead of federation pull
  • Recording rules: Aggregate at edge, send summaries
  • Reduce cardinality: Drop high-cardinality labels before sending
# Per-cluster Prometheus: remote write to central
remote_write:
  - url: https://thanos-receive.monitoring:19291/api/v1/receive

50 clusters × 1000 pods × 10 log lines/second = 500,000 lines/second.

Symptoms:

  • Log ingestion lag
  • Dropped logs
  • Query timeouts

Mitigation:

  • Sampling: Don’t ship all logs (sample debug, keep errors)
  • Edge aggregation: Aggregate common patterns locally
  • Tiered storage: Hot/cold storage for logs
  • Per-cluster Loki: Query individual clusters, federate on demand

Heavy monitoring can stress clusters:

  • Prometheus scraping thousands of targets
  • Logging agents consuming CPU/memory
  • Tracing overhead on every request

Watch for:

  • Monitoring pods consuming significant cluster resources
  • Scrape intervals too aggressive
  • Overly verbose logging levels

Mitigation:

  • Scrape less frequently for stable metrics
  • Use service discovery efficiently (don’t scrape what you don’t need)
  • Set appropriate resource limits on monitoring components

Multi-cluster secrets management adds latency and complexity.

Vault is often the central secrets store. Every cluster fetches secrets:

50 clusters × 100 secrets × refresh every 5 minutes = 1000 requests/minute

Symptoms:

  • Vault latency increases
  • Secret sync delays
  • Pod startup blocked waiting for secrets

Mitigation:

  • Vault replication (regional Vault clusters)
  • Caching (external-secrets-operator caches locally)
  • Longer TTLs for stable secrets
  • Batch secret fetches

Runs in each cluster, syncs secrets from external stores.

At scale:

  • Each cluster runs reconciliation loops
  • All hitting the same Vault/AWS Secrets Manager/etc.

Mitigation:

  • Tune sync intervals (not everything needs 30-second refresh)
  • Use refresh strategies (only refresh on pod restart)
  • Batch requests where possible

When things slow down, where do you look first?

  1. Start at the symptom

    • Slow deployments? → ArgoCD, registry
    • Stale state in member clusters? → Hub, agent sync
    • Metrics gaps? → Observability stack
  2. Check the coordination layer

    • Hub cluster health (API server, etcd)
    • Member agent logs
    • Work object backlogs
  3. Check shared infrastructure

    • Registry response times
    • Git repo rate limits
    • Vault/secrets latency
  4. Check cross-cutting tools

    • ArgoCD queue depths and reconciliation times
    • Prometheus scrape durations
    • Logging ingestion lag

Hub cluster:

# API server latency
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb))

# etcd latency
histogram_quantile(0.99, sum(rate(etcd_request_duration_seconds_bucket[5m])) by (le, operation))

# Watch count
sum(apiserver_registered_watchers) by (group, resource)

ArgoCD:

# Reconciliation duration
histogram_quantile(0.99, sum(rate(argocd_app_reconcile_bucket[5m])) by (le))

# Queue depth
workqueue_depth{name="app_operation_processing_queue"}

# Sync failures
sum(increase(argocd_app_sync_total{phase="Failed"}[1h])) by (dest_server)

Registry (if you expose metrics):

# Pull latency
histogram_quantile(0.99, sum(rate(registry_http_request_duration_seconds_bucket[5m])) by (le))

# Request rate
sum(rate(registry_http_requests_total[5m])) by (method)

Per-cluster Prometheus (federate or remote-write these):

# API server health across fleet
sum(apiserver_request_total) by (cluster, code)

# Pod startup latency across fleet
histogram_quantile(0.99, sum(rate(kubelet_pod_start_duration_seconds_bucket[5m])) by (le, cluster))

If metrics don’t reveal the bottleneck:

  1. Add tracing: Instrument the slow path (ArgoCD reconciliation, agent sync)
  2. Profile: pprof on Go components (ArgoCD, fleet agents)
  3. Simplify: Reduce fleet size temporarily to isolate
  4. Bisect: Disable half the clusters/Applications, see if problem persists

Multi-cluster Kubernetes trades single-cluster bottlenecks for coordination-layer bottlenecks. At scale, watch for:

Layer Bottlenecks
Hub cluster API server load, etcd size, watch fan-out
Member agents Sync frequency, Work object size, status storms
ArgoCD Application controller, repo server, Git rate limits
Image registry Pull thundering herd, single registry SPOF
Observability Federation limits, logging ingestion, cardinality
Secrets Vault load, sync latency

The tools that manage your fleet can become the bottleneck your fleet was supposed to escape. Monitor the coordinators, not just the clusters.