You’ve got 2,000 Services in your cluster. Every node has 20,000+ iptables rules. Pod startup takes 5 seconds just for iptables programming. Network policy updates take minutes to propagate. Welcome to the limits of kube-proxy.
This post explains why kube-proxy struggles at scale, how eBPF fundamentally changes the game, and how to migrate to Cilium without breaking production.
How kube-proxy Works (And Why It Breaks) ¶
kube-proxy runs on every node and implements Kubernetes Services. When you create a Service with a ClusterIP, kube-proxy makes that virtual IP actually work.
The iptables Implementation ¶
By default, kube-proxy uses iptables. For each Service, it creates a chain of rules:
# Simplified view of what kube-proxy creates for one Service
-A KUBE-SERVICES -d 10.96.45.67/32 -p tcp --dport 80 -j KUBE-SVC-XXXX
-A KUBE-SVC-XXXX -m statistic --mode random --probability 0.33 -j KUBE-SEP-AAA
-A KUBE-SVC-XXXX -m statistic --mode random --probability 0.50 -j KUBE-SEP-BBB
-A KUBE-SVC-XXXX -j KUBE-SEP-CCC
-A KUBE-SEP-AAA -p tcp -j DNAT --to-destination 10.244.1.5:80
-A KUBE-SEP-BBB -p tcp -j DNAT --to-destination 10.244.2.8:80
-A KUBE-SEP-CCC -p tcp -j DNAT --to-destination 10.244.3.2:80
This works. But count the rules:
Per Service:
1 rule in KUBE-SERVICES (match ClusterIP)
N rules in KUBE-SVC-* (one per endpoint, for load balancing)
N rules in KUBE-SEP-* (one per endpoint, for DNAT)
Total: 1 + 2N rules per Service
With 2,000 Services averaging 3 endpoints each:
2,000 × (1 + 2×3) = 14,000 rules minimum
Add NodePort rules, external IPs, and load balancer rules — you’re easily at 20,000+ rules per node.
The O(n) Problem ¶
iptables evaluates rules sequentially. When a packet arrives:
Packet arrives
|
v
Rule 1: Does it match? No -> next
Rule 2: Does it match? No -> next
Rule 3: Does it match? No -> next
...
Rule 15,000: Does it match? Yes -> DNAT
Every packet traverses rules until it finds a match. With 20,000 rules, that’s potentially 20,000 comparisons per packet. The first Service in the chain is fast; the last one is slow.
Measured impact:
| Services | Rules | Latency (p99) |
|---|---|---|
| 100 | ~700 | 0.5ms |
| 1,000 | ~7,000 | 2ms |
| 5,000 | ~35,000 | 8ms |
| 10,000 | ~70,000 | 20ms+ |
At scale, iptables becomes the bottleneck, not your application.
Rule Update Storm ¶
When an endpoint changes (pod dies, new pod starts), kube-proxy must update iptables. But iptables doesn’t support atomic updates of individual rules — kube-proxy rewrites large chunks of the rule set.
1. Pod dies
2. Endpoint controller updates Endpoints object
3. kube-proxy sees the change
4. kube-proxy rebuilds iptables rules
5. iptables-restore replaces rules atomically (but slowly)
With 20,000 rules, step 5 can take seconds. During this time:
- CPU spikes on every node
- Network connections may be briefly disrupted
- Other iptables users (CNI, NetworkPolicy) compete for the lock
In large clusters, a rolling deployment can cause cascading iptables updates across all nodes simultaneously.
conntrack Table Exhaustion ¶
iptables uses conntrack to track connection state (needed for DNAT reverse translation). The conntrack table has a default limit:
$ cat /proc/sys/net/netfilter/nf_conntrack_max
131072
131,072 connections. Sounds like a lot until you have:
- 100 pods per node
- Each pod has 50 concurrent connections
- That’s 5,000 connections just from local pods
Add in Service traffic, health checks, and monitoring — you hit the limit. When conntrack is full, new connections are dropped silently.
# Check if you're hitting limits
$ dmesg | grep conntrack
nf_conntrack: table full, dropping packet
IPVS Mode: kube-proxy’s Improvement ¶
Kubernetes 1.11+ supports IPVS (IP Virtual Server) mode in kube-proxy. IPVS is a kernel-level load balancer that uses hash tables instead of sequential rule matching.
How IPVS Differs ¶
iptables: Linear search O(n)
Rule 1 -> Rule 2 -> Rule 3 -> ... -> Rule n
IPVS: Hash table lookup O(1)
Hash(ClusterIP:Port) -> Backend pool -> Select backend
IPVS stores Services in a hash table. Lookup is constant time regardless of Service count.
Enabling IPVS Mode ¶
# Edit kube-proxy ConfigMap
kubectl edit configmap kube-proxy -n kube-system
# Change mode from "" (iptables) to "ipvs"
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
scheduler: "rr" # round-robin, or "lc", "dh", "sh", etc.
# Restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system
IPVS Scheduling Algorithms ¶
IPVS supports multiple load balancing algorithms:
| Algorithm | Flag | Description |
|---|---|---|
| Round Robin | rr |
Rotate through backends |
| Least Connections | lc |
Send to backend with fewest connections |
| Destination Hashing | dh |
Hash destination IP for sticky routing |
| Source Hashing | sh |
Hash source IP for sticky routing |
| Shortest Expected Delay | sed |
Minimize expected delay |
# Check IPVS rules
$ ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.96.45.67:80 rr
-> 10.244.1.5:80 Masq 1 3 0
-> 10.244.2.8:80 Masq 1 2 0
-> 10.244.3.2:80 Masq 1 4 0
IPVS Limitations ¶
IPVS solves the O(n) lookup problem but still has issues:
- Still uses iptables for some functions (SNAT, masquerade, NodePort)
- Still uses conntrack — same table exhaustion problems
- Still runs in userspace — kube-proxy watches API server, then programs kernel
- No Network Policy — IPVS is for load balancing only; you still need iptables or another solution for policies
IPVS is better than iptables mode, but it’s an incremental improvement, not a fundamental redesign.
Enter eBPF ¶
eBPF (extended Berkeley Packet Filter) is a technology that lets you run sandboxed programs inside the Linux kernel. Instead of configuring kernel behavior through static rules (iptables), you inject custom code that the kernel executes.
What Makes eBPF Different ¶
Traditional approach (iptables):
Userspace: kube-proxy watches API, generates rules
|
v
Kernel: iptables netfilter framework
- Generic rule matching engine
- Not optimized for Kubernetes use case
eBPF approach (Cilium):
Userspace: Cilium agent watches API, compiles eBPF programs
|
v
Kernel: Custom eBPF programs attached to network hooks
- Purpose-built for Kubernetes
- Hash tables, direct routing, no rule chains
eBPF Programs ¶
An eBPF program is C code compiled to bytecode that the kernel verifies and JIT-compiles:
// Simplified example: redirect packets to a different destination
SEC("sk_lookup")
int service_lookup(struct bpf_sk_lookup *ctx) {
// Look up Service in eBPF map (hash table)
struct service_key key = {
.ip = ctx->local_ip4,
.port = ctx->local_port,
};
struct service_value *svc = bpf_map_lookup_elem(&services_map, &key);
if (!svc)
return SK_PASS; // Not a Service, let it through
// Select backend (load balancing)
struct backend *backend = select_backend(svc);
// Redirect to backend socket directly
return bpf_sk_assign(ctx, backend->socket, 0);
}
Key advantages:
- Hash table lookups: O(1) Service resolution via eBPF maps
- No context switches: Code runs in kernel, no userspace round-trips
- Socket-level routing: Can intercept at socket connect(), before any packet is generated
- Programmable: Can implement any logic, not limited to predefined rule types
eBPF Hook Points ¶
eBPF programs attach to specific kernel hook points:
Application
|
| connect() syscall
v
+-------------------+
| cgroup/connect4 | <-- eBPF: Intercept before socket connects
+-------------------+
|
v
+-------------------+
| Socket layer |
+-------------------+
|
| Packet created
v
+-------------------+
| tc ingress/egress | <-- eBPF: Modify packets in traffic control
+-------------------+
|
v
+-------------------+
| XDP (driver) | <-- eBPF: Earliest possible hook, in NIC driver
+-------------------+
|
v
Network
Cilium uses multiple hooks:
- cgroup hooks: Intercept socket operations (connect, bind, sendmsg)
- tc hooks: Process packets after they’re created
- XDP: Ultra-fast packet processing at the driver level
Socket-Level Load Balancing ¶
This is Cilium’s killer feature. Traditional kube-proxy works at the packet level:
kube-proxy (packet-level DNAT):
App connects to 10.96.45.67:80 (ClusterIP)
|
v
Packet created: src=10.244.1.10, dst=10.96.45.67
|
v
iptables DNAT: rewrite dst to 10.244.2.8 (backend)
|
v
Packet sent: src=10.244.1.10, dst=10.244.2.8
|
v
Response requires conntrack to reverse the DNAT
Cilium with socket-level LB:
Cilium (socket-level):
App calls connect(10.96.45.67:80)
|
v
eBPF intercepts connect() syscall
|
v
Looks up Service, selects backend 10.244.2.8
|
v
Rewrites socket destination to 10.244.2.8
|
v
Packet created: src=10.244.1.10, dst=10.244.2.8 (already correct!)
|
v
No DNAT needed, no conntrack entry needed for Service
Benefits:
- No conntrack entries for Service traffic (reduces table pressure)
- Lower latency (no packet rewriting in the data path)
- Works with any protocol (not just TCP/UDP)
- Survives backend changes (socket already connected to real backend)
Cilium as kube-proxy Replacement ¶
Cilium can fully replace kube-proxy, handling all Service types: ClusterIP, NodePort, LoadBalancer, and ExternalName.
Architecture ¶
+------------------+
| Cilium Agent | Runs on every node as DaemonSet
+--------+---------+
|
| Watches K8s API (Services, Endpoints, Pods)
| Compiles eBPF programs
| Loads programs into kernel
v
+------------------+
| eBPF Maps | Hash tables in kernel memory
| - Services | ClusterIP -> backend list
| - Backends | Backend ID -> Pod IP:Port
| - Connections | Connection tracking
+------------------+
|
| eBPF programs query maps
v
+------------------+
| eBPF Programs | Attached to cgroup, tc, XDP hooks
| - Socket LB | Intercept connect()
| - Packet LB | DNAT for NodePort, external
| - Policy | Network Policy enforcement
+------------------+
Installing Cilium with kube-proxy Replacement ¶
Prerequisites:
- Linux kernel 4.19+ (5.4+ recommended for full features)
- Direct routing or tunnel mode configured
- API server address accessible from nodes
# Install Cilium CLI
curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz
tar xzvf cilium-linux-amd64.tar.gz
sudo mv cilium /usr/local/bin/
# Install Cilium with kube-proxy replacement
cilium install \
--set kubeProxyReplacement=true \
--set k8sServiceHost=${API_SERVER_IP} \
--set k8sServicePort=${API_SERVER_PORT}
Or with Helm:
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
--namespace kube-system \
--set kubeProxyReplacement=true \
--set k8sServiceHost=${API_SERVER_IP} \
--set k8sServicePort=${API_SERVER_PORT} \
--set socketLB.enabled=true \
--set bpf.masquerade=true
Important: You must provide k8sServiceHost and k8sServicePort because Cilium needs to connect to the API server without relying on the kubernetes Service (which would require kube-proxy to work).
Removing kube-proxy ¶
Once Cilium is running and handling Services:
# Verify Cilium is handling Services
cilium status
kubectl exec -n kube-system ds/cilium -- cilium service list
# Delete kube-proxy
kubectl -n kube-system delete ds kube-proxy
kubectl -n kube-system delete cm kube-proxy
# Clean up iptables rules left by kube-proxy (run on each node)
iptables-save | grep -v KUBE | iptables-restore
Verifying the Replacement ¶
# Check Cilium's view of Services
kubectl exec -n kube-system ds/cilium -- cilium service list
ID Frontend Service Type Backend
1 10.96.0.1:443 ClusterIP 10.0.0.5:6443
2 10.96.0.10:53 ClusterIP 10.244.0.15:53, 10.244.0.16:53
3 10.96.45.67:80 ClusterIP 10.244.1.5:80, 10.244.2.8:80
# Check eBPF maps
kubectl exec -n kube-system ds/cilium -- cilium bpf lb list
SERVICE ADDRESS BACKEND ADDRESS
10.96.0.1:443 10.0.0.5:6443 (1)
10.96.0.10:53 10.244.0.15:53 (1) 10.244.0.16:53 (2)
10.96.45.67:80 10.244.1.5:80 (1) 10.244.2.8:80 (2)
DSR (Direct Server Return) ¶
Cilium supports DSR for external traffic, where response packets go directly from the backend to the client, bypassing the original node:
Without DSR:
Client -> Node1 (NodePort) -> Pod on Node2 -> Node1 -> Client
^
Response goes back through Node1
With DSR:
Client -> Node1 (NodePort) -> Pod on Node2 -> Client
^
Response goes directly to client
Enable DSR:
# Helm values
loadBalancer:
mode: dsr
DSR reduces latency and load on the ingress node, but requires network support (backends must be able to send packets with source IP of the Service).
Maglev Load Balancing ¶
Cilium supports Maglev consistent hashing, which provides:
- Better distribution than random selection
- Connection affinity survives backend changes (mostly)
- Used by Google for their production load balancers
# Helm values
loadBalancer:
algorithm: maglev
Performance Comparison ¶
Real benchmarks comparing kube-proxy (iptables), kube-proxy (IPVS), and Cilium:
Service Lookup Latency ¶
| Implementation | 100 Services | 1,000 Services | 10,000 Services |
|---|---|---|---|
| iptables | 0.3ms | 1.5ms | 12ms |
| IPVS | 0.1ms | 0.1ms | 0.15ms |
| Cilium eBPF | 0.05ms | 0.05ms | 0.05ms |
eBPF is constant time regardless of Service count.
Throughput (Requests/sec) ¶
Testing with 1,000 Services, HTTP workload:
| Implementation | RPS | CPU Usage |
|---|---|---|
| iptables | 45,000 | 80% |
| IPVS | 120,000 | 60% |
| Cilium eBPF | 180,000 | 40% |
Connection Setup Time ¶
| Implementation | p50 | p99 |
|---|---|---|
| iptables | 0.8ms | 5ms |
| IPVS | 0.3ms | 1.2ms |
| Cilium (socket LB) | 0.1ms | 0.3ms |
Socket-level LB eliminates packet-level DNAT overhead entirely.
Memory Usage ¶
| Implementation | Per Service | 10,000 Services |
|---|---|---|
| iptables | ~2KB rules | ~20MB iptables rules |
| IPVS | ~0.5KB | ~5MB |
| Cilium eBPF | ~0.2KB | ~2MB eBPF maps |
Migration Strategy ¶
Replacing kube-proxy in production requires care. Here’s a safe migration path:
Phase 1: Install Cilium Alongside kube-proxy ¶
# Install Cilium without kube-proxy replacement
cilium install --set kubeProxyReplacement=false
Cilium handles Network Policy and pod networking. kube-proxy still handles Services. Verify everything works.
Phase 2: Test kube-proxy Replacement in Staging ¶
# In staging cluster
cilium install \
--set kubeProxyReplacement=true \
--set k8sServiceHost=${API_SERVER_IP} \
--set k8sServicePort=${API_SERVER_PORT}
# Remove kube-proxy in staging
kubectl -n kube-system delete ds kube-proxy
Test all Service types:
- ClusterIP (internal services)
- NodePort (external access)
- LoadBalancer (cloud LB integration)
- ExternalName (DNS aliases)
- Headless Services (direct pod access)
Phase 3: Production Migration ¶
Option A: Rolling migration (safer)
# 1. Update Cilium to enable kube-proxy replacement
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--reuse-values \
--set kubeProxyReplacement=true \
--set k8sServiceHost=${API_SERVER_IP} \
--set k8sServicePort=${API_SERVER_PORT}
# 2. Wait for Cilium to restart on all nodes
kubectl rollout status ds/cilium -n kube-system
# 3. Verify Services work
kubectl exec -n kube-system ds/cilium -- cilium service list
# 4. Remove kube-proxy
kubectl -n kube-system delete ds kube-proxy
Option B: New cluster (cleanest)
Provision new cluster without kube-proxy from the start:
# kubeadm example
kubeadm init --skip-phases=addon/kube-proxy
Then install Cilium with kube-proxy replacement enabled.
Rollback Plan ¶
If things go wrong:
# Re-deploy kube-proxy
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/kube-proxy/kube-proxy-ds.yaml
# Disable Cilium's kube-proxy replacement
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--reuse-values \
--set kubeProxyReplacement=false
Debugging Cilium Service Routing ¶
Check Service Configuration ¶
# List all Services as Cilium sees them
kubectl exec -n kube-system ds/cilium -- cilium service list
# Get details for a specific Service
kubectl exec -n kube-system ds/cilium -- cilium service get <service-id>
Check eBPF Maps ¶
# List load balancer entries
kubectl exec -n kube-system ds/cilium -- cilium bpf lb list
# Check connection tracking
kubectl exec -n kube-system ds/cilium -- cilium bpf ct list global
Monitor Traffic ¶
# Watch traffic in real-time
kubectl exec -n kube-system ds/cilium -- cilium monitor
# Filter for specific Service
kubectl exec -n kube-system ds/cilium -- cilium monitor --related-to <pod-ip>
Common Issues ¶
Service not working after migration:
# Check if Service is in Cilium's map
cilium service list | grep <cluster-ip>
# If missing, check Cilium agent logs
kubectl logs -n kube-system -l k8s-app=cilium | grep <service-name>
Socket LB not working:
# Verify cgroup eBPF programs are attached
kubectl exec -n kube-system ds/cilium -- cilium bpf cgroup list
# Check if pods are in the Cilium-managed cgroup
# (requires cgroupv2)
mount | grep cgroup2
NodePort not accessible:
# Check if NodePort is configured
cilium service list | grep NodePort
# Verify XDP or tc programs are attached to host interfaces
kubectl exec -n kube-system ds/cilium -- cilium bpf prog list
When to Stick with kube-proxy ¶
eBPF isn’t always the answer. Keep kube-proxy if:
- Old kernels: eBPF features require kernel 4.19+; full features need 5.4+
- Small clusters: Under 500 Services, iptables overhead is negligible
- Simplicity: kube-proxy is battle-tested, well-documented, and “just works”
- Compliance: Some environments require known, auditable networking (iptables rules are more readable than eBPF bytecode)
- Windows nodes: eBPF is Linux-only; Windows nodes need kube-proxy
Summary ¶
kube-proxy served Kubernetes well, but its iptables-based design hits fundamental scaling limits:
| Problem | kube-proxy | Cilium eBPF |
|---|---|---|
| Rule scaling | O(n) linear search | O(1) hash lookup |
| Rule updates | Full table rewrite | Incremental map updates |
| conntrack | Required for all Services | Only for external traffic |
| CPU overhead | High at scale | Minimal |
| Latency | Grows with Services | Constant |
The migration path is well-established:
- Install Cilium alongside kube-proxy
- Test replacement in staging
- Enable replacement in production
- Remove kube-proxy
For clusters beyond a few hundred Services, or where network latency matters, eBPF-based service routing isn’t just faster — it’s a fundamentally better architecture.