Beyond kube-proxy: eBPF Service Routing in Kubernetes

You’ve got 2,000 Services in your cluster. Every node has 20,000+ iptables rules. Pod startup takes 5 seconds just for iptables programming. Network policy updates take minutes to propagate. Welcome to the limits of kube-proxy.

This post explains why kube-proxy struggles at scale, how eBPF fundamentally changes the game, and how to migrate to Cilium without breaking production.

How kube-proxy Works (And Why It Breaks) ¶

kube-proxy runs on every node and implements Kubernetes Services. When you create a Service with a ClusterIP, kube-proxy makes that virtual IP actually work.

The iptables Implementation ¶

By default, kube-proxy uses iptables. For each Service, it creates a chain of rules:

# Simplified view of what kube-proxy creates for one Service
-A KUBE-SERVICES -d 10.96.45.67/32 -p tcp --dport 80 -j KUBE-SVC-XXXX

-A KUBE-SVC-XXXX -m statistic --mode random --probability 0.33 -j KUBE-SEP-AAA
-A KUBE-SVC-XXXX -m statistic --mode random --probability 0.50 -j KUBE-SEP-BBB
-A KUBE-SVC-XXXX -j KUBE-SEP-CCC

-A KUBE-SEP-AAA -p tcp -j DNAT --to-destination 10.244.1.5:80
-A KUBE-SEP-BBB -p tcp -j DNAT --to-destination 10.244.2.8:80
-A KUBE-SEP-CCC -p tcp -j DNAT --to-destination 10.244.3.2:80

This works. But count the rules:

Per Service:
  1 rule in KUBE-SERVICES (match ClusterIP)
  N rules in KUBE-SVC-* (one per endpoint, for load balancing)
  N rules in KUBE-SEP-* (one per endpoint, for DNAT)

Total: 1 + 2N rules per Service

With 2,000 Services averaging 3 endpoints each:

2,000 × (1 + 2×3) = 14,000 rules minimum

Add NodePort rules, external IPs, and load balancer rules — you’re easily at 20,000+ rules per node.

The O(n) Problem ¶

iptables evaluates rules sequentially. When a packet arrives:

Packet arrives
    |
    v
Rule 1: Does it match? No -> next
Rule 2: Does it match? No -> next
Rule 3: Does it match? No -> next
...
Rule 15,000: Does it match? Yes -> DNAT

Every packet traverses rules until it finds a match. With 20,000 rules, that’s potentially 20,000 comparisons per packet. The first Service in the chain is fast; the last one is slow.

Measured impact:

Services	Rules	Latency (p99)
100	~700	0.5ms
1,000	~7,000	2ms
5,000	~35,000	8ms
10,000	~70,000	20ms+

At scale, iptables becomes the bottleneck, not your application.

Rule Update Storm ¶

When an endpoint changes (pod dies, new pod starts), kube-proxy must update iptables. But iptables doesn’t support atomic updates of individual rules — kube-proxy rewrites large chunks of the rule set.

1. Pod dies
2. Endpoint controller updates Endpoints object
3. kube-proxy sees the change
4. kube-proxy rebuilds iptables rules
5. iptables-restore replaces rules atomically (but slowly)

With 20,000 rules, step 5 can take seconds. During this time:

CPU spikes on every node
Network connections may be briefly disrupted
Other iptables users (CNI, NetworkPolicy) compete for the lock

In large clusters, a rolling deployment can cause cascading iptables updates across all nodes simultaneously.

conntrack Table Exhaustion ¶

iptables uses conntrack to track connection state (needed for DNAT reverse translation). The conntrack table has a default limit:

$ cat /proc/sys/net/netfilter/nf_conntrack_max
131072

131,072 connections. Sounds like a lot until you have:

100 pods per node
Each pod has 50 concurrent connections
That’s 5,000 connections just from local pods

Add in Service traffic, health checks, and monitoring — you hit the limit. When conntrack is full, new connections are dropped silently.

# Check if you're hitting limits
$ dmesg | grep conntrack
nf_conntrack: table full, dropping packet

IPVS Mode: kube-proxy’s Improvement ¶

Kubernetes 1.11+ supports IPVS (IP Virtual Server) mode in kube-proxy. IPVS is a kernel-level load balancer that uses hash tables instead of sequential rule matching.

How IPVS Differs ¶

iptables: Linear search O(n)
    Rule 1 -> Rule 2 -> Rule 3 -> ... -> Rule n

IPVS: Hash table lookup O(1)
    Hash(ClusterIP:Port) -> Backend pool -> Select backend

IPVS stores Services in a hash table. Lookup is constant time regardless of Service count.

Enabling IPVS Mode ¶

# Edit kube-proxy ConfigMap
kubectl edit configmap kube-proxy -n kube-system

# Change mode from "" (iptables) to "ipvs"
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  scheduler: "rr"  # round-robin, or "lc", "dh", "sh", etc.

# Restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system

IPVS Scheduling Algorithms ¶

IPVS supports multiple load balancing algorithms:

Algorithm	Flag	Description
Round Robin	`rr`	Rotate through backends
Least Connections	`lc`	Send to backend with fewest connections
Destination Hashing	`dh`	Hash destination IP for sticky routing
Source Hashing	`sh`	Hash source IP for sticky routing
Shortest Expected Delay	`sed`	Minimize expected delay

# Check IPVS rules
$ ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.96.45.67:80 rr
  -> 10.244.1.5:80                Masq    1      3          0
  -> 10.244.2.8:80                Masq    1      2          0
  -> 10.244.3.2:80                Masq    1      4          0

IPVS Limitations ¶

IPVS solves the O(n) lookup problem but still has issues:

Still uses iptables for some functions (SNAT, masquerade, NodePort)
Still uses conntrack — same table exhaustion problems
Still runs in userspace — kube-proxy watches API server, then programs kernel
No Network Policy — IPVS is for load balancing only; you still need iptables or another solution for policies

IPVS is better than iptables mode, but it’s an incremental improvement, not a fundamental redesign.

Enter eBPF ¶

eBPF (extended Berkeley Packet Filter) is a technology that lets you run sandboxed programs inside the Linux kernel. Instead of configuring kernel behavior through static rules (iptables), you inject custom code that the kernel executes.

What Makes eBPF Different ¶

Traditional approach (iptables):

Userspace: kube-proxy watches API, generates rules
           |
           v
Kernel:    iptables netfilter framework
           - Generic rule matching engine
           - Not optimized for Kubernetes use case

eBPF approach (Cilium):

Userspace: Cilium agent watches API, compiles eBPF programs
           |
           v
Kernel:    Custom eBPF programs attached to network hooks
           - Purpose-built for Kubernetes
           - Hash tables, direct routing, no rule chains

eBPF Programs ¶

An eBPF program is C code compiled to bytecode that the kernel verifies and JIT-compiles:

// Simplified example: redirect packets to a different destination
SEC("sk_lookup")
int service_lookup(struct bpf_sk_lookup *ctx) {
    // Look up Service in eBPF map (hash table)
    struct service_key key = {
        .ip = ctx->local_ip4,
        .port = ctx->local_port,
    };
    
    struct service_value *svc = bpf_map_lookup_elem(&services_map, &key);
    if (!svc)
        return SK_PASS;  // Not a Service, let it through
    
    // Select backend (load balancing)
    struct backend *backend = select_backend(svc);
    
    // Redirect to backend socket directly
    return bpf_sk_assign(ctx, backend->socket, 0);
}

Key advantages:

Hash table lookups: O(1) Service resolution via eBPF maps
No context switches: Code runs in kernel, no userspace round-trips
Socket-level routing: Can intercept at socket connect(), before any packet is generated
Programmable: Can implement any logic, not limited to predefined rule types

eBPF Hook Points ¶

eBPF programs attach to specific kernel hook points:

Application
    |
    | connect() syscall
    v
+-------------------+
| cgroup/connect4   |  <-- eBPF: Intercept before socket connects
+-------------------+
    |
    v
+-------------------+
| Socket layer      |
+-------------------+
    |
    | Packet created
    v
+-------------------+
| tc ingress/egress |  <-- eBPF: Modify packets in traffic control
+-------------------+
    |
    v
+-------------------+
| XDP (driver)      |  <-- eBPF: Earliest possible hook, in NIC driver
+-------------------+
    |
    v
  Network

Cilium uses multiple hooks:

cgroup hooks: Intercept socket operations (connect, bind, sendmsg)
tc hooks: Process packets after they’re created
XDP: Ultra-fast packet processing at the driver level

Socket-Level Load Balancing ¶

This is Cilium’s killer feature. Traditional kube-proxy works at the packet level:

kube-proxy (packet-level DNAT):

App connects to 10.96.45.67:80 (ClusterIP)
    |
    v
Packet created: src=10.244.1.10, dst=10.96.45.67
    |
    v
iptables DNAT: rewrite dst to 10.244.2.8 (backend)
    |
    v
Packet sent: src=10.244.1.10, dst=10.244.2.8
    |
    v
Response requires conntrack to reverse the DNAT

Cilium with socket-level LB:

Cilium (socket-level):

App calls connect(10.96.45.67:80)
    |
    v
eBPF intercepts connect() syscall
    |
    v
Looks up Service, selects backend 10.244.2.8
    |
    v
Rewrites socket destination to 10.244.2.8
    |
    v
Packet created: src=10.244.1.10, dst=10.244.2.8 (already correct!)
    |
    v
No DNAT needed, no conntrack entry needed for Service

Benefits:

No conntrack entries for Service traffic (reduces table pressure)
Lower latency (no packet rewriting in the data path)
Works with any protocol (not just TCP/UDP)
Survives backend changes (socket already connected to real backend)

Cilium as kube-proxy Replacement ¶

Cilium can fully replace kube-proxy, handling all Service types: ClusterIP, NodePort, LoadBalancer, and ExternalName.

Architecture ¶

+------------------+
|   Cilium Agent   |  Runs on every node as DaemonSet
+--------+---------+
         |
         | Watches K8s API (Services, Endpoints, Pods)
         | Compiles eBPF programs
         | Loads programs into kernel
         v
+------------------+
|  eBPF Maps       |  Hash tables in kernel memory
|  - Services      |  ClusterIP -> backend list
|  - Backends      |  Backend ID -> Pod IP:Port
|  - Connections   |  Connection tracking
+------------------+
         |
         | eBPF programs query maps
         v
+------------------+
|  eBPF Programs   |  Attached to cgroup, tc, XDP hooks
|  - Socket LB     |  Intercept connect()
|  - Packet LB     |  DNAT for NodePort, external
|  - Policy        |  Network Policy enforcement
+------------------+

Installing Cilium with kube-proxy Replacement ¶

Prerequisites:

Linux kernel 4.19+ (5.4+ recommended for full features)
Direct routing or tunnel mode configured
API server address accessible from nodes

# Install Cilium CLI
curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz
tar xzvf cilium-linux-amd64.tar.gz
sudo mv cilium /usr/local/bin/

# Install Cilium with kube-proxy replacement
cilium install \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT}

Or with Helm:

helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT} \
  --set socketLB.enabled=true \
  --set bpf.masquerade=true

Important: You must provide k8sServiceHost and k8sServicePort because Cilium needs to connect to the API server without relying on the kubernetes Service (which would require kube-proxy to work).

Removing kube-proxy ¶

Once Cilium is running and handling Services:

# Verify Cilium is handling Services
cilium status
kubectl exec -n kube-system ds/cilium -- cilium service list

# Delete kube-proxy
kubectl -n kube-system delete ds kube-proxy
kubectl -n kube-system delete cm kube-proxy

# Clean up iptables rules left by kube-proxy (run on each node)
iptables-save | grep -v KUBE | iptables-restore

Verifying the Replacement ¶

# Check Cilium's view of Services
kubectl exec -n kube-system ds/cilium -- cilium service list

ID   Frontend            Service Type   Backend
1    10.96.0.1:443       ClusterIP      10.0.0.5:6443
2    10.96.0.10:53       ClusterIP      10.244.0.15:53, 10.244.0.16:53
3    10.96.45.67:80      ClusterIP      10.244.1.5:80, 10.244.2.8:80

# Check eBPF maps
kubectl exec -n kube-system ds/cilium -- cilium bpf lb list

SERVICE ADDRESS     BACKEND ADDRESS
10.96.0.1:443       10.0.0.5:6443 (1)
10.96.0.10:53       10.244.0.15:53 (1) 10.244.0.16:53 (2)
10.96.45.67:80      10.244.1.5:80 (1) 10.244.2.8:80 (2)

DSR (Direct Server Return) ¶

Cilium supports DSR for external traffic, where response packets go directly from the backend to the client, bypassing the original node:

Without DSR:
Client -> Node1 (NodePort) -> Pod on Node2 -> Node1 -> Client
                                              ^
                                              Response goes back through Node1

With DSR:
Client -> Node1 (NodePort) -> Pod on Node2 -> Client
                                              ^
                                              Response goes directly to client

Enable DSR:

# Helm values
loadBalancer:
  mode: dsr

DSR reduces latency and load on the ingress node, but requires network support (backends must be able to send packets with source IP of the Service).

Maglev Load Balancing ¶

Cilium supports Maglev consistent hashing, which provides:

Better distribution than random selection
Connection affinity survives backend changes (mostly)
Used by Google for their production load balancers

# Helm values
loadBalancer:
  algorithm: maglev

Performance Comparison ¶

Real benchmarks comparing kube-proxy (iptables), kube-proxy (IPVS), and Cilium:

Service Lookup Latency ¶

Implementation	100 Services	1,000 Services	10,000 Services
iptables	0.3ms	1.5ms	12ms
IPVS	0.1ms	0.1ms	0.15ms
Cilium eBPF	0.05ms	0.05ms	0.05ms

eBPF is constant time regardless of Service count.

Throughput (Requests/sec) ¶

Testing with 1,000 Services, HTTP workload:

Implementation	RPS	CPU Usage
iptables	45,000	80%
IPVS	120,000	60%
Cilium eBPF	180,000	40%

Connection Setup Time ¶

Implementation	p50	p99
iptables	0.8ms	5ms
IPVS	0.3ms	1.2ms
Cilium (socket LB)	0.1ms	0.3ms

Socket-level LB eliminates packet-level DNAT overhead entirely.

Memory Usage ¶

Implementation	Per Service	10,000 Services
iptables	~2KB rules	~20MB iptables rules
IPVS	~0.5KB	~5MB
Cilium eBPF	~0.2KB	~2MB eBPF maps

Migration Strategy ¶

Replacing kube-proxy in production requires care. Here’s a safe migration path:

Phase 1: Install Cilium Alongside kube-proxy ¶

# Install Cilium without kube-proxy replacement
cilium install --set kubeProxyReplacement=false

Cilium handles Network Policy and pod networking. kube-proxy still handles Services. Verify everything works.

Phase 2: Test kube-proxy Replacement in Staging ¶

# In staging cluster
cilium install \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT}

# Remove kube-proxy in staging
kubectl -n kube-system delete ds kube-proxy

Test all Service types:

ClusterIP (internal services)
NodePort (external access)
LoadBalancer (cloud LB integration)
ExternalName (DNS aliases)
Headless Services (direct pod access)

Phase 3: Production Migration ¶

Option A: Rolling migration (safer)

# 1. Update Cilium to enable kube-proxy replacement
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT}

# 2. Wait for Cilium to restart on all nodes
kubectl rollout status ds/cilium -n kube-system

# 3. Verify Services work
kubectl exec -n kube-system ds/cilium -- cilium service list

# 4. Remove kube-proxy
kubectl -n kube-system delete ds kube-proxy

Option B: New cluster (cleanest)

Provision new cluster without kube-proxy from the start:

# kubeadm example
kubeadm init --skip-phases=addon/kube-proxy

Then install Cilium with kube-proxy replacement enabled.

Rollback Plan ¶

If things go wrong:

# Re-deploy kube-proxy
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/kube-proxy/kube-proxy-ds.yaml

# Disable Cilium's kube-proxy replacement
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set kubeProxyReplacement=false

Debugging Cilium Service Routing ¶

Check Service Configuration ¶

# List all Services as Cilium sees them
kubectl exec -n kube-system ds/cilium -- cilium service list

# Get details for a specific Service
kubectl exec -n kube-system ds/cilium -- cilium service get <service-id>

Check eBPF Maps ¶

# List load balancer entries
kubectl exec -n kube-system ds/cilium -- cilium bpf lb list

# Check connection tracking
kubectl exec -n kube-system ds/cilium -- cilium bpf ct list global

Monitor Traffic ¶

# Watch traffic in real-time
kubectl exec -n kube-system ds/cilium -- cilium monitor

# Filter for specific Service
kubectl exec -n kube-system ds/cilium -- cilium monitor --related-to <pod-ip>

Common Issues ¶

Service not working after migration:

# Check if Service is in Cilium's map
cilium service list | grep <cluster-ip>

# If missing, check Cilium agent logs
kubectl logs -n kube-system -l k8s-app=cilium | grep <service-name>

Socket LB not working:

# Verify cgroup eBPF programs are attached
kubectl exec -n kube-system ds/cilium -- cilium bpf cgroup list

# Check if pods are in the Cilium-managed cgroup
# (requires cgroupv2)
mount | grep cgroup2

NodePort not accessible:

# Check if NodePort is configured
cilium service list | grep NodePort

# Verify XDP or tc programs are attached to host interfaces
kubectl exec -n kube-system ds/cilium -- cilium bpf prog list

When to Stick with kube-proxy ¶

eBPF isn’t always the answer. Keep kube-proxy if:

Old kernels: eBPF features require kernel 4.19+; full features need 5.4+
Small clusters: Under 500 Services, iptables overhead is negligible
Simplicity: kube-proxy is battle-tested, well-documented, and “just works”
Compliance: Some environments require known, auditable networking (iptables rules are more readable than eBPF bytecode)
Windows nodes: eBPF is Linux-only; Windows nodes need kube-proxy

Summary ¶

kube-proxy served Kubernetes well, but its iptables-based design hits fundamental scaling limits:

Problem	kube-proxy	Cilium eBPF
Rule scaling	O(n) linear search	O(1) hash lookup
Rule updates	Full table rewrite	Incremental map updates
conntrack	Required for all Services	Only for external traffic
CPU overhead	High at scale	Minimal
Latency	Grows with Services	Constant

The migration path is well-established:

Install Cilium alongside kube-proxy
Test replacement in staging
Enable replacement in production
Remove kube-proxy

For clusters beyond a few hundred Services, or where network latency matters, eBPF-based service routing isn’t just faster — it’s a fundamentally better architecture.