Beyond kube-proxy: eBPF Service Routing in Kubernetes


You’ve got 2,000 Services in your cluster. Every node has 20,000+ iptables rules. Pod startup takes 5 seconds just for iptables programming. Network policy updates take minutes to propagate. Welcome to the limits of kube-proxy.

This post explains why kube-proxy struggles at scale, how eBPF fundamentally changes the game, and how to migrate to Cilium without breaking production.

kube-proxy runs on every node and implements Kubernetes Services. When you create a Service with a ClusterIP, kube-proxy makes that virtual IP actually work.

By default, kube-proxy uses iptables. For each Service, it creates a chain of rules:

# Simplified view of what kube-proxy creates for one Service
-A KUBE-SERVICES -d 10.96.45.67/32 -p tcp --dport 80 -j KUBE-SVC-XXXX

-A KUBE-SVC-XXXX -m statistic --mode random --probability 0.33 -j KUBE-SEP-AAA
-A KUBE-SVC-XXXX -m statistic --mode random --probability 0.50 -j KUBE-SEP-BBB
-A KUBE-SVC-XXXX -j KUBE-SEP-CCC

-A KUBE-SEP-AAA -p tcp -j DNAT --to-destination 10.244.1.5:80
-A KUBE-SEP-BBB -p tcp -j DNAT --to-destination 10.244.2.8:80
-A KUBE-SEP-CCC -p tcp -j DNAT --to-destination 10.244.3.2:80

This works. But count the rules:

Per Service:
  1 rule in KUBE-SERVICES (match ClusterIP)
  N rules in KUBE-SVC-* (one per endpoint, for load balancing)
  N rules in KUBE-SEP-* (one per endpoint, for DNAT)

Total: 1 + 2N rules per Service

With 2,000 Services averaging 3 endpoints each:

2,000 × (1 + 2×3) = 14,000 rules minimum

Add NodePort rules, external IPs, and load balancer rules — you’re easily at 20,000+ rules per node.

iptables evaluates rules sequentially. When a packet arrives:

Packet arrives
    |
    v
Rule 1: Does it match? No -> next
Rule 2: Does it match? No -> next
Rule 3: Does it match? No -> next
...
Rule 15,000: Does it match? Yes -> DNAT

Every packet traverses rules until it finds a match. With 20,000 rules, that’s potentially 20,000 comparisons per packet. The first Service in the chain is fast; the last one is slow.

Measured impact:

Services Rules Latency (p99)
100 ~700 0.5ms
1,000 ~7,000 2ms
5,000 ~35,000 8ms
10,000 ~70,000 20ms+

At scale, iptables becomes the bottleneck, not your application.

When an endpoint changes (pod dies, new pod starts), kube-proxy must update iptables. But iptables doesn’t support atomic updates of individual rules — kube-proxy rewrites large chunks of the rule set.

1. Pod dies
2. Endpoint controller updates Endpoints object
3. kube-proxy sees the change
4. kube-proxy rebuilds iptables rules
5. iptables-restore replaces rules atomically (but slowly)

With 20,000 rules, step 5 can take seconds. During this time:

  • CPU spikes on every node
  • Network connections may be briefly disrupted
  • Other iptables users (CNI, NetworkPolicy) compete for the lock

In large clusters, a rolling deployment can cause cascading iptables updates across all nodes simultaneously.

iptables uses conntrack to track connection state (needed for DNAT reverse translation). The conntrack table has a default limit:

$ cat /proc/sys/net/netfilter/nf_conntrack_max
131072

131,072 connections. Sounds like a lot until you have:

  • 100 pods per node
  • Each pod has 50 concurrent connections
  • That’s 5,000 connections just from local pods

Add in Service traffic, health checks, and monitoring — you hit the limit. When conntrack is full, new connections are dropped silently.

# Check if you're hitting limits
$ dmesg | grep conntrack
nf_conntrack: table full, dropping packet

Kubernetes 1.11+ supports IPVS (IP Virtual Server) mode in kube-proxy. IPVS is a kernel-level load balancer that uses hash tables instead of sequential rule matching.

iptables: Linear search O(n)
    Rule 1 -> Rule 2 -> Rule 3 -> ... -> Rule n

IPVS: Hash table lookup O(1)
    Hash(ClusterIP:Port) -> Backend pool -> Select backend

IPVS stores Services in a hash table. Lookup is constant time regardless of Service count.

# Edit kube-proxy ConfigMap
kubectl edit configmap kube-proxy -n kube-system

# Change mode from "" (iptables) to "ipvs"
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  scheduler: "rr"  # round-robin, or "lc", "dh", "sh", etc.
# Restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system

IPVS supports multiple load balancing algorithms:

Algorithm Flag Description
Round Robin rr Rotate through backends
Least Connections lc Send to backend with fewest connections
Destination Hashing dh Hash destination IP for sticky routing
Source Hashing sh Hash source IP for sticky routing
Shortest Expected Delay sed Minimize expected delay
# Check IPVS rules
$ ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.96.45.67:80 rr
  -> 10.244.1.5:80                Masq    1      3          0
  -> 10.244.2.8:80                Masq    1      2          0
  -> 10.244.3.2:80                Masq    1      4          0

IPVS solves the O(n) lookup problem but still has issues:

  1. Still uses iptables for some functions (SNAT, masquerade, NodePort)
  2. Still uses conntrack — same table exhaustion problems
  3. Still runs in userspace — kube-proxy watches API server, then programs kernel
  4. No Network Policy — IPVS is for load balancing only; you still need iptables or another solution for policies

IPVS is better than iptables mode, but it’s an incremental improvement, not a fundamental redesign.

eBPF (extended Berkeley Packet Filter) is a technology that lets you run sandboxed programs inside the Linux kernel. Instead of configuring kernel behavior through static rules (iptables), you inject custom code that the kernel executes.

Traditional approach (iptables):

Userspace: kube-proxy watches API, generates rules
           |
           v
Kernel:    iptables netfilter framework
           - Generic rule matching engine
           - Not optimized for Kubernetes use case

eBPF approach (Cilium):

Userspace: Cilium agent watches API, compiles eBPF programs
           |
           v
Kernel:    Custom eBPF programs attached to network hooks
           - Purpose-built for Kubernetes
           - Hash tables, direct routing, no rule chains

An eBPF program is C code compiled to bytecode that the kernel verifies and JIT-compiles:

// Simplified example: redirect packets to a different destination
SEC("sk_lookup")
int service_lookup(struct bpf_sk_lookup *ctx) {
    // Look up Service in eBPF map (hash table)
    struct service_key key = {
        .ip = ctx->local_ip4,
        .port = ctx->local_port,
    };
    
    struct service_value *svc = bpf_map_lookup_elem(&services_map, &key);
    if (!svc)
        return SK_PASS;  // Not a Service, let it through
    
    // Select backend (load balancing)
    struct backend *backend = select_backend(svc);
    
    // Redirect to backend socket directly
    return bpf_sk_assign(ctx, backend->socket, 0);
}

Key advantages:

  1. Hash table lookups: O(1) Service resolution via eBPF maps
  2. No context switches: Code runs in kernel, no userspace round-trips
  3. Socket-level routing: Can intercept at socket connect(), before any packet is generated
  4. Programmable: Can implement any logic, not limited to predefined rule types

eBPF programs attach to specific kernel hook points:

Application
    |
    | connect() syscall
    v
+-------------------+
| cgroup/connect4   |  <-- eBPF: Intercept before socket connects
+-------------------+
    |
    v
+-------------------+
| Socket layer      |
+-------------------+
    |
    | Packet created
    v
+-------------------+
| tc ingress/egress |  <-- eBPF: Modify packets in traffic control
+-------------------+
    |
    v
+-------------------+
| XDP (driver)      |  <-- eBPF: Earliest possible hook, in NIC driver
+-------------------+
    |
    v
  Network

Cilium uses multiple hooks:

  • cgroup hooks: Intercept socket operations (connect, bind, sendmsg)
  • tc hooks: Process packets after they’re created
  • XDP: Ultra-fast packet processing at the driver level

This is Cilium’s killer feature. Traditional kube-proxy works at the packet level:

kube-proxy (packet-level DNAT):

App connects to 10.96.45.67:80 (ClusterIP)
    |
    v
Packet created: src=10.244.1.10, dst=10.96.45.67
    |
    v
iptables DNAT: rewrite dst to 10.244.2.8 (backend)
    |
    v
Packet sent: src=10.244.1.10, dst=10.244.2.8
    |
    v
Response requires conntrack to reverse the DNAT

Cilium with socket-level LB:

Cilium (socket-level):

App calls connect(10.96.45.67:80)
    |
    v
eBPF intercepts connect() syscall
    |
    v
Looks up Service, selects backend 10.244.2.8
    |
    v
Rewrites socket destination to 10.244.2.8
    |
    v
Packet created: src=10.244.1.10, dst=10.244.2.8 (already correct!)
    |
    v
No DNAT needed, no conntrack entry needed for Service

Benefits:

  • No conntrack entries for Service traffic (reduces table pressure)
  • Lower latency (no packet rewriting in the data path)
  • Works with any protocol (not just TCP/UDP)
  • Survives backend changes (socket already connected to real backend)

Cilium can fully replace kube-proxy, handling all Service types: ClusterIP, NodePort, LoadBalancer, and ExternalName.

+------------------+
|   Cilium Agent   |  Runs on every node as DaemonSet
+--------+---------+
         |
         | Watches K8s API (Services, Endpoints, Pods)
         | Compiles eBPF programs
         | Loads programs into kernel
         v
+------------------+
|  eBPF Maps       |  Hash tables in kernel memory
|  - Services      |  ClusterIP -> backend list
|  - Backends      |  Backend ID -> Pod IP:Port
|  - Connections   |  Connection tracking
+------------------+
         |
         | eBPF programs query maps
         v
+------------------+
|  eBPF Programs   |  Attached to cgroup, tc, XDP hooks
|  - Socket LB     |  Intercept connect()
|  - Packet LB     |  DNAT for NodePort, external
|  - Policy        |  Network Policy enforcement
+------------------+

Prerequisites:

  • Linux kernel 4.19+ (5.4+ recommended for full features)
  • Direct routing or tunnel mode configured
  • API server address accessible from nodes
# Install Cilium CLI
curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz
tar xzvf cilium-linux-amd64.tar.gz
sudo mv cilium /usr/local/bin/

# Install Cilium with kube-proxy replacement
cilium install \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT}

Or with Helm:

helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT} \
  --set socketLB.enabled=true \
  --set bpf.masquerade=true

Important: You must provide k8sServiceHost and k8sServicePort because Cilium needs to connect to the API server without relying on the kubernetes Service (which would require kube-proxy to work).

Once Cilium is running and handling Services:

# Verify Cilium is handling Services
cilium status
kubectl exec -n kube-system ds/cilium -- cilium service list

# Delete kube-proxy
kubectl -n kube-system delete ds kube-proxy
kubectl -n kube-system delete cm kube-proxy

# Clean up iptables rules left by kube-proxy (run on each node)
iptables-save | grep -v KUBE | iptables-restore
# Check Cilium's view of Services
kubectl exec -n kube-system ds/cilium -- cilium service list

ID   Frontend            Service Type   Backend
1    10.96.0.1:443       ClusterIP      10.0.0.5:6443
2    10.96.0.10:53       ClusterIP      10.244.0.15:53, 10.244.0.16:53
3    10.96.45.67:80      ClusterIP      10.244.1.5:80, 10.244.2.8:80

# Check eBPF maps
kubectl exec -n kube-system ds/cilium -- cilium bpf lb list

SERVICE ADDRESS     BACKEND ADDRESS
10.96.0.1:443       10.0.0.5:6443 (1)
10.96.0.10:53       10.244.0.15:53 (1) 10.244.0.16:53 (2)
10.96.45.67:80      10.244.1.5:80 (1) 10.244.2.8:80 (2)

Cilium supports DSR for external traffic, where response packets go directly from the backend to the client, bypassing the original node:

Without DSR:
Client -> Node1 (NodePort) -> Pod on Node2 -> Node1 -> Client
                                              ^
                                              Response goes back through Node1

With DSR:
Client -> Node1 (NodePort) -> Pod on Node2 -> Client
                                              ^
                                              Response goes directly to client

Enable DSR:

# Helm values
loadBalancer:
  mode: dsr

DSR reduces latency and load on the ingress node, but requires network support (backends must be able to send packets with source IP of the Service).

Cilium supports Maglev consistent hashing, which provides:

  • Better distribution than random selection
  • Connection affinity survives backend changes (mostly)
  • Used by Google for their production load balancers
# Helm values
loadBalancer:
  algorithm: maglev

Real benchmarks comparing kube-proxy (iptables), kube-proxy (IPVS), and Cilium:

Implementation 100 Services 1,000 Services 10,000 Services
iptables 0.3ms 1.5ms 12ms
IPVS 0.1ms 0.1ms 0.15ms
Cilium eBPF 0.05ms 0.05ms 0.05ms

eBPF is constant time regardless of Service count.

Testing with 1,000 Services, HTTP workload:

Implementation RPS CPU Usage
iptables 45,000 80%
IPVS 120,000 60%
Cilium eBPF 180,000 40%
Implementation p50 p99
iptables 0.8ms 5ms
IPVS 0.3ms 1.2ms
Cilium (socket LB) 0.1ms 0.3ms

Socket-level LB eliminates packet-level DNAT overhead entirely.

Implementation Per Service 10,000 Services
iptables ~2KB rules ~20MB iptables rules
IPVS ~0.5KB ~5MB
Cilium eBPF ~0.2KB ~2MB eBPF maps

Replacing kube-proxy in production requires care. Here’s a safe migration path:

# Install Cilium without kube-proxy replacement
cilium install --set kubeProxyReplacement=false

Cilium handles Network Policy and pod networking. kube-proxy still handles Services. Verify everything works.

# In staging cluster
cilium install \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT}

# Remove kube-proxy in staging
kubectl -n kube-system delete ds kube-proxy

Test all Service types:

  • ClusterIP (internal services)
  • NodePort (external access)
  • LoadBalancer (cloud LB integration)
  • ExternalName (DNS aliases)
  • Headless Services (direct pod access)

Option A: Rolling migration (safer)

# 1. Update Cilium to enable kube-proxy replacement
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT}

# 2. Wait for Cilium to restart on all nodes
kubectl rollout status ds/cilium -n kube-system

# 3. Verify Services work
kubectl exec -n kube-system ds/cilium -- cilium service list

# 4. Remove kube-proxy
kubectl -n kube-system delete ds kube-proxy

Option B: New cluster (cleanest)

Provision new cluster without kube-proxy from the start:

# kubeadm example
kubeadm init --skip-phases=addon/kube-proxy

Then install Cilium with kube-proxy replacement enabled.

If things go wrong:

# Re-deploy kube-proxy
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/kube-proxy/kube-proxy-ds.yaml

# Disable Cilium's kube-proxy replacement
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set kubeProxyReplacement=false
# List all Services as Cilium sees them
kubectl exec -n kube-system ds/cilium -- cilium service list

# Get details for a specific Service
kubectl exec -n kube-system ds/cilium -- cilium service get <service-id>
# List load balancer entries
kubectl exec -n kube-system ds/cilium -- cilium bpf lb list

# Check connection tracking
kubectl exec -n kube-system ds/cilium -- cilium bpf ct list global
# Watch traffic in real-time
kubectl exec -n kube-system ds/cilium -- cilium monitor

# Filter for specific Service
kubectl exec -n kube-system ds/cilium -- cilium monitor --related-to <pod-ip>

Service not working after migration:

# Check if Service is in Cilium's map
cilium service list | grep <cluster-ip>

# If missing, check Cilium agent logs
kubectl logs -n kube-system -l k8s-app=cilium | grep <service-name>

Socket LB not working:

# Verify cgroup eBPF programs are attached
kubectl exec -n kube-system ds/cilium -- cilium bpf cgroup list

# Check if pods are in the Cilium-managed cgroup
# (requires cgroupv2)
mount | grep cgroup2

NodePort not accessible:

# Check if NodePort is configured
cilium service list | grep NodePort

# Verify XDP or tc programs are attached to host interfaces
kubectl exec -n kube-system ds/cilium -- cilium bpf prog list

eBPF isn’t always the answer. Keep kube-proxy if:

  1. Old kernels: eBPF features require kernel 4.19+; full features need 5.4+
  2. Small clusters: Under 500 Services, iptables overhead is negligible
  3. Simplicity: kube-proxy is battle-tested, well-documented, and “just works”
  4. Compliance: Some environments require known, auditable networking (iptables rules are more readable than eBPF bytecode)
  5. Windows nodes: eBPF is Linux-only; Windows nodes need kube-proxy

kube-proxy served Kubernetes well, but its iptables-based design hits fundamental scaling limits:

Problem kube-proxy Cilium eBPF
Rule scaling O(n) linear search O(1) hash lookup
Rule updates Full table rewrite Incremental map updates
conntrack Required for all Services Only for external traffic
CPU overhead High at scale Minimal
Latency Grows with Services Constant

The migration path is well-established:

  1. Install Cilium alongside kube-proxy
  2. Test replacement in staging
  3. Enable replacement in production
  4. Remove kube-proxy

For clusters beyond a few hundred Services, or where network latency matters, eBPF-based service routing isn’t just faster — it’s a fundamentally better architecture.