Kubernetes Networking Demystified: Tracing the Magic (and Debugging the Nightmare)


You run curl my-service:8080 from a pod, and it just works. The request reaches another pod, possibly on a different node, and you get a response. Magic.

Until it doesn’t work. Then you’re staring at iptables rules, tcpdump output, and CNI logs wondering where your packet went.

This post traces a packet through Kubernetes networking — from pod to Service to pod across nodes. Understanding this flow turns debugging from nightmare to systematic diagnosis.

Let’s make this concrete. We have:

Node 1 (192.168.1.10)
  Pod A: 10.244.1.5 (client)

Node 2 (192.168.1.11)  
  Pod B: 10.244.2.8 (server, behind Service)
  Pod C: 10.244.2.9 (server, behind Service)

Service: my-service
  ClusterIP: 10.96.45.67
  Port: 8080
  Endpoints: 10.244.2.8:8080, 10.244.2.9:8080

Pod A runs: curl my-service:8080

What actually happens? Let’s trace it.

Before crossing nodes, let’s understand the simplest case: two pods on the same node.

Each pod gets its own network namespace — an isolated network stack with its own interfaces, routes, and iptables. But pods need to communicate. Kubernetes creates a virtual ethernet pair (veth) connecting each pod to the node:

+-----------------------------------------------------------+
|                          Node                             |
|                                                           |
|   +-----------+                    +-----------+          |
|   |   Pod A   |                    |   Pod B   |          |
|   |   eth0    |                    |   eth0    |          |
|   +-----+-----+                    +-----+-----+          |
|         | veth                           | veth           |
|         |                                |                |
|   +-----+--------------------------------+-----+          |
|   |            Bridge (cni0/cbr0)             |           |
|   +-------------------------------------------+           |
|                                                           |
+-----------------------------------------------------------+

The veth pair: One end (eth0) is inside the pod’s network namespace. The other end is attached to a bridge on the node. The pair acts like a virtual cable.

The bridge: A software switch (commonly named cni0, cbr0, or docker0). All pod veths connect to this bridge. Packets to other pods on the same node go through the bridge.

Pod A (10.244.1.5) sends to Pod B (10.244.1.6), both on Node 1:

1. Pod A sends packet (src: 10.244.1.5, dst: 10.244.1.6)
2. Packet exits via eth0 (inside pod) -> enters veth -> arrives at bridge
3. Bridge looks up MAC for 10.244.1.6 -> forwards to Pod B's veth
4. Packet enters Pod B's eth0
5. Pod B receives packet

No iptables (for basic connectivity), no encapsulation. Just Layer 2 switching on the bridge.

# On the node, list veths
ip link show type veth

# See the bridge
ip link show type bridge
brctl show cni0

# See pod connections
bridge fdb show dev cni0

# tcpdump on the bridge
tcpdump -i cni0 -n host 10.244.1.5

Now for the interesting part. Pod A doesn’t call Pod B directly — it calls my-service:8080. The Service has a ClusterIP (10.96.45.67) that doesn’t exist on any interface. How does this work?

A ClusterIP is a virtual IP. No interface has this address. No ARP entry exists. If you try to ping it from outside the cluster, nothing responds.

Yet from inside a pod, it works. The secret: iptables rewrites the destination before the packet leaves.

kube-proxy runs on every node. It watches Services and Endpoints, then programs iptables rules that:

  1. Intercept packets destined for ClusterIPs
  2. Rewrite the destination to an actual pod IP (DNAT)
  3. Load balance across endpoints

iptables organizes rules into chains. kube-proxy creates a hierarchy:

Packet arrives (dst: 10.96.45.67:8080)
    |
    v
PREROUTING (or OUTPUT for local pods)
    |
    v
KUBE-SERVICES -- matches on ClusterIP:port
    |
    v
KUBE-SVC-XYZABC123 -- the Service's chain, randomly selects endpoint
    |
    +--- 50% ---> KUBE-SEP-ENDPOINT1 (DNAT to 10.244.2.8)
    |
    +--- 50% ---> KUBE-SEP-ENDPOINT2 (DNAT to 10.244.2.9)

KUBE-SERVICES: The entry point. Has rules for every Service, matching on ClusterIP:port.

KUBE-SVC-*: One chain per Service. Contains probability-based jumps to endpoint chains (this is how load balancing works).

KUBE-SEP-*: One chain per endpoint (pod). Performs the actual DNAT — rewriting destination from ClusterIP to pod IP.

Let’s see what kube-proxy creates:

# Dump all iptables rules
iptables-save | grep my-service

# Or more specifically, find the Service chain
iptables -t nat -L KUBE-SERVICES -n | grep 10.96.45.67

Example output (annotated):

# Entry in KUBE-SERVICES for our Service
-A KUBE-SERVICES -d 10.96.45.67/32 -p tcp -m tcp --dport 8080 \
    -j KUBE-SVC-ABCD1234  # Jump to Service chain

# The Service chain with load balancing
-A KUBE-SVC-ABCD1234 -m statistic --mode random --probability 0.5 \
    -j KUBE-SEP-ENDPOINT1  # 50% to first endpoint
-A KUBE-SVC-ABCD1234 \
    -j KUBE-SEP-ENDPOINT2  # Remaining 50% to second endpoint

# Endpoint chains - the actual DNAT
-A KUBE-SEP-ENDPOINT1 -p tcp \
    -j DNAT --to-destination 10.244.2.8:8080
-A KUBE-SEP-ENDPOINT2 -p tcp \
    -j DNAT --to-destination 10.244.2.9:8080

The probability math: With two endpoints, the first rule matches 50%. The second rule catches everything else (also 50%). With three endpoints: 33%, 50% of remaining (33%), then the rest (33%).

Before DNAT:

src: 10.244.1.5 (Pod A)
dst: 10.96.45.67:8080 (Service ClusterIP)

After DNAT:

src: 10.244.1.5 (Pod A)
dst: 10.244.2.8:8080 (Pod B - actual endpoint)

The packet now has a real destination. It can be routed to Pod B.

One problem: Pod B’s response will have:

src: 10.244.2.8:8080 (Pod B)
dst: 10.244.1.5 (Pod A)

Pod A sent to 10.96.45.67 but receives from 10.244.2.8. Won’t it be confused?

conntrack saves us. The kernel tracks connections. When the response arrives, it reverses the DNAT:

Response packet:
  Before reverse DNAT: src=10.244.2.8, dst=10.244.1.5
  After reverse DNAT:  src=10.96.45.67, dst=10.244.1.5

Pod A sees the response coming from the ClusterIP it originally contacted. The illusion holds.

# See connection tracking entries
conntrack -L | grep 10.96.45.67

Our packet is now destined for 10.244.2.8 (Pod B on Node 2). But there’s a problem.

Pod IPs (10.244.x.x) are internal to Kubernetes. Your physical network doesn’t know how to route them:

Node 1 (192.168.1.10) wants to send to 10.244.2.8
Physical router: "10.244.2.8? Never heard of it. Drop."

Node 1’s routing table doesn’t have a route to 10.244.2.0/24. Neither does your datacenter’s router.

An overlay network encapsulates pod-to-pod packets inside node-to-node packets:

+----------------------------------------------------------+
| Original Packet                                          |
| src: 10.244.1.5   dst: 10.244.2.8   payload: HTTP GET    |
+----------------------------------------------------------+
                         |
                  VXLAN Encapsulation
                         |
                         v
+----------------------------------------------------------+
| Outer Header                                             |
| src: 192.168.1.10  dst: 192.168.1.11  proto: UDP:8472    |
+----------------------------------------------------------+
| VXLAN Header (VNI: 1)                                    |
+----------------------------------------------------------+
| Inner Packet (original)                                  |
| src: 10.244.1.5   dst: 10.244.2.8   payload: HTTP GET    |
+----------------------------------------------------------+

The physical network only sees the outer header: Node 1 sending UDP to Node 2. It routes normally. Node 2 receives, strips the outer header, and delivers the inner packet to Pod B.

VXLAN (Virtual Extensible LAN) is the most common overlay in Kubernetes (used by Flannel, Calico in VXLAN mode, etc.).

Key components on each node:

  1. VXLAN interface (flannel.1, vxlan.calico): A virtual interface that handles encap/decap
  2. FDB (Forwarding Database): Maps pod IPs/MACs to node IPs
  3. Routes: Direct pod CIDR traffic to the VXLAN interface
# See the VXLAN interface
ip -d link show flannel.1

# See the FDB entries (which node has which pods)
bridge fdb show dev flannel.1

# See routes to other pod CIDRs
ip route | grep 10.244

Example routing table on Node 1:

10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1  # Local pods
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink              # Node 2's pods
10.244.3.0/24 via 10.244.3.0 dev flannel.1 onlink              # Node 3's pods

Traffic to 10.244.2.x goes through flannel.1, which encapsulates it.

Full journey for Pod A (10.244.1.5, Node 1) to Pod B (10.244.2.8, Node 2):

Node 1:
  1. Pod A sends: src=10.244.1.5, dst=10.244.2.8
  2. Packet exits pod via veth -> arrives at bridge (cni0)
  3. Bridge checks: 10.244.2.8 not local
  4. Packet routes to flannel.1 (VXLAN interface)
  5. VXLAN encapsulates:
     - Outer: src=192.168.1.10, dst=192.168.1.11, UDP:8472
     - Inner: original packet
  6. Encapsulated packet sent on physical network (eth0)

Physical Network:
  7. Packet routed from 192.168.1.10 to 192.168.1.11

Node 2:
  8. eth0 receives packet
  9. Kernel sees UDP:8472 -> hands to VXLAN interface
  10. flannel.1 decapsulates, extracts inner packet
  11. Inner packet: src=10.244.1.5, dst=10.244.2.8
  12. Routes to cni0 (bridge)
  13. Bridge forwards to Pod B's veth
  14. Pod B receives original packet

VXLAN adds ~50 bytes of overhead (outer IP + UDP + VXLAN header). If your physical MTU is 1500:

Physical MTU:   1500
VXLAN overhead:  -50
Pod MTU:        1450

If pods use MTU 1500, packets that need the full size will fail (too big after encapsulation). Symptoms:

  • Small requests work, large requests hang
  • SSH works, SCP fails
  • TCP connections stall

Check MTU configuration:

# On pod
cat /sys/class/net/eth0/mtu

# On node VXLAN interface
cat /sys/class/net/flannel.1/mtu

Most CNIs set pod MTU correctly, but misconfigurations happen.

Overlay isn’t the only option. Calico can run in BGP mode:

  • Each node advertises its pod CIDR to the network
  • Routers learn: “10.244.2.0/24 is behind 192.168.1.11”
  • No encapsulation needed — native routing

Trade-offs:

  • BGP: No overhead, but requires network integration (not all environments support it)
  • VXLAN: Works anywhere, but has overhead

Cloud providers often use their own routing (VPC routes in AWS/GCP) — no overlay, no BGP, just cloud magic.

Let’s put it all together. Pod A curls my-service:8080:

NODE 1 (192.168.1.10)
======================
Pod A (10.244.1.5)
    | (1) curl my-service:8080
    |     DNS resolves to 10.96.45.67
    |     Packet: src=10.244.1.5, dst=10.96.45.67:8080
    v
iptables (PREROUTING)
    | (2) DNAT: dst 10.96.45.67 -> 10.244.2.8
    |     Packet: src=10.244.1.5, dst=10.244.2.8:8080
    v
Bridge (cni0)
    | (3) 10.244.2.8 not local -> route lookup
    v
VXLAN (flannel.1)
    | (4) Encapsulate
    |     Outer: src=192.168.1.10, dst=192.168.1.11
    v
eth0 (192.168.1.10)
    | (5) Send to physical network
    v
==== Physical Network ====
    |
    v
NODE 2 (192.168.1.11)
======================
eth0 (192.168.1.11)
    | (6) Receive encapsulated packet
    v
VXLAN (flannel.1)
    | (7) Decapsulate
    |     Extract: src=10.244.1.5, dst=10.244.2.8
    v
Bridge (cni0)
    | (8) Forward to Pod B's veth
    v
Pod B (10.244.2.8)
    (9) Receive packet, process HTTP request

The return path:

  1. Pod B responds: src=10.244.2.8, dst=10.244.1.5
  2. VXLAN encapsulates, sends to Node 1
  3. Node 1 decapsulates
  4. conntrack matches the existing connection
  5. Reverse DNAT: src becomes 10.96.45.67 (the ClusterIP)
  6. Pod A receives response from “my-service”

Armed with this knowledge, debugging becomes systematic.

Step 1: Can the pod reach anything?

# From inside the pod
kubectl exec -it pod-a -- ping 8.8.8.8
kubectl exec -it pod-a -- ping 10.244.1.1  # Node's bridge IP

If this fails, the problem is basic connectivity (CNI, veth, bridge).

Step 2: Can the pod reach other pods on the same node?

kubectl exec -it pod-a -- ping <pod-on-same-node-ip>

If this fails: bridge or veth issue.

Step 3: Can the pod reach pods on other nodes?

kubectl exec -it pod-a -- ping <pod-on-different-node-ip>

If same-node works but cross-node fails: overlay problem.

Step 4: Can the pod reach the ClusterIP?

kubectl exec -it pod-a -- curl -v 10.96.45.67:8080

If direct pod IP works but ClusterIP fails: kube-proxy/iptables problem.

Step 5: Is DNS working?

kubectl exec -it pod-a -- nslookup my-service
kubectl exec -it pod-a -- cat /etc/resolv.conf

If IP works but name doesn’t: DNS problem (CoreDNS, resolv.conf).

# SSH to the node, then:

# Find rules for your Service
iptables-save | grep <service-name>
iptables-save | grep <cluster-ip>

# List the KUBE-SERVICES chain
iptables -t nat -L KUBE-SERVICES -n --line-numbers

# Follow a specific Service chain
iptables -t nat -L KUBE-SVC-XXXXX -n

What to look for:

  • Is there a rule matching your ClusterIP?
  • Does the Service chain have endpoint rules?
  • Are the endpoint IPs correct?

Missing rules? Check if kube-proxy is running:

kubectl get pods -n kube-system -l k8s-app=kube-proxy
kubectl logs -n kube-system -l k8s-app=kube-proxy

Capture traffic to see where packets go (or stop):

# On the pod (if tcpdump available)
kubectl exec -it pod-a -- tcpdump -i eth0 -n host 10.244.2.8

# On node, at the bridge
tcpdump -i cni0 -n host 10.244.1.5

# On node, at the VXLAN interface
tcpdump -i flannel.1 -n host 10.244.2.8

# On node, at the physical interface (see encapsulated packets)
tcpdump -i eth0 -n udp port 8472

# On destination node
tcpdump -i eth0 -n udp port 8472
tcpdump -i flannel.1 -n host 10.244.1.5

Interpret what you see:

  • Packets at cni0 but not flannel.1? Routing problem.
  • Packets at flannel.1 but not remote eth0? Physical network problem.
  • Packets at remote eth0 but not flannel.1? VXLAN decap problem.
  • Packets at remote cni0 but not pod? Bridge/veth problem.
# See all tracked connections
conntrack -L

# Filter for your Service
conntrack -L | grep 10.96.45.67

# Watch new connections
conntrack -E

Stale conntrack entries can cause weird issues (traffic to old pod IPs). Flushing can help (but disrupts existing connections):

conntrack -F

kube-proxy not running:

  • Symptom: ClusterIP doesn’t work, direct pod IP works
  • Check: kubectl get pods -n kube-system -l k8s-app=kube-proxy
  • Fix: Restart kube-proxy, check logs

CNI misconfiguration:

  • Symptom: Pods can’t communicate at all, or only on same node
  • Check: kubectl get pods -n kube-system for CNI pods (flannel, calico, etc.)
  • Check: /etc/cni/net.d/ for CNI config
  • Fix: Reinstall CNI, check config

iptables rules missing:

  • Symptom: ClusterIP doesn’t work after Service creation
  • Check: iptables-save | grep <service-ip>
  • Cause: kube-proxy error, RBAC issue
  • Fix: Check kube-proxy logs

MTU mismatch:

  • Symptom: Small packets work, large fail; TCP stalls
  • Check: MTU on pod, bridge, VXLAN interface
  • Fix: Configure CNI with correct MTU

NetworkPolicy blocking:

  • Symptom: Some pods can’t connect, others can
  • Check: kubectl get networkpolicy -A
  • Fix: Add appropriate NetworkPolicy rules

Firewall blocking VXLAN:

  • Symptom: Cross-node fails, same-node works
  • Check: tcpdump -i eth0 udp port 8472 — packets sent but not received?
  • Fix: Open UDP 8472 (VXLAN) between nodes
# Check all network interfaces
ip addr

# Check routes
ip route

# Check iptables NAT rules
iptables -t nat -L -n -v

# Check VXLAN FDB
bridge fdb show dev flannel.1

# Check CNI config
cat /etc/cni/net.d/*

# Check kube-proxy mode
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using"

# Check endpoints for a Service
kubectl get endpoints my-service

The magic of Kubernetes networking is built on:

  1. veths and bridges: Connect pods within a node
  2. iptables (kube-proxy): Implement Services via DNAT
  3. Overlay networks: Carry pod traffic across nodes via encapsulation

When debugging:

  • Trace the path systematically (pod -> bridge -> overlay -> remote node -> pod)
  • Use tcpdump at each hop to see where packets stop
  • Check iptables for Service issues
  • Check conntrack for connection state issues
  • Check MTU for “big packets fail” symptoms

The magic is just routing, NAT, and encapsulation. Once you know the layers, you know where to look.