You run curl my-service:8080 from a pod, and it just works. The request reaches another pod, possibly on a different node, and you get a response. Magic.
Until it doesn’t work. Then you’re staring at iptables rules, tcpdump output, and CNI logs wondering where your packet went.
This post traces a packet through Kubernetes networking — from pod to Service to pod across nodes. Understanding this flow turns debugging from nightmare to systematic diagnosis.
The Setup: What We’re Tracing ¶
Let’s make this concrete. We have:
Node 1 (192.168.1.10)
Pod A: 10.244.1.5 (client)
Node 2 (192.168.1.11)
Pod B: 10.244.2.8 (server, behind Service)
Pod C: 10.244.2.9 (server, behind Service)
Service: my-service
ClusterIP: 10.96.45.67
Port: 8080
Endpoints: 10.244.2.8:8080, 10.244.2.9:8080
Pod A runs: curl my-service:8080
What actually happens? Let’s trace it.
Part 1: Pod-to-Pod on the Same Node ¶
Before crossing nodes, let’s understand the simplest case: two pods on the same node.
The Virtual Network ¶
Each pod gets its own network namespace — an isolated network stack with its own interfaces, routes, and iptables. But pods need to communicate. Kubernetes creates a virtual ethernet pair (veth) connecting each pod to the node:
+-----------------------------------------------------------+
| Node |
| |
| +-----------+ +-----------+ |
| | Pod A | | Pod B | |
| | eth0 | | eth0 | |
| +-----+-----+ +-----+-----+ |
| | veth | veth |
| | | |
| +-----+--------------------------------+-----+ |
| | Bridge (cni0/cbr0) | |
| +-------------------------------------------+ |
| |
+-----------------------------------------------------------+
The veth pair: One end (eth0) is inside the pod’s network namespace. The other end is attached to a bridge on the node. The pair acts like a virtual cable.
The bridge: A software switch (commonly named cni0, cbr0, or docker0). All pod veths connect to this bridge. Packets to other pods on the same node go through the bridge.
Packet Flow: Same Node ¶
Pod A (10.244.1.5) sends to Pod B (10.244.1.6), both on Node 1:
1. Pod A sends packet (src: 10.244.1.5, dst: 10.244.1.6)
2. Packet exits via eth0 (inside pod) -> enters veth -> arrives at bridge
3. Bridge looks up MAC for 10.244.1.6 -> forwards to Pod B's veth
4. Packet enters Pod B's eth0
5. Pod B receives packet
No iptables (for basic connectivity), no encapsulation. Just Layer 2 switching on the bridge.
See It Yourself ¶
# On the node, list veths
ip link show type veth
# See the bridge
ip link show type bridge
brctl show cni0
# See pod connections
bridge fdb show dev cni0
# tcpdump on the bridge
tcpdump -i cni0 -n host 10.244.1.5
Part 2: How Services Work (iptables) ¶
Now for the interesting part. Pod A doesn’t call Pod B directly — it calls my-service:8080. The Service has a ClusterIP (10.96.45.67) that doesn’t exist on any interface. How does this work?
The ClusterIP Illusion ¶
A ClusterIP is a virtual IP. No interface has this address. No ARP entry exists. If you try to ping it from outside the cluster, nothing responds.
Yet from inside a pod, it works. The secret: iptables rewrites the destination before the packet leaves.
kube-proxy’s Job ¶
kube-proxy runs on every node. It watches Services and Endpoints, then programs iptables rules that:
- Intercept packets destined for ClusterIPs
- Rewrite the destination to an actual pod IP (DNAT)
- Load balance across endpoints
The Chain of Chains ¶
iptables organizes rules into chains. kube-proxy creates a hierarchy:
Packet arrives (dst: 10.96.45.67:8080)
|
v
PREROUTING (or OUTPUT for local pods)
|
v
KUBE-SERVICES -- matches on ClusterIP:port
|
v
KUBE-SVC-XYZABC123 -- the Service's chain, randomly selects endpoint
|
+--- 50% ---> KUBE-SEP-ENDPOINT1 (DNAT to 10.244.2.8)
|
+--- 50% ---> KUBE-SEP-ENDPOINT2 (DNAT to 10.244.2.9)
KUBE-SERVICES: The entry point. Has rules for every Service, matching on ClusterIP:port.
KUBE-SVC-*: One chain per Service. Contains probability-based jumps to endpoint chains (this is how load balancing works).
KUBE-SEP-*: One chain per endpoint (pod). Performs the actual DNAT — rewriting destination from ClusterIP to pod IP.
Reading Real iptables Rules ¶
Let’s see what kube-proxy creates:
# Dump all iptables rules
iptables-save | grep my-service
# Or more specifically, find the Service chain
iptables -t nat -L KUBE-SERVICES -n | grep 10.96.45.67
Example output (annotated):
# Entry in KUBE-SERVICES for our Service
-A KUBE-SERVICES -d 10.96.45.67/32 -p tcp -m tcp --dport 8080 \
-j KUBE-SVC-ABCD1234 # Jump to Service chain
# The Service chain with load balancing
-A KUBE-SVC-ABCD1234 -m statistic --mode random --probability 0.5 \
-j KUBE-SEP-ENDPOINT1 # 50% to first endpoint
-A KUBE-SVC-ABCD1234 \
-j KUBE-SEP-ENDPOINT2 # Remaining 50% to second endpoint
# Endpoint chains - the actual DNAT
-A KUBE-SEP-ENDPOINT1 -p tcp \
-j DNAT --to-destination 10.244.2.8:8080
-A KUBE-SEP-ENDPOINT2 -p tcp \
-j DNAT --to-destination 10.244.2.9:8080
The probability math: With two endpoints, the first rule matches 50%. The second rule catches everything else (also 50%). With three endpoints: 33%, 50% of remaining (33%), then the rest (33%).
What DNAT Does ¶
Before DNAT:
src: 10.244.1.5 (Pod A)
dst: 10.96.45.67:8080 (Service ClusterIP)
After DNAT:
src: 10.244.1.5 (Pod A)
dst: 10.244.2.8:8080 (Pod B - actual endpoint)
The packet now has a real destination. It can be routed to Pod B.
Connection Tracking (conntrack) ¶
One problem: Pod B’s response will have:
src: 10.244.2.8:8080 (Pod B)
dst: 10.244.1.5 (Pod A)
Pod A sent to 10.96.45.67 but receives from 10.244.2.8. Won’t it be confused?
conntrack saves us. The kernel tracks connections. When the response arrives, it reverses the DNAT:
Response packet:
Before reverse DNAT: src=10.244.2.8, dst=10.244.1.5
After reverse DNAT: src=10.96.45.67, dst=10.244.1.5
Pod A sees the response coming from the ClusterIP it originally contacted. The illusion holds.
# See connection tracking entries
conntrack -L | grep 10.96.45.67
Part 3: Crossing Nodes (Overlay Networks) ¶
Our packet is now destined for 10.244.2.8 (Pod B on Node 2). But there’s a problem.
The Problem: Pod CIDRs Aren’t Routable ¶
Pod IPs (10.244.x.x) are internal to Kubernetes. Your physical network doesn’t know how to route them:
Node 1 (192.168.1.10) wants to send to 10.244.2.8
Physical router: "10.244.2.8? Never heard of it. Drop."
Node 1’s routing table doesn’t have a route to 10.244.2.0/24. Neither does your datacenter’s router.
Solution: Overlay Networks ¶
An overlay network encapsulates pod-to-pod packets inside node-to-node packets:
+----------------------------------------------------------+
| Original Packet |
| src: 10.244.1.5 dst: 10.244.2.8 payload: HTTP GET |
+----------------------------------------------------------+
|
VXLAN Encapsulation
|
v
+----------------------------------------------------------+
| Outer Header |
| src: 192.168.1.10 dst: 192.168.1.11 proto: UDP:8472 |
+----------------------------------------------------------+
| VXLAN Header (VNI: 1) |
+----------------------------------------------------------+
| Inner Packet (original) |
| src: 10.244.1.5 dst: 10.244.2.8 payload: HTTP GET |
+----------------------------------------------------------+
The physical network only sees the outer header: Node 1 sending UDP to Node 2. It routes normally. Node 2 receives, strips the outer header, and delivers the inner packet to Pod B.
VXLAN Deep Dive ¶
VXLAN (Virtual Extensible LAN) is the most common overlay in Kubernetes (used by Flannel, Calico in VXLAN mode, etc.).
Key components on each node:
- VXLAN interface (flannel.1, vxlan.calico): A virtual interface that handles encap/decap
- FDB (Forwarding Database): Maps pod IPs/MACs to node IPs
- Routes: Direct pod CIDR traffic to the VXLAN interface
# See the VXLAN interface
ip -d link show flannel.1
# See the FDB entries (which node has which pods)
bridge fdb show dev flannel.1
# See routes to other pod CIDRs
ip route | grep 10.244
Example routing table on Node 1:
10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1 # Local pods
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink # Node 2's pods
10.244.3.0/24 via 10.244.3.0 dev flannel.1 onlink # Node 3's pods
Traffic to 10.244.2.x goes through flannel.1, which encapsulates it.
Packet Flow: Crossing Nodes ¶
Full journey for Pod A (10.244.1.5, Node 1) to Pod B (10.244.2.8, Node 2):
Node 1:
1. Pod A sends: src=10.244.1.5, dst=10.244.2.8
2. Packet exits pod via veth -> arrives at bridge (cni0)
3. Bridge checks: 10.244.2.8 not local
4. Packet routes to flannel.1 (VXLAN interface)
5. VXLAN encapsulates:
- Outer: src=192.168.1.10, dst=192.168.1.11, UDP:8472
- Inner: original packet
6. Encapsulated packet sent on physical network (eth0)
Physical Network:
7. Packet routed from 192.168.1.10 to 192.168.1.11
Node 2:
8. eth0 receives packet
9. Kernel sees UDP:8472 -> hands to VXLAN interface
10. flannel.1 decapsulates, extracts inner packet
11. Inner packet: src=10.244.1.5, dst=10.244.2.8
12. Routes to cni0 (bridge)
13. Bridge forwards to Pod B's veth
14. Pod B receives original packet
MTU Matters ¶
VXLAN adds ~50 bytes of overhead (outer IP + UDP + VXLAN header). If your physical MTU is 1500:
Physical MTU: 1500
VXLAN overhead: -50
Pod MTU: 1450
If pods use MTU 1500, packets that need the full size will fail (too big after encapsulation). Symptoms:
- Small requests work, large requests hang
- SSH works, SCP fails
- TCP connections stall
Check MTU configuration:
# On pod
cat /sys/class/net/eth0/mtu
# On node VXLAN interface
cat /sys/class/net/flannel.1/mtu
Most CNIs set pod MTU correctly, but misconfigurations happen.
Alternative: No Overlay (BGP) ¶
Overlay isn’t the only option. Calico can run in BGP mode:
- Each node advertises its pod CIDR to the network
- Routers learn: “10.244.2.0/24 is behind 192.168.1.11”
- No encapsulation needed — native routing
Trade-offs:
- BGP: No overhead, but requires network integration (not all environments support it)
- VXLAN: Works anywhere, but has overhead
Cloud providers often use their own routing (VPC routes in AWS/GCP) — no overlay, no BGP, just cloud magic.
Part 4: The Full Journey ¶
Let’s put it all together. Pod A curls my-service:8080:
NODE 1 (192.168.1.10)
======================
Pod A (10.244.1.5)
| (1) curl my-service:8080
| DNS resolves to 10.96.45.67
| Packet: src=10.244.1.5, dst=10.96.45.67:8080
v
iptables (PREROUTING)
| (2) DNAT: dst 10.96.45.67 -> 10.244.2.8
| Packet: src=10.244.1.5, dst=10.244.2.8:8080
v
Bridge (cni0)
| (3) 10.244.2.8 not local -> route lookup
v
VXLAN (flannel.1)
| (4) Encapsulate
| Outer: src=192.168.1.10, dst=192.168.1.11
v
eth0 (192.168.1.10)
| (5) Send to physical network
v
==== Physical Network ====
|
v
NODE 2 (192.168.1.11)
======================
eth0 (192.168.1.11)
| (6) Receive encapsulated packet
v
VXLAN (flannel.1)
| (7) Decapsulate
| Extract: src=10.244.1.5, dst=10.244.2.8
v
Bridge (cni0)
| (8) Forward to Pod B's veth
v
Pod B (10.244.2.8)
(9) Receive packet, process HTTP request
The return path:
- Pod B responds: src=10.244.2.8, dst=10.244.1.5
- VXLAN encapsulates, sends to Node 1
- Node 1 decapsulates
- conntrack matches the existing connection
- Reverse DNAT: src becomes 10.96.45.67 (the ClusterIP)
- Pod A receives response from “my-service”
Part 5: Debugging the Nightmare ¶
Armed with this knowledge, debugging becomes systematic.
“Pod Can’t Reach Service” ¶
Step 1: Can the pod reach anything?
# From inside the pod
kubectl exec -it pod-a -- ping 8.8.8.8
kubectl exec -it pod-a -- ping 10.244.1.1 # Node's bridge IP
If this fails, the problem is basic connectivity (CNI, veth, bridge).
Step 2: Can the pod reach other pods on the same node?
kubectl exec -it pod-a -- ping <pod-on-same-node-ip>
If this fails: bridge or veth issue.
Step 3: Can the pod reach pods on other nodes?
kubectl exec -it pod-a -- ping <pod-on-different-node-ip>
If same-node works but cross-node fails: overlay problem.
Step 4: Can the pod reach the ClusterIP?
kubectl exec -it pod-a -- curl -v 10.96.45.67:8080
If direct pod IP works but ClusterIP fails: kube-proxy/iptables problem.
Step 5: Is DNS working?
kubectl exec -it pod-a -- nslookup my-service
kubectl exec -it pod-a -- cat /etc/resolv.conf
If IP works but name doesn’t: DNS problem (CoreDNS, resolv.conf).
Reading iptables Rules ¶
# SSH to the node, then:
# Find rules for your Service
iptables-save | grep <service-name>
iptables-save | grep <cluster-ip>
# List the KUBE-SERVICES chain
iptables -t nat -L KUBE-SERVICES -n --line-numbers
# Follow a specific Service chain
iptables -t nat -L KUBE-SVC-XXXXX -n
What to look for:
- Is there a rule matching your ClusterIP?
- Does the Service chain have endpoint rules?
- Are the endpoint IPs correct?
Missing rules? Check if kube-proxy is running:
kubectl get pods -n kube-system -l k8s-app=kube-proxy
kubectl logs -n kube-system -l k8s-app=kube-proxy
tcpdump at Each Hop ¶
Capture traffic to see where packets go (or stop):
# On the pod (if tcpdump available)
kubectl exec -it pod-a -- tcpdump -i eth0 -n host 10.244.2.8
# On node, at the bridge
tcpdump -i cni0 -n host 10.244.1.5
# On node, at the VXLAN interface
tcpdump -i flannel.1 -n host 10.244.2.8
# On node, at the physical interface (see encapsulated packets)
tcpdump -i eth0 -n udp port 8472
# On destination node
tcpdump -i eth0 -n udp port 8472
tcpdump -i flannel.1 -n host 10.244.1.5
Interpret what you see:
- Packets at cni0 but not flannel.1? Routing problem.
- Packets at flannel.1 but not remote eth0? Physical network problem.
- Packets at remote eth0 but not flannel.1? VXLAN decap problem.
- Packets at remote cni0 but not pod? Bridge/veth problem.
conntrack Inspection ¶
# See all tracked connections
conntrack -L
# Filter for your Service
conntrack -L | grep 10.96.45.67
# Watch new connections
conntrack -E
Stale conntrack entries can cause weird issues (traffic to old pod IPs). Flushing can help (but disrupts existing connections):
conntrack -F
Common Failures ¶
kube-proxy not running:
- Symptom: ClusterIP doesn’t work, direct pod IP works
- Check:
kubectl get pods -n kube-system -l k8s-app=kube-proxy - Fix: Restart kube-proxy, check logs
CNI misconfiguration:
- Symptom: Pods can’t communicate at all, or only on same node
- Check:
kubectl get pods -n kube-systemfor CNI pods (flannel, calico, etc.) - Check:
/etc/cni/net.d/for CNI config - Fix: Reinstall CNI, check config
iptables rules missing:
- Symptom: ClusterIP doesn’t work after Service creation
- Check:
iptables-save | grep <service-ip> - Cause: kube-proxy error, RBAC issue
- Fix: Check kube-proxy logs
MTU mismatch:
- Symptom: Small packets work, large fail; TCP stalls
- Check: MTU on pod, bridge, VXLAN interface
- Fix: Configure CNI with correct MTU
NetworkPolicy blocking:
- Symptom: Some pods can’t connect, others can
- Check:
kubectl get networkpolicy -A - Fix: Add appropriate NetworkPolicy rules
Firewall blocking VXLAN:
- Symptom: Cross-node fails, same-node works
- Check:
tcpdump -i eth0 udp port 8472— packets sent but not received? - Fix: Open UDP 8472 (VXLAN) between nodes
Quick Diagnostic Commands ¶
# Check all network interfaces
ip addr
# Check routes
ip route
# Check iptables NAT rules
iptables -t nat -L -n -v
# Check VXLAN FDB
bridge fdb show dev flannel.1
# Check CNI config
cat /etc/cni/net.d/*
# Check kube-proxy mode
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using"
# Check endpoints for a Service
kubectl get endpoints my-service
Summary ¶
The magic of Kubernetes networking is built on:
- veths and bridges: Connect pods within a node
- iptables (kube-proxy): Implement Services via DNAT
- Overlay networks: Carry pod traffic across nodes via encapsulation
When debugging:
- Trace the path systematically (pod -> bridge -> overlay -> remote node -> pod)
- Use tcpdump at each hop to see where packets stop
- Check iptables for Service issues
- Check conntrack for connection state issues
- Check MTU for “big packets fail” symptoms
The magic is just routing, NAT, and encapsulation. Once you know the layers, you know where to look.