We had gang-scheduled jobs that performed DNS lookups at startup. If DNS resolution failed, the pod failed. If one pod in the gang failed, the entire gang restarted. Hundreds of pods restarting simultaneously meant hundreds of DNS queries hitting CoreDNS at once. CoreDNS couldn’t keep up, more pods failed, more restarts, more DNS queries—a cascading failure that took down our batch processing pipeline.
The fix: NodeLocal DNSCache. But understanding why it works requires understanding how Kubernetes DNS works and why it breaks under load.
How Kubernetes DNS Works ¶
Every pod gets DNS configuration injected via /etc/resolv.conf:
$ cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
Let’s break this down:
The Nameserver ¶
10.96.0.10 is the ClusterIP of the kube-dns service (which points to CoreDNS pods):
$ kubectl get svc -n kube-system kube-dns
NAME TYPE CLUSTER-IP PORT(S)
kube-dns ClusterIP 10.96.0.10 53/UDP,53/TCP
All DNS queries from all pods go to this single VIP.
Search Domains ¶
When you resolve a name like my-service, Kubernetes tries multiple suffixes:
1. my-service.default.svc.cluster.local
2. my-service.svc.cluster.local
3. my-service.cluster.local
4. my-service (absolute)
The ndots Setting ¶
ndots:5 means: if the name has fewer than 5 dots, try search domains first.
"my-service" (0 dots < 5) → try search domains first
"api.example.com" (2 dots < 5) → try search domains first!
"api.example.com." (trailing dot) → absolute, skip search domains
This is where things get expensive. A simple lookup for api.example.com generates:
1. api.example.com.default.svc.cluster.local → NXDOMAIN
2. api.example.com.svc.cluster.local → NXDOMAIN
3. api.example.com.cluster.local → NXDOMAIN
4. api.example.com → SUCCESS
Four queries for one resolution. And each query is a UDP packet through the cluster network.
The DNS Resolution Path ¶
Here’s what happens when a pod resolves a name:
Pod (10.244.1.5)
|
| UDP packet to 10.96.0.10:53
v
iptables/IPVS (kube-proxy rules)
|
| DNAT to CoreDNS pod IP
v
CoreDNS Pod (10.244.0.10)
|
| Lookup in cache or forward upstream
v
Response back through same path
Every DNS query:
- Goes through the pod’s network namespace
- Hits iptables/IPVS rules for the service
- Gets DNAT’d to a CoreDNS pod
- Creates a conntrack entry
- Returns through the same path
At scale, this becomes a bottleneck.
Why CoreDNS Becomes a Bottleneck ¶
Single Service VIP ¶
All cluster DNS traffic funnels through one service IP. Even with multiple CoreDNS replicas, every packet hits the same iptables/IPVS rules:
Pod A ──┐
Pod B ──┼──► 10.96.0.10 (kube-dns) ──► CoreDNS Pods
Pod C ──┤ │
Pod D ──┘ │
v
iptables/IPVS
(single bottleneck)
Conntrack Table Pressure ¶
Every DNS query creates a conntrack entry (even for UDP). The default table size is 131072 entries. With thousands of pods doing DNS lookups:
$ cat /proc/sys/net/netfilter/nf_conntrack_count
128000 # Getting close to limit!
$ dmesg | grep conntrack
nf_conntrack: table full, dropping packet
Dropped packets = failed DNS queries = failed pods.
Thundering Herd ¶
Our gang scheduling scenario:
- 100-pod gang starts
- Each pod does 3 DNS lookups at startup
- Each lookup expands to 4 queries (ndots)
- 100 × 3 × 4 = 1,200 DNS queries in milliseconds
If CoreDNS can’t respond fast enough, queries time out (default: 5 seconds). Pods fail, gang restarts, another 1,200 queries. CoreDNS falls further behind. Cascade.
Gang starts
|
v
1,200 DNS queries ──► CoreDNS overwhelmed
| |
v v
Timeouts Queue grows
| |
v v
Pods fail Latency increases
| |
v v
Gang restarts ────────► More queries
|
v
Cascade
UDP Packet Loss ¶
Under load, UDP packets get dropped:
- Kernel socket buffer overflow
- Network interface queue overflow
- iptables processing delays
Unlike TCP, UDP has no built-in retry. The application must handle retries, adding latency.
Debugging DNS Issues ¶
Symptoms ¶
- Pod startup failures with DNS errors
- Slow service-to-service communication
- Intermittent connection timeouts
- CoreDNS pods showing high CPU
CoreDNS Metrics ¶
CoreDNS exposes Prometheus metrics:
# Request rate
rate(coredns_dns_requests_total[5m])
# Error rate
rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])
# Latency
histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))
# Cache hit rate
rate(coredns_cache_hits_total[5m]) /
(rate(coredns_cache_hits_total[5m]) + rate(coredns_cache_misses_total[5m]))
Warning signs:
- Request rate spiking
- Latency p99 > 100ms
- SERVFAIL responses increasing
- Cache hit rate dropping
Testing from a Pod ¶
# Run a debug pod
kubectl run debug --image=busybox --rm -it -- sh
# Test DNS resolution
nslookup kubernetes.default
nslookup google.com
# Measure timing
time nslookup google.com
# Check resolv.conf
cat /etc/resolv.conf
# Verbose DNS query
nslookup -debug kubernetes.default
Using dig ¶
# Install dig (dnsutils)
kubectl run debug --image=tutum/dnsutils --rm -it -- bash
# Query with timing
dig kubernetes.default.svc.cluster.local
# Query CoreDNS directly
dig @10.96.0.10 kubernetes.default.svc.cluster.local
# See full query expansion
dig +search my-service
# Trace the resolution path
dig +trace google.com
Checking Conntrack ¶
On a node:
# Current connections
cat /proc/sys/net/netfilter/nf_conntrack_count
# Max connections
cat /proc/sys/net/netfilter/nf_conntrack_max
# Conntrack stats (look for drops)
conntrack -S
cpu=0 found=0 invalid=1234 ignore=5678 insert=0 insert_failed=100 drop=50
^^^^
Drops!
Packet Capture ¶
# On a node, capture DNS traffic
tcpdump -i any port 53 -nn
# Filter for specific pod
tcpdump -i any port 53 and host 10.244.1.5 -nn
# Save for analysis
tcpdump -i any port 53 -w dns.pcap
Mitigation Options ¶
Scale CoreDNS ¶
The obvious first step:
kubectl -n kube-system scale deployment coredns --replicas=5
Helps: More pods to handle queries.
Doesn’t solve: Traffic still funnels through service VIP. Conntrack pressure remains. Thundering herd still overwhelms.
Tune ndots ¶
Reduce query fan-out by lowering ndots:
apiVersion: v1
kind: Pod
spec:
dnsConfig:
options:
- name: ndots
value: "2"
With ndots:2, names with 2+ dots resolve directly:
"api.example.com" (2 dots >= 2) → resolve directly, no search domains
"my-service" (0 dots < 2) → still uses search domains
Helps: Reduces queries for external domains.
Doesn’t solve: Internal service lookups still expand. Thundering herd still a problem.
Use FQDNs ¶
Force absolute lookups with trailing dots:
// Instead of
http.Get("http://api.example.com/path")
// Use
http.Get("http://api.example.com./path") // Note trailing dot
Helps: Eliminates search domain expansion for that lookup.
Doesn’t solve: Requires code changes. Internal services still need search domains.
Increase CoreDNS Cache ¶
# CoreDNS Corefile
.:53 {
cache 300 # Cache for 5 minutes instead of default 30s
# ...
}
Helps: More cache hits, fewer upstream queries.
Doesn’t solve: Cold start thundering herd (nothing in cache yet).
The Fix: NodeLocal DNSCache ¶
NodeLocal DNSCache runs a DNS cache on every node. Pods query the local cache instead of the CoreDNS service.
Architecture ¶
Before (all traffic to CoreDNS):
Pod ──► kube-dns Service (10.96.0.10) ──► CoreDNS Pods
│
(iptables/IPVS)
After (local cache):
Pod ──► NodeLocal DNS (169.254.20.10) ──► Cache Hit? ──► Response
│ │
│ Cache Miss
│ │
│ v
│ CoreDNS Pods
│
(runs on same node)
How It Works ¶
- DaemonSet: NodeLocal DNSCache runs on every node
- Link-local IP: Listens on
169.254.20.10(node-local, no network hop) - iptables rules: Redirect DNS traffic to local cache
- Cache: Serves cached responses instantly
- Upstream: Cache misses go to CoreDNS
Benefits ¶
No service VIP: Queries don’t go through iptables/IPVS for the kube-dns service.
No cross-node traffic: Cache hits are served locally.
No conntrack for local queries: Link-local traffic doesn’t create conntrack entries.
Survives CoreDNS issues: Cached entries still work if CoreDNS is temporarily unavailable.
Reduces CoreDNS load: Only cache misses reach CoreDNS.
Deploying NodeLocal DNSCache ¶
Prerequisites ¶
- Kubernetes 1.18+
- Know your cluster DNS IP (usually
10.96.0.10) - Know your cluster domain (usually
cluster.local)
Installation ¶
# Download the manifest
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml
# Replace placeholders
# __PILLAR__DNS__SERVER__ → your kube-dns ClusterIP (e.g., 10.96.0.10)
# __PILLAR__LOCAL__DNS__ → 169.254.20.10
# __PILLAR__DNS__DOMAIN__ → cluster.local
sed -i 's/__PILLAR__DNS__SERVER__/10.96.0.10/g' nodelocaldns.yaml
sed -i 's/__PILLAR__LOCAL__DNS__/169.254.20.10/g' nodelocaldns.yaml
sed -i 's/__PILLAR__DNS__DOMAIN__/cluster.local/g' nodelocaldns.yaml
# Apply
kubectl apply -f nodelocaldns.yaml
Verify DaemonSet ¶
$ kubectl get ds -n kube-system node-local-dns
NAME DESIRED CURRENT READY NODE SELECTOR
node-local-dns 50 50 50 <none>
$ kubectl get pods -n kube-system -l k8s-app=node-local-dns
NAME READY STATUS RESTARTS
node-local-dns-abc12 1/1 Running 0
node-local-dns-def34 1/1 Running 0
...
Update kubelet Configuration ¶
Pods need to use the local DNS. Update kubelet’s --cluster-dns flag:
# kubelet configuration
clusterDNS:
- 169.254.20.10 # NodeLocal DNS
Or for new pods only, keep existing kubelet config and let NodeLocal DNSCache’s iptables rules intercept traffic to 10.96.0.10.
Verify It’s Working ¶
# Check pod's resolv.conf
kubectl run test --image=busybox --rm -it -- cat /etc/resolv.conf
nameserver 169.254.20.10 # Should show local DNS
# Or if using iptables interception:
nameserver 10.96.0.10 # Original, but traffic is redirected
# Test resolution
kubectl run test --image=busybox --rm -it -- nslookup kubernetes.default
Check NodeLocal DNS Metrics ¶
# Port-forward to a node-local-dns pod
kubectl port-forward -n kube-system pod/node-local-dns-abc12 9253:9253
# Check metrics
curl http://localhost:9253/metrics | grep coredns_cache
Results ¶
After deploying NodeLocal DNSCache:
Before:
- Gang scheduling failures due to DNS timeouts
- CoreDNS CPU at 80% during job spikes
- DNS p99 latency: 500ms+ during load
- Cascading failures from DNS-induced restarts
After:
- Gang scheduling stable
- CoreDNS CPU dropped to 20% (only cache misses)
- DNS p99 latency: <5ms (local cache hits)
- No more DNS-induced cascading failures
The local cache absorbs the thundering herd. Even if 100 pods start simultaneously on one node, the local cache serves repeated queries instantly.
Other DNS Optimizations ¶
autopath Plugin ¶
CoreDNS’s autopath plugin reduces search domain queries:
# Corefile
.:53 {
autopath @kubernetes
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
}
With autopath, CoreDNS detects the client’s namespace and optimizes the search path. Instead of 4 queries, often just 1-2.
Negative Cache Tuning ¶
Cache NXDOMAIN responses to avoid repeated failed lookups:
.:53 {
cache {
success 9984 30 # Cache successful responses for 30s
denial 9984 5 # Cache NXDOMAIN for 5s
}
}
Pod DNS Policy ¶
For pods that only need external DNS:
spec:
dnsPolicy: Default # Use node's DNS, not cluster DNS
For pods that need no DNS:
spec:
dnsPolicy: None
dnsConfig:
nameservers:
- 8.8.8.8
Application-Level Caching ¶
For high-frequency lookups, cache at the application level:
// Go: Use a custom resolver with caching
resolver := &net.Resolver{
PreferGo: true,
Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
// Custom dial with caching
},
}
Or use a sidecar cache like dnsmasq for legacy applications.
Summary ¶
Kubernetes DNS becomes a bottleneck because:
| Factor | Impact |
|---|---|
| Single service VIP | All traffic through one point |
| ndots expansion | 1 lookup → 4+ queries |
| Conntrack entries | Table exhaustion under load |
| UDP packet loss | No built-in retry |
| Thundering herd | Concurrent startups overwhelm CoreDNS |
NodeLocal DNSCache fixes this by:
| Benefit | How |
|---|---|
| Local resolution | No cross-node traffic for cache hits |
| No service VIP | Bypasses iptables/IPVS bottleneck |
| Reduced conntrack | Link-local traffic doesn’t track |
| Resilience | Cached entries survive CoreDNS issues |
For any cluster running batch jobs, gang scheduling, or high pod churn, NodeLocal DNSCache is essential. The thundering herd problem is real, and a local cache is the most effective solution.