CoreDNS Under Pressure: How We Fixed DNS Bottlenecks with NodeLocal DNSCache

We had gang-scheduled jobs that performed DNS lookups at startup. If DNS resolution failed, the pod failed. If one pod in the gang failed, the entire gang restarted. Hundreds of pods restarting simultaneously meant hundreds of DNS queries hitting CoreDNS at once. CoreDNS couldn’t keep up, more pods failed, more restarts, more DNS queries—a cascading failure that took down our batch processing pipeline.

The fix: NodeLocal DNSCache. But understanding why it works requires understanding how Kubernetes DNS works and why it breaks under load.

How Kubernetes DNS Works ¶

Every pod gets DNS configuration injected via /etc/resolv.conf:

$ cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Let’s break this down:

The Nameserver ¶

10.96.0.10 is the ClusterIP of the kube-dns service (which points to CoreDNS pods):

$ kubectl get svc -n kube-system kube-dns
NAME       TYPE        CLUSTER-IP   PORT(S)
kube-dns   ClusterIP   10.96.0.10   53/UDP,53/TCP

All DNS queries from all pods go to this single VIP.

Search Domains ¶

When you resolve a name like my-service, Kubernetes tries multiple suffixes:

1. my-service.default.svc.cluster.local
2. my-service.svc.cluster.local
3. my-service.cluster.local
4. my-service (absolute)

The ndots Setting ¶

ndots:5 means: if the name has fewer than 5 dots, try search domains first.

"my-service" (0 dots < 5) → try search domains first
"api.example.com" (2 dots < 5) → try search domains first!
"api.example.com." (trailing dot) → absolute, skip search domains

This is where things get expensive. A simple lookup for api.example.com generates:

1. api.example.com.default.svc.cluster.local → NXDOMAIN
2. api.example.com.svc.cluster.local → NXDOMAIN
3. api.example.com.cluster.local → NXDOMAIN
4. api.example.com → SUCCESS

Four queries for one resolution. And each query is a UDP packet through the cluster network.

The DNS Resolution Path ¶

Here’s what happens when a pod resolves a name:

Pod (10.244.1.5)
    |
    | UDP packet to 10.96.0.10:53
    v
iptables/IPVS (kube-proxy rules)
    |
    | DNAT to CoreDNS pod IP
    v
CoreDNS Pod (10.244.0.10)
    |
    | Lookup in cache or forward upstream
    v
Response back through same path

Every DNS query:

Goes through the pod’s network namespace
Hits iptables/IPVS rules for the service
Gets DNAT’d to a CoreDNS pod
Creates a conntrack entry
Returns through the same path

At scale, this becomes a bottleneck.

Why CoreDNS Becomes a Bottleneck ¶

Single Service VIP ¶

All cluster DNS traffic funnels through one service IP. Even with multiple CoreDNS replicas, every packet hits the same iptables/IPVS rules:

    Pod A ──┐
    Pod B ──┼──► 10.96.0.10 (kube-dns) ──► CoreDNS Pods
    Pod C ──┤         │
    Pod D ──┘         │
                      v
              iptables/IPVS
              (single bottleneck)

Conntrack Table Pressure ¶

Every DNS query creates a conntrack entry (even for UDP). The default table size is 131072 entries. With thousands of pods doing DNS lookups:

$ cat /proc/sys/net/netfilter/nf_conntrack_count
128000  # Getting close to limit!

$ dmesg | grep conntrack
nf_conntrack: table full, dropping packet

Dropped packets = failed DNS queries = failed pods.

Thundering Herd ¶

Our gang scheduling scenario:

100-pod gang starts
Each pod does 3 DNS lookups at startup
Each lookup expands to 4 queries (ndots)
100 × 3 × 4 = 1,200 DNS queries in milliseconds

If CoreDNS can’t respond fast enough, queries time out (default: 5 seconds). Pods fail, gang restarts, another 1,200 queries. CoreDNS falls further behind. Cascade.

Gang starts
    |
    v
1,200 DNS queries ──► CoreDNS overwhelmed
    |                      |
    v                      v
Timeouts              Queue grows
    |                      |
    v                      v
Pods fail             Latency increases
    |                      |
    v                      v
Gang restarts ────────► More queries
    |
    v
  Cascade

UDP Packet Loss ¶

Under load, UDP packets get dropped:

Kernel socket buffer overflow
Network interface queue overflow
iptables processing delays

Unlike TCP, UDP has no built-in retry. The application must handle retries, adding latency.

Debugging DNS Issues ¶

Symptoms ¶

Pod startup failures with DNS errors
Slow service-to-service communication
Intermittent connection timeouts
CoreDNS pods showing high CPU

CoreDNS Metrics ¶

CoreDNS exposes Prometheus metrics:

# Request rate
rate(coredns_dns_requests_total[5m])

# Error rate
rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])

# Latency
histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))

# Cache hit rate
rate(coredns_cache_hits_total[5m]) / 
(rate(coredns_cache_hits_total[5m]) + rate(coredns_cache_misses_total[5m]))

Warning signs:

Request rate spiking
Latency p99 > 100ms
SERVFAIL responses increasing
Cache hit rate dropping

Testing from a Pod ¶

# Run a debug pod
kubectl run debug --image=busybox --rm -it -- sh

# Test DNS resolution
nslookup kubernetes.default
nslookup google.com

# Measure timing
time nslookup google.com

# Check resolv.conf
cat /etc/resolv.conf

# Verbose DNS query
nslookup -debug kubernetes.default

Using dig ¶

# Install dig (dnsutils)
kubectl run debug --image=tutum/dnsutils --rm -it -- bash

# Query with timing
dig kubernetes.default.svc.cluster.local

# Query CoreDNS directly
dig @10.96.0.10 kubernetes.default.svc.cluster.local

# See full query expansion
dig +search my-service

# Trace the resolution path
dig +trace google.com

Checking Conntrack ¶

On a node:

# Current connections
cat /proc/sys/net/netfilter/nf_conntrack_count

# Max connections
cat /proc/sys/net/netfilter/nf_conntrack_max

# Conntrack stats (look for drops)
conntrack -S
cpu=0       found=0 invalid=1234 ignore=5678 insert=0 insert_failed=100 drop=50
                                                                        ^^^^
                                                                        Drops!

Packet Capture ¶

# On a node, capture DNS traffic
tcpdump -i any port 53 -nn

# Filter for specific pod
tcpdump -i any port 53 and host 10.244.1.5 -nn

# Save for analysis
tcpdump -i any port 53 -w dns.pcap

Mitigation Options ¶

Scale CoreDNS ¶

The obvious first step:

kubectl -n kube-system scale deployment coredns --replicas=5

Helps: More pods to handle queries.

Doesn’t solve: Traffic still funnels through service VIP. Conntrack pressure remains. Thundering herd still overwhelms.

Tune ndots ¶

Reduce query fan-out by lowering ndots:

apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"

With ndots:2, names with 2+ dots resolve directly:

"api.example.com" (2 dots >= 2) → resolve directly, no search domains
"my-service" (0 dots < 2) → still uses search domains

Helps: Reduces queries for external domains.

Doesn’t solve: Internal service lookups still expand. Thundering herd still a problem.

Use FQDNs ¶

Force absolute lookups with trailing dots:

// Instead of
http.Get("http://api.example.com/path")

// Use
http.Get("http://api.example.com./path")  // Note trailing dot

Helps: Eliminates search domain expansion for that lookup.

Doesn’t solve: Requires code changes. Internal services still need search domains.

Increase CoreDNS Cache ¶

# CoreDNS Corefile
.:53 {
    cache 300  # Cache for 5 minutes instead of default 30s
    # ...
}

Helps: More cache hits, fewer upstream queries.

Doesn’t solve: Cold start thundering herd (nothing in cache yet).

The Fix: NodeLocal DNSCache ¶

NodeLocal DNSCache runs a DNS cache on every node. Pods query the local cache instead of the CoreDNS service.

Architecture ¶

Before (all traffic to CoreDNS):

Pod ──► kube-dns Service (10.96.0.10) ──► CoreDNS Pods
                    │
            (iptables/IPVS)


After (local cache):

Pod ──► NodeLocal DNS (169.254.20.10) ──► Cache Hit? ──► Response
             │                                │
             │                           Cache Miss
             │                                │
             │                                v
             │                          CoreDNS Pods
             │
        (runs on same node)

How It Works ¶

DaemonSet: NodeLocal DNSCache runs on every node
Link-local IP: Listens on 169.254.20.10 (node-local, no network hop)
iptables rules: Redirect DNS traffic to local cache
Cache: Serves cached responses instantly
Upstream: Cache misses go to CoreDNS

Benefits ¶

No service VIP: Queries don’t go through iptables/IPVS for the kube-dns service.

No cross-node traffic: Cache hits are served locally.

No conntrack for local queries: Link-local traffic doesn’t create conntrack entries.

Survives CoreDNS issues: Cached entries still work if CoreDNS is temporarily unavailable.

Reduces CoreDNS load: Only cache misses reach CoreDNS.

Deploying NodeLocal DNSCache ¶

Prerequisites ¶

Kubernetes 1.18+
Know your cluster DNS IP (usually 10.96.0.10)
Know your cluster domain (usually cluster.local)

Installation ¶

# Download the manifest
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

# Replace placeholders
# __PILLAR__DNS__SERVER__ → your kube-dns ClusterIP (e.g., 10.96.0.10)
# __PILLAR__LOCAL__DNS__ → 169.254.20.10
# __PILLAR__DNS__DOMAIN__ → cluster.local

sed -i 's/__PILLAR__DNS__SERVER__/10.96.0.10/g' nodelocaldns.yaml
sed -i 's/__PILLAR__LOCAL__DNS__/169.254.20.10/g' nodelocaldns.yaml
sed -i 's/__PILLAR__DNS__DOMAIN__/cluster.local/g' nodelocaldns.yaml

# Apply
kubectl apply -f nodelocaldns.yaml

Verify DaemonSet ¶

$ kubectl get ds -n kube-system node-local-dns
NAME             DESIRED   CURRENT   READY   NODE SELECTOR
node-local-dns   50        50        50      <none>

$ kubectl get pods -n kube-system -l k8s-app=node-local-dns
NAME                   READY   STATUS    RESTARTS
node-local-dns-abc12   1/1     Running   0
node-local-dns-def34   1/1     Running   0
...

Update kubelet Configuration ¶

Pods need to use the local DNS. Update kubelet’s --cluster-dns flag:

# kubelet configuration
clusterDNS:
  - 169.254.20.10  # NodeLocal DNS

Or for new pods only, keep existing kubelet config and let NodeLocal DNSCache’s iptables rules intercept traffic to 10.96.0.10.

Verify It’s Working ¶

# Check pod's resolv.conf
kubectl run test --image=busybox --rm -it -- cat /etc/resolv.conf
nameserver 169.254.20.10  # Should show local DNS

# Or if using iptables interception:
nameserver 10.96.0.10  # Original, but traffic is redirected

# Test resolution
kubectl run test --image=busybox --rm -it -- nslookup kubernetes.default

Check NodeLocal DNS Metrics ¶

# Port-forward to a node-local-dns pod
kubectl port-forward -n kube-system pod/node-local-dns-abc12 9253:9253

# Check metrics
curl http://localhost:9253/metrics | grep coredns_cache

Results ¶

After deploying NodeLocal DNSCache:

Before:

Gang scheduling failures due to DNS timeouts
CoreDNS CPU at 80% during job spikes
DNS p99 latency: 500ms+ during load
Cascading failures from DNS-induced restarts

After:

Gang scheduling stable
CoreDNS CPU dropped to 20% (only cache misses)
DNS p99 latency: <5ms (local cache hits)
No more DNS-induced cascading failures

The local cache absorbs the thundering herd. Even if 100 pods start simultaneously on one node, the local cache serves repeated queries instantly.

Other DNS Optimizations ¶

autopath Plugin ¶

CoreDNS’s autopath plugin reduces search domain queries:

# Corefile
.:53 {
    autopath @kubernetes
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
    }
}

With autopath, CoreDNS detects the client’s namespace and optimizes the search path. Instead of 4 queries, often just 1-2.

Negative Cache Tuning ¶

Cache NXDOMAIN responses to avoid repeated failed lookups:

.:53 {
    cache {
        success 9984 30  # Cache successful responses for 30s
        denial 9984 5    # Cache NXDOMAIN for 5s
    }
}

Pod DNS Policy ¶

For pods that only need external DNS:

spec:
  dnsPolicy: Default  # Use node's DNS, not cluster DNS

For pods that need no DNS:

spec:
  dnsPolicy: None
  dnsConfig:
    nameservers:
      - 8.8.8.8

Application-Level Caching ¶

For high-frequency lookups, cache at the application level:

// Go: Use a custom resolver with caching
resolver := &net.Resolver{
    PreferGo: true,
    Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
        // Custom dial with caching
    },
}

Or use a sidecar cache like dnsmasq for legacy applications.

Summary ¶

Kubernetes DNS becomes a bottleneck because:

Factor	Impact
Single service VIP	All traffic through one point
ndots expansion	1 lookup → 4+ queries
Conntrack entries	Table exhaustion under load
UDP packet loss	No built-in retry
Thundering herd	Concurrent startups overwhelm CoreDNS

NodeLocal DNSCache fixes this by:

Benefit	How
Local resolution	No cross-node traffic for cache hits
No service VIP	Bypasses iptables/IPVS bottleneck
Reduced conntrack	Link-local traffic doesn’t track
Resilience	Cached entries survive CoreDNS issues

For any cluster running batch jobs, gang scheduling, or high pod churn, NodeLocal DNSCache is essential. The thundering herd problem is real, and a local cache is the most effective solution.