CoreDNS Under Pressure: How We Fixed DNS Bottlenecks with NodeLocal DNSCache


We had gang-scheduled jobs that performed DNS lookups at startup. If DNS resolution failed, the pod failed. If one pod in the gang failed, the entire gang restarted. Hundreds of pods restarting simultaneously meant hundreds of DNS queries hitting CoreDNS at once. CoreDNS couldn’t keep up, more pods failed, more restarts, more DNS queries—a cascading failure that took down our batch processing pipeline.

The fix: NodeLocal DNSCache. But understanding why it works requires understanding how Kubernetes DNS works and why it breaks under load.

Every pod gets DNS configuration injected via /etc/resolv.conf:

$ cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Let’s break this down:

10.96.0.10 is the ClusterIP of the kube-dns service (which points to CoreDNS pods):

$ kubectl get svc -n kube-system kube-dns
NAME       TYPE        CLUSTER-IP   PORT(S)
kube-dns   ClusterIP   10.96.0.10   53/UDP,53/TCP

All DNS queries from all pods go to this single VIP.

When you resolve a name like my-service, Kubernetes tries multiple suffixes:

1. my-service.default.svc.cluster.local
2. my-service.svc.cluster.local
3. my-service.cluster.local
4. my-service (absolute)

ndots:5 means: if the name has fewer than 5 dots, try search domains first.

"my-service" (0 dots < 5) → try search domains first
"api.example.com" (2 dots < 5) → try search domains first!
"api.example.com." (trailing dot) → absolute, skip search domains

This is where things get expensive. A simple lookup for api.example.com generates:

1. api.example.com.default.svc.cluster.local → NXDOMAIN
2. api.example.com.svc.cluster.local → NXDOMAIN
3. api.example.com.cluster.local → NXDOMAIN
4. api.example.com → SUCCESS

Four queries for one resolution. And each query is a UDP packet through the cluster network.

Here’s what happens when a pod resolves a name:

Pod (10.244.1.5)
    |
    | UDP packet to 10.96.0.10:53
    v
iptables/IPVS (kube-proxy rules)
    |
    | DNAT to CoreDNS pod IP
    v
CoreDNS Pod (10.244.0.10)
    |
    | Lookup in cache or forward upstream
    v
Response back through same path

Every DNS query:

  1. Goes through the pod’s network namespace
  2. Hits iptables/IPVS rules for the service
  3. Gets DNAT’d to a CoreDNS pod
  4. Creates a conntrack entry
  5. Returns through the same path

At scale, this becomes a bottleneck.

All cluster DNS traffic funnels through one service IP. Even with multiple CoreDNS replicas, every packet hits the same iptables/IPVS rules:

    Pod A ──┐
    Pod B ──┼──► 10.96.0.10 (kube-dns) ──► CoreDNS Pods
    Pod C ──┤         │
    Pod D ──┘         │
                      v
              iptables/IPVS
              (single bottleneck)

Every DNS query creates a conntrack entry (even for UDP). The default table size is 131072 entries. With thousands of pods doing DNS lookups:

$ cat /proc/sys/net/netfilter/nf_conntrack_count
128000  # Getting close to limit!

$ dmesg | grep conntrack
nf_conntrack: table full, dropping packet

Dropped packets = failed DNS queries = failed pods.

Our gang scheduling scenario:

  • 100-pod gang starts
  • Each pod does 3 DNS lookups at startup
  • Each lookup expands to 4 queries (ndots)
  • 100 × 3 × 4 = 1,200 DNS queries in milliseconds

If CoreDNS can’t respond fast enough, queries time out (default: 5 seconds). Pods fail, gang restarts, another 1,200 queries. CoreDNS falls further behind. Cascade.

Gang starts
    |
    v
1,200 DNS queries ──► CoreDNS overwhelmed
    |                      |
    v                      v
Timeouts              Queue grows
    |                      |
    v                      v
Pods fail             Latency increases
    |                      |
    v                      v
Gang restarts ────────► More queries
    |
    v
  Cascade

Under load, UDP packets get dropped:

  • Kernel socket buffer overflow
  • Network interface queue overflow
  • iptables processing delays

Unlike TCP, UDP has no built-in retry. The application must handle retries, adding latency.

  • Pod startup failures with DNS errors
  • Slow service-to-service communication
  • Intermittent connection timeouts
  • CoreDNS pods showing high CPU

CoreDNS exposes Prometheus metrics:

# Request rate
rate(coredns_dns_requests_total[5m])

# Error rate
rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])

# Latency
histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))

# Cache hit rate
rate(coredns_cache_hits_total[5m]) / 
(rate(coredns_cache_hits_total[5m]) + rate(coredns_cache_misses_total[5m]))

Warning signs:

  • Request rate spiking
  • Latency p99 > 100ms
  • SERVFAIL responses increasing
  • Cache hit rate dropping
# Run a debug pod
kubectl run debug --image=busybox --rm -it -- sh

# Test DNS resolution
nslookup kubernetes.default
nslookup google.com

# Measure timing
time nslookup google.com

# Check resolv.conf
cat /etc/resolv.conf

# Verbose DNS query
nslookup -debug kubernetes.default
# Install dig (dnsutils)
kubectl run debug --image=tutum/dnsutils --rm -it -- bash

# Query with timing
dig kubernetes.default.svc.cluster.local

# Query CoreDNS directly
dig @10.96.0.10 kubernetes.default.svc.cluster.local

# See full query expansion
dig +search my-service

# Trace the resolution path
dig +trace google.com

On a node:

# Current connections
cat /proc/sys/net/netfilter/nf_conntrack_count

# Max connections
cat /proc/sys/net/netfilter/nf_conntrack_max

# Conntrack stats (look for drops)
conntrack -S
cpu=0       found=0 invalid=1234 ignore=5678 insert=0 insert_failed=100 drop=50
                                                                        ^^^^
                                                                        Drops!
# On a node, capture DNS traffic
tcpdump -i any port 53 -nn

# Filter for specific pod
tcpdump -i any port 53 and host 10.244.1.5 -nn

# Save for analysis
tcpdump -i any port 53 -w dns.pcap

The obvious first step:

kubectl -n kube-system scale deployment coredns --replicas=5

Helps: More pods to handle queries.

Doesn’t solve: Traffic still funnels through service VIP. Conntrack pressure remains. Thundering herd still overwhelms.

Reduce query fan-out by lowering ndots:

apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"

With ndots:2, names with 2+ dots resolve directly:

"api.example.com" (2 dots >= 2) → resolve directly, no search domains
"my-service" (0 dots < 2) → still uses search domains

Helps: Reduces queries for external domains.

Doesn’t solve: Internal service lookups still expand. Thundering herd still a problem.

Force absolute lookups with trailing dots:

// Instead of
http.Get("http://api.example.com/path")

// Use
http.Get("http://api.example.com./path")  // Note trailing dot

Helps: Eliminates search domain expansion for that lookup.

Doesn’t solve: Requires code changes. Internal services still need search domains.

# CoreDNS Corefile
.:53 {
    cache 300  # Cache for 5 minutes instead of default 30s
    # ...
}

Helps: More cache hits, fewer upstream queries.

Doesn’t solve: Cold start thundering herd (nothing in cache yet).

NodeLocal DNSCache runs a DNS cache on every node. Pods query the local cache instead of the CoreDNS service.

Before (all traffic to CoreDNS):

Pod ──► kube-dns Service (10.96.0.10) ──► CoreDNS Pods
                    │
            (iptables/IPVS)


After (local cache):

Pod ──► NodeLocal DNS (169.254.20.10) ──► Cache Hit? ──► Response
             │                                │
             │                           Cache Miss
             │                                │
             │                                v
             │                          CoreDNS Pods
             │
        (runs on same node)
  1. DaemonSet: NodeLocal DNSCache runs on every node
  2. Link-local IP: Listens on 169.254.20.10 (node-local, no network hop)
  3. iptables rules: Redirect DNS traffic to local cache
  4. Cache: Serves cached responses instantly
  5. Upstream: Cache misses go to CoreDNS

No service VIP: Queries don’t go through iptables/IPVS for the kube-dns service.

No cross-node traffic: Cache hits are served locally.

No conntrack for local queries: Link-local traffic doesn’t create conntrack entries.

Survives CoreDNS issues: Cached entries still work if CoreDNS is temporarily unavailable.

Reduces CoreDNS load: Only cache misses reach CoreDNS.

  • Kubernetes 1.18+
  • Know your cluster DNS IP (usually 10.96.0.10)
  • Know your cluster domain (usually cluster.local)
# Download the manifest
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

# Replace placeholders
# __PILLAR__DNS__SERVER__ → your kube-dns ClusterIP (e.g., 10.96.0.10)
# __PILLAR__LOCAL__DNS__ → 169.254.20.10
# __PILLAR__DNS__DOMAIN__ → cluster.local

sed -i 's/__PILLAR__DNS__SERVER__/10.96.0.10/g' nodelocaldns.yaml
sed -i 's/__PILLAR__LOCAL__DNS__/169.254.20.10/g' nodelocaldns.yaml
sed -i 's/__PILLAR__DNS__DOMAIN__/cluster.local/g' nodelocaldns.yaml

# Apply
kubectl apply -f nodelocaldns.yaml
$ kubectl get ds -n kube-system node-local-dns
NAME             DESIRED   CURRENT   READY   NODE SELECTOR
node-local-dns   50        50        50      <none>

$ kubectl get pods -n kube-system -l k8s-app=node-local-dns
NAME                   READY   STATUS    RESTARTS
node-local-dns-abc12   1/1     Running   0
node-local-dns-def34   1/1     Running   0
...

Pods need to use the local DNS. Update kubelet’s --cluster-dns flag:

# kubelet configuration
clusterDNS:
  - 169.254.20.10  # NodeLocal DNS

Or for new pods only, keep existing kubelet config and let NodeLocal DNSCache’s iptables rules intercept traffic to 10.96.0.10.

# Check pod's resolv.conf
kubectl run test --image=busybox --rm -it -- cat /etc/resolv.conf
nameserver 169.254.20.10  # Should show local DNS

# Or if using iptables interception:
nameserver 10.96.0.10  # Original, but traffic is redirected

# Test resolution
kubectl run test --image=busybox --rm -it -- nslookup kubernetes.default
# Port-forward to a node-local-dns pod
kubectl port-forward -n kube-system pod/node-local-dns-abc12 9253:9253

# Check metrics
curl http://localhost:9253/metrics | grep coredns_cache

After deploying NodeLocal DNSCache:

Before:

  • Gang scheduling failures due to DNS timeouts
  • CoreDNS CPU at 80% during job spikes
  • DNS p99 latency: 500ms+ during load
  • Cascading failures from DNS-induced restarts

After:

  • Gang scheduling stable
  • CoreDNS CPU dropped to 20% (only cache misses)
  • DNS p99 latency: <5ms (local cache hits)
  • No more DNS-induced cascading failures

The local cache absorbs the thundering herd. Even if 100 pods start simultaneously on one node, the local cache serves repeated queries instantly.

CoreDNS’s autopath plugin reduces search domain queries:

# Corefile
.:53 {
    autopath @kubernetes
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
    }
}

With autopath, CoreDNS detects the client’s namespace and optimizes the search path. Instead of 4 queries, often just 1-2.

Cache NXDOMAIN responses to avoid repeated failed lookups:

.:53 {
    cache {
        success 9984 30  # Cache successful responses for 30s
        denial 9984 5    # Cache NXDOMAIN for 5s
    }
}

For pods that only need external DNS:

spec:
  dnsPolicy: Default  # Use node's DNS, not cluster DNS

For pods that need no DNS:

spec:
  dnsPolicy: None
  dnsConfig:
    nameservers:
      - 8.8.8.8

For high-frequency lookups, cache at the application level:

// Go: Use a custom resolver with caching
resolver := &net.Resolver{
    PreferGo: true,
    Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
        // Custom dial with caching
    },
}

Or use a sidecar cache like dnsmasq for legacy applications.

Kubernetes DNS becomes a bottleneck because:

Factor Impact
Single service VIP All traffic through one point
ndots expansion 1 lookup → 4+ queries
Conntrack entries Table exhaustion under load
UDP packet loss No built-in retry
Thundering herd Concurrent startups overwhelm CoreDNS

NodeLocal DNSCache fixes this by:

Benefit How
Local resolution No cross-node traffic for cache hits
No service VIP Bypasses iptables/IPVS bottleneck
Reduced conntrack Link-local traffic doesn’t track
Resilience Cached entries survive CoreDNS issues

For any cluster running batch jobs, gang scheduling, or high pod churn, NodeLocal DNSCache is essential. The thundering herd problem is real, and a local cache is the most effective solution.