Admission Webhooks at Scale: Diagnosis, Hardening, and Multi-Cluster Consistency

Your cluster has 12 admission webhooks. Pod creation takes 4 seconds. Sometimes it times out. Nobody knows which webhook is the problem, or even what all these webhooks do. Welcome to webhook sprawl.

This post covers how to diagnose webhook problems, harden configurations, and maintain consistency across a multi-cluster fleet.

The Webhook Accumulation Problem ¶

Every tool wants an admission webhook:

Policy engines: Kyverno, OPA/Gatekeeper, Kubewarden
Service meshes: Istio, Linkerd (sidecar injection)
Security: Vault (secret injection), Falco, image scanners
Certificates: cert-manager
Platform tooling: Custom mutating webhooks for labels, resource defaults, etc.

Each webhook seems reasonable in isolation. But they accumulate:

$ kubectl get mutatingwebhookconfigurations
NAME                            WEBHOOKS   AGE
cert-manager-webhook            1          180d
istio-sidecar-injector          1          90d
kyverno-resource-mutating       1          60d
vault-agent-injector            1          45d
team-a-defaults                 1          30d
team-b-image-rewriter           1          14d

$ kubectl get validatingwebhookconfigurations  
NAME                            WEBHOOKS   AGE
cert-manager-webhook            1          180d
gatekeeper-validating-webhook   1          120d
kyverno-resource-validating     1          60d
team-c-compliance-checker       1          21d

Symptoms you’ll see:

Pod creation latency measured in seconds
Intermittent API timeouts
“Connection refused” errors when webhooks are overwhelmed
Mysterious admission rejections (“admission webhook denied the request” — but which one?)
3am pages when a webhook goes down

How Admission Webhooks Actually Work ¶

Understanding the mechanics helps diagnose problems.

The Admission Chain ¶

When you create a resource, the API server processes it through a chain:

Client Request
     |
     v
Authentication
     |
     v
Authorization
     |
     v
Mutating Admission Webhooks (in order)
  Webhook 1 -> Webhook 2 -> Webhook 3
     |
     v
Object Schema Validation
     |
     v
Validating Admission Webhooks (parallel)
  Webhook A    Webhook B    Webhook C
     |
     v
Persist to etcd

Key points:

Mutating webhooks run serially, in the order defined by their configuration. Each one can modify the object before passing to the next.
Validating webhooks run in parallel (mostly). They can only accept or reject — no modifications.
If any webhook rejects, the entire request fails.
If any webhook times out or errors, behavior depends on failurePolicy.

The Reinvocation Trap ¶

Here’s a subtle issue: after mutating webhooks run, if the object was modified, validating webhooks see the mutated version. But there’s more.

If a mutating webhook modifies the object, the API server may reinvoke earlier mutating webhooks to ensure they see the final state. This can cause:

Unexpected latency (webhooks called multiple times)
Ordering surprises (Webhook A runs, then B mutates, then A runs again)
Infinite loops (A mutates, triggers B, B mutates, triggers A…)

The reinvocationPolicy field controls this:

webhooks:
  - name: my-webhook.example.com
    reinvocationPolicy: IfNeeded  # Default - may reinvoke
    # or
    reinvocationPolicy: Never     # Don't reinvoke this webhook

Timeout Behavior ¶

Each webhook has a timeout. The default is 10 seconds (was 30 seconds in older Kubernetes).

webhooks:
  - name: my-webhook.example.com
    timeoutSeconds: 5  # Fail fast

If a webhook doesn’t respond in time:

failurePolicy: Fail → Request rejected
failurePolicy: Ignore → Webhook skipped, request continues

With 10 webhooks at 10 seconds each, worst case is 100 seconds before timeout. In practice, the API server has its own overall timeout (~60s default), so you’ll hit that first.

Diagnosing Webhook Problems ¶

API Server Metrics ¶

The API server exposes detailed webhook metrics. These are your primary diagnostic tool.

Webhook latency:

# P99 latency per webhook
histogram_quantile(0.99, 
  sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket[5m])) 
  by (le, name, operation)
)

Webhook rejection rate:

# Rejections per webhook
sum(rate(apiserver_admission_webhook_rejection_count[5m])) by (name, error_type)

Webhook failure rate (timeouts, connection errors):

sum(rate(apiserver_admission_webhook_fail_open_count[5m])) by (name)

Create a dashboard with:

Latency heatmap per webhook
Rejection rate over time
Failure/timeout rate
Request volume per webhook

Identifying Slow Webhooks ¶

High P99 latency on a specific webhook? Dig deeper:

# Check webhook endpoint health
kubectl get mutatingwebhookconfiguration <name> -o jsonpath='{.webhooks[*].clientConfig.service}'

# Check the backing service
kubectl get pods -n <namespace> -l app=<webhook-app>
kubectl logs -n <namespace> -l app=<webhook-app> --tail=100

Common causes of slow webhooks:

Webhook does external calls (API, database) synchronously
Webhook has insufficient resources (CPU throttling)
Webhook is overloaded (not enough replicas)
Network latency to webhook service

“Which Webhook Rejected My Pod?” ¶

The API server error message is often unhelpful:

Error from server: admission webhook "webhook.example.com" denied the request: [error details]

If it doesn’t say which webhook, or the error is generic:

Step 1: Check recent events

kubectl get events --field-selector reason=FailedCreate --sort-by='.lastTimestamp'

Step 2: Enable API server audit logging

Audit logs capture which webhooks were called and their responses:

# Audit policy to log admission decisions
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods"]
    omitStages:
      - RequestReceived

Step 3: Dry-run the request

Kubernetes 1.18+ supports server-side dry-run:

kubectl apply -f pod.yaml --dry-run=server -v=8

The verbose output shows which webhooks were called.

Step 4: Binary search

If you’re desperate, temporarily disable webhooks one by one to find the culprit:

# Add a namespaceSelector that matches nothing
kubectl patch mutatingwebhookconfiguration suspect-webhook --type='json' -p='[{"op": "add", "path": "/webhooks/0/namespaceSelector", "value": {"matchLabels": {"nonexistent": "label"}}}]'

(Don’t do this in production without understanding the consequences.)

Tracing a Request ¶

For deep debugging, trace a single request through the webhook chain.

If you have distributed tracing (Jaeger, Zipkin), ensure your webhooks propagate trace headers. The API server doesn’t initiate traces, but your webhooks can create spans.

Quick tracing with curl:

# Get API server address and token
API_SERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
TOKEN=$(kubectl create token default)

# Create pod with timing
time curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d @pod.json \
  "$API_SERVER/api/v1/namespaces/default/pods?dryRun=All" \
  -w "\nTotal time: %{time_total}s\n"

Hardening Webhook Configurations ¶

Timeout: Don’t Default to 10 Seconds ¶

10 seconds is an eternity for an admission decision. If your webhook needs 10 seconds, something is wrong.

webhooks:
  - name: fast-webhook.example.com
    timeoutSeconds: 3  # Be aggressive

Guidelines:

Simple validation: 1-2 seconds
Mutation with no external calls: 2-3 seconds
External calls (policy checks, etc.): 3-5 seconds max
More than 5 seconds: Reconsider your architecture

failurePolicy: The Tradeoff ¶

webhooks:
  - name: my-webhook.example.com
    failurePolicy: Fail    # Reject if webhook fails
    # or
    failurePolicy: Ignore  # Skip webhook if it fails

Fail (default):

Webhook down → API requests rejected
Safer for security-critical webhooks
Risk: Webhook failure blocks the entire cluster

Ignore:

Webhook down → Requests proceed without webhook
Better for availability
Risk: Security policies bypassed during outages

Recommendation:

Security-critical (policy enforcement): Fail, but ensure high availability
Nice-to-have mutations (adding labels): Ignore
Development/testing webhooks: Ignore

Scope Your Webhooks ¶

Don’t intercept everything:

webhooks:
  - name: my-webhook.example.com
    # Only match specific namespaces
    namespaceSelector:
      matchExpressions:
        - key: webhook.example.com/enabled
          operator: In
          values: ["true"]
    
    # Only match specific resources
    rules:
      - apiGroups: ["apps"]
        apiVersions: ["v1"]
        operations: ["CREATE", "UPDATE"]
        resources: ["deployments"]
        scope: Namespaced
    
    # Only match objects with specific labels
    objectSelector:
      matchLabels:
        webhook.example.com/process: "true"

Always exclude system namespaces:

namespaceSelector:
  matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
        - kube-system
        - kube-public
        - kube-node-lease

Filter by operation:

If you only care about CREATE, don’t intercept UPDATE:

rules:
  - operations: ["CREATE"]  # Not ["CREATE", "UPDATE", "DELETE"]
    resources: ["pods"]

matchPolicy: Exact vs Equivalent ¶

webhooks:
  - name: my-webhook.example.com
    matchPolicy: Equivalent  # Default
    # or
    matchPolicy: Exact

Equivalent matches resources by type equivalence (e.g., apps/v1 Deployment and apps/v1beta1 Deployment are equivalent). This is usually what you want.

Exact requires exact API group/version match. Use this if your webhook logic is version-specific.

sideEffects Declaration ¶

Webhooks must declare their side effects:

webhooks:
  - name: my-webhook.example.com
    sideEffects: None  # No side effects, safe for dry-run
    # or
    sideEffects: NoneOnDryRun  # Side effects only on real requests

None allows the API server to skip your webhook during dry-run requests, reducing unnecessary calls.

Multi-Cluster Webhook Consistency ¶

Discovering What Exists ¶

First problem: knowing what webhooks exist across your fleet.

Quick audit script:

#!/bin/bash
for cluster in $(kubectl config get-contexts -o name); do
  echo "=== $cluster ==="
  kubectl --context=$cluster get mutatingwebhookconfigurations -o custom-columns=NAME:.metadata.name,WEBHOOKS:.webhooks[*].name
  kubectl --context=$cluster get validatingwebhookconfigurations -o custom-columns=NAME:.metadata.name,WEBHOOKS:.webhooks[*].name
  echo
done

Structured collection:

# Export webhook configs from all clusters
for cluster in $(kubectl config get-contexts -o name); do
  kubectl --context=$cluster get mutatingwebhookconfigurations -o yaml > "webhooks-mutating-$cluster.yaml"
  kubectl --context=$cluster get validatingwebhookconfigurations -o yaml > "webhooks-validating-$cluster.yaml"
done

# Diff them
diff webhooks-mutating-cluster1.yaml webhooks-mutating-cluster2.yaml

Detecting Drift ¶

Webhook drift happens when:

Someone manually adds a webhook to one cluster
A Helm upgrade fails on some clusters
Different teams deploy different versions

Automated drift detection:

# On your hub cluster, define expected webhooks
apiVersion: v1
kind: ConfigMap
metadata:
  name: expected-webhooks
  namespace: fleet-system
data:
  mutating: |
    cert-manager-webhook
    istio-sidecar-injector
    kyverno-resource-mutating
  validating: |
    cert-manager-webhook
    kyverno-resource-validating

Then run periodic jobs that compare actual vs expected.

Propagating Webhooks via Fleet ¶

Webhook configurations are cluster-scoped resources. Propagate them like any other:

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
  name: webhook-configs
spec:
  resourceSelectors:
    - group: admissionregistration.k8s.io
      kind: MutatingWebhookConfiguration
      version: v1
      name: my-webhook
    - group: admissionregistration.k8s.io
      kind: ValidatingWebhookConfiguration
      version: v1
      name: my-webhook
  policy:
    placementType: PickAll

Caution: The webhook configuration references a Service (the webhook endpoint). That Service must exist in every cluster. Options:

Webhook runs in every cluster: Service is local. Propagate both the webhook workload and the configuration.
Centralized webhook: All clusters call a central endpoint. Use url instead of service in the webhook config. (Not recommended for latency-sensitive or high-volume webhooks.)

Canary Rollouts for Webhook Changes ¶

Webhook changes are risky. A bad config can break the entire cluster.

Staged rollout:

# Stage 1: Canary cluster only
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
  name: webhook-canary
spec:
  resourceSelectors:
    - group: admissionregistration.k8s.io
      kind: MutatingWebhookConfiguration
      version: v1
      name: new-webhook-v2
  policy:
    placementType: PickN
    numberOfClusters: 1
    affinity:
      clusterAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          clusterSelectorTerms:
            - labelSelector:
                matchLabels:
                  environment: canary

Monitor the canary cluster. Check metrics. If healthy, expand:

# Stage 2: All non-prod
policy:
  placementType: PickAll
  affinity:
    clusterAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        clusterSelectorTerms:
          - labelSelector:
              matchExpressions:
                - key: environment
                  operator: NotIn
                  values: ["production"]

Then production.

Consolidation: Fewer Webhooks, Less Pain ¶

The best webhook is the one you don’t have.

Kyverno/OPA Can Replace Many Single-Purpose Webhooks ¶

Instead of:

Webhook A: Require labels
Webhook B: Enforce resource limits
Webhook C: Disallow privileged pods
Webhook D: Restrict registries

Use one policy engine:

# One Kyverno installation replaces 4 webhooks
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: all-the-things
spec:
  rules:
    - name: require-labels
      # ...
    - name: require-limits
      # ...
    - name: disallow-privileged
      # ...
    - name: restrict-registries
      # ...

Benefits:

One webhook call instead of four
Consistent policy language
Unified reporting (PolicyReports)
One thing to monitor and maintain

When to Keep Webhooks Separate ¶

Different SLAs: A security webhook needs failurePolicy: Fail. A convenience webhook can use Ignore.
Different ownership: Istio sidecar injection is owned by the platform team. Team-specific mutations should be separate.
Different lifecycles: cert-manager upgrades shouldn’t require redeploying your custom policies.

Evaluating Webhook Necessity ¶

For each webhook, ask:

What problem does this solve?
Can it be solved another way (controller, policy engine)?
What’s the latency impact?
What happens if it fails?
Who owns it?

Kill zombies: Webhooks installed years ago for a use case nobody remembers. If it doesn’t have an owner, it shouldn’t exist.

Summary ¶

Webhook sprawl is a real problem at scale. The fixes:

Measure: Use API server metrics to understand latency and failure rates per webhook.
Debug systematically: Know how to trace a request and identify which webhook is the problem.
Harden configurations: Aggressive timeouts, appropriate failurePolicy, scoped selectors.
Maintain consistency: Propagate configurations via Fleet, detect drift, canary changes.
Consolidate: Fewer webhooks doing more beats many webhooks doing little.

Every webhook is a tax on every API request. Make sure each one is paying its way.