Your cluster has 12 admission webhooks. Pod creation takes 4 seconds. Sometimes it times out. Nobody knows which webhook is the problem, or even what all these webhooks do. Welcome to webhook sprawl.
This post covers how to diagnose webhook problems, harden configurations, and maintain consistency across a multi-cluster fleet.
The Webhook Accumulation Problem ¶
Every tool wants an admission webhook:
- Policy engines: Kyverno, OPA/Gatekeeper, Kubewarden
- Service meshes: Istio, Linkerd (sidecar injection)
- Security: Vault (secret injection), Falco, image scanners
- Certificates: cert-manager
- Platform tooling: Custom mutating webhooks for labels, resource defaults, etc.
Each webhook seems reasonable in isolation. But they accumulate:
$ kubectl get mutatingwebhookconfigurations
NAME WEBHOOKS AGE
cert-manager-webhook 1 180d
istio-sidecar-injector 1 90d
kyverno-resource-mutating 1 60d
vault-agent-injector 1 45d
team-a-defaults 1 30d
team-b-image-rewriter 1 14d
$ kubectl get validatingwebhookconfigurations
NAME WEBHOOKS AGE
cert-manager-webhook 1 180d
gatekeeper-validating-webhook 1 120d
kyverno-resource-validating 1 60d
team-c-compliance-checker 1 21d
Symptoms you’ll see:
- Pod creation latency measured in seconds
- Intermittent API timeouts
- “Connection refused” errors when webhooks are overwhelmed
- Mysterious admission rejections (“admission webhook denied the request” — but which one?)
- 3am pages when a webhook goes down
How Admission Webhooks Actually Work ¶
Understanding the mechanics helps diagnose problems.
The Admission Chain ¶
When you create a resource, the API server processes it through a chain:
Client Request
|
v
Authentication
|
v
Authorization
|
v
Mutating Admission Webhooks (in order)
Webhook 1 -> Webhook 2 -> Webhook 3
|
v
Object Schema Validation
|
v
Validating Admission Webhooks (parallel)
Webhook A Webhook B Webhook C
|
v
Persist to etcd
Key points:
-
Mutating webhooks run serially, in the order defined by their configuration. Each one can modify the object before passing to the next.
-
Validating webhooks run in parallel (mostly). They can only accept or reject — no modifications.
-
If any webhook rejects, the entire request fails.
-
If any webhook times out or errors, behavior depends on
failurePolicy.
The Reinvocation Trap ¶
Here’s a subtle issue: after mutating webhooks run, if the object was modified, validating webhooks see the mutated version. But there’s more.
If a mutating webhook modifies the object, the API server may reinvoke earlier mutating webhooks to ensure they see the final state. This can cause:
- Unexpected latency (webhooks called multiple times)
- Ordering surprises (Webhook A runs, then B mutates, then A runs again)
- Infinite loops (A mutates, triggers B, B mutates, triggers A…)
The reinvocationPolicy field controls this:
webhooks:
- name: my-webhook.example.com
reinvocationPolicy: IfNeeded # Default - may reinvoke
# or
reinvocationPolicy: Never # Don't reinvoke this webhook
Timeout Behavior ¶
Each webhook has a timeout. The default is 10 seconds (was 30 seconds in older Kubernetes).
webhooks:
- name: my-webhook.example.com
timeoutSeconds: 5 # Fail fast
If a webhook doesn’t respond in time:
failurePolicy: Fail→ Request rejectedfailurePolicy: Ignore→ Webhook skipped, request continues
With 10 webhooks at 10 seconds each, worst case is 100 seconds before timeout. In practice, the API server has its own overall timeout (~60s default), so you’ll hit that first.
Diagnosing Webhook Problems ¶
API Server Metrics ¶
The API server exposes detailed webhook metrics. These are your primary diagnostic tool.
Webhook latency:
# P99 latency per webhook
histogram_quantile(0.99,
sum(rate(apiserver_admission_webhook_admission_duration_seconds_bucket[5m]))
by (le, name, operation)
)
Webhook rejection rate:
# Rejections per webhook
sum(rate(apiserver_admission_webhook_rejection_count[5m])) by (name, error_type)
Webhook failure rate (timeouts, connection errors):
sum(rate(apiserver_admission_webhook_fail_open_count[5m])) by (name)
Create a dashboard with:
- Latency heatmap per webhook
- Rejection rate over time
- Failure/timeout rate
- Request volume per webhook
Identifying Slow Webhooks ¶
High P99 latency on a specific webhook? Dig deeper:
# Check webhook endpoint health
kubectl get mutatingwebhookconfiguration <name> -o jsonpath='{.webhooks[*].clientConfig.service}'
# Check the backing service
kubectl get pods -n <namespace> -l app=<webhook-app>
kubectl logs -n <namespace> -l app=<webhook-app> --tail=100
Common causes of slow webhooks:
- Webhook does external calls (API, database) synchronously
- Webhook has insufficient resources (CPU throttling)
- Webhook is overloaded (not enough replicas)
- Network latency to webhook service
“Which Webhook Rejected My Pod?” ¶
The API server error message is often unhelpful:
Error from server: admission webhook "webhook.example.com" denied the request: [error details]
If it doesn’t say which webhook, or the error is generic:
Step 1: Check recent events
kubectl get events --field-selector reason=FailedCreate --sort-by='.lastTimestamp'
Step 2: Enable API server audit logging
Audit logs capture which webhooks were called and their responses:
# Audit policy to log admission decisions
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: RequestResponse
resources:
- group: ""
resources: ["pods"]
omitStages:
- RequestReceived
Step 3: Dry-run the request
Kubernetes 1.18+ supports server-side dry-run:
kubectl apply -f pod.yaml --dry-run=server -v=8
The verbose output shows which webhooks were called.
Step 4: Binary search
If you’re desperate, temporarily disable webhooks one by one to find the culprit:
# Add a namespaceSelector that matches nothing
kubectl patch mutatingwebhookconfiguration suspect-webhook --type='json' -p='[{"op": "add", "path": "/webhooks/0/namespaceSelector", "value": {"matchLabels": {"nonexistent": "label"}}}]'
(Don’t do this in production without understanding the consequences.)
Tracing a Request ¶
For deep debugging, trace a single request through the webhook chain.
If you have distributed tracing (Jaeger, Zipkin), ensure your webhooks propagate trace headers. The API server doesn’t initiate traces, but your webhooks can create spans.
Quick tracing with curl:
# Get API server address and token
API_SERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
TOKEN=$(kubectl create token default)
# Create pod with timing
time curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d @pod.json \
"$API_SERVER/api/v1/namespaces/default/pods?dryRun=All" \
-w "\nTotal time: %{time_total}s\n"
Hardening Webhook Configurations ¶
Timeout: Don’t Default to 10 Seconds ¶
10 seconds is an eternity for an admission decision. If your webhook needs 10 seconds, something is wrong.
webhooks:
- name: fast-webhook.example.com
timeoutSeconds: 3 # Be aggressive
Guidelines:
- Simple validation: 1-2 seconds
- Mutation with no external calls: 2-3 seconds
- External calls (policy checks, etc.): 3-5 seconds max
- More than 5 seconds: Reconsider your architecture
failurePolicy: The Tradeoff ¶
webhooks:
- name: my-webhook.example.com
failurePolicy: Fail # Reject if webhook fails
# or
failurePolicy: Ignore # Skip webhook if it fails
Fail (default):
- Webhook down → API requests rejected
- Safer for security-critical webhooks
- Risk: Webhook failure blocks the entire cluster
Ignore:
- Webhook down → Requests proceed without webhook
- Better for availability
- Risk: Security policies bypassed during outages
Recommendation:
- Security-critical (policy enforcement):
Fail, but ensure high availability - Nice-to-have mutations (adding labels):
Ignore - Development/testing webhooks:
Ignore
Scope Your Webhooks ¶
Don’t intercept everything:
webhooks:
- name: my-webhook.example.com
# Only match specific namespaces
namespaceSelector:
matchExpressions:
- key: webhook.example.com/enabled
operator: In
values: ["true"]
# Only match specific resources
rules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
scope: Namespaced
# Only match objects with specific labels
objectSelector:
matchLabels:
webhook.example.com/process: "true"
Always exclude system namespaces:
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values:
- kube-system
- kube-public
- kube-node-lease
Filter by operation:
If you only care about CREATE, don’t intercept UPDATE:
rules:
- operations: ["CREATE"] # Not ["CREATE", "UPDATE", "DELETE"]
resources: ["pods"]
matchPolicy: Exact vs Equivalent ¶
webhooks:
- name: my-webhook.example.com
matchPolicy: Equivalent # Default
# or
matchPolicy: Exact
Equivalent matches resources by type equivalence (e.g., apps/v1 Deployment and apps/v1beta1 Deployment are equivalent). This is usually what you want.
Exact requires exact API group/version match. Use this if your webhook logic is version-specific.
sideEffects Declaration ¶
Webhooks must declare their side effects:
webhooks:
- name: my-webhook.example.com
sideEffects: None # No side effects, safe for dry-run
# or
sideEffects: NoneOnDryRun # Side effects only on real requests
None allows the API server to skip your webhook during dry-run requests, reducing unnecessary calls.
Multi-Cluster Webhook Consistency ¶
Discovering What Exists ¶
First problem: knowing what webhooks exist across your fleet.
Quick audit script:
#!/bin/bash
for cluster in $(kubectl config get-contexts -o name); do
echo "=== $cluster ==="
kubectl --context=$cluster get mutatingwebhookconfigurations -o custom-columns=NAME:.metadata.name,WEBHOOKS:.webhooks[*].name
kubectl --context=$cluster get validatingwebhookconfigurations -o custom-columns=NAME:.metadata.name,WEBHOOKS:.webhooks[*].name
echo
done
Structured collection:
# Export webhook configs from all clusters
for cluster in $(kubectl config get-contexts -o name); do
kubectl --context=$cluster get mutatingwebhookconfigurations -o yaml > "webhooks-mutating-$cluster.yaml"
kubectl --context=$cluster get validatingwebhookconfigurations -o yaml > "webhooks-validating-$cluster.yaml"
done
# Diff them
diff webhooks-mutating-cluster1.yaml webhooks-mutating-cluster2.yaml
Detecting Drift ¶
Webhook drift happens when:
- Someone manually adds a webhook to one cluster
- A Helm upgrade fails on some clusters
- Different teams deploy different versions
Automated drift detection:
# On your hub cluster, define expected webhooks
apiVersion: v1
kind: ConfigMap
metadata:
name: expected-webhooks
namespace: fleet-system
data:
mutating: |
cert-manager-webhook
istio-sidecar-injector
kyverno-resource-mutating
validating: |
cert-manager-webhook
kyverno-resource-validating
Then run periodic jobs that compare actual vs expected.
Propagating Webhooks via Fleet ¶
Webhook configurations are cluster-scoped resources. Propagate them like any other:
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
name: webhook-configs
spec:
resourceSelectors:
- group: admissionregistration.k8s.io
kind: MutatingWebhookConfiguration
version: v1
name: my-webhook
- group: admissionregistration.k8s.io
kind: ValidatingWebhookConfiguration
version: v1
name: my-webhook
policy:
placementType: PickAll
Caution: The webhook configuration references a Service (the webhook endpoint). That Service must exist in every cluster. Options:
-
Webhook runs in every cluster: Service is local. Propagate both the webhook workload and the configuration.
-
Centralized webhook: All clusters call a central endpoint. Use
urlinstead ofservicein the webhook config. (Not recommended for latency-sensitive or high-volume webhooks.)
Canary Rollouts for Webhook Changes ¶
Webhook changes are risky. A bad config can break the entire cluster.
Staged rollout:
# Stage 1: Canary cluster only
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
name: webhook-canary
spec:
resourceSelectors:
- group: admissionregistration.k8s.io
kind: MutatingWebhookConfiguration
version: v1
name: new-webhook-v2
policy:
placementType: PickN
numberOfClusters: 1
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
environment: canary
Monitor the canary cluster. Check metrics. If healthy, expand:
# Stage 2: All non-prod
policy:
placementType: PickAll
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchExpressions:
- key: environment
operator: NotIn
values: ["production"]
Then production.
Consolidation: Fewer Webhooks, Less Pain ¶
The best webhook is the one you don’t have.
Kyverno/OPA Can Replace Many Single-Purpose Webhooks ¶
Instead of:
- Webhook A: Require labels
- Webhook B: Enforce resource limits
- Webhook C: Disallow privileged pods
- Webhook D: Restrict registries
Use one policy engine:
# One Kyverno installation replaces 4 webhooks
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: all-the-things
spec:
rules:
- name: require-labels
# ...
- name: require-limits
# ...
- name: disallow-privileged
# ...
- name: restrict-registries
# ...
Benefits:
- One webhook call instead of four
- Consistent policy language
- Unified reporting (PolicyReports)
- One thing to monitor and maintain
When to Keep Webhooks Separate ¶
- Different SLAs: A security webhook needs
failurePolicy: Fail. A convenience webhook can useIgnore. - Different ownership: Istio sidecar injection is owned by the platform team. Team-specific mutations should be separate.
- Different lifecycles: cert-manager upgrades shouldn’t require redeploying your custom policies.
Evaluating Webhook Necessity ¶
For each webhook, ask:
- What problem does this solve?
- Can it be solved another way (controller, policy engine)?
- What’s the latency impact?
- What happens if it fails?
- Who owns it?
Kill zombies: Webhooks installed years ago for a use case nobody remembers. If it doesn’t have an owner, it shouldn’t exist.
Summary ¶
Webhook sprawl is a real problem at scale. The fixes:
-
Measure: Use API server metrics to understand latency and failure rates per webhook.
-
Debug systematically: Know how to trace a request and identify which webhook is the problem.
-
Harden configurations: Aggressive timeouts, appropriate failurePolicy, scoped selectors.
-
Maintain consistency: Propagate configurations via Fleet, detect drift, canary changes.
-
Consolidate: Fewer webhooks doing more beats many webhooks doing little.
Every webhook is a tax on every API request. Make sure each one is paying its way.