You’ve got 20 clusters. You want consistent security policies across all of them. You roll out a Kyverno policy, and suddenly deployments are failing in production because someone’s legitimate workload doesn’t comply. Now multiply that panic by 20 clusters.
This post is about avoiding that scenario — deploying Kyverno policies across a multi-cluster fleet without breaking things.
The Problem ¶
In a single cluster, policy management is straightforward: install Kyverno, write policies, done. But multi-cluster introduces real challenges:
Policy drift: Cluster-7 is running an old version of your policies. Cluster-12 has a policy someone added manually. Cluster-3 has an exception you forgot about. Nobody knows the actual state.
Blast radius: A bad policy update doesn’t break one cluster — it breaks all of them. Simultaneously. During business hours.
Exceptions at scale: Team A needs an exception in their namespace. Team B needs a different exception, but only in the staging cluster. How do you manage this without drowning in YAML?
Visibility: “Is this policy actually enforced everywhere?” shouldn’t require SSH’ing into 20 clusters.
Kyverno Essentials ¶
If you’re new to Kyverno, here’s the minimum you need to follow this post.
Kyverno is a Kubernetes-native policy engine. It runs as an admission controller — when someone creates or updates a resource, Kyverno intercepts the request and decides whether to allow, modify, or reject it.
ClusterPolicy: Applies across all namespaces in a cluster.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-labels
spec:
validationFailureAction: Enforce # or Audit
rules:
- name: check-team-label
match:
any:
- resources:
kinds:
- Pod
validate:
message: "The label 'team' is required."
pattern:
metadata:
labels:
team: "?*"
validationFailureAction:
Enforce: Block non-compliant resources (admission rejected)Audit: Allow but report the violation (PolicyReport created)
Policy types:
- Validate: Accept or reject based on rules
- Mutate: Modify resources on the fly (inject labels, add sidecars)
- Generate: Create companion resources (NetworkPolicy when namespace created)
- VerifyImages: Check image signatures
That’s enough background. Let’s talk about not breaking production.
Policies That Don’t Break Production ¶
Start in Audit Mode. Always. ¶
The single most important practice:
spec:
validationFailureAction: Audit # Start here
In Audit mode, Kyverno evaluates resources against your policy but doesn’t block anything. It creates PolicyReports documenting what would have been blocked.
The workflow:
- Deploy policy in Audit mode
- Wait (days, not hours — you need to see real traffic)
- Review PolicyReports for violations
- Fix legitimate workloads or adjust the policy
- Switch to Enforce
spec:
validationFailureAction: Enforce # Only after audit
Skipping this step is how you break production at 2am.
Exclude System Namespaces ¶
Your policy shouldn’t block Kubernetes system components:
spec:
rules:
- name: require-resource-limits
match:
any:
- resources:
kinds:
- Pod
exclude:
any:
- resources:
namespaces:
- kube-system
- kube-public
- kube-node-lease
- kyverno
- fleet-system
- gatekeeper-system
Better yet, use a label-based exclusion that’s consistent across policies:
exclude:
any:
- resources:
selector:
matchLabels:
policy.example.com/exclude: "true"
PolicyExceptions: The Escape Hatch ¶
Kyverno 1.9+ introduced PolicyException — a way to grant specific exemptions without modifying the policy itself.
apiVersion: kyverno.io/v2beta1
kind: PolicyException
metadata:
name: allow-privileged-monitoring
namespace: monitoring
spec:
exceptions:
- policyName: disallow-privileged
ruleNames:
- deny-privileged-containers
match:
any:
- resources:
kinds:
- Pod
namespaces:
- monitoring
names:
- node-exporter-*
This says: “The disallow-privileged policy doesn’t apply to pods named node-exporter-* in the monitoring namespace.”
Why this is better than policy modification:
- Policies stay clean and universal
- Exceptions are explicit and auditable
- You can track who requested what exception and why
- Deleting the exception re-enables enforcement
Staged Rollout ¶
Don’t go from zero to fleet-wide enforcement in one step:
Stage 1: Single namespace in one cluster
spec:
rules:
- name: test-policy
match:
any:
- resources:
namespaces:
- policy-test
Stage 2: Audit mode, all namespaces, one cluster
spec:
validationFailureAction: Audit
rules:
- name: test-policy
match:
any:
- resources:
kinds:
- Pod
exclude:
# ... system namespaces
Stage 3: Enforce mode, one cluster
Stage 4: Audit mode, fleet-wide
Stage 5: Enforce mode, fleet-wide
At each stage, wait and watch. PolicyReports tell you what’s happening.
The “Oh Shit” Recovery ¶
You deployed a bad policy. Deployments are failing. Here’s your emergency playbook:
Option 1: Switch to Audit (fast)
kubectl patch clusterpolicy bad-policy -p '{"spec":{"validationFailureAction":"Audit"}}' --type=merge
Option 2: Delete the policy (faster)
kubectl delete clusterpolicy bad-policy
Option 3: Kyverno failurePolicy (prevents this scenario)
When you install Kyverno, configure the webhook to fail open:
# In Kyverno Helm values
config:
webhooks:
- failurePolicy: Ignore # Fail open if Kyverno is down/slow
With Ignore, if Kyverno can’t evaluate a request (timeout, crash), Kubernetes allows the request. You lose enforcement temporarily, but deployments don’t break because your policy engine is having a bad day.
Multi-Cluster Propagation ¶
Now for the multi-cluster part. You have policies that work. How do you deploy them consistently across a fleet?
Propagating Policies via KubeFleet ¶
If you’re using KubeFleet (or Azure Fleet Manager), you can treat Kyverno ClusterPolicies like any other Kubernetes resource:
Step 1: Create policies on the hub cluster
# On hub cluster
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
namespace: default # ClusterPolicy is cluster-scoped, but we need it in a namespace for Fleet
spec:
validationFailureAction: Audit
rules:
- name: require-limits
match:
any:
- resources:
kinds:
- Pod
exclude:
any:
- resources:
namespaces:
- kube-system
- kyverno
validate:
message: "CPU and memory limits are required."
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
Step 2: Create a ClusterResourcePlacement to propagate it
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
name: kyverno-policies
spec:
resourceSelectors:
- group: kyverno.io
kind: ClusterPolicy
version: v1
name: require-resource-limits
policy:
placementType: PickAll # All member clusters
Every member cluster now receives the policy. Updates to the policy on the hub automatically propagate.
Wrapping Policies in a Namespace ¶
ClusterPolicies are cluster-scoped, but Fleet propagates namespace-scoped resources more naturally. A common pattern is to wrap policies in a dedicated namespace:
# Namespace to hold policies
apiVersion: v1
kind: Namespace
metadata:
name: cluster-policies
---
# Policy lives in this namespace (Fleet will propagate both)
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-privileged
# Note: ClusterPolicy is cluster-scoped, but can be "associated" with a namespace for organizational purposes
spec:
# ...
Then propagate the entire namespace:
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
name: security-policies
spec:
resourceSelectors:
- group: ""
kind: Namespace
version: v1
name: cluster-policies
policy:
placementType: PickAll
Per-Cluster Overrides ¶
What if most clusters need Enforce but your development cluster should use Audit?
Use KubeFleet’s ClusterResourceOverride:
apiVersion: placement.kubernetes-fleet.io/v1alpha1
kind: ClusterResourceOverride
metadata:
name: dev-cluster-audit-mode
spec:
clusterResourceSelectors:
- group: kyverno.io
kind: ClusterPolicy
name: require-resource-limits
version: v1
policy:
overrideRules:
- clusterSelector:
clusterSelectorTerms:
- labelSelector:
matchLabels:
environment: development
jsonPatchOverrides:
- op: replace
path: /spec/validationFailureAction
value: Audit
Now require-resource-limits is enforced everywhere except clusters labeled environment: development, where it runs in Audit mode.
PolicyExceptions Across the Fleet ¶
Should PolicyExceptions be centralized or per-cluster?
Centralized exceptions (propagated from hub):
- Good for: Fleet-wide exceptions (monitoring tools, platform components)
- Propagate like any other resource
Local exceptions (created on member clusters):
- Good for: Cluster-specific needs, team autonomy
- Don’t propagate — each cluster manages its own
A reasonable split:
- Platform exceptions (node-exporter, ingress controller) → centralized
- Application exceptions → local, with approval process
Versioning and Rollback ¶
When you update a policy on the hub, it propagates everywhere. How do you handle this safely?
GitOps: Store policies in Git. Changes go through PR review. Fleet syncs from Git (via ArgoCD or Flux on the hub).
Staged rollout with labels:
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
name: policies-canary
spec:
resourceSelectors:
- group: kyverno.io
kind: ClusterPolicy
version: v1
name: new-policy-v2
policy:
placementType: PickN
numberOfClusters: 2
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
policy-canary: "true"
Deploy to canary clusters first. Watch for issues. Then expand to the fleet.
Rollback: Update the hub policy to the previous version. Fleet propagates the rollback.
Observability ¶
PolicyReports ¶
Kyverno creates PolicyReport resources documenting violations:
apiVersion: wgpolicyk8s.io/v1alpha2
kind: ClusterPolicyReport
metadata:
name: clusterpolicy-require-resource-limits
results:
- message: "CPU and memory limits are required."
policy: require-resource-limits
result: fail
rule: require-limits
resources:
- apiVersion: v1
kind: Pod
name: my-app-xyz123
namespace: default
timestamp: "2025-01-28T10:30:00Z"
Aggregating Reports Across Clusters ¶
In a multi-cluster setup, you need to aggregate reports. Options:
Option 1: Policy Reporter UI
Policy Reporter is an open-source tool that aggregates PolicyReports and provides a dashboard. Deploy it on each cluster and point to a central backend.
Option 2: Export to central logging
Ship PolicyReports to your logging stack (Elasticsearch, Loki, etc.):
# Policy Reporter can push to various targets
target:
loki:
host: http://loki.monitoring:3100
path: /loki/api/v1/push
minimumPriority: warning
Option 3: Metrics
Kyverno exposes Prometheus metrics:
kyverno_policy_results_total{policy_name="require-resource-limits", rule_name="require-limits", result="fail"}
Aggregate with Thanos or Cortex across clusters. Alert on violation spikes.
Debugging Admission Failures ¶
User: “My deployment won’t create pods!”
Step 1: Check events
kubectl describe deployment my-app
# Look for admission webhook errors in events
Step 2: Check Kyverno logs
kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno --tail=100 | grep my-app
Step 3: Check PolicyReports
kubectl get policyreport -A | grep my-app
kubectl get clusterpolicyreport -o yaml | grep -A20 my-app
Step 4: Dry-run the resource
Kyverno CLI lets you test locally:
kyverno apply policy.yaml --resource pod.yaml
Step 5: Check for exceptions
kubectl get policyexception -A
# Is there an exception that should apply but doesn't?
Real Policies That Matter ¶
Here are battle-tested policies, not toy examples.
1. Require Resource Limits ¶
Pods without limits can starve nodes. This is non-negotiable in shared clusters.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
annotations:
policies.kyverno.io/title: Require Resource Limits
policies.kyverno.io/description: >-
Pods must specify CPU and memory limits to prevent resource starvation.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: require-limits
match:
any:
- resources:
kinds:
- Pod
exclude:
any:
- resources:
namespaces:
- kube-system
- kube-node-lease
- kyverno
- resources:
selector:
matchLabels:
policy.example.com/exclude: "true"
validate:
message: "CPU and memory limits are required for all containers."
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
=(initContainers):
- resources:
limits:
memory: "?*"
cpu: "?*"
2. Restrict Image Registries ¶
Only allow images from your approved registries:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-image-registries
annotations:
policies.kyverno.io/title: Restrict Image Registries
policies.kyverno.io/description: >-
Images must come from approved registries.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: validate-registries
match:
any:
- resources:
kinds:
- Pod
exclude:
any:
- resources:
namespaces:
- kube-system
- kyverno
validate:
message: "Images must be from approved registries: gcr.io/mycompany, mycompany.azurecr.io"
pattern:
spec:
containers:
- image: "gcr.io/mycompany/* | mycompany.azurecr.io/*"
=(initContainers):
- image: "gcr.io/mycompany/* | mycompany.azurecr.io/*"
=(ephemeralContainers):
- image: "gcr.io/mycompany/* | mycompany.azurecr.io/*"
3. Disallow Privileged Containers ¶
Privileged containers can escape to the host. Block them unless explicitly exempted.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-privileged
annotations:
policies.kyverno.io/title: Disallow Privileged Containers
policies.kyverno.io/description: >-
Privileged containers are not allowed.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: deny-privileged
match:
any:
- resources:
kinds:
- Pod
exclude:
any:
- resources:
namespaces:
- kube-system
validate:
message: "Privileged containers are not allowed."
pattern:
spec:
containers:
- =(securityContext):
=(privileged): false
=(initContainers):
- =(securityContext):
=(privileged): false
=(ephemeralContainers):
- =(securityContext):
=(privileged): false
4. Auto-Generate NetworkPolicy for New Namespaces ¶
When a namespace is created, automatically generate a default-deny NetworkPolicy:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: generate-default-network-policy
annotations:
policies.kyverno.io/title: Generate Default NetworkPolicy
policies.kyverno.io/description: >-
Creates a default-deny NetworkPolicy for new namespaces.
spec:
rules:
- name: generate-default-deny
match:
any:
- resources:
kinds:
- Namespace
exclude:
any:
- resources:
names:
- kube-*
- default
- kyverno
generate:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
name: default-deny-all
namespace: "{{request.object.metadata.name}}"
data:
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
New namespace → automatic NetworkPolicy. Teams must explicitly open traffic.
Summary ¶
Kyverno at scale isn’t about writing clever policies — it’s about deploying them safely:
- Audit before enforce — always
- Exclude system namespaces — don’t break the cluster
- Use PolicyExceptions — keep policies clean, exceptions explicit
- Stage rollouts — one cluster before fifty
- Propagate via Fleet — single source of truth
- Override where needed — development clusters are different
- Aggregate observability — know your policy posture across the fleet
The goal is consistent, enforceable security across all your clusters — without being the person who broke production.