Kyverno at Scale: Multi-Cluster Policy Without the Pain

You’ve got 20 clusters. You want consistent security policies across all of them. You roll out a Kyverno policy, and suddenly deployments are failing in production because someone’s legitimate workload doesn’t comply. Now multiply that panic by 20 clusters.

This post is about avoiding that scenario — deploying Kyverno policies across a multi-cluster fleet without breaking things.

The Problem ¶

In a single cluster, policy management is straightforward: install Kyverno, write policies, done. But multi-cluster introduces real challenges:

Policy drift: Cluster-7 is running an old version of your policies. Cluster-12 has a policy someone added manually. Cluster-3 has an exception you forgot about. Nobody knows the actual state.

Blast radius: A bad policy update doesn’t break one cluster — it breaks all of them. Simultaneously. During business hours.

Exceptions at scale: Team A needs an exception in their namespace. Team B needs a different exception, but only in the staging cluster. How do you manage this without drowning in YAML?

Visibility: “Is this policy actually enforced everywhere?” shouldn’t require SSH’ing into 20 clusters.

Kyverno Essentials ¶

If you’re new to Kyverno, here’s the minimum you need to follow this post.

Kyverno is a Kubernetes-native policy engine. It runs as an admission controller — when someone creates or updates a resource, Kyverno intercepts the request and decides whether to allow, modify, or reject it.

ClusterPolicy: Applies across all namespaces in a cluster.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
spec:
  validationFailureAction: Enforce  # or Audit
  rules:
    - name: check-team-label
      match:
        any:
        - resources:
            kinds:
              - Pod
      validate:
        message: "The label 'team' is required."
        pattern:
          metadata:
            labels:
              team: "?*"

validationFailureAction:

Enforce: Block non-compliant resources (admission rejected)
Audit: Allow but report the violation (PolicyReport created)

Policy types:

Validate: Accept or reject based on rules
Mutate: Modify resources on the fly (inject labels, add sidecars)
Generate: Create companion resources (NetworkPolicy when namespace created)
VerifyImages: Check image signatures

That’s enough background. Let’s talk about not breaking production.

Policies That Don’t Break Production ¶

Start in Audit Mode. Always. ¶

The single most important practice:

spec:
  validationFailureAction: Audit  # Start here

In Audit mode, Kyverno evaluates resources against your policy but doesn’t block anything. It creates PolicyReports documenting what would have been blocked.

The workflow:

Deploy policy in Audit mode
Wait (days, not hours — you need to see real traffic)
Review PolicyReports for violations
Fix legitimate workloads or adjust the policy
Switch to Enforce

spec:
  validationFailureAction: Enforce  # Only after audit

Skipping this step is how you break production at 2am.

Exclude System Namespaces ¶

Your policy shouldn’t block Kubernetes system components:

spec:
  rules:
    - name: require-resource-limits
      match:
        any:
        - resources:
            kinds:
              - Pod
      exclude:
        any:
        - resources:
            namespaces:
              - kube-system
              - kube-public
              - kube-node-lease
              - kyverno
              - fleet-system
              - gatekeeper-system

Better yet, use a label-based exclusion that’s consistent across policies:

exclude:
  any:
  - resources:
      selector:
        matchLabels:
          policy.example.com/exclude: "true"

PolicyExceptions: The Escape Hatch ¶

Kyverno 1.9+ introduced PolicyException — a way to grant specific exemptions without modifying the policy itself.

apiVersion: kyverno.io/v2beta1
kind: PolicyException
metadata:
  name: allow-privileged-monitoring
  namespace: monitoring
spec:
  exceptions:
    - policyName: disallow-privileged
      ruleNames:
        - deny-privileged-containers
  match:
    any:
    - resources:
        kinds:
          - Pod
        namespaces:
          - monitoring
        names:
          - node-exporter-*

This says: “The disallow-privileged policy doesn’t apply to pods named node-exporter-* in the monitoring namespace.”

Why this is better than policy modification:

Policies stay clean and universal
Exceptions are explicit and auditable
You can track who requested what exception and why
Deleting the exception re-enables enforcement

Staged Rollout ¶

Don’t go from zero to fleet-wide enforcement in one step:

Stage 1: Single namespace in one cluster

spec:
  rules:
    - name: test-policy
      match:
        any:
        - resources:
            namespaces:
              - policy-test

Stage 2: Audit mode, all namespaces, one cluster

spec:
  validationFailureAction: Audit
  rules:
    - name: test-policy
      match:
        any:
        - resources:
            kinds:
              - Pod
      exclude:
        # ... system namespaces

Stage 3: Enforce mode, one cluster

Stage 4: Audit mode, fleet-wide

Stage 5: Enforce mode, fleet-wide

At each stage, wait and watch. PolicyReports tell you what’s happening.

The “Oh Shit” Recovery ¶

You deployed a bad policy. Deployments are failing. Here’s your emergency playbook:

Option 1: Switch to Audit (fast)

kubectl patch clusterpolicy bad-policy -p '{"spec":{"validationFailureAction":"Audit"}}' --type=merge

Option 2: Delete the policy (faster)

kubectl delete clusterpolicy bad-policy

Option 3: Kyverno failurePolicy (prevents this scenario)

When you install Kyverno, configure the webhook to fail open:

# In Kyverno Helm values
config:
  webhooks:
    - failurePolicy: Ignore  # Fail open if Kyverno is down/slow

With Ignore, if Kyverno can’t evaluate a request (timeout, crash), Kubernetes allows the request. You lose enforcement temporarily, but deployments don’t break because your policy engine is having a bad day.

Multi-Cluster Propagation ¶

Now for the multi-cluster part. You have policies that work. How do you deploy them consistently across a fleet?

Propagating Policies via KubeFleet ¶

If you’re using KubeFleet (or Azure Fleet Manager), you can treat Kyverno ClusterPolicies like any other Kubernetes resource:

Step 1: Create policies on the hub cluster

# On hub cluster
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
  namespace: default  # ClusterPolicy is cluster-scoped, but we need it in a namespace for Fleet
spec:
  validationFailureAction: Audit
  rules:
    - name: require-limits
      match:
        any:
        - resources:
            kinds:
              - Pod
      exclude:
        any:
        - resources:
            namespaces:
              - kube-system
              - kyverno
      validate:
        message: "CPU and memory limits are required."
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

Step 2: Create a ClusterResourcePlacement to propagate it

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
  name: kyverno-policies
spec:
  resourceSelectors:
    - group: kyverno.io
      kind: ClusterPolicy
      version: v1
      name: require-resource-limits
  policy:
    placementType: PickAll  # All member clusters

Every member cluster now receives the policy. Updates to the policy on the hub automatically propagate.

Wrapping Policies in a Namespace ¶

ClusterPolicies are cluster-scoped, but Fleet propagates namespace-scoped resources more naturally. A common pattern is to wrap policies in a dedicated namespace:

# Namespace to hold policies
apiVersion: v1
kind: Namespace
metadata:
  name: cluster-policies
---
# Policy lives in this namespace (Fleet will propagate both)
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged
  # Note: ClusterPolicy is cluster-scoped, but can be "associated" with a namespace for organizational purposes
spec:
  # ...

Then propagate the entire namespace:

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
  name: security-policies
spec:
  resourceSelectors:
    - group: ""
      kind: Namespace
      version: v1
      name: cluster-policies
  policy:
    placementType: PickAll

Per-Cluster Overrides ¶

What if most clusters need Enforce but your development cluster should use Audit?

Use KubeFleet’s ClusterResourceOverride:

apiVersion: placement.kubernetes-fleet.io/v1alpha1
kind: ClusterResourceOverride
metadata:
  name: dev-cluster-audit-mode
spec:
  clusterResourceSelectors:
    - group: kyverno.io
      kind: ClusterPolicy
      name: require-resource-limits
      version: v1
  policy:
    overrideRules:
      - clusterSelector:
          clusterSelectorTerms:
            - labelSelector:
                matchLabels:
                  environment: development
        jsonPatchOverrides:
          - op: replace
            path: /spec/validationFailureAction
            value: Audit

Now require-resource-limits is enforced everywhere except clusters labeled environment: development, where it runs in Audit mode.

PolicyExceptions Across the Fleet ¶

Should PolicyExceptions be centralized or per-cluster?

Centralized exceptions (propagated from hub):

Good for: Fleet-wide exceptions (monitoring tools, platform components)
Propagate like any other resource

Local exceptions (created on member clusters):

Good for: Cluster-specific needs, team autonomy
Don’t propagate — each cluster manages its own

A reasonable split:

Platform exceptions (node-exporter, ingress controller) → centralized
Application exceptions → local, with approval process

Versioning and Rollback ¶

When you update a policy on the hub, it propagates everywhere. How do you handle this safely?

GitOps: Store policies in Git. Changes go through PR review. Fleet syncs from Git (via ArgoCD or Flux on the hub).

Staged rollout with labels:

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
  name: policies-canary
spec:
  resourceSelectors:
    - group: kyverno.io
      kind: ClusterPolicy
      version: v1
      name: new-policy-v2
  policy:
    placementType: PickN
    numberOfClusters: 2
    affinity:
      clusterAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          clusterSelectorTerms:
            - labelSelector:
                matchLabels:
                  policy-canary: "true"

Deploy to canary clusters first. Watch for issues. Then expand to the fleet.

Rollback: Update the hub policy to the previous version. Fleet propagates the rollback.

Observability ¶

PolicyReports ¶

Kyverno creates PolicyReport resources documenting violations:

apiVersion: wgpolicyk8s.io/v1alpha2
kind: ClusterPolicyReport
metadata:
  name: clusterpolicy-require-resource-limits
results:
  - message: "CPU and memory limits are required."
    policy: require-resource-limits
    result: fail
    rule: require-limits
    resources:
      - apiVersion: v1
        kind: Pod
        name: my-app-xyz123
        namespace: default
    timestamp: "2025-01-28T10:30:00Z"

Aggregating Reports Across Clusters ¶

In a multi-cluster setup, you need to aggregate reports. Options:

Option 1: Policy Reporter UI

Policy Reporter is an open-source tool that aggregates PolicyReports and provides a dashboard. Deploy it on each cluster and point to a central backend.

Option 2: Export to central logging

Ship PolicyReports to your logging stack (Elasticsearch, Loki, etc.):

# Policy Reporter can push to various targets
target:
  loki:
    host: http://loki.monitoring:3100
    path: /loki/api/v1/push
    minimumPriority: warning

Option 3: Metrics

Kyverno exposes Prometheus metrics:

kyverno_policy_results_total{policy_name="require-resource-limits", rule_name="require-limits", result="fail"}

Aggregate with Thanos or Cortex across clusters. Alert on violation spikes.

Debugging Admission Failures ¶

User: “My deployment won’t create pods!”

Step 1: Check events

kubectl describe deployment my-app
# Look for admission webhook errors in events

Step 2: Check Kyverno logs

kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno --tail=100 | grep my-app

Step 3: Check PolicyReports

kubectl get policyreport -A | grep my-app
kubectl get clusterpolicyreport -o yaml | grep -A20 my-app

Step 4: Dry-run the resource

Kyverno CLI lets you test locally:

kyverno apply policy.yaml --resource pod.yaml

Step 5: Check for exceptions

kubectl get policyexception -A
# Is there an exception that should apply but doesn't?

Real Policies That Matter ¶

Here are battle-tested policies, not toy examples.

1. Require Resource Limits ¶

Pods without limits can starve nodes. This is non-negotiable in shared clusters.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
  annotations:
    policies.kyverno.io/title: Require Resource Limits
    policies.kyverno.io/description: >-
      Pods must specify CPU and memory limits to prevent resource starvation.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: require-limits
      match:
        any:
        - resources:
            kinds:
              - Pod
      exclude:
        any:
        - resources:
            namespaces:
              - kube-system
              - kube-node-lease
              - kyverno
        - resources:
            selector:
              matchLabels:
                policy.example.com/exclude: "true"
      validate:
        message: "CPU and memory limits are required for all containers."
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"
            =(initContainers):
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

2. Restrict Image Registries ¶

Only allow images from your approved registries:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
  annotations:
    policies.kyverno.io/title: Restrict Image Registries
    policies.kyverno.io/description: >-
      Images must come from approved registries.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-registries
      match:
        any:
        - resources:
            kinds:
              - Pod
      exclude:
        any:
        - resources:
            namespaces:
              - kube-system
              - kyverno
      validate:
        message: "Images must be from approved registries: gcr.io/mycompany, mycompany.azurecr.io"
        pattern:
          spec:
            containers:
              - image: "gcr.io/mycompany/* | mycompany.azurecr.io/*"
            =(initContainers):
              - image: "gcr.io/mycompany/* | mycompany.azurecr.io/*"
            =(ephemeralContainers):
              - image: "gcr.io/mycompany/* | mycompany.azurecr.io/*"

3. Disallow Privileged Containers ¶

Privileged containers can escape to the host. Block them unless explicitly exempted.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged
  annotations:
    policies.kyverno.io/title: Disallow Privileged Containers
    policies.kyverno.io/description: >-
      Privileged containers are not allowed.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: deny-privileged
      match:
        any:
        - resources:
            kinds:
              - Pod
      exclude:
        any:
        - resources:
            namespaces:
              - kube-system
      validate:
        message: "Privileged containers are not allowed."
        pattern:
          spec:
            containers:
              - =(securityContext):
                  =(privileged): false
            =(initContainers):
              - =(securityContext):
                  =(privileged): false
            =(ephemeralContainers):
              - =(securityContext):
                  =(privileged): false

4. Auto-Generate NetworkPolicy for New Namespaces ¶

When a namespace is created, automatically generate a default-deny NetworkPolicy:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-default-network-policy
  annotations:
    policies.kyverno.io/title: Generate Default NetworkPolicy
    policies.kyverno.io/description: >-
      Creates a default-deny NetworkPolicy for new namespaces.
spec:
  rules:
    - name: generate-default-deny
      match:
        any:
        - resources:
            kinds:
              - Namespace
      exclude:
        any:
        - resources:
            names:
              - kube-*
              - default
              - kyverno
      generate:
        apiVersion: networking.k8s.io/v1
        kind: NetworkPolicy
        name: default-deny-all
        namespace: "{{request.object.metadata.name}}"
        data:
          spec:
            podSelector: {}
            policyTypes:
              - Ingress
              - Egress

New namespace → automatic NetworkPolicy. Teams must explicitly open traffic.

Summary ¶

Kyverno at scale isn’t about writing clever policies — it’s about deploying them safely:

Audit before enforce — always
Exclude system namespaces — don’t break the cluster
Use PolicyExceptions — keep policies clean, exceptions explicit
Stage rollouts — one cluster before fifty
Propagate via Fleet — single source of truth
Override where needed — development clusters are different
Aggregate observability — know your policy posture across the fleet

The goal is consistent, enforceable security across all your clusters — without being the person who broke production.