If you’ve spent any time managing stateful applications on Kubernetes, you know the pain: manually scaling, handling backups, managing upgrades, recovering from failures. Kubernetes Operators encode this operational knowledge into software, turning your cluster into a self-managing system. In this post, we’ll build one from scratch.
Why Operators? ¶
Kubernetes is great at managing stateless workloads—Deployments, ReplicaSets, Services. But what about databases, message queues, or your custom distributed system that needs coordinated rolling upgrades? That’s where operators shine.
An operator is essentially a custom controller that watches your custom resources and reconciles the actual state of the world with the desired state you’ve declared. Think of it as a robot SRE that never sleeps.
Real-world examples:
- Prometheus Operator — manages Prometheus instances, alerting rules, and ServiceMonitors
- cert-manager — automates TLS certificate issuance and renewal
- Database operators (PostgreSQL, MySQL, Redis) — handle replication, failover, backups
When you need an operator:
- Your application requires complex lifecycle management
- You’re tired of writing the same runbooks over and over
- You want to offer a self-service platform to your developers
When you don’t:
- A simple Deployment + ConfigMap does the job
- Helm charts with hooks are sufficient
- You’re not ready to maintain custom Go code
Core Concepts ¶
Before we write code, let’s understand the building blocks.
Custom Resource Definitions (CRDs) ¶
CRDs extend the Kubernetes API. Instead of just Pods, Services, and Deployments, you can create your own resource types like WebApp, DatabaseCluster, or MLPipeline. Once you register a CRD, kubectl can interact with it just like any built-in resource.
apiVersion: apps.example.com/v1
kind: WebApp
metadata:
name: my-app
spec:
image: nginx:latest
replicas: 3
port: 80
Controllers & The Reconciliation Loop ¶
A controller watches resources and continuously reconciles actual state with desired state. The pattern is simple:
Observe → Diff → Act → Repeat
- Observe: Watch for changes to your custom resource (and any resources it owns)
- Diff: Compare what exists vs what should exist
- Act: Create, update, or delete resources to close the gap
- Repeat: Requeue and check again
The key insight: your reconcile function should be idempotent. Running it 100 times with the same input should produce the same result as running it once.
Desired State vs Actual State ¶
Kubernetes is declarative. You don’t say “create 3 pods”—you say “I want 3 pods” and the controller makes it happen. Your operator follows the same philosophy: users declare what they want, your controller figures out how to get there.
Hands-On: Building a WebApp Operator ¶
Let’s build something real. Our WebApp operator will manage a Deployment and a Service from a single custom resource. Users create a WebApp, and our operator handles the rest.
Prerequisites ¶
- Go 1.21+
- Docker
- kubectl
- A Kubernetes cluster (Kind or Minikube works great)
- Kubebuilder installed
# Install kubebuilder
curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
chmod +x kubebuilder && sudo mv kubebuilder /usr/local/bin/
Step 1: Scaffold the Project ¶
mkdir webapp-operator && cd webapp-operator
kubebuilder init --domain example.com --repo github.com/yourusername/webapp-operator
This creates the boilerplate: main.go, Makefile, Dockerfile, and config manifests.
Step 2: Create the API (CRD + Controller) ¶
kubebuilder create api --group apps --version v1 --kind WebApp
Say yes to both prompts (create resource and controller). Kubebuilder generates:
api/v1/webapp_types.go— your CRD’s Go typesinternal/controller/webapp_controller.go— your controller logic
Step 3: Define the CRD Schema ¶
Edit api/v1/webapp_types.go to define what a WebApp looks like:
package v1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// WebAppSpec defines the desired state of WebApp
type WebAppSpec struct {
// Image is the container image to deploy
// +kubebuilder:validation:Required
Image string `json:"image"`
// Replicas is the number of pod replicas
// +kubebuilder:validation:Minimum=1
// +kubebuilder:default=1
Replicas int32 `json:"replicas,omitempty"`
// Port is the container port to expose
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=65535
// +kubebuilder:default=80
Port int32 `json:"port,omitempty"`
}
// WebAppStatus defines the observed state of WebApp
type WebAppStatus struct {
// AvailableReplicas is the number of ready pods
AvailableReplicas int32 `json:"availableReplicas,omitempty"`
// Conditions represent the latest available observations
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Image",type=string,JSONPath=`.spec.image`
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Available",type=integer,JSONPath=`.status.availableReplicas`
// WebApp is the Schema for the webapps API
type WebApp struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec WebAppSpec `json:"spec,omitempty"`
Status WebAppStatus `json:"status,omitempty"`
}
// +kubebuilder:object:root=true
// WebAppList contains a list of WebApp
type WebAppList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []WebApp `json:"items"`
}
func init() {
SchemeBuilder.Register(&WebApp{}, &WebAppList{})
}
The +kubebuilder comments are markers that generate OpenAPI validation and kubectl output columns.
Regenerate the manifests:
make manifests
Step 4: Implement the Controller ¶
This is where the magic happens. Edit internal/controller/webapp_controller.go:
package controller
import (
"context"
"fmt"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/types"
"k8s.io/apimachinery/pkg/util/intstr"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
"sigs.k8s.io/controller-runtime/pkg/log"
webappv1 "github.com/yourusername/webapp-operator/api/v1"
)
// WebAppReconciler reconciles a WebApp object
type WebAppReconciler struct {
client.Client
Scheme *runtime.Scheme
}
// +kubebuilder:rbac:groups=apps.example.com,resources=webapps,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=apps.example.com,resources=webapps/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps.example.com,resources=webapps/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups="",resources=services,verbs=get;list;watch;create;update;patch;delete
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx)
// Fetch the WebApp instance
webapp := &webappv1.WebApp{}
if err := r.Get(ctx, req.NamespacedName, webapp); err != nil {
if errors.IsNotFound(err) {
// Resource deleted - nothing to do
logger.Info("WebApp resource not found, ignoring")
return ctrl.Result{}, nil
}
logger.Error(err, "Failed to get WebApp")
return ctrl.Result{}, err
}
// Reconcile Deployment
if err := r.reconcileDeployment(ctx, webapp); err != nil {
logger.Error(err, "Failed to reconcile Deployment")
return ctrl.Result{}, err
}
// Reconcile Service
if err := r.reconcileService(ctx, webapp); err != nil {
logger.Error(err, "Failed to reconcile Service")
return ctrl.Result{}, err
}
// Update status
if err := r.updateStatus(ctx, webapp); err != nil {
logger.Error(err, "Failed to update status")
return ctrl.Result{}, err
}
logger.Info("Successfully reconciled WebApp", "name", webapp.Name)
return ctrl.Result{}, nil
}
func (r *WebAppReconciler) reconcileDeployment(ctx context.Context, webapp *webappv1.WebApp) error {
deploy := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
},
}
op, err := controllerutil.CreateOrUpdate(ctx, r.Client, deploy, func() error {
// Set the deployment spec
labels := map[string]string{
"app": webapp.Name,
"app.kubernetes.io/managed-by": "webapp-operator",
}
deploy.Spec = appsv1.DeploymentSpec{
Replicas: &webapp.Spec.Replicas,
Selector: &metav1.LabelSelector{
MatchLabels: labels,
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: labels,
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{
Name: "app",
Image: webapp.Spec.Image,
Ports: []corev1.ContainerPort{
{
ContainerPort: webapp.Spec.Port,
},
},
},
},
},
},
}
// Set WebApp as the owner - enables garbage collection
return controllerutil.SetControllerReference(webapp, deploy, r.Scheme)
})
if err != nil {
return fmt.Errorf("failed to reconcile deployment: %w", err)
}
log.FromContext(ctx).Info("Reconciled Deployment", "operation", op)
return nil
}
func (r *WebAppReconciler) reconcileService(ctx context.Context, webapp *webappv1.WebApp) error {
svc := &corev1.Service{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
},
}
op, err := controllerutil.CreateOrUpdate(ctx, r.Client, svc, func() error {
labels := map[string]string{
"app": webapp.Name,
}
svc.Spec = corev1.ServiceSpec{
Selector: labels,
Ports: []corev1.ServicePort{
{
Port: webapp.Spec.Port,
TargetPort: intstr.FromInt32(webapp.Spec.Port),
},
},
}
return controllerutil.SetControllerReference(webapp, svc, r.Scheme)
})
if err != nil {
return fmt.Errorf("failed to reconcile service: %w", err)
}
log.FromContext(ctx).Info("Reconciled Service", "operation", op)
return nil
}
func (r *WebAppReconciler) updateStatus(ctx context.Context, webapp *webappv1.WebApp) error {
deploy := &appsv1.Deployment{}
if err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, deploy); err != nil {
return err
}
webapp.Status.AvailableReplicas = deploy.Status.AvailableReplicas
return r.Status().Update(ctx, webapp)
}
// SetupWithManager sets up the controller with the Manager
func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&webappv1.WebApp{}).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
Complete(r)
}
Key things happening here:
-
CreateOrUpdate— This helper creates the resource if it doesn’t exist, or updates it if it does. Idempotency built-in. -
SetControllerReference— Sets the WebApp as the owner of the Deployment/Service. When the WebApp is deleted, Kubernetes automatically garbage collects the child resources. -
Owns(&appsv1.Deployment{})— Tells the controller to also watch Deployments it owns. If someone manually edits the Deployment, the controller will reconcile it back.
Step 5: Test Locally ¶
Start a local cluster and run the operator:
# Start a Kind cluster
kind create cluster
# Install the CRD
make install
# Run the operator locally (outside the cluster)
make run
In another terminal, create a WebApp:
kubectl apply -f - <<EOF
apiVersion: apps.example.com/v1
kind: WebApp
metadata:
name: demo-app
spec:
image: nginx:1.25
replicas: 3
port: 80
EOF
Watch the magic:
kubectl get webapps
kubectl get deployments
kubectl get services
kubectl get pods
Try updating the replicas or image—the operator will reconcile automatically.
Step 6: Build and Deploy to Cluster ¶
# Build and push the operator image
make docker-build docker-push IMG=yourusername/webapp-operator:v0.1.0
# Deploy to cluster
make deploy IMG=yourusername/webapp-operator:v0.1.0
Best Practices & Gotchas ¶
Idempotency is Everything ¶
Your reconcile function will be called multiple times—on create, on update, on resync, on restart. It must handle all cases gracefully. Use CreateOrUpdate or check-before-act patterns.
Finalizers for Cleanup ¶
If your operator creates resources outside Kubernetes (cloud resources, external databases), use finalizers to ensure cleanup:
const finalizerName = "apps.example.com/finalizer"
// In Reconcile():
if webapp.ObjectMeta.DeletionTimestamp.IsZero() {
// Not being deleted - add finalizer if missing
if !controllerutil.ContainsFinalizer(webapp, finalizerName) {
controllerutil.AddFinalizer(webapp, finalizerName)
if err := r.Update(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
}
} else {
// Being deleted - cleanup external resources
if controllerutil.ContainsFinalizer(webapp, finalizerName) {
if err := r.cleanupExternalResources(webapp); err != nil {
return ctrl.Result{}, err
}
controllerutil.RemoveFinalizer(webapp, finalizerName)
if err := r.Update(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
Error Handling and Requeue ¶
Return errors to trigger a requeue with exponential backoff. For expected temporary failures, requeue explicitly:
// Requeue after 30 seconds
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
// Requeue immediately (with rate limiting)
return ctrl.Result{Requeue: true}, nil
Testing ¶
Kubebuilder generates a test suite with envtest that spins up a real API server:
make test
Write tests for your reconcile logic—it’s the most critical code path.
Wrapping Up ¶
We’ve built a functional Kubernetes operator that manages Deployments and Services from a single custom resource. The patterns here—CRDs, reconciliation loops, owner references—apply to operators of any complexity.
Full source code: github.com/yourusername/webapp-operator
Further reading:
- The Kubebuilder Book — the definitive guide
- Operator SDK — alternative scaffolding tool
- Programming Kubernetes — deep dive into client-go
Happy operating 🤖