Building a Kubernetes Operator from Scratch


If you’ve spent any time managing stateful applications on Kubernetes, you know the pain: manually scaling, handling backups, managing upgrades, recovering from failures. Kubernetes Operators encode this operational knowledge into software, turning your cluster into a self-managing system. In this post, we’ll build one from scratch.

Kubernetes is great at managing stateless workloads—Deployments, ReplicaSets, Services. But what about databases, message queues, or your custom distributed system that needs coordinated rolling upgrades? That’s where operators shine.

An operator is essentially a custom controller that watches your custom resources and reconciles the actual state of the world with the desired state you’ve declared. Think of it as a robot SRE that never sleeps.

Real-world examples:

  • Prometheus Operator — manages Prometheus instances, alerting rules, and ServiceMonitors
  • cert-manager — automates TLS certificate issuance and renewal
  • Database operators (PostgreSQL, MySQL, Redis) — handle replication, failover, backups

When you need an operator:

  • Your application requires complex lifecycle management
  • You’re tired of writing the same runbooks over and over
  • You want to offer a self-service platform to your developers

When you don’t:

  • A simple Deployment + ConfigMap does the job
  • Helm charts with hooks are sufficient
  • You’re not ready to maintain custom Go code

Before we write code, let’s understand the building blocks.

CRDs extend the Kubernetes API. Instead of just Pods, Services, and Deployments, you can create your own resource types like WebApp, DatabaseCluster, or MLPipeline. Once you register a CRD, kubectl can interact with it just like any built-in resource.

apiVersion: apps.example.com/v1
kind: WebApp
metadata:
  name: my-app
spec:
  image: nginx:latest
  replicas: 3
  port: 80

A controller watches resources and continuously reconciles actual state with desired state. The pattern is simple:

Observe → Diff → Act → Repeat
  1. Observe: Watch for changes to your custom resource (and any resources it owns)
  2. Diff: Compare what exists vs what should exist
  3. Act: Create, update, or delete resources to close the gap
  4. Repeat: Requeue and check again

The key insight: your reconcile function should be idempotent. Running it 100 times with the same input should produce the same result as running it once.

Kubernetes is declarative. You don’t say “create 3 pods”—you say “I want 3 pods” and the controller makes it happen. Your operator follows the same philosophy: users declare what they want, your controller figures out how to get there.

Let’s build something real. Our WebApp operator will manage a Deployment and a Service from a single custom resource. Users create a WebApp, and our operator handles the rest.

  • Go 1.21+
  • Docker
  • kubectl
  • A Kubernetes cluster (Kind or Minikube works great)
  • Kubebuilder installed
# Install kubebuilder
curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
chmod +x kubebuilder && sudo mv kubebuilder /usr/local/bin/
mkdir webapp-operator && cd webapp-operator
kubebuilder init --domain example.com --repo github.com/yourusername/webapp-operator

This creates the boilerplate: main.go, Makefile, Dockerfile, and config manifests.

kubebuilder create api --group apps --version v1 --kind WebApp

Say yes to both prompts (create resource and controller). Kubebuilder generates:

  • api/v1/webapp_types.go — your CRD’s Go types
  • internal/controller/webapp_controller.go — your controller logic

Edit api/v1/webapp_types.go to define what a WebApp looks like:

package v1

import (
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// WebAppSpec defines the desired state of WebApp
type WebAppSpec struct {
	// Image is the container image to deploy
	// +kubebuilder:validation:Required
	Image string `json:"image"`

	// Replicas is the number of pod replicas
	// +kubebuilder:validation:Minimum=1
	// +kubebuilder:default=1
	Replicas int32 `json:"replicas,omitempty"`

	// Port is the container port to expose
	// +kubebuilder:validation:Minimum=1
	// +kubebuilder:validation:Maximum=65535
	// +kubebuilder:default=80
	Port int32 `json:"port,omitempty"`
}

// WebAppStatus defines the observed state of WebApp
type WebAppStatus struct {
	// AvailableReplicas is the number of ready pods
	AvailableReplicas int32 `json:"availableReplicas,omitempty"`

	// Conditions represent the latest available observations
	Conditions []metav1.Condition `json:"conditions,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Image",type=string,JSONPath=`.spec.image`
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Available",type=integer,JSONPath=`.status.availableReplicas`

// WebApp is the Schema for the webapps API
type WebApp struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   WebAppSpec   `json:"spec,omitempty"`
	Status WebAppStatus `json:"status,omitempty"`
}

// +kubebuilder:object:root=true

// WebAppList contains a list of WebApp
type WebAppList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []WebApp `json:"items"`
}

func init() {
	SchemeBuilder.Register(&WebApp{}, &WebAppList{})
}

The +kubebuilder comments are markers that generate OpenAPI validation and kubectl output columns.

Regenerate the manifests:

make manifests

This is where the magic happens. Edit internal/controller/webapp_controller.go:

package controller

import (
	"context"
	"fmt"

	appsv1 "k8s.io/api/apps/v1"
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/types"
	"k8s.io/apimachinery/pkg/util/intstr"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
	"sigs.k8s.io/controller-runtime/pkg/log"

	webappv1 "github.com/yourusername/webapp-operator/api/v1"
)

// WebAppReconciler reconciles a WebApp object
type WebAppReconciler struct {
	client.Client
	Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=apps.example.com,resources=webapps,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=apps.example.com,resources=webapps/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps.example.com,resources=webapps/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups="",resources=services,verbs=get;list;watch;create;update;patch;delete

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	logger := log.FromContext(ctx)

	// Fetch the WebApp instance
	webapp := &webappv1.WebApp{}
	if err := r.Get(ctx, req.NamespacedName, webapp); err != nil {
		if errors.IsNotFound(err) {
			// Resource deleted - nothing to do
			logger.Info("WebApp resource not found, ignoring")
			return ctrl.Result{}, nil
		}
		logger.Error(err, "Failed to get WebApp")
		return ctrl.Result{}, err
	}

	// Reconcile Deployment
	if err := r.reconcileDeployment(ctx, webapp); err != nil {
		logger.Error(err, "Failed to reconcile Deployment")
		return ctrl.Result{}, err
	}

	// Reconcile Service
	if err := r.reconcileService(ctx, webapp); err != nil {
		logger.Error(err, "Failed to reconcile Service")
		return ctrl.Result{}, err
	}

	// Update status
	if err := r.updateStatus(ctx, webapp); err != nil {
		logger.Error(err, "Failed to update status")
		return ctrl.Result{}, err
	}

	logger.Info("Successfully reconciled WebApp", "name", webapp.Name)
	return ctrl.Result{}, nil
}

func (r *WebAppReconciler) reconcileDeployment(ctx context.Context, webapp *webappv1.WebApp) error {
	deploy := &appsv1.Deployment{
		ObjectMeta: metav1.ObjectMeta{
			Name:      webapp.Name,
			Namespace: webapp.Namespace,
		},
	}

	op, err := controllerutil.CreateOrUpdate(ctx, r.Client, deploy, func() error {
		// Set the deployment spec
		labels := map[string]string{
			"app":                       webapp.Name,
			"app.kubernetes.io/managed-by": "webapp-operator",
		}

		deploy.Spec = appsv1.DeploymentSpec{
			Replicas: &webapp.Spec.Replicas,
			Selector: &metav1.LabelSelector{
				MatchLabels: labels,
			},
			Template: corev1.PodTemplateSpec{
				ObjectMeta: metav1.ObjectMeta{
					Labels: labels,
				},
				Spec: corev1.PodSpec{
					Containers: []corev1.Container{
						{
							Name:  "app",
							Image: webapp.Spec.Image,
							Ports: []corev1.ContainerPort{
								{
									ContainerPort: webapp.Spec.Port,
								},
							},
						},
					},
				},
			},
		}

		// Set WebApp as the owner - enables garbage collection
		return controllerutil.SetControllerReference(webapp, deploy, r.Scheme)
	})

	if err != nil {
		return fmt.Errorf("failed to reconcile deployment: %w", err)
	}

	log.FromContext(ctx).Info("Reconciled Deployment", "operation", op)
	return nil
}

func (r *WebAppReconciler) reconcileService(ctx context.Context, webapp *webappv1.WebApp) error {
	svc := &corev1.Service{
		ObjectMeta: metav1.ObjectMeta{
			Name:      webapp.Name,
			Namespace: webapp.Namespace,
		},
	}

	op, err := controllerutil.CreateOrUpdate(ctx, r.Client, svc, func() error {
		labels := map[string]string{
			"app": webapp.Name,
		}

		svc.Spec = corev1.ServiceSpec{
			Selector: labels,
			Ports: []corev1.ServicePort{
				{
					Port:       webapp.Spec.Port,
					TargetPort: intstr.FromInt32(webapp.Spec.Port),
				},
			},
		}

		return controllerutil.SetControllerReference(webapp, svc, r.Scheme)
	})

	if err != nil {
		return fmt.Errorf("failed to reconcile service: %w", err)
	}

	log.FromContext(ctx).Info("Reconciled Service", "operation", op)
	return nil
}

func (r *WebAppReconciler) updateStatus(ctx context.Context, webapp *webappv1.WebApp) error {
	deploy := &appsv1.Deployment{}
	if err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, deploy); err != nil {
		return err
	}

	webapp.Status.AvailableReplicas = deploy.Status.AvailableReplicas
	return r.Status().Update(ctx, webapp)
}

// SetupWithManager sets up the controller with the Manager
func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&webappv1.WebApp{}).
		Owns(&appsv1.Deployment{}).
		Owns(&corev1.Service{}).
		Complete(r)
}

Key things happening here:

  1. CreateOrUpdate — This helper creates the resource if it doesn’t exist, or updates it if it does. Idempotency built-in.

  2. SetControllerReference — Sets the WebApp as the owner of the Deployment/Service. When the WebApp is deleted, Kubernetes automatically garbage collects the child resources.

  3. Owns(&appsv1.Deployment{}) — Tells the controller to also watch Deployments it owns. If someone manually edits the Deployment, the controller will reconcile it back.

Start a local cluster and run the operator:

# Start a Kind cluster
kind create cluster

# Install the CRD
make install

# Run the operator locally (outside the cluster)
make run

In another terminal, create a WebApp:

kubectl apply -f - <<EOF
apiVersion: apps.example.com/v1
kind: WebApp
metadata:
  name: demo-app
spec:
  image: nginx:1.25
  replicas: 3
  port: 80
EOF

Watch the magic:

kubectl get webapps
kubectl get deployments
kubectl get services
kubectl get pods

Try updating the replicas or image—the operator will reconcile automatically.

# Build and push the operator image
make docker-build docker-push IMG=yourusername/webapp-operator:v0.1.0

# Deploy to cluster
make deploy IMG=yourusername/webapp-operator:v0.1.0

Your reconcile function will be called multiple times—on create, on update, on resync, on restart. It must handle all cases gracefully. Use CreateOrUpdate or check-before-act patterns.

If your operator creates resources outside Kubernetes (cloud resources, external databases), use finalizers to ensure cleanup:

const finalizerName = "apps.example.com/finalizer"

// In Reconcile():
if webapp.ObjectMeta.DeletionTimestamp.IsZero() {
    // Not being deleted - add finalizer if missing
    if !controllerutil.ContainsFinalizer(webapp, finalizerName) {
        controllerutil.AddFinalizer(webapp, finalizerName)
        if err := r.Update(ctx, webapp); err != nil {
            return ctrl.Result{}, err
        }
    }
} else {
    // Being deleted - cleanup external resources
    if controllerutil.ContainsFinalizer(webapp, finalizerName) {
        if err := r.cleanupExternalResources(webapp); err != nil {
            return ctrl.Result{}, err
        }
        controllerutil.RemoveFinalizer(webapp, finalizerName)
        if err := r.Update(ctx, webapp); err != nil {
            return ctrl.Result{}, err
        }
    }
    return ctrl.Result{}, nil
}

Return errors to trigger a requeue with exponential backoff. For expected temporary failures, requeue explicitly:

// Requeue after 30 seconds
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil

// Requeue immediately (with rate limiting)
return ctrl.Result{Requeue: true}, nil

Kubebuilder generates a test suite with envtest that spins up a real API server:

make test

Write tests for your reconcile logic—it’s the most critical code path.

We’ve built a functional Kubernetes operator that manages Deployments and Services from a single custom resource. The patterns here—CRDs, reconciliation loops, owner references—apply to operators of any complexity.

Full source code: github.com/yourusername/webapp-operator

Further reading:

Happy operating 🤖