Lean Thinking for Platform Engineering: Finding the Muda in Your Platform

Toyota revolutionized manufacturing by relentlessly eliminating waste—muda in Japanese.

Your platform has waste too. Developers waiting for CI. Unnecessary approval steps. Defects that cause rollbacks. Features nobody uses.

Lean thinking provides a framework to find and eliminate this waste. Let’s apply the seven wastes to platform engineering.

The Seven Wastes ¶

Toyota identified seven categories of waste. Each has a direct parallel in platform engineering.

Waste	Manufacturing	Platform Engineering
Transport	Moving parts unnecessarily	Moving data between systems
Inventory	Excess stock	Over-provisioned resources
Motion	Unnecessary worker movement	Context switching, navigation
Waiting	Idle time	Queue time, approval delays
Overproduction	Making too much	Building unused features
Overprocessing	Unnecessary work	Excessive compliance, redundant checks
Defects	Rework, scrap	Incidents, rollbacks, debugging

Let’s examine each in detail.

Waste 1: Transport ¶

Manufacturing: Moving parts between workstations adds time but no value.

Platform engineering: Moving data, artifacts, or requests between systems unnecessarily.

Examples ¶

Code artifact journey:
  Git → Jenkins → Artifactory → Jenkins → Kubernetes → Registry → Kubernetes

Why is the artifact touching 6 systems?
Could it go: Git → Build → Registry → Kubernetes?

Log data transport:
  App → Fluentd → Kafka → Logstash → Elasticsearch → Kibana

Each hop adds latency, complexity, and failure risk.

Request routing:
  User → CDN → Load Balancer → API Gateway → Service Mesh → Service

Is every hop necessary?

Finding Transport Waste ¶

Exercise: Draw your deployment pipeline

For each arrow between systems:
  - Why does this transition exist?
  - What value does it add?
  - Could we eliminate or combine steps?

Eliminating Transport Waste ¶

Before: Git → Jenkins → Artifactory → Spinnaker → Kubernetes
After:  Git → GitHub Actions → Kubernetes (direct deploy)

Removed: 2 system transitions, 3 integration points
Result:  Faster deploys, fewer failure modes

Waste 2: Inventory ¶

Manufacturing: Excess stock ties up capital and hides problems.

Platform engineering: Over-provisioned resources, unused capacity, accumulated queues.

Examples ¶

Resource inventory:
  Reserved instances at 30% utilization
  Kubernetes nodes with 20% pod density
  10TB of "just in case" storage
  Environments that nobody uses anymore

Work inventory (WIP):
  50 open PRs waiting for review
  20 tickets "in progress" for weeks
  12 half-finished platform features

Data inventory:
  Logs retained for 2 years (policy requires 90 days)
  Backups of decommissioned systems
  Metrics at 10-second granularity stored forever

Finding Inventory Waste ¶

Resource audit:
  - What's the utilization of each resource type?
  - What's the oldest unused environment?
  - How much data is past retention policy?

Work audit:
  - How many items are in WIP?
  - What's the average age of WIP?
  - What's blocked and why?

Eliminating Inventory Waste ¶

Little's Law: Lead Time = WIP / Throughput

To reduce lead time, reduce WIP:
  - PR limit: Max 3 open PRs per developer
  - Feature limit: Max 2 features in progress per team
  - Environment cleanup: Delete after 7 days of inactivity

Resource right-sizing:
  Before: 20 nodes at 20% utilization
  After:  8 nodes at 50% utilization
  Savings: 60% compute cost

Waste 3: Motion ¶

Manufacturing: Workers walking to get tools or materials.

Platform engineering: Developers navigating systems, context switching, hunting for information.

Examples ¶

Deployment motion:
  1. Open Jenkins (find the right job)
  2. Check the logs (scroll, scroll, scroll)
  3. Open Kubernetes dashboard (find the namespace)
  4. Check pod status (wait for it to load)
  5. Open Datadog (create the right query)
  6. Verify metrics (adjust time range)

6 different systems, 6 context switches.

Debugging motion:
  "Where are the logs for this service?"
  → Slack the on-call
  → They say "check Kibana"
  → Can't find the right index
  → Ask again
  → "Oh, that service uses CloudWatch"
  → Find CloudWatch
  → Wrong region

Information hunting:
  "What's the config for this service?"
  → Check Git (which repo?)
  → Check wiki (outdated)
  → Check Confluence (wrong version)
  → Ask in Slack (wait for response)

Finding Motion Waste ¶

Shadow a developer for a day:
  - How many systems do they touch?
  - How many times do they context switch?
  - How much time finding vs doing?
  - What questions do they ask repeatedly?

Eliminating Motion Waste ¶

Single pane of glass:
  Before: 6 tools to check deployment status
  After:  1 dashboard with deployment, logs, metrics, alerts

Reduced motion: 5 context switches eliminated

Self-service answers:
  Before: Slack questions about config, access, status
  After:  Internal developer portal with search

Reduced motion: No more hunting, asking, waiting

Waste 4: Waiting ¶

Manufacturing: Workers or machines idle, waiting for inputs.

Platform engineering: Developers waiting for builds, tests, approvals, environments.

Examples ¶

CI/CD waiting:
  Build queue time:       15 minutes
  Build execution:        10 minutes
  Test queue time:        10 minutes
  Test execution:         20 minutes
  Deploy approval wait:   4 hours
  Deploy execution:       5 minutes

Total: 5 hours
Active work: 35 minutes
Waiting: 4.5 hours (90% of time is waste)

Environment waiting:
  "I need a staging environment"
  → Submit request ticket
  → Wait for approval (1 day)
  → Wait for provisioning (1 day)
  → Environment ready (2 days later)

Human bottleneck waiting:
  Code review:            2 days average
  Security review:        1 week average
  Architecture review:    2 weeks average

Finding Waiting Waste ¶

Measure queue times at each step:
  Time in queue / Time being processed = Wait Ratio

Wait ratio > 1 = More waiting than working
Wait ratio > 5 = Severe waiting waste

Eliminating Waiting Waste ¶

Build queue time:
  Before: 15 minutes (shared build agents)
  After:  0 minutes (auto-scaling build agents)

Test parallelization:
  Before: 20 minutes (sequential)
  After:  5 minutes (parallelized)

Self-service environments:
  Before: 2 days (ticket + approval + provision)
  After:  10 minutes (automated provisioning)

Async approvals:
  Before: Block on human approval
  After:  Deploy to staging immediately, require approval for prod

Reduced wait without reducing safety.

Waste 5: Overproduction ¶

Manufacturing: Making more than customers need.

Platform engineering: Building features that aren’t used, over-engineering solutions.

Examples ¶

Platform feature graveyard:
  - Custom deployment strategies (nobody uses)
  - Advanced caching layer (one team tried once)
  - Multi-region support (never activated)
  - Plugin system (no plugins built)

Premature optimization:
  "We built this to handle 10x our current scale"
  (Scale never came)
  (Complexity remains)

Documentation overproduction:
  100-page architecture doc (never read)
  Detailed runbooks (outdated before finished)
  Video tutorials (nobody watches)

Finding Overproduction Waste ¶

Feature usage audit:
  For each platform capability:
    - How many teams use it?
    - How often is it used?
    - If removed, who would notice?

Eliminating Overproduction Waste ¶

Build only what's needed:
  Before: Design for hypothetical scale
  After:  Build for current needs + clear extension points

Kill unused features:
  If usage < 5%:  Deprecate it
  If usage = 0%:  Remove it
  
Sunsets reduce maintenance burden.

Minimal documentation:
  Before: Comprehensive docs (never read, always stale)
  After:  README + runbook + examples (actually maintained)

Waste 6: Overprocessing ¶

Manufacturing: More precision or steps than needed.

Platform engineering: Excessive process, redundant checks, unnecessary rigor.

Examples ¶

Approval overhead:
  Deploy to dev:     Requires approval (why?)
  Change config:     Requires CAB ticket (really?)
  Add team member:   Requires 3 sign-offs (necessary?)

Compliance theater:
  Security scan on every commit (same code, same result)
  Vulnerability report nobody reads
  Audit logs nobody audits
  Checklist nobody checks

Process for process's sake:
  "We need a design doc for this 10-line change"
  "Let's schedule a review meeting" (for trivial changes)
  "Fill out this template" (fields don't apply)

Finding Overprocessing Waste ¶

For each process step:
  - What risk does this mitigate?
  - Has that risk ever materialized?
  - Is there a lighter-weight alternative?
  - Who would notice if we skipped it?

Eliminating Overprocessing Waste ¶

Risk-based approvals:
  Before: All changes require approval
  After:  
    Low risk (dev, config):  Auto-approve
    Medium risk (staging):   Peer approve
    High risk (prod):        Lead approve

Smart scanning:
  Before: Full security scan on every commit
  After:  Full scan on changed files only
          Full scan nightly
          
  Reduced: 80% scan time with same coverage

Waste 7: Defects ¶

Manufacturing: Scrap, rework, quality failures.

Platform engineering: Incidents, rollbacks, debugging, incorrect configurations.

Examples ¶

Deployment defects:
  Failed deployments:     15% of deploys
  Rollbacks:              5% of deploys
  Time to fix:            2 hours average
  
  If 100 deploys/week:
    15 failures × 2 hours = 30 hours of rework/week

Configuration defects:
  "Why is this environment broken?"
  → Config drift from production
  → 4 hours debugging
  → Manual fix
  → (Will happen again)

Platform defects:
  CI randomly fails (flaky tests)
  Environments randomly break (resource limits)
  Deploys randomly timeout (network issues)
  
  "Random" = Unresolved defects

Finding Defect Waste ¶

Track defect metrics:
  - Deployment failure rate
  - Mean time to recovery
  - Rollback frequency
  - Repeat incidents (same cause)
  - Escaped defects (caught in production)

Eliminating Defect Waste ¶

Build quality in:
  Before: Test in production, fix defects
  After:  Fail fast in CI, prevent defects

Poka-yoke (mistake-proofing):
  Before: Config as free-form YAML (easy to break)
  After:  Config as typed schema (invalid = won't deploy)
  
Reduced: 80% of config-related incidents

Root cause elimination:
  Before: Fix the incident, move on
  After:  Fix the incident, fix the system that allowed it
  
  Incident → Post-mortem → Prevention → No repeat

The Lean Platform Audit ¶

Walk through your platform with the waste lens:

For each developer workflow:

1. Map the value stream (every step from code to production)

2. Categorize each step:
   - Value-add: Customer would pay for this
   - Non-value but necessary: Required (compliance, safety)
   - Waste: Neither valuable nor necessary

3. For each waste, identify the type:
   - Transport: Unnecessary movement of artifacts/data
   - Inventory: Excess resources, WIP, data
   - Motion: Developer context switching, hunting
   - Waiting: Queue time, approval delays
   - Overproduction: Unused features, premature optimization
   - Overprocessing: Unnecessary rigor, redundant checks
   - Defects: Failures, rework, debugging

4. Prioritize by impact:
   - How much time/money does this waste?
   - How hard is it to eliminate?
   - Focus on high-impact, low-effort first

Continuous Improvement (Kaizen) ¶

Lean isn’t a one-time audit. It’s a culture.

Daily:
  - Notice waste when you see it
  - Quick fixes for small waste

Weekly:
  - Team discusses biggest waste encountered
  - Prioritize improvement items

Monthly:
  - Measure waste metrics (lead time, failure rate, utilization)
  - Track improvement trends

Quarterly:
  - Value stream mapping exercise
  - Major waste elimination initiatives

Summary ¶

The seven wastes in platform engineering:

Waste	Platform Symptoms	Elimination
Transport	Artifacts touching too many systems	Simplify pipeline
Inventory	Over-provisioned, excess WIP	Right-size, limit WIP
Motion	Context switching, hunting	Single pane of glass
Waiting	Queue time, approval delays	Parallelize, self-service
Overproduction	Unused features	Build only what’s needed
Overprocessing	Excessive process	Risk-based controls
Defects	Failures, rollbacks	Build quality in

The lean mindset:

Ask constantly:
  - Is this step adding value?
  - Who is waiting? Why?
  - What did we build that nobody uses?
  - What failed? How do we prevent it?

Toyota didn’t become Toyota in a day. They improved relentlessly for decades.

Your platform can too. One waste at a time.

Start today: Pick one workflow. Map it. Find the muda. Eliminate it.

Then do it again tomorrow.