Lean Thinking for Platform Engineering: Finding the Muda in Your Platform


Toyota revolutionized manufacturing by relentlessly eliminating waste—muda in Japanese.

Your platform has waste too. Developers waiting for CI. Unnecessary approval steps. Defects that cause rollbacks. Features nobody uses.

Lean thinking provides a framework to find and eliminate this waste. Let’s apply the seven wastes to platform engineering.

Toyota identified seven categories of waste. Each has a direct parallel in platform engineering.

Waste Manufacturing Platform Engineering
Transport Moving parts unnecessarily Moving data between systems
Inventory Excess stock Over-provisioned resources
Motion Unnecessary worker movement Context switching, navigation
Waiting Idle time Queue time, approval delays
Overproduction Making too much Building unused features
Overprocessing Unnecessary work Excessive compliance, redundant checks
Defects Rework, scrap Incidents, rollbacks, debugging

Let’s examine each in detail.

Manufacturing: Moving parts between workstations adds time but no value.

Platform engineering: Moving data, artifacts, or requests between systems unnecessarily.

Code artifact journey:
  Git → Jenkins → Artifactory → Jenkins → Kubernetes → Registry → Kubernetes

Why is the artifact touching 6 systems?
Could it go: Git → Build → Registry → Kubernetes?
Log data transport:
  App → Fluentd → Kafka → Logstash → Elasticsearch → Kibana

Each hop adds latency, complexity, and failure risk.
Request routing:
  User → CDN → Load Balancer → API Gateway → Service Mesh → Service

Is every hop necessary?
Exercise: Draw your deployment pipeline

For each arrow between systems:
  - Why does this transition exist?
  - What value does it add?
  - Could we eliminate or combine steps?
Before: Git → Jenkins → Artifactory → Spinnaker → Kubernetes
After:  Git → GitHub Actions → Kubernetes (direct deploy)

Removed: 2 system transitions, 3 integration points
Result:  Faster deploys, fewer failure modes

Manufacturing: Excess stock ties up capital and hides problems.

Platform engineering: Over-provisioned resources, unused capacity, accumulated queues.

Resource inventory:
  Reserved instances at 30% utilization
  Kubernetes nodes with 20% pod density
  10TB of "just in case" storage
  Environments that nobody uses anymore
Work inventory (WIP):
  50 open PRs waiting for review
  20 tickets "in progress" for weeks
  12 half-finished platform features
Data inventory:
  Logs retained for 2 years (policy requires 90 days)
  Backups of decommissioned systems
  Metrics at 10-second granularity stored forever
Resource audit:
  - What's the utilization of each resource type?
  - What's the oldest unused environment?
  - How much data is past retention policy?

Work audit:
  - How many items are in WIP?
  - What's the average age of WIP?
  - What's blocked and why?
Little's Law: Lead Time = WIP / Throughput

To reduce lead time, reduce WIP:
  - PR limit: Max 3 open PRs per developer
  - Feature limit: Max 2 features in progress per team
  - Environment cleanup: Delete after 7 days of inactivity
Resource right-sizing:
  Before: 20 nodes at 20% utilization
  After:  8 nodes at 50% utilization
  Savings: 60% compute cost

Manufacturing: Workers walking to get tools or materials.

Platform engineering: Developers navigating systems, context switching, hunting for information.

Deployment motion:
  1. Open Jenkins (find the right job)
  2. Check the logs (scroll, scroll, scroll)
  3. Open Kubernetes dashboard (find the namespace)
  4. Check pod status (wait for it to load)
  5. Open Datadog (create the right query)
  6. Verify metrics (adjust time range)

6 different systems, 6 context switches.
Debugging motion:
  "Where are the logs for this service?"
  → Slack the on-call
  → They say "check Kibana"
  → Can't find the right index
  → Ask again
  → "Oh, that service uses CloudWatch"
  → Find CloudWatch
  → Wrong region
Information hunting:
  "What's the config for this service?"
  → Check Git (which repo?)
  → Check wiki (outdated)
  → Check Confluence (wrong version)
  → Ask in Slack (wait for response)
Shadow a developer for a day:
  - How many systems do they touch?
  - How many times do they context switch?
  - How much time finding vs doing?
  - What questions do they ask repeatedly?
Single pane of glass:
  Before: 6 tools to check deployment status
  After:  1 dashboard with deployment, logs, metrics, alerts

Reduced motion: 5 context switches eliminated
Self-service answers:
  Before: Slack questions about config, access, status
  After:  Internal developer portal with search

Reduced motion: No more hunting, asking, waiting

Manufacturing: Workers or machines idle, waiting for inputs.

Platform engineering: Developers waiting for builds, tests, approvals, environments.

CI/CD waiting:
  Build queue time:       15 minutes
  Build execution:        10 minutes
  Test queue time:        10 minutes
  Test execution:         20 minutes
  Deploy approval wait:   4 hours
  Deploy execution:       5 minutes

Total: 5 hours
Active work: 35 minutes
Waiting: 4.5 hours (90% of time is waste)
Environment waiting:
  "I need a staging environment"
  → Submit request ticket
  → Wait for approval (1 day)
  → Wait for provisioning (1 day)
  → Environment ready (2 days later)
Human bottleneck waiting:
  Code review:            2 days average
  Security review:        1 week average
  Architecture review:    2 weeks average
Measure queue times at each step:
  Time in queue / Time being processed = Wait Ratio

Wait ratio > 1 = More waiting than working
Wait ratio > 5 = Severe waiting waste
Build queue time:
  Before: 15 minutes (shared build agents)
  After:  0 minutes (auto-scaling build agents)

Test parallelization:
  Before: 20 minutes (sequential)
  After:  5 minutes (parallelized)

Self-service environments:
  Before: 2 days (ticket + approval + provision)
  After:  10 minutes (automated provisioning)
Async approvals:
  Before: Block on human approval
  After:  Deploy to staging immediately, require approval for prod

Reduced wait without reducing safety.

Manufacturing: Making more than customers need.

Platform engineering: Building features that aren’t used, over-engineering solutions.

Platform feature graveyard:
  - Custom deployment strategies (nobody uses)
  - Advanced caching layer (one team tried once)
  - Multi-region support (never activated)
  - Plugin system (no plugins built)
Premature optimization:
  "We built this to handle 10x our current scale"
  (Scale never came)
  (Complexity remains)
Documentation overproduction:
  100-page architecture doc (never read)
  Detailed runbooks (outdated before finished)
  Video tutorials (nobody watches)
Feature usage audit:
  For each platform capability:
    - How many teams use it?
    - How often is it used?
    - If removed, who would notice?
Build only what's needed:
  Before: Design for hypothetical scale
  After:  Build for current needs + clear extension points

Kill unused features:
  If usage < 5%:  Deprecate it
  If usage = 0%:  Remove it
  
Sunsets reduce maintenance burden.
Minimal documentation:
  Before: Comprehensive docs (never read, always stale)
  After:  README + runbook + examples (actually maintained)

Manufacturing: More precision or steps than needed.

Platform engineering: Excessive process, redundant checks, unnecessary rigor.

Approval overhead:
  Deploy to dev:     Requires approval (why?)
  Change config:     Requires CAB ticket (really?)
  Add team member:   Requires 3 sign-offs (necessary?)
Compliance theater:
  Security scan on every commit (same code, same result)
  Vulnerability report nobody reads
  Audit logs nobody audits
  Checklist nobody checks
Process for process's sake:
  "We need a design doc for this 10-line change"
  "Let's schedule a review meeting" (for trivial changes)
  "Fill out this template" (fields don't apply)
For each process step:
  - What risk does this mitigate?
  - Has that risk ever materialized?
  - Is there a lighter-weight alternative?
  - Who would notice if we skipped it?
Risk-based approvals:
  Before: All changes require approval
  After:  
    Low risk (dev, config):  Auto-approve
    Medium risk (staging):   Peer approve
    High risk (prod):        Lead approve
Smart scanning:
  Before: Full security scan on every commit
  After:  Full scan on changed files only
          Full scan nightly
          
  Reduced: 80% scan time with same coverage

Manufacturing: Scrap, rework, quality failures.

Platform engineering: Incidents, rollbacks, debugging, incorrect configurations.

Deployment defects:
  Failed deployments:     15% of deploys
  Rollbacks:              5% of deploys
  Time to fix:            2 hours average
  
  If 100 deploys/week:
    15 failures × 2 hours = 30 hours of rework/week
Configuration defects:
  "Why is this environment broken?"
  → Config drift from production
  → 4 hours debugging
  → Manual fix
  → (Will happen again)
Platform defects:
  CI randomly fails (flaky tests)
  Environments randomly break (resource limits)
  Deploys randomly timeout (network issues)
  
  "Random" = Unresolved defects
Track defect metrics:
  - Deployment failure rate
  - Mean time to recovery
  - Rollback frequency
  - Repeat incidents (same cause)
  - Escaped defects (caught in production)
Build quality in:
  Before: Test in production, fix defects
  After:  Fail fast in CI, prevent defects

Poka-yoke (mistake-proofing):
  Before: Config as free-form YAML (easy to break)
  After:  Config as typed schema (invalid = won't deploy)
  
Reduced: 80% of config-related incidents
Root cause elimination:
  Before: Fix the incident, move on
  After:  Fix the incident, fix the system that allowed it
  
  Incident → Post-mortem → Prevention → No repeat

Walk through your platform with the waste lens:

For each developer workflow:

1. Map the value stream (every step from code to production)

2. Categorize each step:
   - Value-add: Customer would pay for this
   - Non-value but necessary: Required (compliance, safety)
   - Waste: Neither valuable nor necessary

3. For each waste, identify the type:
   - Transport: Unnecessary movement of artifacts/data
   - Inventory: Excess resources, WIP, data
   - Motion: Developer context switching, hunting
   - Waiting: Queue time, approval delays
   - Overproduction: Unused features, premature optimization
   - Overprocessing: Unnecessary rigor, redundant checks
   - Defects: Failures, rework, debugging

4. Prioritize by impact:
   - How much time/money does this waste?
   - How hard is it to eliminate?
   - Focus on high-impact, low-effort first

Lean isn’t a one-time audit. It’s a culture.

Daily:
  - Notice waste when you see it
  - Quick fixes for small waste

Weekly:
  - Team discusses biggest waste encountered
  - Prioritize improvement items

Monthly:
  - Measure waste metrics (lead time, failure rate, utilization)
  - Track improvement trends

Quarterly:
  - Value stream mapping exercise
  - Major waste elimination initiatives

The seven wastes in platform engineering:

Waste Platform Symptoms Elimination
Transport Artifacts touching too many systems Simplify pipeline
Inventory Over-provisioned, excess WIP Right-size, limit WIP
Motion Context switching, hunting Single pane of glass
Waiting Queue time, approval delays Parallelize, self-service
Overproduction Unused features Build only what’s needed
Overprocessing Excessive process Risk-based controls
Defects Failures, rollbacks Build quality in

The lean mindset:

Ask constantly:
  - Is this step adding value?
  - Who is waiting? Why?
  - What did we build that nobody uses?
  - What failed? How do we prevent it?

Toyota didn’t become Toyota in a day. They improved relentlessly for decades.

Your platform can too. One waste at a time.

Start today: Pick one workflow. Map it. Find the muda. Eliminate it.

Then do it again tomorrow.