Toyota revolutionized manufacturing by relentlessly eliminating waste—muda in Japanese.
Your platform has waste too. Developers waiting for CI. Unnecessary approval steps. Defects that cause rollbacks. Features nobody uses.
Lean thinking provides a framework to find and eliminate this waste. Let’s apply the seven wastes to platform engineering.
The Seven Wastes ¶
Toyota identified seven categories of waste. Each has a direct parallel in platform engineering.
| Waste | Manufacturing | Platform Engineering |
|---|---|---|
| Transport | Moving parts unnecessarily | Moving data between systems |
| Inventory | Excess stock | Over-provisioned resources |
| Motion | Unnecessary worker movement | Context switching, navigation |
| Waiting | Idle time | Queue time, approval delays |
| Overproduction | Making too much | Building unused features |
| Overprocessing | Unnecessary work | Excessive compliance, redundant checks |
| Defects | Rework, scrap | Incidents, rollbacks, debugging |
Let’s examine each in detail.
Waste 1: Transport ¶
Manufacturing: Moving parts between workstations adds time but no value.
Platform engineering: Moving data, artifacts, or requests between systems unnecessarily.
Examples ¶
Code artifact journey:
Git → Jenkins → Artifactory → Jenkins → Kubernetes → Registry → Kubernetes
Why is the artifact touching 6 systems?
Could it go: Git → Build → Registry → Kubernetes?
Log data transport:
App → Fluentd → Kafka → Logstash → Elasticsearch → Kibana
Each hop adds latency, complexity, and failure risk.
Request routing:
User → CDN → Load Balancer → API Gateway → Service Mesh → Service
Is every hop necessary?
Finding Transport Waste ¶
Exercise: Draw your deployment pipeline
For each arrow between systems:
- Why does this transition exist?
- What value does it add?
- Could we eliminate or combine steps?
Eliminating Transport Waste ¶
Before: Git → Jenkins → Artifactory → Spinnaker → Kubernetes
After: Git → GitHub Actions → Kubernetes (direct deploy)
Removed: 2 system transitions, 3 integration points
Result: Faster deploys, fewer failure modes
Waste 2: Inventory ¶
Manufacturing: Excess stock ties up capital and hides problems.
Platform engineering: Over-provisioned resources, unused capacity, accumulated queues.
Examples ¶
Resource inventory:
Reserved instances at 30% utilization
Kubernetes nodes with 20% pod density
10TB of "just in case" storage
Environments that nobody uses anymore
Work inventory (WIP):
50 open PRs waiting for review
20 tickets "in progress" for weeks
12 half-finished platform features
Data inventory:
Logs retained for 2 years (policy requires 90 days)
Backups of decommissioned systems
Metrics at 10-second granularity stored forever
Finding Inventory Waste ¶
Resource audit:
- What's the utilization of each resource type?
- What's the oldest unused environment?
- How much data is past retention policy?
Work audit:
- How many items are in WIP?
- What's the average age of WIP?
- What's blocked and why?
Eliminating Inventory Waste ¶
Little's Law: Lead Time = WIP / Throughput
To reduce lead time, reduce WIP:
- PR limit: Max 3 open PRs per developer
- Feature limit: Max 2 features in progress per team
- Environment cleanup: Delete after 7 days of inactivity
Resource right-sizing:
Before: 20 nodes at 20% utilization
After: 8 nodes at 50% utilization
Savings: 60% compute cost
Waste 3: Motion ¶
Manufacturing: Workers walking to get tools or materials.
Platform engineering: Developers navigating systems, context switching, hunting for information.
Examples ¶
Deployment motion:
1. Open Jenkins (find the right job)
2. Check the logs (scroll, scroll, scroll)
3. Open Kubernetes dashboard (find the namespace)
4. Check pod status (wait for it to load)
5. Open Datadog (create the right query)
6. Verify metrics (adjust time range)
6 different systems, 6 context switches.
Debugging motion:
"Where are the logs for this service?"
→ Slack the on-call
→ They say "check Kibana"
→ Can't find the right index
→ Ask again
→ "Oh, that service uses CloudWatch"
→ Find CloudWatch
→ Wrong region
Information hunting:
"What's the config for this service?"
→ Check Git (which repo?)
→ Check wiki (outdated)
→ Check Confluence (wrong version)
→ Ask in Slack (wait for response)
Finding Motion Waste ¶
Shadow a developer for a day:
- How many systems do they touch?
- How many times do they context switch?
- How much time finding vs doing?
- What questions do they ask repeatedly?
Eliminating Motion Waste ¶
Single pane of glass:
Before: 6 tools to check deployment status
After: 1 dashboard with deployment, logs, metrics, alerts
Reduced motion: 5 context switches eliminated
Self-service answers:
Before: Slack questions about config, access, status
After: Internal developer portal with search
Reduced motion: No more hunting, asking, waiting
Waste 4: Waiting ¶
Manufacturing: Workers or machines idle, waiting for inputs.
Platform engineering: Developers waiting for builds, tests, approvals, environments.
Examples ¶
CI/CD waiting:
Build queue time: 15 minutes
Build execution: 10 minutes
Test queue time: 10 minutes
Test execution: 20 minutes
Deploy approval wait: 4 hours
Deploy execution: 5 minutes
Total: 5 hours
Active work: 35 minutes
Waiting: 4.5 hours (90% of time is waste)
Environment waiting:
"I need a staging environment"
→ Submit request ticket
→ Wait for approval (1 day)
→ Wait for provisioning (1 day)
→ Environment ready (2 days later)
Human bottleneck waiting:
Code review: 2 days average
Security review: 1 week average
Architecture review: 2 weeks average
Finding Waiting Waste ¶
Measure queue times at each step:
Time in queue / Time being processed = Wait Ratio
Wait ratio > 1 = More waiting than working
Wait ratio > 5 = Severe waiting waste
Eliminating Waiting Waste ¶
Build queue time:
Before: 15 minutes (shared build agents)
After: 0 minutes (auto-scaling build agents)
Test parallelization:
Before: 20 minutes (sequential)
After: 5 minutes (parallelized)
Self-service environments:
Before: 2 days (ticket + approval + provision)
After: 10 minutes (automated provisioning)
Async approvals:
Before: Block on human approval
After: Deploy to staging immediately, require approval for prod
Reduced wait without reducing safety.
Waste 5: Overproduction ¶
Manufacturing: Making more than customers need.
Platform engineering: Building features that aren’t used, over-engineering solutions.
Examples ¶
Platform feature graveyard:
- Custom deployment strategies (nobody uses)
- Advanced caching layer (one team tried once)
- Multi-region support (never activated)
- Plugin system (no plugins built)
Premature optimization:
"We built this to handle 10x our current scale"
(Scale never came)
(Complexity remains)
Documentation overproduction:
100-page architecture doc (never read)
Detailed runbooks (outdated before finished)
Video tutorials (nobody watches)
Finding Overproduction Waste ¶
Feature usage audit:
For each platform capability:
- How many teams use it?
- How often is it used?
- If removed, who would notice?
Eliminating Overproduction Waste ¶
Build only what's needed:
Before: Design for hypothetical scale
After: Build for current needs + clear extension points
Kill unused features:
If usage < 5%: Deprecate it
If usage = 0%: Remove it
Sunsets reduce maintenance burden.
Minimal documentation:
Before: Comprehensive docs (never read, always stale)
After: README + runbook + examples (actually maintained)
Waste 6: Overprocessing ¶
Manufacturing: More precision or steps than needed.
Platform engineering: Excessive process, redundant checks, unnecessary rigor.
Examples ¶
Approval overhead:
Deploy to dev: Requires approval (why?)
Change config: Requires CAB ticket (really?)
Add team member: Requires 3 sign-offs (necessary?)
Compliance theater:
Security scan on every commit (same code, same result)
Vulnerability report nobody reads
Audit logs nobody audits
Checklist nobody checks
Process for process's sake:
"We need a design doc for this 10-line change"
"Let's schedule a review meeting" (for trivial changes)
"Fill out this template" (fields don't apply)
Finding Overprocessing Waste ¶
For each process step:
- What risk does this mitigate?
- Has that risk ever materialized?
- Is there a lighter-weight alternative?
- Who would notice if we skipped it?
Eliminating Overprocessing Waste ¶
Risk-based approvals:
Before: All changes require approval
After:
Low risk (dev, config): Auto-approve
Medium risk (staging): Peer approve
High risk (prod): Lead approve
Smart scanning:
Before: Full security scan on every commit
After: Full scan on changed files only
Full scan nightly
Reduced: 80% scan time with same coverage
Waste 7: Defects ¶
Manufacturing: Scrap, rework, quality failures.
Platform engineering: Incidents, rollbacks, debugging, incorrect configurations.
Examples ¶
Deployment defects:
Failed deployments: 15% of deploys
Rollbacks: 5% of deploys
Time to fix: 2 hours average
If 100 deploys/week:
15 failures × 2 hours = 30 hours of rework/week
Configuration defects:
"Why is this environment broken?"
→ Config drift from production
→ 4 hours debugging
→ Manual fix
→ (Will happen again)
Platform defects:
CI randomly fails (flaky tests)
Environments randomly break (resource limits)
Deploys randomly timeout (network issues)
"Random" = Unresolved defects
Finding Defect Waste ¶
Track defect metrics:
- Deployment failure rate
- Mean time to recovery
- Rollback frequency
- Repeat incidents (same cause)
- Escaped defects (caught in production)
Eliminating Defect Waste ¶
Build quality in:
Before: Test in production, fix defects
After: Fail fast in CI, prevent defects
Poka-yoke (mistake-proofing):
Before: Config as free-form YAML (easy to break)
After: Config as typed schema (invalid = won't deploy)
Reduced: 80% of config-related incidents
Root cause elimination:
Before: Fix the incident, move on
After: Fix the incident, fix the system that allowed it
Incident → Post-mortem → Prevention → No repeat
The Lean Platform Audit ¶
Walk through your platform with the waste lens:
For each developer workflow:
1. Map the value stream (every step from code to production)
2. Categorize each step:
- Value-add: Customer would pay for this
- Non-value but necessary: Required (compliance, safety)
- Waste: Neither valuable nor necessary
3. For each waste, identify the type:
- Transport: Unnecessary movement of artifacts/data
- Inventory: Excess resources, WIP, data
- Motion: Developer context switching, hunting
- Waiting: Queue time, approval delays
- Overproduction: Unused features, premature optimization
- Overprocessing: Unnecessary rigor, redundant checks
- Defects: Failures, rework, debugging
4. Prioritize by impact:
- How much time/money does this waste?
- How hard is it to eliminate?
- Focus on high-impact, low-effort first
Continuous Improvement (Kaizen) ¶
Lean isn’t a one-time audit. It’s a culture.
Daily:
- Notice waste when you see it
- Quick fixes for small waste
Weekly:
- Team discusses biggest waste encountered
- Prioritize improvement items
Monthly:
- Measure waste metrics (lead time, failure rate, utilization)
- Track improvement trends
Quarterly:
- Value stream mapping exercise
- Major waste elimination initiatives
Summary ¶
The seven wastes in platform engineering:
| Waste | Platform Symptoms | Elimination |
|---|---|---|
| Transport | Artifacts touching too many systems | Simplify pipeline |
| Inventory | Over-provisioned, excess WIP | Right-size, limit WIP |
| Motion | Context switching, hunting | Single pane of glass |
| Waiting | Queue time, approval delays | Parallelize, self-service |
| Overproduction | Unused features | Build only what’s needed |
| Overprocessing | Excessive process | Risk-based controls |
| Defects | Failures, rollbacks | Build quality in |
The lean mindset:
Ask constantly:
- Is this step adding value?
- Who is waiting? Why?
- What did we build that nobody uses?
- What failed? How do we prevent it?
Toyota didn’t become Toyota in a day. They improved relentlessly for decades.
Your platform can too. One waste at a time.
Start today: Pick one workflow. Map it. Find the muda. Eliminate it.
Then do it again tomorrow.