Architecture Decision Records: Template and Best Practices
Implement Architecture Decision Records to document why you made important technical decisions. ADR template, examples, and workflow for software teams.
Developer Productivity
Write effective runbooks that guide engineers through incidents and operations. Template, best practices, and maintenance workflow for operational documentation.
Your system breaks at 2am.
On-call engineer gets paged.
Wakes up.
Confused.
Logs in.
"What do I do?"
They scramble.
Without documentation:
With a runbook:
A runbook is the ops documentation that saves you at 2am.
This guide covers writing runbooks that work.
A runbook is a procedure document.
It tells someone exactly what to do in a specific situation.
Not:
Yes:
When used:
When stress/panic hits, people forget things.
They can't think clearly.
Runbook removes need to think.
Just follow steps.
With runbook: 15 minutes
Without runbook: 2+ hours
Savings: 105 minutes per incident
Under pressure, engineers make mistakes.
Runbook provides guardrails.
"Did you check X?"
Yes → next step.
Mistakes prevented.
Without runbook: Only expert can handle incident.
With runbook: Anyone can follow it.
Team isn't bottlenecked on one expert.
On-call engineer: "I know what to do."
They follow steps.
They recover the system.
Confidence + competence.
Use this template for every critical procedure:
# [Procedure Title]
## When to Use This Runbook
[What event triggers this? What symptoms?]
## Prerequisites
[What access/permissions needed?]
[What tools should be available?]
## Severity
[Critical / High / Medium / Low]
## Estimated Time
[How long does this typically take?]
## Quick Overview
[30-second summary of what you'll do]
## Diagnosis Steps
### Check 1: [What to verify?]
Command: [Exact command]
Expected result: [What should you see if working?]
If not: Go to Troubleshooting
### Check 2: [What to verify?]
Command: [Exact command]
Expected result: [What should you see?]
## Solution Steps
### Step 1: [Action]
Command: [Exact command]
Expected result: [What you should see]
### Step 2: [Action]
Command: [Exact command]
Expected result: [What you should see]
## Verification
After fix, verify:
- [ ] [Verification check 1]
- [ ] [Verification check 2]
- [ ] [Verification check 3]
## If Something Goes Wrong
### Issue: [Possible problem]
Solution: [What to do]
### Issue: [Possible problem]
Solution: [What to do]
## Rollback
If fix doesn't work:
Step 1: [Rollback action]
Step 2: [Verify rollback]
## Escalation
If this doesn't work:
1. Page: [Person/team]
2. Slack: [Channel]
3. Last resort: [Executive]
## Reference
- Related runbook: [Link]
- Architecture doc: [Link]
- Monitoring: [URL to dashboard]
- Previous incidents: [Links]
## Last Updated
[Date] by [Name]
## Review Date
[Next quarterly review]
# API Service CPU Exhaustion Recovery
## When to Use This Runbook
- Alert: "API Service CPU > 90%"
- Symptom: API requests timing out
- Effect: Customer-facing feature broken
## Prerequisites
- SSH access to prod
- AWS credentials configured
- kubectl installed locally
- Monitoring dashboard access
## Severity
Critical (customer-facing feature down)
## Estimated Time
15–30 minutes
## Quick Overview
1. Confirm issue (CPU high)
2. Check for runaway processes
3. Identify cause (query? leak? bot?)
4. Kill process or restart service
5. Verify recovery
## Diagnosis Steps
### Check 1: Verify CPU high
ssh prod-api-1 top
Expected: One process using lots of CPU
### Check 2: Identify process
ps aux | grep -E "node|python|java"
Expected: See which process is the culprit
### Check 3: Check query logs (if database-heavy)
tail -f /var/log/api.log | grep "SLOW QUERY"
Expected: See slow queries if issue is database
## Solution Steps
### Step 1: Identify root cause
Is it:
- Runaway query? (database)
- Memory leak? (service restart needed)
- Bot attack? (rate limiter)
- Invalid code? (rollback)
### Step 2a: If it's a slow query
QUERY_ID=$(ps aux | grep "mysql" | grep -o "query.*" | head -1)
mysql -u admin -p$MYSQL_PASS -e "KILL $QUERY_ID;"
### Step 2b: If it's a memory leak
kubectl rollout restart deployment/api
kubectl get pods -l app=api
### Step 2c: If it's a bot attack
kubectl set env deployment/api RATE_LIMIT_ENABLED=true
kubectl rollout restart deployment/api
## Verification
After fix, verify:
- [ ] CPU drops below 80%: `top` shows CPU < 80%
- [ ] API responds: `curl https://api.prod/health` returns 200
- [ ] Customers report: Check Slack feedback channel
## If Something Goes Wrong
### Issue: CPU still high after restart
Solution: Don't restart again. Page senior engineer immediately.
### Issue: Can't SSH to server
Solution: Try backup server: `ssh prod-api-2`
### Issue: kubectl unavailable
Solution: SSH to server directly and use docker commands
## Rollback
If fix makes things worse:
1. Previous version was fine
2. `kubectl rollout undo deployment/api`
3. Verify: `curl https://api.prod/health`
## Escalation
If this doesn't work:
1. Page: #oncall-api channel
2. Slack: @api-team
3. Last resort: Call VP Engineering
## Reference
- Architecture: [Company wiki: API Service]
- Previous incidents: [Incident 2025-01-15]
- Monitoring: [Grafana dashboard: API CPU]
- Database slow logs: [/var/log/mysql-slow.log]
## Last Updated
2025-02-10 by Alice Chen
## Review Date
2025-05-10 (quarterly)
WRONG: "Query the database and find slow queries"
RIGHT: "Run: mysql -u admin -p$MYSQL_PASS -e 'SHOW FULL PROCESSLIST;'"
On-call engineer might not know exact MySQL syntax.
Be specific. Provide exact commands.
WRONG: "Fix the database issue"
RIGHT: "Prerequisites: SSH access to prod-db-1, MySQL CLI installed, credentials in ~/.my.cnf"
Runbook assumes reader can't access something.
Verify prerequisites first.
WRONG: "Restart the service"
RIGHT: "Is it a database query or service memory leak?
- Database: Kill the query
- Memory: Restart service
- Network: Check firewall"
Different symptoms need different fixes.
Provide decision tree.
WRONG: [Procedure ends without checking if it worked]
RIGHT: [After each step, verify result. "Expected to see: X"]
How does on-call engineer know if they fixed it?
Include verification steps.
Runbook written 6 months ago.
System changed.
Commands don't work.
Fix: Set quarterly review date. Update before then.
What could break?
Next time incident happens:
Don't use runbook.
Go through procedure.
Solve the problem.
Document every step.
Use what you just did.
Exact commands.
Exact output.
Decision points.
Have someone else follow it.
Do they succeed?
Any confusing parts?
Update.
Next incident, have on-call follow runbook.
Does it work?
What's missing?
Update immediately.
Pros: Searchable, linked, organized
Cons: Need to remember to check wiki during incident
Pros: In same place as code
Cons: Not easily found if you're in production server
Pros: Accessible when no network
Cons: Gets outdated
/runbook [name]Before runbooks: 2+ hours
After runbooks: 15–30 min
Savings: 90+ minutes per incident
Track: % of incidents where runbook was used
Track: % of runbooks that successfully resolve issue
A runbook is the procedure that saves you at 2am.
Template elements:
How to start:
In 6 months, every critical procedure will have a runbook.
On-call incidents will be faster and less stressful.
For team wiki, see Team Wiki Setup Guide. For onboarding, check Onboarding Documentation System.
Write runbooks. Test before you need them. Recover fast.
More WebSnips articles that pair well with this topic.
Implement Architecture Decision Records to document why you made important technical decisions. ADR template, examples, and workflow for software teams.
Build async documentation practices for remote teams. How to write documentation that answers questions before they're asked and reduces meeting load.
Build a personal documentation system for developers. Capture solutions, architecture decisions, and technical context so you never solve the same problem twice.