Developer Productivity

Runbook Documentation: Write Ops Guides That Work at 2am

Write effective runbooks that guide engineers through incidents and operations. Template, best practices, and maintenance workflow for operational documentation.

Back to blogApril 16, 20264 min read
operationsdocumentationincident-responserunbooks

Your system breaks at 2am.

On-call engineer gets paged.

Wakes up.

Confused.

Logs in.

"What do I do?"

They scramble.

Without documentation:

  • They don't know what to check
  • They try random fixes
  • 2 hours to recover
  • System down for paying customers

With a runbook:

  • They know exactly what to do
  • Follow steps
  • 15 minutes to recover
  • Minimal customer impact

A runbook is the ops documentation that saves you at 2am.

This guide covers writing runbooks that work.


What a Runbook Is

A runbook is a procedure document.

It tells someone exactly what to do in a specific situation.

Not:

  • General explanation
  • System overview
  • Theory

Yes:

  • Step-by-step procedure
  • Specific commands
  • Decision trees
  • Rollback procedures

When used:

  • During production incidents
  • For recurring operational tasks
  • For critical systems
  • For high-stakes decisions

Why Runbooks Matter

Reason 1: Pressure Breaks Memory

When stress/panic hits, people forget things.

They can't think clearly.

Runbook removes need to think.

Just follow steps.

Reason 2: Reduces Recovery Time

With runbook: 15 minutes

Without runbook: 2+ hours

Savings: 105 minutes per incident

Reason 3: Reduces Mistakes

Under pressure, engineers make mistakes.

Runbook provides guardrails.

"Did you check X?"

Yes → next step.

Mistakes prevented.

Reason 4: Distributes Knowledge

Without runbook: Only expert can handle incident.

With runbook: Anyone can follow it.

Team isn't bottlenecked on one expert.

Reason 5: Builds Confidence

On-call engineer: "I know what to do."

They follow steps.

They recover the system.

Confidence + competence.


Runbook Template

Use this template for every critical procedure:

# [Procedure Title]

## When to Use This Runbook
[What event triggers this? What symptoms?]

## Prerequisites
[What access/permissions needed?]
[What tools should be available?]

## Severity
[Critical / High / Medium / Low]

## Estimated Time
[How long does this typically take?]

## Quick Overview
[30-second summary of what you'll do]

## Diagnosis Steps

### Check 1: [What to verify?]
Command: [Exact command]
Expected result: [What should you see if working?]
If not: Go to Troubleshooting

### Check 2: [What to verify?]
Command: [Exact command]
Expected result: [What should you see?]

## Solution Steps

### Step 1: [Action]
Command: [Exact command]
Expected result: [What you should see]

### Step 2: [Action]
Command: [Exact command]
Expected result: [What you should see]

## Verification

After fix, verify:
- [ ] [Verification check 1]
- [ ] [Verification check 2]
- [ ] [Verification check 3]

## If Something Goes Wrong

### Issue: [Possible problem]
Solution: [What to do]

### Issue: [Possible problem]
Solution: [What to do]

## Rollback

If fix doesn't work:
Step 1: [Rollback action]
Step 2: [Verify rollback]

## Escalation

If this doesn't work:
1. Page: [Person/team]
2. Slack: [Channel]
3. Last resort: [Executive]

## Reference

- Related runbook: [Link]
- Architecture doc: [Link]
- Monitoring: [URL to dashboard]
- Previous incidents: [Links]

## Last Updated
[Date] by [Name]

## Review Date
[Next quarterly review]

Concrete Example

# API Service CPU Exhaustion Recovery

## When to Use This Runbook
- Alert: "API Service CPU > 90%"
- Symptom: API requests timing out
- Effect: Customer-facing feature broken

## Prerequisites
- SSH access to prod
- AWS credentials configured
- kubectl installed locally
- Monitoring dashboard access

## Severity
Critical (customer-facing feature down)

## Estimated Time
15–30 minutes

## Quick Overview
1. Confirm issue (CPU high)
2. Check for runaway processes
3. Identify cause (query? leak? bot?)
4. Kill process or restart service
5. Verify recovery

## Diagnosis Steps

### Check 1: Verify CPU high

ssh prod-api-1 top

Look for: CPU usage > 90%

Expected: One process using lots of CPU

### Check 2: Identify process

ps aux | grep -E "node|python|java"

Note PID of high-CPU process

Expected: See which process is the culprit

### Check 3: Check query logs (if database-heavy)

tail -f /var/log/api.log | grep "SLOW QUERY"

Expected: See slow queries if issue is database

## Solution Steps

### Step 1: Identify root cause
Is it:
- Runaway query? (database)
- Memory leak? (service restart needed)
- Bot attack? (rate limiter)
- Invalid code? (rollback)

### Step 2a: If it's a slow query

Get query ID

QUERY_ID=$(ps aux | grep "mysql" | grep -o "query.*" | head -1)

Kill query

mysql -u admin -p$MYSQL_PASS -e "KILL $QUERY_ID;"


### Step 2b: If it's a memory leak

Restart service

kubectl rollout restart deployment/api

Verify

kubectl get pods -l app=api

Wait for new pods to be Running


### Step 2c: If it's a bot attack

Activate rate limiter (already deployed)

kubectl set env deployment/api RATE_LIMIT_ENABLED=true

Restart with new config

kubectl rollout restart deployment/api


## Verification

After fix, verify:
- [ ] CPU drops below 80%: `top` shows CPU < 80%
- [ ] API responds: `curl https://api.prod/health` returns 200
- [ ] Customers report: Check Slack feedback channel

## If Something Goes Wrong

### Issue: CPU still high after restart
Solution: Don't restart again. Page senior engineer immediately.

### Issue: Can't SSH to server
Solution: Try backup server: `ssh prod-api-2`

### Issue: kubectl unavailable
Solution: SSH to server directly and use docker commands

## Rollback

If fix makes things worse:
1. Previous version was fine
2. `kubectl rollout undo deployment/api`
3. Verify: `curl https://api.prod/health`

## Escalation

If this doesn't work:
1. Page: #oncall-api channel
2. Slack: @api-team
3. Last resort: Call VP Engineering

## Reference

- Architecture: [Company wiki: API Service]
- Previous incidents: [Incident 2025-01-15]
- Monitoring: [Grafana dashboard: API CPU]
- Database slow logs: [/var/log/mysql-slow.log]

## Last Updated
2025-02-10 by Alice Chen

## Review Date
2025-05-10 (quarterly)

Common Runbook Mistakes

Mistake 1: Assumptions About User Knowledge

WRONG: "Query the database and find slow queries"

RIGHT: "Run: mysql -u admin -p$MYSQL_PASS -e 'SHOW FULL PROCESSLIST;'"

On-call engineer might not know exact MySQL syntax.

Be specific. Provide exact commands.

Mistake 2: Missing Prerequisites

WRONG: "Fix the database issue"

RIGHT: "Prerequisites: SSH access to prod-db-1, MySQL CLI installed, credentials in ~/.my.cnf"

Runbook assumes reader can't access something.

Verify prerequisites first.

Mistake 3: No Decision Tree

WRONG: "Restart the service"

RIGHT: "Is it a database query or service memory leak?
- Database: Kill the query
- Memory: Restart service
- Network: Check firewall"

Different symptoms need different fixes.

Provide decision tree.

Mistake 4: No Verification

WRONG: [Procedure ends without checking if it worked]

RIGHT: [After each step, verify result. "Expected to see: X"]

How does on-call engineer know if they fixed it?

Include verification steps.

Mistake 5: Never Updated

Runbook written 6 months ago.

System changed.

Commands don't work.

Fix: Set quarterly review date. Update before then.


How to Write Runbooks

Step 1: Identify Critical Procedures

What could break?

  • Database connection pool exhausted
  • API service down
  • Memory leak
  • Disk full
  • Security incident
  • Deployment failure

Step 2: Solve It Yourself (First Time)

Next time incident happens:

Don't use runbook.

Go through procedure.

Solve the problem.

Document every step.

Step 3: Write From Experience

Use what you just did.

Exact commands.

Exact output.

Decision points.

Step 4: Test the Runbook

Have someone else follow it.

Do they succeed?

Any confusing parts?

Update.

Step 5: Use During Incident

Next incident, have on-call follow runbook.

Does it work?

What's missing?

Update immediately.


Runbook Maintenance

Weekly

  • Any incidents? Update relevant runbooks
  • New issues found? Create runbook

Monthly

  • Review runbooks used in incidents
  • Test one runbook (actually follow it)
  • Update commands if anything changed

Quarterly

  • Full review of all runbooks
  • Are they still accurate?
  • Do they need updating?

Where to Store Runbooks

Option 1: Wiki (GitHub, Confluence)

Pros: Searchable, linked, organized

Cons: Need to remember to check wiki during incident

Option 2: README in Repo

Pros: In same place as code

Cons: Not easily found if you're in production server

Option 3: Printed

Pros: Accessible when no network

Cons: Gets outdated

Option 4: Multiple Places

  • Main: Wiki
  • Quick reference: Slack channel /runbook [name]
  • Emergency: Printed copy at desk

Realistic Implementation

Week 1: Identify Top 5

  • Database connection pool exhausted
  • API service down
  • Disk full
  • Memory leak
  • Deployment failed

Week 2: Write Runbooks

  • Write one runbook for each
  • Test with team

Week 3: Use in Training

  • New on-call person uses runbooks
  • Gives feedback
  • Improve runbooks

Week 4+: Maintain

  • Update after each incident
  • Monthly review
  • Quarterly full review

Metrics That Matter

Metric 1: Time to Recovery

Before runbooks: 2+ hours

After runbooks: 15–30 min

Savings: 90+ minutes per incident

Metric 2: Runbook Usage

Track: % of incidents where runbook was used

  • Low: <30% (runbooks not trusted)
  • Good: 50%+
  • Excellent: 80%+ (runbooks are standard)

Metric 3: Runbook Accuracy

Track: % of runbooks that successfully resolve issue

  • Low: <70% (runbooks outdated)
  • Good: 85%+
  • Excellent: 95%+ (well-maintained)

Conclusion

A runbook is the procedure that saves you at 2am.

Template elements:

  1. When to use (trigger)
  2. Prerequisites (requirements)
  3. Diagnosis (how to confirm issue)
  4. Solution (exact steps)
  5. Verification (did it work?)
  6. Troubleshooting (if it didn't)
  7. Rollback (how to undo)
  8. Escalation (who to call)

How to start:

  1. Identify 5 critical procedures
  2. Next incident, document every step
  3. Write runbook from that documentation
  4. Test with another engineer
  5. Use in production

In 6 months, every critical procedure will have a runbook.

On-call incidents will be faster and less stressful.

For team wiki, see Team Wiki Setup Guide. For onboarding, check Onboarding Documentation System.

Write runbooks. Test before you need them. Recover fast.

Keep reading

More WebSnips articles that pair well with this topic.