Incident Runbook Templates

Name: Incident Runbook Templates
Author: wshobson

wshobson/agents

8.4k installs
38.3k repo stars
Updated July 22, 2026
wshobson/agents

incident-runbook-templates is an agent skill that Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage.

About

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for --- name: incident-runbook-templates description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams. --- # Incident Runbook Templates Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication. ## When to Use This Skill - Creating incident response procedures - Building service-specific runbook.

Incident Runbook Templates
Creating incident response procedures
Building service-specific runbooks
Establishing escalation paths
Documenting recovery procedures

Incident Runbook Templates by the numbers

8,422 all-time installs (skills.sh)
+168 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #183 of 2,184 Testing & QA skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

incident-runbook-templates capabilities & compatibility

Capabilities: incident runbook templates · creating incident response procedures · building service specific runbooks · establishing escalation paths · documenting recovery procedures
Use cases: documentation

From the docs

What incident-runbook-templates says it does

--- name: incident-runbook-templates description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions.

SKILL.md

--- # Incident Runbook Templates Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

SKILL.md

Escalation Matrix ``` ## Detailed patterns and worked examples Detailed pattern documentation lives in `references/details.md`.

SKILL.md

Read that file when the navigation tier above is insufficient.

SKILL.md

npx skills add https://github.com/wshobson/agents --skill incident-runbook-templates

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/wshobson/agents/incident-runbook-templates.svg)](https://skillselion.com/skills/wshobson/agents/incident-runbook-templates)

Installs	8.4k
repo stars	★ 38.3k
Security audit	2 / 3 scanners passed
Last updated	July 22, 2026
Repository	wshobson/agents ↗

What problem does incident-runbook-templates solve for developers using this skill?

Who is it for?

Developers who need incident-runbook-templates patterns described in the cached skill documentation.

Skip if: Skip when docs are empty or the task is outside the skill's documented scope.

When should I use this skill?

What you get

Actionable workflows and conventions from SKILL.md for incident-runbook-templates.

incident runbook markdown
alert definitions
escalation checklists

Files

SKILL.mdMarkdownGitHub ↗

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

When to Use This Skill

Creating incident response procedures
Building service-specific runbooks
Establishing escalation paths
Documenting recovery procedures
Responding to active incidents
Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

Severity	Impact	Response Time	Example
SEV1	Complete outage, data loss	15 min	Production down
SEV2	Major degradation	30 min	Critical feature broken
SEV3	Minor impact	2 hours	Non-critical bug
SEV4	Minimal impact	Next business day	Cosmetic issue

2. Runbook Structure

1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix

Detailed patterns and worked examples

Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.

Best Practices

Do's

Keep runbooks updated - Review after every incident
Test runbooks regularly - Game days, chaos engineering
Include rollback steps - Always have an escape hatch
Document assumptions - What must be true for steps to work
Link to dashboards - Quick access during stress

Don'ts

Don't assume knowledge - Write for 3 AM brain
Don't skip verification - Confirm each step worked
Don't forget communication - Keep stakeholders informed
Don't work alone - Escalate early
Don't skip postmortems - Learn from every incident

Troubleshooting

Runbook steps work in staging but fail during a real incident

Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note:

# Step: Check pod status
kubectl get pods -n payments

# Prerequisites: kubectl configured, kubeconfig points to correct cluster
# If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`
# Expected output: pods in Running state

On-call engineer panics and skips steps out of order

Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document:

## Quick Checklist
- [ ] 1. Declare incident severity and open war room
- [ ] 2. Check service health (Section 4.1)
- [ ] 3. Check recent deployments (Section 4.1)
- [ ] 4. Roll back if deploy is suspect (Section 4.1)
- [ ] 5. Post initial notification to #payments-incidents
- [ ] 6. Escalate if > 15 min unresolved

Runbook is outdated — commands reference old cluster names or endpoints

Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all curl endpoints and kubectl context names are still valid:

## Runbook Metadata
| Field | Value |
|---|---|
| Last verified | 2024-11-15 |
| Owner | @platform-team |
| Review cadence | After every SEV1/SEV2 |

Stakeholder communication is delayed while engineers are heads-down

Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template:

Update every 15 minutes (even if no new information):
- Current status (Investigating / Mitigating / Monitoring)
- Impact (what is broken, who is affected, % of traffic)
- What we are doing right now
- Next update in: 15 minutes

Database runbook commands cause additional downtime when run incorrectly

Add explicit warnings before destructive SQL commands and require a dry-run output check before executing:

-- WARNING: This terminates active connections. Verify count first.
-- DRY RUN (check count before terminating):
SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';

-- EXECUTE only after verifying count is reasonable (< 50):
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND query_start < now() - interval '10 minutes';

Related Skills

postmortem-writing - After resolving an incident, use postmortem templates to capture root cause and preventive actions
on-call-handoff-patterns - Structure shift handoffs so the incoming responder has full context on active incidents

incident-runbook-templates — detailed patterns and worked examples

Runbook Templates

Template 1: Service Outage Runbook

````markdown

[Service Name] Outage Runbook

Overview

Service: Payment Processing Service Owner: Platform Team Slack: #payments-incidents PagerDuty: payments-oncall

Impact Assessment

[ ] Which customers are affected?
[ ] What percentage of traffic is impacted?
[ ] Are there financial implications?
[ ] What's the blast radius?

Detection

Alerts

payment_error_rate > 5% (PagerDuty)
payment_latency_p99 > 2s (Slack)
payment_success_rate < 95% (PagerDuty)

Dashboards

Initial Triage (First 5 Minutes)

1. Assess Scope

# Check service health
kubectl get pods -n payments -l app=payment-service

# Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

````

2. Quick Health Checks

[ ] Can you reach the service? curl -I https://api.company.com/payments/health
[ ] Database connectivity? Check connection pool metrics
[ ] External dependencies? Check Stripe, bank API status
[ ] Recent changes? Check deploy history

3. Initial Classification

Symptom	Likely Cause	Go To Section
All requests failing	Service down	Section 4.1
High latency	Database/dependency	Section 4.2
Partial failures	Code bug	Section 4.3
Spike in errors	Traffic surge	Section 4.4

Mitigation Procedures

4.1 Service Completely Down

# Step 1: Check pod status
kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments

4.2 High Latency

# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
  curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
  SELECT pid, now() - query_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active' AND duration > interval '5 seconds'
  ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 Partial Failures (Specific Errors)

# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
  grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
  SELECT * FROM audit_log
  WHERE table_name = 'payment_methods'
  AND created_at > now() - interval '1 hour';"

4.4 Traffic Surge

# Step 1: Check current request rate
kubectl top pods -n payments

# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
  RATE_LIMIT_ENABLED=true \
  RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-suspicious
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-service
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.168.1.0/24  # Suspicious range
EOF

Verification Steps

# Verify service is healthy
curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows
./scripts/smoke-test-payments.sh

Rollback Procedures

# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

Escalation Matrix

Condition	Escalate To	Contact
> 15 min unresolved SEV1	Engineering Manager	@manager (Slack)
Data breach suspected	Security Team	#security-incidents
Financial impact > $10k	Finance + Legal	@finance-oncall
Customer communication needed	Support Lead	@support-lead

Communication Templates

Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents

Status Update

📊 UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes

Resolution Notification

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress

````

Template 2: Database Incident Runbook

# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |

## Connection Pool Exhaustion

-- Check current connections SELECT datname, usename, state, count() FROM pg_stat_activity GROUP BY datname, usename, state ORDER BY count() DESC;

-- Identify long-running connections SELECT pid, usename, datname, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;

-- Terminate idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes'; ````

Replication Lag

-- Check lag on replica
SELECT
  CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
  END AS lag_seconds;

-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable

Disk Space Critical

# Check disk usage
df -h /var/lib/postgresql/data

# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"

# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk

Related skills

TddFollow test-driven development with a strict red-green-refactor loop when creating reliable features or fixing bugs.510k185k

Test Driven DevelopmentEnforce writing failing tests before any production implementation code.176k260k

QaRun conversational QA sessions that turn user-reported bugs into well-written, domain-aware GitHub issues without manual ticket writing.164k185k

Migrate To ShoehornAutomatically update TypeScript test files that rely on unsafe `as` type assertions by replacing them with type-safe partial objects from @total-typescript/shoehorn.151k185k

Webapp TestingVerify frontend behavior, debug UI issues, capture screenshots, and inspect logs of a running local web application using Playwright.121k164k

Playwright CliRun browser automation, generate element snapshots, inspect DOM attributes, and execute Playwright tests from the terminal.96.3k12.2k

How it compares

Use incident-runbook-templates over generic documentation skills when the output must be a paging-ready outage playbook with alert and dashboard sections.

About

Incident Runbook Templates by the numbers

incident-runbook-templates capabilities & compatibility

What incident-runbook-templates says it does

Add your badge

What problem does incident-runbook-templates solve for developers using this skill?

Who is it for?

When should I use this skill?

What you get

Files

Incident Runbook Templates

When to Use This Skill

Core Concepts

1. Incident Severity Levels

2. Runbook Structure

Detailed patterns and worked examples

Best Practices

Do's

Don'ts

Troubleshooting

Runbook steps work in staging but fail during a real incident

On-call engineer panics and skips steps out of order

Runbook is outdated — commands reference old cluster names or endpoints

Stakeholder communication is delayed while engineers are heads-down

Database runbook commands cause additional downtime when run incorrectly

Related Skills

incident-runbook-templates — detailed patterns and worked examples

Runbook Templates

Template 1: Service Outage Runbook

[Service Name] Outage Runbook

Overview

Impact Assessment

Detection

Alerts

Dashboards

Initial Triage (First 5 Minutes)

1. Assess Scope

2. Quick Health Checks

3. Initial Classification

Mitigation Procedures

4.1 Service Completely Down

4.2 High Latency

4.3 Partial Failures (Specific Errors)

4.4 Traffic Surge

Verification Steps

Rollback Procedures

Escalation Matrix

Communication Templates

Initial Notification (Internal)

Status Update

Resolution Notification

Template 2: Database Incident Runbook

Replication Lag

Disk Space Critical

Related skills

How it compares

FAQ

What does incident-runbook-templates do?

When should I use incident-runbook-templates?

Is incident-runbook-templates safe to install?

This week in AI coding