
Incident Runbook Templates
Generate and adapt production incident runbooks—with triage checklists, alert references, and kubectl probes—for services you operate as a solo builder.
Install
npx skills add https://github.com/wshobson/agents --skill incident-runbook-templatesWhat is this skill?
- Service Outage Runbook markdown template with owner, Slack, and PagerDuty fields
- Impact assessment checklist: customers, traffic %, financial risk, blast radius
- Detection section wiring alerts, Grafana dashboards, Sentry, and dependency status pages
- Initial Triage (First 5 Minutes): scope commands, health checks, classification table
- Kubernetes-oriented examples: pod health, rollout history, Prometheus error-rate queries
Adoption & trust: 7k installs on skills.sh; 36.5k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Runbooks are shelved under operate/errors because they guide live incident response, though the same templates help during build when you document services and during ship when you define on-call expectations. Errors is the right subphase for outage playbooks, impact assessment, and first-five-minute triage—not generic infra provisioning or post-release analytics.
Common Questions / FAQ
Is Incident Runbook Templates safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Incident Runbook Templates
# incident-runbook-templates — detailed patterns and worked examples ## Runbook Templates ### Template 1: Service Outage Runbook ````markdown # [Service Name] Outage Runbook ## Overview **Service**: Payment Processing Service **Owner**: Platform Team **Slack**: #payments-incidents **PagerDuty**: payments-oncall ## Impact Assessment - [ ] Which customers are affected? - [ ] What percentage of traffic is impacted? - [ ] Are there financial implications? - [ ] What's the blast radius? ## Detection ### Alerts - `payment_error_rate > 5%` (PagerDuty) - `payment_latency_p99 > 2s` (Slack) - `payment_success_rate < 95%` (PagerDuty) ### Dashboards - [Payment Service Dashboard](https://grafana/d/payments) - [Error Tracking](https://sentry.io/payments) - [Dependency Status](https://status.stripe.com) ## Initial Triage (First 5 Minutes) ### 1. Assess Scope ```bash # Check service health kubectl get pods -n payments -l app=payment-service # Check recent deployments kubectl rollout history deployment/payment-service -n payments # Check error rates curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" ``` ```` ### 2. Quick Health Checks - [ ] Can you reach the service? `curl -I https://api.company.com/payments/health` - [ ] Database connectivity? Check connection pool metrics - [ ] External dependencies? Check Stripe, bank API status - [ ] Recent changes? Check deploy history ### 3. Initial Classification | Symptom | Likely Cause | Go To Section | | -------------------- | ------------------- | ------------- | | All requests failing | Service down | Section 4.1 | | High latency | Database/dependency | Section 4.2 | | Partial failures | Code bug | Section 4.3 | | Spike in errors | Traffic surge | Section 4.4 | ## Mitigation Procedures ### 4.1 Service Completely Down ```bash # Step 1: Check pod status kubectl get pods -n payments # Step 2: If pods are crash-looping, check logs kubectl logs -n payments -l app=payment-service --tail=100 # Step 3: Check recent deployments kubectl rollout history deployment/payment-service -n payments # Step 4: ROLLBACK if recent deploy is suspect kubectl rollout undo deployment/payment-service -n payments # Step 5: Scale up if resource constrained kubectl scale deployment/payment-service -n payments --replicas=10 # Step 6: Verify recovery kubectl rollout status deployment/payment-service -n payments ``` ### 4.2 High Latency ```bash # Step 1: Check database connections kubectl exec -n payments deploy/payment-service -- \ curl localhost:8080/metrics | grep db_pool # Step 2: Check slow queries (if DB issue) psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;" # Step 3: Kill long-running queries if needed psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);" # Step 4: Check external dependency latency curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health # Step 5: Enable circuit breaker if dependency is slow kubectl set env deployment/payment-service \ STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments ``` ### 4.3 Partial Failures (Specific Errors) ```bash # Step 1: Identify error pattern kubectl logs -n payments -l app=payment-service --tail=500 | \ grep -i error | sort | uniq -c | sort -rn | head -20 # Step 2: Check error tracking # Go to Sentry: https://sentry.io/payments # Step 3: If specific endpoint, enable feature flag to disable curl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}' # Step 4: If data issue, check recent data changes psql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() - interval '1 hour';" ``` ### 4.4 Traffic Surge ```bash # Step