Root Cause Analysis

Root cause work belongs after something breaks in production or staging, when reliability and learning matter more than feature velocity. Errors is the canonical shelf because RCA starts from alerts, outages, and failure signals rather than greenfield design.

Also useful

Also useful

Where it fits

Example use

OperateMonitoring & observability

Draft an RCA after connection-pool exhaustion dropped error rates for thousands of users.

Example use

Define alert and metric checks to verify a fix and catch early warnings.

Example use

Document a bad deploy rollback with timeline before the next release gate.

Example use

Turn repeat support tickets into contributing factors and procedural updates.

How it compares

Use for structured incident learning instead of generic debugging tips with no report or prevention loop.

Common Questions / FAQ

Who is root-cause-analysis for?

Solo founders and small teams who ship their own APIs or SaaS and need disciplined incident write-ups without a dedicated reliability org.

When should I use root-cause-analysis?

In Operate after alerts or user-reported failures; during Ship when a launch regression needs a formal timeline; and in Grow when support escalations reveal a systemic defect worth documenting.

Is root-cause-analysis safe to install?

It is prompt and checklist content—avoid pasting live secrets into reports; review the Security Audits panel on this page before granting broad repo access to the agent.

SKILL.md

READMESKILL.md - Root Cause Analysis

# Follow-Up & Prevention

## Follow-Up & Prevention

```yaml
After RCA:

1. Track Action Items
  - Assign owner
  - Set deadline
  - Follow up in retrospective

2. Prevent Recurrence
  - Automated tests
  - Monitoring/alerts
  - Procedural changes
  - Training

3. Monitor Metrics
  - Track similar incidents
  - Verify fix effectiveness
  - Monitor preventive measures
  - Catch early warnings

4. Share Learnings
  - Document incident
  - Share with team
  - Industry sharing if relevant
  - Update procedures

---

Checklist:

[ ] Incident details documented
[ ] Timeline established
[ ] Logs reviewed
[ ] Metrics analyzed
[ ] Root cause identified (via 5 Whys)
[ ] Contributing factors listed
[ ] Immediate actions completed
[ ] Short-term solutions planned
[ ] Long-term solutions identified
[ ] Solutions prioritized
[ ] RCA report written
[ ] Team debriefing scheduled
[ ] Action items assigned
[ ] Prevention measures planned
[ ] Follow-up scheduled
```


# RCA Report Template

## RCA Report Template

```yaml
RCA Report:

Incident: Database connection failure (2024-01-15, 14:30-15:15)

Impact:
  - Duration: 45 minutes
  - Users affected: 5,000 (10% of user base)
  - Revenue lost: ~$2,000
  - Severity: P1 (Critical)

Timeline:
  14:30: Automated monitoring alert: High error rate (20%)
  14:32: On-call engineer notified
  14:35: Identified database connection error in logs
  14:40: Restarted database connection pool
  14:42: Service recovered, error rate returned to 0.1%
  14:50: Incident declared resolved
  15:15: Full recovery verified

Root Cause:
  Poorly optimized query introduced in release 2.5.0 caused
  queries to take 10x longer. Connection pool exhausted as
  connections weren't released quickly.

Contributing Factors:
  1. No query performance testing pre-deployment
  2. Load testing environment doesn't match production volume
  3. No alerting on query duration
  4. Connection pool timeout set too high

Solutions:
  Immediate (Done):
    - Rolled back problematic query optimization

  Short-term (1 week):
    - Added query performance alerts (>1s)
    - Added index for slow query
    - Set query timeout to 5 seconds

  Long-term (1 month):
    - Updated load testing with production-like data
    - Implement performance benchmarks in CI/CD
    - Improve monitoring for connection pool health
    - Training on query optimization

Prevention:
  - Query performance regression tests
  - Load testing with production data
  - Connection pool metrics monitoring
  - Code review of database changes
```


# Root Cause Analysis Techniques

## Root Cause Analysis Techniques

```yaml
Fishbone Diagram:

Main problem: Slow API Response

Branches:

  Code:
    - Inefficient algorithm
    - Missing cache
    - Unnecessary queries

  Data:
    - Large dataset
    - Missing index
    - Slow database

  Infrastructure:
    - Low CPU capacity
    - Slow network
    - Disk I/O bottleneck

  Process:
    - No monitoring
    - No load testing
    - Manual deployments

  People:
    - Lack of knowledge
    - Lack of tools
    - No peer review

---

Systemic vs. Individual Causes:

Individual: "Developer used inefficient code"
  Fix: Training
  Risk: Happens again with different person

Systemic: "No code review process"
  Fix: Implement mandatory code review
  Risk: Prevents similar issues

Prefer systemic solutions for prevention
```


# Systematic RCA Process

## Systematic RCA Process

```yaml
Step 1: Gather Facts
  - When did issue occur?
  - Who detected it?
  - How many users affected?
  - What error messages?
  - What system changes deployed?
  - Check logs, metrics, alerts
  - Determine impact scope

Step 2: Reproduce
  - Can we reproduce consistently?
  - What are the exact steps?
  - What environment (prod, staging)?
  - Can we isolate to component?
  - Set up test case

Step 3: Identify Contributing Factors
  - Direct cause
  - Indirect/enabling factors
  - System vulnerabilities
  - Procedural gaps
  - Knowledge gaps

Step 4: Determine Ro

What is this skill?

End-to-end RCA checklist from documentation through debrief and assigned action items

5 Whys and contributing-factors framing with example incident timeline and impact block

Follow-up loop: owners, deadlines, retros, and effectiveness monitoring

Prevention bundle: tests, alerts, procedures, and training hooks

RCA report template with severity, duration, and revenue-style impact fields

RCA checklist with timeline, 5 Whys, and follow-up sections

Example P1 incident template with duration and user-impact fields

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 514 installs on skills.sh; 251 GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

OperateMonitoring & observability

Draft an RCA after connection-pool exhaustion dropped error rates for thousands of users.

Example use

Define alert and metric checks to verify a fix and catch early warnings.

Example use

Document a bad deploy rollback with timeline before the next release gate.

Example use