
Postmortem Writing
Draft blameless incident postmortems with timelines, impact, and action items after outages or SEV events.
Overview
Postmortem Writing is an agent skill most often used in Operate (also Ship and Grow) that supplies standard incident postmortem templates and a filled timeline example.
Install
npx skills add https://github.com/wshobson/agents --skill postmortem-writingWhat is this skill?
- Standard postmortem markdown template with Executive Summary, Impact, and Timeline sections.
- Worked example includes SEV2 severity, duration in minutes, and UTC timeline table rows.
- Status workflow: Draft, In Review, Final for team review before publishing.
- Captures customer impact metrics (users affected, revenue, support tickets) alongside technical root cause.
- Structured narrative suitable for agents to fill from incident notes and pager logs.
- Standard template includes Executive Summary, Impact, and UTC Timeline sections.
- Worked example documents a 47-minute SEV2 incident with an 8-row timeline table.
Adoption & trust: 6.9k installs on skills.sh; 36.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
After an outage you have scattered alerts and chat logs but no consistent postmortem doc stakeholders can review.
Who is it for?
Solo founders or two-person teams documenting production incidents for themselves, investors, or support-heavy SaaS products.
Skip if: Live incident response runbooks or automated alerting setup—use those before you write the retrospective.
When should I use this skill?
A production incident has stabilized and you need a structured blameless postmortem document.
What do I get? / Deliverables
You produce a structured postmortem with timeline, impact, and root-cause narrative ready for review and follow-up action items.
- Markdown postmortem with metadata, executive summary, and impact
- UTC timeline table from detection through mitigation
- Draft or Final status doc ready for review
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is operate/iterate because postmortems convert production pain into process fixes after the incident ends. Writing and reviewing the report is operational learning and follow-up, not initial monitoring setup or pre-launch security review.
Where it fits
Turn launch-week payment errors into a dated timeline before you close the launch checklist.
Summarize alert-to-rollback steps after error-rate spikes so the next deploy avoids the same config mistake.
Convert incident notes into Final-status postmortem with action items for the next sprint.
Align support ticket volume and customer impact bullets with the public status narrative.
How it compares
Markdown template pack for blameless postmortems—not a monitoring MCP or on-call paging integration.
Common Questions / FAQ
Who is postmortem-writing for?
Builders and agents who need repeatable incident write-ups after production failures, especially small teams without a dedicated SRE docs team.
When should I use postmortem-writing?
After ship or operate incidents when error rates spike—fill the template during operate iterate; reuse in grow when publishing transparency content from the same doc.
Is postmortem-writing safe to install?
Review the Security Audits panel on this page; redact customer PII and secrets before sharing generated postmortems externally.
SKILL.md
READMESKILL.md - Postmortem Writing
# postmortem-writing — templates and worked examples ## Templates ### Template 1: Standard Postmortem ```markdown # Postmortem: [Incident Title] **Date**: 2024-01-15 **Authors**: @alice, @bob **Status**: Draft | In Review | Final **Incident Severity**: SEV2 **Incident Duration**: 47 minutes ## Executive Summary On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits. **Impact**: - 12,000 customers unable to complete purchases - Estimated revenue loss: $45,000 - 847 support tickets created - No data loss or security implications ## Timeline (All times UTC) | Time | Event | | ----- | ----------------------------------------------- | | 14:23 | Deployment v2.3.4 completed to production | | 14:31 | First alert: `payment_error_rate > 5%` | | 14:33 | On-call engineer @alice acknowledges alert | | 14:35 | Initial investigation begins, error rate at 23% | | 14:41 | Incident declared SEV2, @bob joins | | 14:45 | Database connection exhaustion identified | | 14:52 | Decision to rollback deployment | | 14:58 | Rollback to v2.3.3 initiated | | 15:10 | Rollback complete, error rate dropping | | 15:18 | Service fully recovered, incident resolved | ## Root Cause Analysis ### What Happened The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections. ### Why It Happened 1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls. 2. **Contributing Factors**: - Code review did not catch the connection handling change - No integration tests specifically for connection pool behavior - Staging environment has lower traffic, masking the issue - Database connection metrics alert threshold was too high (90%) 3. **5 Whys Analysis**: - Why did the service fail? → Database connections exhausted - Why were connections exhausted? → Each request opened new connection - Why did each request open new connection? → Code bypassed connection pool - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns - Why was developer unfamiliar? → No documentation on connection management patterns ### System Diagram ``` [Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause) ``` ## Detection ### What Worked - Error rate alert fired within 8 minutes of deployment - Grafana dashboard clearly showed connection spike - On-call response was swift (2 minute acknowledgment) ### What Didn't Work - Database connection metric alert threshold too high - No deployment-correlated alerting - Canary deployment would have caught this earlier ### Detection Gap The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster. ## Response ### What Worked - On-call engineer quickly identified database as the issue - Rollback decision was made decisively - Clear communication in incident channel ### What Could Be Improved - Took 10 minutes to correlate issue with recent deployment - Had to manually check deployment history - Rollback took 12 minutes (could be faster) ## Impact ### Customer Impact - 12,000 unique customers affected - Average impact duration: 35 minutes - 847 support tickets (23% of affected users) - Customer satisfaction score dropped 12 points ### Business Impact - Estimated revenue loss: $45,000 -