Postmortem Writing

Name: Postmortem Writing
Author: wshobson

wshobson/agents

8.3k installs
38.3k repo stars
Updated July 22, 2026
wshobson/agents

postmortem-writing is an agent skill that Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents,.

About

Write effective blameless postmortems with root cause analysis timelines and action items Use when conducting incident reviews writing postmortem documents or improving incident response processes name postmortem-writing description Write effective blameless postmortems with root cause analysis timelines and action items Use when conducting incident reviews writing postmortem documents or improving incident response processes Postmortem Writing Comprehensive guide to writing effective blameless postmortems that drive organizational learning and prevent incident recurrence When to Use This Skill Conducting post-incident reviews Writing postmortem documents Facilitating blameless postmortem meetings Identifying root causes and contributing factors Creating actionable follow-up items Building organizational learning culture Core Concepts 1 Blameless Culture Blame-Focused Blameless Who caused this What conditions allowed this Someone made a mistake The system allowed this mistake Punish individuals Improve systems Hide information Share learnings Fear of speaking up Psychological safety 2 Postmortem Triggers SEV1 or SEV2 incidents Customer-facing outages 15 minutes Data loss or securi.

Conducting post-incident reviews
Writing postmortem documents
Facilitating blameless postmortem meetings
Identifying root causes and contributing factors
Creating actionable follow-up items

Postmortem Writing by the numbers

8,281 all-time installs (skills.sh)
+169 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #195 of 2,184 Testing & QA skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

postmortem-writing capabilities & compatibility

Capabilities: conducting post incident reviews · writing postmortem documents · facilitating blameless postmortem meetings · identifying root causes and contributing factors · creating actionable follow up items
Use cases: documentation

From the docs

What postmortem-writing says it does

--- name: postmortem-writing description: Write effective blameless postmortems with root cause analysis, timelines, and action items.

SKILL.md

Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.

SKILL.md

--- # Postmortem Writing Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.

SKILL.md

Read that file when you need the concrete templates.

SKILL.md

npx skills add https://github.com/wshobson/agents --skill postmortem-writing

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/wshobson/agents/postmortem-writing.svg)](https://skillselion.com/skills/wshobson/agents/postmortem-writing)

Installs	8.3k
repo stars	★ 38.3k
Security audit	3 / 3 scanners passed
Last updated	July 22, 2026
Repository	wshobson/agents ↗

What problem does postmortem-writing solve for developers using this skill?

Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processe

Who is it for?

Developers who need postmortem-writing patterns described in the cached skill documentation.

Skip if: Skip when docs are empty or the task is outside the skill's documented scope.

When should I use this skill?

What you get

Actionable workflows and conventions from SKILL.md for postmortem-writing.

Blameless postmortem markdown document

By the numbers

Worked example documents a 47-minute SEV2 outage affecting 12,000 customers

Files

SKILL.mdMarkdownGitHub ↗

Postmortem Writing

Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.

When to Use This Skill

Conducting post-incident reviews
Writing postmortem documents
Facilitating blameless postmortem meetings
Identifying root causes and contributing factors
Creating actionable follow-up items
Building organizational learning culture

Core Concepts

1. Blameless Culture

Blame-Focused	Blameless
"Who caused this?"	"What conditions allowed this?"
"Someone made a mistake"	"The system allowed this mistake"
Punish individuals	Improve systems
Hide information	Share learnings
Fear of speaking up	Psychological safety

2. Postmortem Triggers

SEV1 or SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes
Incidents requiring unusual intervention

Quick Start

Postmortem Timeline

Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents

Templates and detailed worked examples

Full template library and detailed worked examples live in references/details.md. Read that file when you need the concrete templates.

References

Connection Pool Best Practices
Deployment Runbook


### Template 2: 5 Whys Analysis

5 Whys Analysis: [Incident]

Problem Statement

Payment service experienced 47-minute outage due to database connection exhaustion.

Analysis

Why #1: Why did the service fail?

Answer: Database connections were exhausted, causing all new requests to fail.

Evidence: Metrics showed connection count at 100/100 (max), with 500+ pending requests.

---

Why #2: Why were database connections exhausted?

Answer: Each incoming request opened a new database connection instead of using the connection pool.

Evidence: Code diff shows direct DriverManager.getConnection() instead of pooled DataSource.

---

Why #3: Why did the code bypass the connection pool?

Answer: A developer refactored the repository class and inadvertently changed the connection acquisition method.

Evidence: PR #1234 shows the change, made while fixing a different bug.

---

Why #4: Why wasn't this caught in code review?

Answer: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.

Evidence: Review comments only discuss business logic.

---

Why #5: Why isn't there a safety net for this type of change?

Answer: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.

Evidence: Test suite has no tests for connection handling; wiki has no article on database connections.

Root Causes Identified

1. Primary: Missing automated tests for infrastructure behavior 2. Secondary: Insufficient documentation of architectural patterns 3. Tertiary: Code review checklist doesn't include infrastructure considerations

Systemic Improvements

Root Cause	Improvement	Type
Missing tests	Add infrastructure behavior tests	Prevention
Missing docs	Document connection patterns	Prevention
Review gaps	Update review checklist	Detection
No canary	Implement canary deployments	Mitigation


### Template 3: Quick Postmortem (Minor Incidents)

Quick Postmortem: [Brief Title]

Date: 2024-01-15 | Duration: 12 min | Severity: SEV3

What Happened

API latency spiked to 5s due to cache miss storm after cache flush.

Timeline

10:00 - Cache flush initiated for config update
10:02 - Latency alerts fire
10:05 - Identified as cache miss storm
10:08 - Enabled cache warming
10:12 - Latency normalized

Root Cause

Full cache flush for minor config update caused thundering herd.

Fix

Immediate: Enabled cache warming
Long-term: Implement partial cache invalidation (ENG-999)

Lessons

Don't full-flush cache in production; use targeted invalidation.


## Facilitation Guide

### Running a Postmortem Meeting

Meeting Structure (60 minutes)

1. Opening (5 min)

Remind everyone of blameless culture
"We're here to learn, not to blame"
Review meeting norms

2. Timeline Review (15 min)

Walk through events chronologically
Ask clarifying questions
Identify gaps in timeline

3. Analysis Discussion (20 min)

What failed?
Why did it fail?
What conditions allowed this?
What would have prevented it?

4. Action Items (15 min)

Brainstorm improvements
Prioritize by impact and effort
Assign owners and due dates

5. Closing (5 min)

Summarize key learnings
Confirm action item owners
Schedule follow-up if needed

Facilitation Tips

Keep discussion on track
Redirect blame to systems
Encourage quiet participants
Document dissenting views
Time-box tangents


## Anti-Patterns to Avoid

| Anti-Pattern            | Problem                    | Better Approach                 |
| ----------------------- | -------------------------- | ------------------------------- |
| **Blame game**          | Shuts down learning        | Focus on systems                |
| **Shallow analysis**    | Doesn't prevent recurrence | Ask "why" 5 times               |
| **No action items**     | Waste of time              | Always have concrete next steps |
| **Unrealistic actions** | Never completed            | Scope to achievable tasks       |
| **No follow-up**        | Actions forgotten          | Track in ticketing system       |

## Best Practices

### Do's

- **Start immediately** - Memory fades fast
- **Be specific** - Exact times, exact errors
- **Include graphs** - Visual evidence
- **Assign owners** - No orphan action items
- **Share widely** - Organizational learning

### Don'ts

- **Don't name and shame** - Ever
- **Don't skip small incidents** - They reveal patterns
- **Don't make it a blame doc** - That kills learning
- **Don't create busywork** - Actions should be meaningful
- **Don't skip follow-up** - Verify actions completed

postmortem-writing — templates and worked examples

Templates

Template 1: Standard Postmortem

# Postmortem: [Incident Title]

**Date**: 2024-01-15
**Authors**: @alice, @bob
**Status**: Draft | In Review | Final
**Incident Severity**: SEV2
**Incident Duration**: 47 minutes

## Executive Summary

On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.

**Impact**:

- 12,000 customers unable to complete purchases
- Estimated revenue loss: $45,000
- 847 support tickets created
- No data loss or security implications

## Timeline (All times UTC)

| Time  | Event                                           |
| ----- | ----------------------------------------------- |
| 14:23 | Deployment v2.3.4 completed to production       |
| 14:31 | First alert: `payment_error_rate > 5%`          |
| 14:33 | On-call engineer @alice acknowledges alert      |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins              |
| 14:45 | Database connection exhaustion identified       |
| 14:52 | Decision to rollback deployment                 |
| 14:58 | Rollback to v2.3.3 initiated                    |
| 15:10 | Rollback complete, error rate dropping          |
| 15:18 | Service fully recovered, incident resolved      |

## Root Cause Analysis

### What Happened

The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.

### Why It Happened

1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.

2. **Contributing Factors**:
   - Code review did not catch the connection handling change
   - No integration tests specifically for connection pool behavior
   - Staging environment has lower traffic, masking the issue
   - Database connection metrics alert threshold was too high (90%)

3. **5 Whys Analysis**:
   - Why did the service fail? → Database connections exhausted
   - Why were connections exhausted? → Each request opened new connection
   - Why did each request open new connection? → Code bypassed connection pool
   - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
   - Why was developer unfamiliar? → No documentation on connection management patterns

### System Diagram

[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)


## Detection

### What Worked
- Error rate alert fired within 8 minutes of deployment
- Grafana dashboard clearly showed connection spike
- On-call response was swift (2 minute acknowledgment)

### What Didn't Work
- Database connection metric alert threshold too high
- No deployment-correlated alerting
- Canary deployment would have caught this earlier

### Detection Gap
The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.

## Response

### What Worked
- On-call engineer quickly identified database as the issue
- Rollback decision was made decisively
- Clear communication in incident channel

### What Could Be Improved
- Took 10 minutes to correlate issue with recent deployment
- Had to manually check deployment history
- Rollback took 12 minutes (could be faster)

## Impact

### Customer Impact
- 12,000 unique customers affected
- Average impact duration: 35 minutes
- 847 support tickets (23% of affected users)
- Customer satisfaction score dropped 12 points

### Business Impact
- Estimated revenue loss: $45,000
- Support cost: ~$2,500 (agent time)
- Engineering time: ~8 person-hours

### Technical Impact
- Database primary experienced elevated load
- Some replica lag during incident
- No permanent damage to systems

## Lessons Learned

### What Went Well
1. Alerting detected the issue before customer reports
2. Team collaborated effectively under pressure
3. Rollback procedure worked smoothly
4. Communication was clear and timely

### What Went Wrong
1. Code review missed critical change
2. Test coverage gap for connection pooling
3. Staging environment doesn't reflect production traffic
4. Alert thresholds were not tuned properly

### Where We Got Lucky
1. Incident occurred during business hours with full team available
2. Database handled the load without failing completely
3. No other incidents occurred simultaneously

## Action Items

| Priority | Action | Owner | Due Date | Ticket |
|----------|--------|-------|----------|--------|
| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |

## Appendix

### Supporting Data

#### Error Rate Graph
[Link to Grafana dashboard snapshot]

#### Database Connection Graph
[Link to metrics]

### Related Incidents
- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)

#

Related skills

TddFollow test-driven development with a strict red-green-refactor loop when creating reliable features or fixing bugs.510k185k

Test Driven DevelopmentEnforce writing failing tests before any production implementation code.176k260k

QaRun conversational QA sessions that turn user-reported bugs into well-written, domain-aware GitHub issues without manual ticket writing.164k185k

Migrate To ShoehornAutomatically update TypeScript test files that rely on unsafe `as` type assertions by replacing them with type-safe partial objects from @total-typescript/shoehorn.151k185k

Webapp TestingVerify frontend behavior, debug UI issues, capture screenshots, and inspect logs of a running local web application using Playwright.121k164k

Playwright CliRun browser automation, generate element snapshots, inspect DOM attributes, and execute Playwright tests from the terminal.96.3k12.2k

How it compares

Choose postmortem-writing when you need incident narrative templates; use runbook or monitoring skills for prevention automation.

About

Postmortem Writing by the numbers

postmortem-writing capabilities & compatibility

What postmortem-writing says it does

Add your badge

What problem does postmortem-writing solve for developers using this skill?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Postmortem Writing

When to Use This Skill

Core Concepts

1. Blameless Culture

2. Postmortem Triggers

Quick Start

Postmortem Timeline

Templates and detailed worked examples

References

5 Whys Analysis: [Incident]

Problem Statement

Analysis

Why #1: Why did the service fail?

Why #2: Why were database connections exhausted?

Why #3: Why did the code bypass the connection pool?

Why #4: Why wasn't this caught in code review?

Why #5: Why isn't there a safety net for this type of change?

Root Causes Identified

Systemic Improvements

Quick Postmortem: [Brief Title]

What Happened

Timeline

Root Cause

Fix

Lessons

Meeting Structure (60 minutes)

1. Opening (5 min)

2. Timeline Review (15 min)

3. Analysis Discussion (20 min)

4. Action Items (15 min)

5. Closing (5 min)

Facilitation Tips

postmortem-writing — templates and worked examples

Templates

Template 1: Standard Postmortem

Related skills

How it compares

FAQ

What does postmortem-writing do?

When should I use postmortem-writing?

Is postmortem-writing safe to install?

This week in AI coding