Incident Commander

Name: Incident Commander
Author: alirezarezvani

alirezarezvani/claude-skills

581 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

incident-commander is an agent skill that generates structured SEV1–SEV4 incident reports, timelines, and post-incident reviews for developers coordinating production outage response and SRE postmortems.

About

incident-commander is an engineering-team agent skill (version 1.0.0) for availability and reliability incidents—not security forensics. It provides an incident response framework with SEV1 through SEV4 severity definitions, executive-ready report templates, and three Python utilities: incident_classifier.py for triage, timeline_reconstructor.py for chronological narratives, and pir_generator.py for post-incident reviews using 5 Whys, Fishbone, and timeline RCA frameworks. SEV1 criteria include complete customer-facing outages with 5-minute commander assignment and 15-minute executive notification requirements. The skill distinguishes operational impact scoring from security incident-response skills that handle intrusion and data exfiltration under NIST SP 800-61. On-call engineers and platform leads use incident-commander when declaring incidents, coordinating war rooms, or drafting PIR documents after mitigation, because it standardizes stakeholder comms frequency and impact tables instead of leaving postmortems as ad-hoc Slack threads without owners or timelines. Templates include communication cadence guidance by severity level.

Standardized incident report template with severity, status, and commander fields
Executive Summary optimized for non-technical stakeholders
Comprehensive Impact Statement table covering duration, users, revenue, SLA, and regions
Detailed Timeline with phases, key decision points and rationales
Customer-Facing Impact section focused on user journeys

Incident Commander by the numbers

581 all-time installs (skills.sh)
Ranked #73 of 598 Debugging skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill incident-commander

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/incident-commander.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/incident-commander)

Installs	581
repo stars	★ 23.5k
Security audit	3 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you write an executive incident report?

Generate structured, executive-ready incident reports when production issues occur.

Who is it for?

SREs and on-call engineers who need SEV1–SEV4 triage templates, timelines, and post-incident reviews during production outages.

Skip if: Security intrusion or forensic investigations that require the separate incident-response security skill rather than availability postmortems.

When should I use this skill?

Production is degraded or down and the user needs incident commander templates, severity classification, or PIR generation.

What you get

SEV-classified incident report, reconstructed timeline, stakeholder comms templates, and post-incident review document

Executive incident report
Chronological timeline
Post-incident review with action items

By the numbers

Defines 4 operational severity levels from SEV1 through SEV4
Includes 3 Python utilities: classifier, timeline reconstructor, and PIR generator

Files

assets/
expected_outputs/
references/
scripts/

SKILL.mdMarkdownGitHub ↗

Incident Commander Skill

Category: Engineering Team Tier: POWERFUL Author: Claude Skills Team Version: 1.0.0 Last Updated: February 2026

Overview

Incident response framework for availability/reliability incidents (outages, degradations, failed deploys): severity classification, timeline reconstruction, and post-incident review.

This is NOT security incident triage. For security events (ransomware, intrusion, data exfiltration, IOC analysis, NIST SP 800-61 forensics), route to incident-response. Both skills use SEV1-SEV4 labels; this one scores operational impact (users, revenue, SLA), while incident-response classifies attack types and forensic handling.

Key Features

Automated Severity Classification - Intelligent incident triage based on impact and urgency metrics
Timeline Reconstruction - Transform scattered logs and events into coherent incident narratives
Post-Incident Review Generation - Structured PIRs with multiple RCA frameworks
Communication Templates - Pre-built templates for stakeholder updates and escalations
Runbook Integration - Generate actionable runbooks from incident patterns

Skills Included

Core Tools

1. Incident Classifier (incident_classifier.py)

Analyzes incident descriptions and outputs severity levels
Recommends response teams and initial actions
Generates communication templates based on severity

2. Timeline Reconstructor (timeline_reconstructor.py)

Processes timestamped events from multiple sources
Reconstructs chronological incident timeline
Identifies gaps and provides duration analysis

3. PIR Generator (pir_generator.py)

Creates comprehensive Post-Incident Review documents
Applies multiple RCA frameworks (5 Whys, Fishbone, Timeline)
Generates actionable follow-up items

Incident Response Framework

Severity Classification System

SEV1 - Critical Outage

Definition: Complete service failure affecting all users or critical business functions

Characteristics:

Customer-facing services completely unavailable
Data loss or corruption affecting users
Security breaches with customer data exposure
Revenue-generating systems down
SLA violations with financial penalties

Response Requirements:

Immediate escalation to on-call engineer
Incident Commander assigned within 5 minutes
Executive notification within 15 minutes
Public status page update within 15 minutes
War room established
All hands on deck if needed

Communication Frequency: Every 15 minutes until resolution

SEV2 - Major Impact

Definition: Significant degradation affecting subset of users or non-critical functions

Characteristics:

Partial service degradation (>25% of users affected)
Performance issues causing user frustration
Non-critical features unavailable
Internal tools impacting productivity
Data inconsistencies not affecting user experience

Response Requirements:

On-call engineer response within 15 minutes
Incident Commander assigned within 30 minutes
Status page update within 30 minutes
Stakeholder notification within 1 hour
Regular team updates

Communication Frequency: Every 30 minutes during active response

SEV3 - Minor Impact

Definition: Limited impact with workarounds available

Characteristics:

Single feature or component affected
<25% of users impacted
Workarounds available
Performance degradation not significantly impacting UX
Non-urgent monitoring alerts

Response Requirements:

Response within 2 hours during business hours
Next business day response acceptable outside hours
Internal team notification
Optional status page update

Communication Frequency: At key milestones only

SEV4 - Low Impact

Definition: Minimal impact, cosmetic issues, or planned maintenance

Characteristics:

Cosmetic bugs
Documentation issues
Logging or monitoring gaps
Performance issues with no user impact
Development/test environment issues

Response Requirements:

Response within 1-2 business days
Standard ticket/issue tracking
No special escalation required

Communication Frequency: Standard development cycle updates

Incident Commander Role

Primary Responsibilities

1. Command and Control

Own the incident response process
Make critical decisions about resource allocation
Coordinate between technical teams and stakeholders
Maintain situational awareness across all response streams

2. Communication Hub

Provide regular updates to stakeholders
Manage external communications (status pages, customer notifications)
Facilitate effective communication between response teams
Shield responders from external distractions

3. Process Management

Ensure proper incident tracking and documentation
Drive toward resolution while maintaining quality
Coordinate handoffs between team members
Plan and execute rollback strategies if needed

4. Post-Incident Leadership

Ensure thorough post-incident reviews are conducted
Drive implementation of preventive measures
Share learnings with broader organization

Decision-Making Framework

Emergency Decisions (SEV1/2):

Incident Commander has full authority
Bias toward action over analysis
Document decisions for later review
Consult subject matter experts but don't get blocked

Resource Allocation:

Can pull in any necessary team members
Authority to escalate to senior leadership
Can approve emergency spend for external resources
Make call on communication channels and timing

Technical Decisions:

Lean on technical leads for implementation details
Make final calls on trade-offs between speed and risk
Approve rollback vs. fix-forward strategies
Coordinate testing and validation approaches

Communication Templates

Initial Incident Notification (SEV1/2)

Subject: [SEV{severity}] {Service Name} - {Brief Description}

Incident Details:
- Start Time: {timestamp}
- Severity: SEV{level}
- Impact: {user impact description}
- Current Status: {investigating/mitigating/resolved}

Technical Details:
- Affected Services: {service list}
- Symptoms: {what users are experiencing}
- Initial Assessment: {suspected root cause if known}

Response Team:
- Incident Commander: {name}
- Technical Lead: {name}
- SMEs Engaged: {list}

Next Update: {timestamp}
Status Page: {link}
War Room: {bridge/chat link}

---
{Incident Commander Name}
{Contact Information}

Executive Summary (SEV1)

Subject: URGENT - Customer-Impacting Outage - {Service Name}

Executive Summary:
{2-3 sentence description of customer impact and business implications}

Key Metrics:
- Time to Detection: {X minutes}
- Time to Engagement: {X minutes} 
- Estimated Customer Impact: {number/percentage}
- Current Status: {status}
- ETA to Resolution: {time or "investigating"}

Leadership Actions Required:
- [ ] Customer communication approval
- [ ] PR/Communications coordination  
- [ ] Resource allocation decisions
- [ ] External vendor engagement

Incident Commander: {name} ({contact})
Next Update: {time}

---
This is an automated alert from our incident response system.

Customer Communication Template

We are currently experiencing {brief description of issue} affecting {scope of impact}. 

Our engineering team was alerted at {time} and is actively working to resolve the issue. We will provide updates every {frequency} until resolved.

What we know:
- {factual statement of impact}
- {factual statement of scope}
- {brief status of response}

What we're doing:
- {primary response action}
- {secondary response action}

Workaround (if available):
{workaround steps or "No workaround currently available"}

We apologize for the inconvenience and will share more information as it becomes available.

Next update: {time}
Status page: {link}

Stakeholder Management

Stakeholder Classification

Internal Stakeholders:

Engineering Leadership - Technical decisions and resource allocation
Product Management - Customer impact assessment and feature implications
Customer Support - User communication and support ticket management
Sales/Account Management - Customer relationship management for enterprise clients
Executive Team - Business impact decisions and external communication approval
Legal/Compliance - Regulatory reporting and liability assessment

External Stakeholders:

Customers - Service availability and impact communication
Partners - API availability and integration impacts
Vendors - Third-party service dependencies and support escalation
Regulators - Compliance reporting for regulated industries
Public/Media - Transparency for public-facing outages

Communication Cadence by Stakeholder

Stakeholder	SEV1	SEV2	SEV3	SEV4
Engineering Leadership	Real-time	30min	4hrs	Daily
Executive Team	15min	1hr	EOD	Weekly
Customer Support	Real-time	30min	2hrs	As needed
Customers	15min	1hr	Optional	None
Partners	30min	2hrs	Optional	None

Runbook Generation Framework

Dynamic Runbook Components

1. Detection Playbooks

Monitoring alert definitions
Triage decision trees
Escalation trigger points
Initial response actions

2. Response Playbooks

Step-by-step mitigation procedures
Rollback instructions
Validation checkpoints
Communication checkpoints

3. Recovery Playbooks

Service restoration procedures
Data consistency checks
Performance validation
User notification processes

Runbook Template Structure

# {Service/Component} Incident Response Runbook

## Quick Reference
- **Severity Indicators:** {list of conditions for each severity level}
- **Key Contacts:** {on-call rotations and escalation paths}
- **Critical Commands:** {list of emergency commands with descriptions}

## Detection
### Monitoring Alerts
- {Alert name}: {description and thresholds}
- {Alert name}: {description and thresholds}

### Manual Detection Signs
- {Symptom}: {what to look for and where}
- {Symptom}: {what to look for and where}

## Initial Response (0-15 minutes)
1. **Assess Severity**
   - [ ] Check {primary metric}
   - [ ] Verify {secondary indicator}
   - [ ] Classify as SEV{level} based on {criteria}

2. **Establish Command**
   - [ ] Page Incident Commander if SEV1/2
   - [ ] Create incident tracking ticket
   - [ ] Join war room: {link/bridge info}

3. **Initial Investigation**
   - [ ] Check recent deployments: {deployment log location}
   - [ ] Review error logs: {log location and queries}
   - [ ] Verify dependencies: {dependency check commands}

## Mitigation Strategies
### Strategy 1: {Name}
**Use when:** {conditions}
**Steps:**
1. {detailed step with commands}
2. {detailed step with expected outcomes}
3. {validation step}

**Rollback Plan:**
1. {rollback step}
2. {verification step}

### Strategy 2: {Name}
{similar structure}

## Recovery and Validation
1. **Service Restoration**
   - [ ] {restoration step}
   - [ ] Wait for {metric} to return to normal
   - [ ] Validate end-to-end functionality

2. **Communication**
   - [ ] Update status page
   - [ ] Notify stakeholders
   - [ ] Schedule PIR

## Common Pitfalls
- **{Pitfall}:** {description and how to avoid}
- **{Pitfall}:** {description and how to avoid}

## Reference Information
→ See references/reference-information.md for details

## Usage Examples

### Example 1: Database Connection Pool Exhaustion

Classify the incident

echo '{"description": "Users reporting 500 errors, database connections timing out", "affected_users": "80%", "business_impact": "high"}' | python scripts/incident_classifier.py

Reconstruct timeline from logs

python scripts/timeline_reconstructor.py --input assets/sample_timeline_events.json --output timeline.md

Generate PIR after resolution

python scripts/pir_generator.py --incident assets/sample_incident_data.json --timeline timeline.md --output pir.md


### Example 2: API Rate Limiting Incident

Quick classification from stdin

echo "API rate limits causing customer API calls to fail" | python scripts/incident_classifier.py --format text

Build timeline from multiple sources

python scripts/timeline_reconstructor.py --input assets/simple_timeline_events.json --detect-phases --gap-analysis

Generate comprehensive PIR

python scripts/pir_generator.py --incident assets/sample_incident_pir_data.json --rca-method fishbone --action-items


## Best Practices

### During Incident Response

1. **Maintain Calm Leadership**
   - Stay composed under pressure
   - Make decisive calls with incomplete information
   - Communicate confidence while acknowledging uncertainty

2. **Document Everything**
   - All actions taken and their outcomes
   - Decision rationale, especially for controversial calls
   - Timeline of events as they happen

3. **Effective Communication**
   - Use clear, jargon-free language
   - Provide regular updates even when there's no new information
   - Manage stakeholder expectations proactively

4. **Technical Excellence**
   - Prefer rollbacks to risky fixes under pressure
   - Validate fixes before declaring resolution
   - Plan for secondary failures and cascading effects

### Post-Incident

1. **Blameless Culture**
   - Focus on system failures, not individual mistakes
   - Encourage honest reporting of what went wrong
   - Celebrate learning and improvement opportunities

2. **Action Item Discipline**
   - Assign specific owners and due dates
   - Track progress publicly
   - Prioritize based on risk and effort

3. **Knowledge Sharing**
   - Share PIRs broadly within the organization
   - Update runbooks based on lessons learned
   - Conduct training sessions for common failure modes

4. **Continuous Improvement**
   - Look for patterns across multiple incidents
   - Invest in tooling and automation
   - Regularly review and update processes

## Integration with Existing Tools

### Monitoring and Alerting
- PagerDuty/Opsgenie integration for escalation
- Datadog/Grafana for metrics and dashboards
- ELK/Splunk for log analysis and correlation

### Communication Platforms
- Slack/Teams for war room coordination
- Zoom/Meet for video bridges
- Status page providers (Statuspage.io, etc.)

### Documentation Systems
- Confluence/Notion for PIR storage
- GitHub/GitLab for runbook version control
- JIRA/Linear for action item tracking

### Change Management
- CI/CD pipeline integration
- Deployment tracking systems
- Feature flag platforms for quick rollbacks

Incident Report: [INC-YYYY-NNNN] [Title]

Severity: SEV[1-4] Status: [Active | Mitigated | Resolved] Incident Commander: [Name] Date: [YYYY-MM-DD]

---

Executive Summary

[2-3 sentence summary of the incident: what happened, impact scope, resolution status. Written for executive audience — no jargon, focus on business impact.]

---

Impact Statement

Metric	Value
Duration	[X hours Y minutes]
Affected Users	[number or percentage]
Failed Transactions	[number]
Revenue Impact	$[amount]
Data Loss	[Yes/No — if yes, detail below]
SLA Impact	[X.XX% availability for period]
Affected Regions	[list regions]
Affected Services	[list services]

Customer-Facing Impact

[Describe what customers experienced: error messages, degraded functionality, complete outage. Be specific about which user journeys were affected.]

---

Timeline

Time (UTC)	Phase	Event
HH:MM	Detection	[First alert or report]
HH:MM	Declaration	[Incident declared, channel created]
HH:MM	Investigation	[Key investigation findings]
HH:MM	Mitigation	[Mitigation action taken]
HH:MM	Resolution	[Permanent fix applied]
HH:MM	Closure	[Incident closed, monitoring confirmed stable]

Key Decision Points

1. [HH:MM] [Decision] — [Rationale and outcome] 2. [HH:MM] [Decision] — [Rationale and outcome]

Timeline Gaps

[Note any periods >15 minutes without logged events. These represent potential blind spots in the response.]

---

Root Cause Analysis

Root Cause

[Clear, specific statement of the root cause. Not "human error" — describe the systemic failure.]

Contributing Factors

1. [Factor Category: Process/Tooling/Human/Environment] — [Description] 2. [Factor Category] — [Description] 3. [Factor Category] — [Description]

5-Whys Analysis

Why did the service degrade? → [Answer]

Why did [answer above] happen? → [Answer]

Why did [answer above] happen? → [Root systemic cause]

---

Response Metrics

Metric	Value	Target	Status
MTTD (Mean Time to Detect)	[X min]	<5 min	[Met/Missed]
Time to Declare	[X min]	<10 min	[Met/Missed]
Time to Mitigate	[X min]	<60 min (SEV1)	[Met/Missed]
MTTR (Mean Time to Resolve)	[X min]	<4 hr (SEV1)	[Met/Missed]
Postmortem Timeliness	[X hours]	<72 hr	[Met/Missed]

---

Action Items

#	Priority	Action	Owner	Deadline	Type	Status
1	P1	[Action description]	[owner]	[date]	Detection	Open
2	P1	[Action description]	[owner]	[date]	Prevention	Open
3	P2	[Action description]	[owner]	[date]	Prevention	Open
4	P2	[Action description]	[owner]	[date]	Process	Open

Action Item Types

Detection: Improve ability to detect this class of issue faster
Prevention: Prevent this class of issue from occurring
Mitigation: Reduce impact when this class of issue occurs
Process: Improve response process and coordination

---

Lessons Learned

What Went Well

[Specific positive outcome from the response]
[Specific positive outcome]

What Didn't Go Well

[Specific area for improvement]
[Specific area for improvement]

Where We Got Lucky

[Things that could have made this worse but didn't]

---

Communication Log

Time (UTC)	Channel	Audience	Summary
HH:MM	Status Page	External	[Summary of update]
HH:MM	Slack #exec	Internal	[Summary of update]
HH:MM	Email	Customers	[Summary of notification]

---

Participants

Name	Role
[Name]	Incident Commander
[Name]	Operations Lead
[Name]	Communications Lead
[Name]	Subject Matter Expert

---

Appendix

Related Incidents

[INC-YYYY-NNNN] — [Brief description of related incident]

Reference Links

[Link to monitoring dashboard]
[Link to deployment logs]
[Link to incident channel archive]

---

This report follows the blameless postmortem principle. The goal is systemic improvement, not individual accountability. All contributing factors should trace to process, tooling, or environmental gaps that can be addressed with concrete action items.

Runbook: [Service/Component Name]

Owner: [Team Name] Last Updated: [YYYY-MM-DD] Reviewed By: [Name] Review Cadence: Quarterly

---

Service Overview

Property	Value
Service	[service-name]
Repository	[repo URL]
Dashboard	[monitoring dashboard URL]
On-Call Rotation	[PagerDuty/OpsGenie schedule URL]
SLA Tier	[Tier 1/2/3]
Availability Target	[99.9% / 99.95% / 99.99%]
Dependencies	[list upstream/downstream services]
Owner Team	[team name]
Escalation Contact	[name/email]

Architecture Summary

[2-3 sentence description of the service architecture. Include key components, data stores, and external dependencies.]

---

Alert Response Decision Tree

High Error Rate (>5%)

Error Rate Alert Fired
├── Check: Is this a deployment-related issue?
│   ├── YES → Go to "Recent Deployment Rollback" section
│   └── NO → Continue
├── Check: Is a downstream dependency failing?
│   ├── YES → Go to "Dependency Failure" section
│   └── NO → Continue
├── Check: Is there unusual traffic volume?
│   ├── YES → Go to "Traffic Spike" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner

High Latency (p99 > [threshold]ms)

Latency Alert Fired
├── Check: Database query latency elevated?
│   ├── YES → Go to "Database Performance" section
│   └── NO → Continue
├── Check: Connection pool utilization >80%?
│   ├── YES → Go to "Connection Pool Exhaustion" section
│   └── NO → Continue
├── Check: Memory/CPU pressure on service instances?
│   ├── YES → Go to "Resource Exhaustion" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner

Service Unavailable (Health Check Failing)

Health Check Alert Fired
├── Check: Are all instances down?
│   ├── YES → Go to "Complete Outage" section
│   └── NO → Continue
├── Check: Is only one AZ affected?
│   ├── YES → Go to "AZ Failure" section
│   └── NO → Continue
├── Check: Can instances be restarted?
│   ├── YES → Go to "Instance Restart" section
│   └── NO → Continue
└── Escalate: Declare incident, engage IC

---

Common Scenarios

Recent Deployment Rollback

Symptoms: Error rate spike or latency increase within 60 minutes of a deployment.

Diagnosis: 1. Check deployment history: kubectl rollout history deployment/[service-name] 2. Compare error rate timing with deployment timestamp 3. Review deployment diff for risky changes

Mitigation: 1. Initiate rollback: kubectl rollout undo deployment/[service-name] 2. Verify rollback: kubectl rollout status deployment/[service-name] 3. Confirm error rate returns to baseline (allow 5 minutes) 4. If rollback fails: escalate immediately

Communication: If customer-impacting, update status page within 5 minutes of confirming impact.

---

Database Performance

Symptoms: Elevated query latency, connection pool saturation, timeout errors.

Diagnosis: 1. Check active queries: SELECT * FROM pg_stat_activity WHERE state = 'active'; 2. Check for long-running queries: SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; 3. Check connection count: SELECT count(*) FROM pg_stat_activity; 4. Check table bloat and vacuum status

Mitigation: 1. Kill long-running queries if identified: SELECT pg_terminate_backend([pid]); 2. If connection pool exhausted: increase pool size via config (requires restart) 3. If read replica available: redirect read traffic 4. If write-heavy: identify and defer non-critical writes

Escalation Trigger: If query latency >10s for >5 minutes, escalate to DBA on-call.

---

Connection Pool Exhaustion

Symptoms: Connection timeout errors, pool utilization >90%, requests queuing.

Diagnosis: 1. Check pool metrics: current size, active connections, waiting requests 2. Check for connection leaks: connections held >30s without activity 3. Review recent config changes or deployments

Mitigation: 1. Increase pool size (if infrastructure allows): update config, rolling restart 2. Kill idle connections exceeding timeout 3. If caused by leak: identify and restart affected instances 4. Enable connection pool auto-scaling if available

Prevention: Pool utilization alerting at 70% (warning) and 85% (critical).

---

Dependency Failure

Symptoms: Errors correlated with downstream service failures, circuit breakers tripping.

Diagnosis: 1. Check dependency status dashboards 2. Verify circuit breaker state: open/half-open/closed 3. Check for correlation with dependency deployments or incidents 4. Test dependency health endpoints directly

Mitigation: 1. If circuit breaker not tripping: verify timeout/threshold configuration 2. Enable graceful degradation (serve cached/default responses) 3. If critical path: engage dependency team via incident process 4. If non-critical path: disable feature flag for affected functionality

Communication: Coordinate with dependency team IC if both services have active incidents.

---

Traffic Spike

Symptoms: Sudden traffic increase beyond normal patterns, resource saturation.

Diagnosis: 1. Check traffic source: organic growth vs. bot traffic vs. DDoS 2. Review rate limiting effectiveness 3. Check auto-scaling status and capacity

Mitigation: 1. If bot/DDoS: enable rate limiting, engage security team 2. If organic: trigger manual scale-up, increase auto-scaling limits 3. Enable request queuing or load shedding if at capacity 4. Consider feature flag toggles to reduce per-request cost

---

Complete Outage

Symptoms: All instances unreachable, health checks failing across AZs.

Diagnosis: 1. Check infrastructure status (AWS/GCP status page) 2. Verify network connectivity and DNS resolution 3. Check for infrastructure-level incidents (region outage) 4. Review recent infrastructure changes (Terraform, network config)

Mitigation: 1. If infra provider issue: activate disaster recovery plan 2. If DNS issue: update DNS records, reduce TTL 3. If deployment corruption: redeploy last known good version 4. If data corruption: engage data recovery procedures

Escalation: Immediately declare SEV1 incident. Engage infrastructure team and management.

---

Instance Restart

Symptoms: Individual instances unhealthy, OOM kills, process crashes.

Diagnosis: 1. Check instance logs for crash reason 2. Review memory/CPU usage patterns before crash 3. Check for memory leaks or resource exhaustion 4. Verify configuration consistency across instances

Mitigation: 1. Restart unhealthy instances: kubectl delete pod [pod-name] 2. If recurring: cordon node and migrate workloads 3. If memory leak: schedule immediate patch with increased memory limit 4. Monitor for recurrence after restart

---

AZ Failure

Symptoms: All instances in one availability zone failing, others healthy.

Diagnosis: 1. Confirm AZ-specific failure vs. instance-specific issues 2. Check cloud provider AZ status 3. Verify load balancer is routing around failed AZ

Mitigation: 1. Ensure load balancer marks AZ instances as unhealthy 2. Scale up remaining AZs to handle redirected traffic 3. If auto-scaling: verify it's responding to increased load 4. Monitor remaining AZs for cascade effects

---

Key Metrics & Dashboards

Metric	Normal Range	Warning	Critical	Dashboard
Error Rate	<0.1%	>1%	>5%	[link]
p99 Latency	<200ms	>500ms	>2000ms	[link]
CPU Usage	<60%	>75%	>90%	[link]
Memory Usage	<70%	>80%	>90%	[link]
DB Pool Usage	<50%	>70%	>85%	[link]
Request Rate	[baseline]±20%	±50%	±100%	[link]

---

Escalation Contacts

Level	Contact	When
L1: On-Call Primary	[name/rotation]	First responder
L2: On-Call Secondary	[name/rotation]	Primary unavailable or needs help
L3: Service Owner	[name]	Complex issues, architectural decisions
L4: Engineering Manager	[name]	SEV1/SEV2, customer impact, resource needs
L5: VP Engineering	[name]	SEV1 >30 min, major customer/revenue impact

---

Maintenance Procedures

Planned Maintenance Checklist

[ ] Maintenance window scheduled and communicated (72 hours advance for Tier 1)
[ ] Status page updated with planned maintenance notice
[ ] Rollback plan documented and tested
[ ] On-call notified of maintenance window
[ ] Customer notification sent (if SLA-impacting)
[ ] Post-maintenance verification plan ready

Health Verification After Changes

1. Check all health endpoints return 200 2. Verify error rate returns to baseline within 5 minutes 3. Confirm latency within normal range 4. Run synthetic transaction test 5. Monitor for 15 minutes before declaring success

---

Revision History

Date	Author	Change
[YYYY-MM-DD]	[Name]	Initial version
[YYYY-MM-DD]	[Name]	[Description of update]

---

This runbook should be reviewed quarterly and updated after every incident that reveals missing procedures. The on-call engineer should be able to follow this document without prior context about the service. If any section requires tribal knowledge to execute, it needs to be expanded.

{
  "description": "Database connection timeouts causing 500 errors for payment processing API. Users unable to complete checkout. Error rate spiked from 0.1% to 45% starting at 14:30 UTC. Database monitoring shows connection pool exhaustion with 200/200 connections active.",
  "service": "payment-api",
  "affected_users": "80%",
  "business_impact": "high",
  "duration_minutes": 95,
  "metadata": {
    "error_rate": "45%",
    "connection_pool_utilization": "100%",
    "affected_regions": ["us-west", "us-east", "eu-west"],
    "detection_method": "monitoring_alert",
    "customer_escalations": 12
  }
}

{
  "incident": {
    "id": "INC-2024-0142",
    "title": "Payment Service Degradation",
    "severity": "SEV1",
    "status": "resolved",
    "declared_at": "2024-01-15T14:23:00Z",
    "resolved_at": "2024-01-15T16:45:00Z",
    "commander": "Jane Smith",
    "service": "payment-gateway",
    "affected_services": ["checkout", "subscription-billing"]
  },
  "events": [
    {
      "timestamp": "2024-01-15T14:15:00Z",
      "type": "trigger",
      "actor": "system",
      "description": "Database connection pool utilization reaches 95% on payment-gateway primary",
      "metadata": {"metric": "db_pool_utilization", "value": 95, "threshold": 90}
    },
    {
      "timestamp": "2024-01-15T14:20:00Z",
      "type": "detection",
      "actor": "monitoring",
      "description": "PagerDuty alert fired: payment-gateway error rate >5% (current: 8.2%)",
      "metadata": {"alert_id": "PD-98765", "source": "datadog", "error_rate": 8.2}
    },
    {
      "timestamp": "2024-01-15T14:21:00Z",
      "type": "detection",
      "actor": "monitoring",
      "description": "Datadog alert: p99 latency on /api/payments exceeds 5000ms (current: 8500ms)",
      "metadata": {"alert_id": "DD-54321", "source": "datadog", "latency_p99_ms": 8500}
    },
    {
      "timestamp": "2024-01-15T14:23:00Z",
      "type": "declaration",
      "actor": "Jane Smith",
      "description": "SEV1 declared. Incident channel #inc-20240115-payment-degradation created. Bridge call started.",
      "metadata": {"channel": "#inc-20240115-payment-degradation", "severity": "SEV1"}
    },
    {
      "timestamp": "2024-01-15T14:25:00Z",
      "type": "investigation",
      "actor": "Alice Chen",
      "description": "Confirmed: database connection pool at 100% utilization. All new connections being rejected.",
      "metadata": {"pool_size": 20, "active_connections": 20, "waiting_requests": 147}
    },
    {
      "timestamp": "2024-01-15T14:28:00Z",
      "type": "investigation",
      "actor": "Carol Davis",
      "description": "Identified recent deployment of user-api v2.4.1 at 13:45 UTC. New ORM version (3.2.0) changed connection handling behavior.",
      "metadata": {"deployment": "user-api-v2.4.1", "deployed_at": "2024-01-15T13:45:00Z"}
    },
    {
      "timestamp": "2024-01-15T14:30:00Z",
      "type": "communication",
      "actor": "Bob Kim",
      "description": "Status page updated: Investigating - We are investigating increased error rates affecting payment processing.",
      "metadata": {"channel": "status_page", "status": "investigating"}
    },
    {
      "timestamp": "2024-01-15T14:35:00Z",
      "type": "escalation",
      "actor": "Jane Smith",
      "description": "Escalated to VP Engineering. Customer impact confirmed: 12,500+ users affected, failed transactions accumulating.",
      "metadata": {"escalated_to": "VP Engineering", "reason": "revenue_impact"}
    },
    {
      "timestamp": "2024-01-15T14:40:00Z",
      "type": "mitigation",
      "actor": "Alice Chen",
      "description": "Attempting mitigation: increasing connection pool size from 20 to 50 via config override.",
      "metadata": {"action": "pool_resize", "old_value": 20, "new_value": 50}
    },
    {
      "timestamp": "2024-01-15T14:45:00Z",
      "type": "communication",
      "actor": "Bob Kim",
      "description": "Status page updated: Identified - The issue has been identified as a database configuration problem. We are implementing a fix.",
      "metadata": {"channel": "status_page", "status": "identified"}
    },
    {
      "timestamp": "2024-01-15T14:50:00Z",
      "type": "investigation",
      "actor": "Carol Davis",
      "description": "Pool resize partially effective. Error rate dropped from 23% to 12%. ORM 3.2.0 opens 3x more connections per request than 3.1.2.",
      "metadata": {"error_rate_before": 23.5, "error_rate_after": 12.1}
    },
    {
      "timestamp": "2024-01-15T15:00:00Z",
      "type": "mitigation",
      "actor": "Alice Chen",
      "description": "Decision: roll back ORM version to 3.1.2. Initiating rollback deployment of user-api v2.3.9.",
      "metadata": {"action": "rollback", "target_version": "2.3.9", "rollback_reason": "orm_connection_leak"}
    },
    {
      "timestamp": "2024-01-15T15:15:00Z",
      "type": "mitigation",
      "actor": "Alice Chen",
      "description": "Rollback deployment complete. user-api v2.3.9 running in production. Connection pool utilization dropping.",
      "metadata": {"deployment_duration_minutes": 15, "pool_utilization": 45}
    },
    {
      "timestamp": "2024-01-15T15:20:00Z",
      "type": "communication",
      "actor": "Bob Kim",
      "description": "Status page updated: Monitoring - A fix has been implemented and we are monitoring the results.",
      "metadata": {"channel": "status_page", "status": "monitoring"}
    },
    {
      "timestamp": "2024-01-15T15:30:00Z",
      "type": "mitigation",
      "actor": "Jane Smith",
      "description": "Error rate back to baseline (<0.1%). Payment processing fully restored. Entering monitoring phase.",
      "metadata": {"error_rate": 0.08, "pool_utilization": 32}
    },
    {
      "timestamp": "2024-01-15T16:30:00Z",
      "type": "investigation",
      "actor": "Carol Davis",
      "description": "Confirmed stable for 60 minutes. No degradation detected. Root cause documented: ORM 3.2.0 connection pooling incompatibility.",
      "metadata": {"monitoring_duration_minutes": 60, "stable": true}
    },
    {
      "timestamp": "2024-01-15T16:45:00Z",
      "type": "resolution",
      "actor": "Jane Smith",
      "description": "Incident resolved. All services nominal. Postmortem scheduled for 2024-01-17 10:00 UTC.",
      "metadata": {"postmortem_scheduled": "2024-01-17T10:00:00Z"}
    },
    {
      "timestamp": "2024-01-15T16:50:00Z",
      "type": "communication",
      "actor": "Bob Kim",
      "description": "Status page updated: Resolved - The issue has been resolved. Payment processing is operating normally.",
      "metadata": {"channel": "status_page", "status": "resolved"}
    }
  ],
  "communications": [
    {
      "timestamp": "2024-01-15T14:30:00Z",
      "channel": "status_page",
      "audience": "external",
      "message": "Investigating - We are investigating increased error rates affecting payment processing. Some transactions may fail. We will provide an update within 15 minutes."
    },
    {
      "timestamp": "2024-01-15T14:35:00Z",
      "channel": "slack_exec",
      "audience": "internal",
      "message": "SEV1 ACTIVE: Payment service degradation. ~12,500 users affected. Failed transactions accumulating. IC: Jane Smith. Bridge: [link]. ETA for mitigation: investigating."
    },
    {
      "timestamp": "2024-01-15T14:45:00Z",
      "channel": "status_page",
      "audience": "external",
      "message": "Identified - The issue has been identified as a database configuration problem following a recent deployment. We are implementing a fix. Next update in 15 minutes."
    },
    {
      "timestamp": "2024-01-15T15:20:00Z",
      "channel": "status_page",
      "audience": "external",
      "message": "Monitoring - A fix has been implemented and we are monitoring the results. Payment processing is recovering. We will provide a final update once we confirm stability."
    },
    {
      "timestamp": "2024-01-15T16:50:00Z",
      "channel": "status_page",
      "audience": "external",
      "message": "Resolved - The issue affecting payment processing has been resolved. All systems are operating normally. We will publish a full incident report within 48 hours."
    }
  ],
  "impact": {
    "revenue_impact": "high",
    "affected_users_percentage": 45,
    "affected_regions": ["us-east-1", "eu-west-1"],
    "data_integrity_risk": false,
    "security_breach": false,
    "customer_facing": true,
    "degradation_type": "partial",
    "workaround_available": false
  },
  "signals": {
    "error_rate_percentage": 23.5,
    "latency_p99_ms": 8500,
    "affected_endpoints": ["/api/payments", "/api/checkout", "/api/subscriptions"],
    "dependent_services": ["checkout", "subscription-billing", "order-service"],
    "alert_count": 12,
    "customer_reports": 8
  },
  "context": {
    "recent_deployments": [
      {
        "service": "user-api",
        "deployed_at": "2024-01-15T13:45:00Z",
        "version": "2.4.1",
        "changes": "Upgraded ORM from 3.1.2 to 3.2.0"
      }
    ],
    "ongoing_incidents": [],
    "maintenance_windows": [],
    "on_call": {
      "primary": "alice@company.com",
      "secondary": "bob@company.com",
      "escalation_manager": "director-eng@company.com"
    }
  },
  "resolution": {
    "root_cause": "Database connection pool exhaustion caused by ORM 3.2.0 opening 3x more connections per request than previous version 3.1.2, exceeding the pool size of 20",
    "contributing_factors": [
      "Insufficient load testing of new ORM version under production-scale connection patterns",
      "Connection pool monitoring alert threshold set too high (90%) with no warning at 70%",
      "No canary deployment process for database configuration or ORM changes",
      "Missing connection pool sizing documentation for service dependencies"
    ],
    "mitigation_steps": [
      "Increased connection pool size from 20 to 50 as temporary relief",
      "Rolled back user-api from v2.4.1 (ORM 3.2.0) to v2.3.9 (ORM 3.1.2)"
    ],
    "permanent_fix": "Load test ORM 3.2.0 with production connection patterns, update pool sizing, implement canary deployment for ORM changes",
    "customer_impact": {
      "affected_users": 12500,
      "failed_transactions": 342,
      "revenue_impact_usd": 28500,
      "data_loss": false
    }
  },
  "action_items": [
    {
      "title": "Add connection pool utilization alerting at 70% warning and 85% critical thresholds",
      "owner": "alice@company.com",
      "priority": "P1",
      "deadline": "2024-01-22",
      "type": "detection",
      "status": "open"
    },
    {
      "title": "Implement canary deployment pipeline for database configuration and ORM changes",
      "owner": "bob@company.com",
      "priority": "P1",
      "deadline": "2024-02-01",
      "type": "prevention",
      "status": "open"
    },
    {
      "title": "Load test ORM v3.2.0 with production-scale connection patterns before re-deployment",
      "owner": "carol@company.com",
      "priority": "P2",
      "deadline": "2024-01-29",
      "type": "prevention",
      "status": "open"
    },
    {
      "title": "Document connection pool sizing requirements for all services in runbook",
      "owner": "alice@company.com",
      "priority": "P2",
      "deadline": "2024-02-05",
      "type": "process",
      "status": "open"
    },
    {
      "title": "Add ORM connection behavior to integration test suite",
      "owner": "carol@company.com",
      "priority": "P3",
      "deadline": "2024-02-15",
      "type": "prevention",
      "status": "open"
    }
  ],
  "participants": [
    {"name": "Jane Smith", "role": "Incident Commander"},
    {"name": "Alice Chen", "role": "Operations Lead"},
    {"name": "Bob Kim", "role": "Communications Lead"},
    {"name": "Carol Davis", "role": "Database SME"}
  ]
}

{
  "incident_id": "INC-2024-0315-001",
  "title": "Payment API Database Connection Pool Exhaustion",
  "description": "Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.",
  "severity": "sev2",
  "start_time": "2024-03-15T14:30:00Z",
  "end_time": "2024-03-15T15:35:00Z",
  "duration": "1h 5m",
  "affected_services": ["payment-api", "checkout-service", "subscription-billing"],
  "customer_impact": "80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.",
  "business_impact": "Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.",
  "incident_commander": "Mike Rodriguez",
  "responders": [
    "Sarah Chen - On-call Engineer, Primary Responder",
    "Tom Wilson - Database Team Lead",
    "Lisa Park - Database Engineer",
    "Mike Rodriguez - Incident Commander",
    "David Kumar - DevOps Engineer"
  ],
  "status": "resolved",
  "detection_details": {
    "detection_method": "automated_monitoring",
    "detection_time": "2024-03-15T14:30:00Z",
    "alert_source": "Datadog error rate threshold",
    "time_to_detection": "immediate"
  },
  "response_details": {
    "time_to_response": "5 minutes",
    "time_to_escalation": "10 minutes",
    "time_to_resolution": "65 minutes",
    "war_room_established": "2024-03-15T14:45:00Z",
    "executives_notified": false,
    "status_page_updated": true
  },
  "technical_details": {
    "root_cause": "Inefficient database query introduced in deployment v2.3.1 caused each payment validation to take 15 seconds instead of normal 0.1 seconds, exhausting the 200-connection database pool",
    "affected_regions": ["us-west", "us-east", "eu-west"],
    "error_metrics": {
      "peak_error_rate": "45%",
      "normal_error_rate": "0.1%",
      "connection_pool_max": 200,
      "connections_exhausted_at": "100%"
    },
    "resolution_method": "rollback",
    "rollback_target": "v2.2.9",
    "rollback_duration": "7 minutes"
  },
  "communication_log": [
    {
      "timestamp": "2024-03-15T14:50:00Z",
      "type": "status_page",
      "message": "Investigating payment processing issues",
      "audience": "customers"
    },
    {
      "timestamp": "2024-03-15T15:35:00Z", 
      "type": "status_page",
      "message": "Payment processing issues resolved",
      "audience": "customers"
    }
  ],
  "lessons_learned_preview": [
    "Deployment v2.3.1 code review missed performance implications of query change",
    "Load testing didn't include realistic database query patterns",
    "Connection pool monitoring could have provided earlier warning",
    "Rollback procedure worked effectively - 7 minute rollback time"
  ],
  "preliminary_action_items": [
    "Fix inefficient query for v2.3.2 deployment",
    "Add database query performance checks to CI pipeline", 
    "Improve load testing to include database performance scenarios",
    "Add connection pool utilization alerts"
  ]
}

[
  {
    "timestamp": "2024-03-15T14:30:00Z",
    "source": "datadog",
    "type": "alert",
    "message": "High error rate detected on payment-api: 45% error rate (threshold: 5%)",
    "severity": "critical",
    "actor": "monitoring-system",
    "metadata": {
      "alert_id": "ALT-001",
      "metric_value": "45%",
      "threshold": "5%"
    }
  },
  {
    "timestamp": "2024-03-15T14:32:00Z",
    "source": "pagerduty",
    "type": "escalation",
    "message": "Paged on-call engineer Sarah Chen for payment-api alerts",
    "severity": "high",
    "actor": "pagerduty-system",
    "metadata": {
      "incident_id": "PD-12345",
      "responder": "sarah.chen@company.com"
    }
  },
  {
    "timestamp": "2024-03-15T14:35:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Sarah Chen acknowledged the alert and is investigating payment-api issues",
    "severity": "medium",
    "actor": "sarah.chen",
    "metadata": {
      "channel": "#incidents",
      "message_id": "1234567890.123456"
    }
  },
  {
    "timestamp": "2024-03-15T14:38:00Z",
    "source": "application_logs",
    "type": "log",
    "message": "Database connection pool exhausted: 200/200 connections active, unable to acquire new connections",
    "severity": "critical",
    "actor": "payment-api",
    "metadata": {
      "log_level": "ERROR",
      "component": "database_pool",
      "connection_count": 200,
      "max_connections": 200
    }
  },
  {
    "timestamp": "2024-03-15T14:40:00Z",
    "source": "slack",
    "type": "escalation",
    "message": "Sarah Chen: Escalating to incident commander - database connection pool exhausted, need database team",
    "severity": "high",
    "actor": "sarah.chen",
    "metadata": {
      "channel": "#incidents",
      "escalation_reason": "database_expertise_needed"
    }
  },
  {
    "timestamp": "2024-03-15T14:42:00Z",
    "source": "pagerduty",
    "type": "escalation",
    "message": "Incident commander Mike Rodriguez assigned to incident PD-12345",
    "severity": "high",
    "actor": "pagerduty-system",
    "metadata": {
      "incident_commander": "mike.rodriguez@company.com",
      "role": "incident_commander"
    }
  },
  {
    "timestamp": "2024-03-15T14:45:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Mike Rodriguez: War room established in #war-room-payment-api. Engaging database team.",
    "severity": "high",
    "actor": "mike.rodriguez",
    "metadata": {
      "channel": "#incidents",
      "war_room": "#war-room-payment-api"
    }
  },
  {
    "timestamp": "2024-03-15T14:47:00Z",
    "source": "pagerduty",
    "type": "escalation",
    "message": "Database team engineers paged: Tom Wilson, Lisa Park",
    "severity": "medium",
    "actor": "pagerduty-system",
    "metadata": {
      "team": "database-team",
      "responders": ["tom.wilson@company.com", "lisa.park@company.com"]
    }
  },
  {
    "timestamp": "2024-03-15T14:50:00Z",
    "source": "statuspage",
    "type": "communication",
    "message": "Status page updated: Investigating payment processing issues",
    "severity": "medium",
    "actor": "mike.rodriguez",
    "metadata": {
      "status": "investigating",
      "affected_systems": ["payment-api"]
    }
  },
  {
    "timestamp": "2024-03-15T14:52:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Tom Wilson: Joining war room. Looking at database metrics now. Seeing unusual query patterns from recent deployment.",
    "severity": "medium",
    "actor": "tom.wilson",
    "metadata": {
      "channel": "#war-room-payment-api",
      "investigation_focus": "database_metrics"
    }
  },
  {
    "timestamp": "2024-03-15T14:55:00Z",
    "source": "database_monitoring",
    "type": "log",
    "message": "Identified slow query introduced in deployment v2.3.1: payment validation taking 15s per request",
    "severity": "critical",
    "actor": "database-monitor",
    "metadata": {
      "deployment_version": "v2.3.1",
      "query_time": "15s",
      "normal_query_time": "0.1s"
    }
  },
  {
    "timestamp": "2024-03-15T15:00:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Tom Wilson: Root cause identified - inefficient query in v2.3.1 deployment. Recommending immediate rollback.",
    "severity": "high",
    "actor": "tom.wilson",
    "metadata": {
      "channel": "#war-room-payment-api",
      "root_cause": "inefficient_query",
      "recommendation": "rollback"
    }
  },
  {
    "timestamp": "2024-03-15T15:02:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Mike Rodriguez: Approved rollback to v2.2.9. Sarah initiating rollback procedure.",
    "severity": "high",
    "actor": "mike.rodriguez",
    "metadata": {
      "channel": "#war-room-payment-api",
      "decision": "rollback_approved",
      "target_version": "v2.2.9"
    }
  },
  {
    "timestamp": "2024-03-15T15:05:00Z",
    "source": "deployment_system",
    "type": "action",
    "message": "Rollback initiated: payment-api v2.3.1 → v2.2.9",
    "severity": "medium",
    "actor": "sarah.chen",
    "metadata": {
      "from_version": "v2.3.1",
      "to_version": "v2.2.9",
      "deployment_type": "rollback"
    }
  },
  {
    "timestamp": "2024-03-15T15:12:00Z",
    "source": "deployment_system",
    "type": "action",
    "message": "Rollback completed successfully: payment-api now running v2.2.9 across all regions",
    "severity": "medium",
    "actor": "deployment-system",
    "metadata": {
      "deployment_status": "completed",
      "regions": ["us-west", "us-east", "eu-west"]
    }
  },
  {
    "timestamp": "2024-03-15T15:15:00Z",
    "source": "datadog",
    "type": "log",
    "message": "Error rate decreasing: payment-api error rate dropped to 8% and continuing to decline",
    "severity": "medium",
    "actor": "monitoring-system",
    "metadata": {
      "error_rate": "8%",
      "trend": "decreasing"
    }
  },
  {
    "timestamp": "2024-03-15T15:18:00Z",
    "source": "database_monitoring",
    "type": "log",
    "message": "Connection pool utilization normalizing: 45/200 connections active",
    "severity": "low",
    "actor": "database-monitor",
    "metadata": {
      "connection_count": 45,
      "max_connections": 200,
      "utilization": "22.5%"
    }
  },
  {
    "timestamp": "2024-03-15T15:25:00Z",
    "source": "datadog",
    "type": "log",
    "message": "Error rate returned to normal: payment-api error rate now 0.2% (within normal range)",
    "severity": "low",
    "actor": "monitoring-system",
    "metadata": {
      "error_rate": "0.2%",
      "status": "normal"
    }
  },
  {
    "timestamp": "2024-03-15T15:30:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Mike Rodriguez: All metrics returned to normal. Declaring incident resolved. Thanks to all responders.",
    "severity": "low",
    "actor": "mike.rodriguez",
    "metadata": {
      "channel": "#war-room-payment-api",
      "status": "resolved"
    }
  },
  {
    "timestamp": "2024-03-15T15:35:00Z",
    "source": "statuspage",
    "type": "communication",
    "message": "Status page updated: Payment processing issues resolved. All systems operational.",
    "severity": "low",
    "actor": "mike.rodriguez",
    "metadata": {
      "status": "resolved",
      "duration": "65 minutes"
    }
  },
  {
    "timestamp": "2024-03-15T15:40:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Mike Rodriguez: PIR scheduled for tomorrow 10am. Action item: fix the inefficient query in v2.3.2",
    "severity": "low",
    "actor": "mike.rodriguez",
    "metadata": {
      "channel": "#incidents",
      "pir_time": "2024-03-16T10:00:00Z",
      "action_item": "fix_query_v2.3.2"
    }
  }
]

[
  {
    "timestamp": "2024-03-10T09:00:00Z",
    "source": "monitoring",
    "message": "High CPU utilization detected on web servers",
    "severity": "medium",
    "actor": "system"
  },
  {
    "timestamp": "2024-03-10T09:05:00Z",
    "source": "slack",
    "message": "Engineer investigating high CPU alerts",
    "severity": "medium", 
    "actor": "john.doe"
  },
  {
    "timestamp": "2024-03-10T09:15:00Z",
    "source": "deployment",
    "message": "Deployed hotfix to reduce CPU usage",
    "severity": "low",
    "actor": "john.doe"
  },
  {
    "timestamp": "2024-03-10T09:25:00Z",
    "source": "monitoring",
    "message": "CPU utilization returned to normal levels",
    "severity": "low",
    "actor": "system"
  }
]

============================================================
INCIDENT CLASSIFICATION REPORT
============================================================

CLASSIFICATION:
  Severity: SEV1
  Confidence: 100.0%
  Reasoning: Classified as SEV1 based on: keywords: timeout, 500 error; user impact: 80%
  Timestamp: 2026-02-16T12:41:46.644096+00:00

RECOMMENDED RESPONSE:
  Primary Team: Analytics Team
  Supporting Teams: SRE, API Team, Backend Engineering, Finance Engineering, Payments Team, DevOps, Compliance Team, Database Team, Platform Team, Data Engineering
  Response Time: 5 minutes

INITIAL ACTIONS:
  1. Establish incident command (Priority 1)
     Timeout: 5 minutes
     Page incident commander and establish war room

  2. Create incident ticket (Priority 1)
     Timeout: 2 minutes
     Create tracking ticket with all known details

  3. Update status page (Priority 2)
     Timeout: 15 minutes
     Post initial status page update acknowledging incident

  4. Notify executives (Priority 2)
     Timeout: 15 minutes
     Alert executive team of customer-impacting outage

  5. Engage subject matter experts (Priority 3)
     Timeout: 10 minutes
     Page relevant SMEs based on affected systems

COMMUNICATION:
  Subject: 🚨 [SEV1] payment-api - Database connection timeouts causing 500 errors fo...
  Urgency: SEV1
  Recipients: on-call, engineering-leadership, executives, customer-success
  Channels: pager, phone, slack, email, status-page
  Update Frequency: Every 15 minutes

============================================================

Post-Incident Review: Payment API Database Connection Pool Exhaustion

Executive Summary

On March 15, 2024, we experienced a sev2 incident affecting ['payment-api', 'checkout-service', 'subscription-billing']. The incident lasted 1h 5m and had the following impact: 80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay. The incident has been resolved and we have identified specific actions to prevent recurrence.

Incident Overview

Incident ID: INC-2024-0315-001
Date & Time: 2024-03-15 14:30:00 UTC
Duration: 1h 5m
Severity: SEV2
Status: Resolved
Incident Commander: Mike Rodriguez
Responders: Sarah Chen - On-call Engineer, Primary Responder, Tom Wilson - Database Team Lead, Lisa Park - Database Engineer, Mike Rodriguez - Incident Commander, David Kumar - DevOps Engineer

Customer Impact

80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.

Business Impact

Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.

Timeline

No detailed timeline available.

Root Cause Analysis

Analysis Method: 5 Whys Analysis

Why Analysis

Why 1: Why did Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.? Answer: New deployment introduced a regression

Why 2: Why wasn't this detected earlier? Answer: Code review process missed the issue

Why 3: Why didn't existing safeguards prevent this? Answer: Testing environment didn't match production

Why 4: Why wasn't there a backup mechanism? Answer: Further investigation needed

Why 5: Why wasn't this scenario anticipated? Answer: Further investigation needed

What Went Well

The incident was successfully resolved
Incident command was established
Multiple team members collaborated on resolution

What Didn't Go Well

Analysis in progress

Lessons Learned

Lessons learned to be documented following detailed analysis.

Action Items

Action items to be defined.

Follow-up and Prevention

Prevention Measures

Based on the root cause analysis, the following preventive measures have been identified:

Implement comprehensive testing for similar scenarios
Improve monitoring and alerting coverage
Enhance error handling and resilience patterns

Follow-up Schedule

1 week: Review action item progress
1 month: Evaluate effectiveness of implemented changes
3 months: Conduct follow-up assessment and update preventive measures

Appendix

Additional Information

Incident ID: INC-2024-0315-001
Severity Classification: sev2
Affected Services: payment-api, checkout-service, subscription-billing

References

Incident tracking ticket: [Link TBD]
Monitoring dashboards: [Link TBD]
Communication thread: [Link TBD]

--- Generated on 2026-02-16 by PIR Generator

============================================================
INCIDENT CLASSIFICATION REPORT
============================================================

CLASSIFICATION:
  Severity: SEV2
  Confidence: 100.0%
  Reasoning: Classified as SEV2 based on: keywords: slow; user impact: 25%
  Timestamp: 2026-02-16T12:42:41.889774+00:00

RECOMMENDED RESPONSE:
  Primary Team: UX Engineering
  Supporting Teams: Product Engineering, Frontend Team
  Response Time: 15 minutes

INITIAL ACTIONS:
  1. Assign incident commander (Priority 1)
     Timeout: 30 minutes
     Assign IC and establish coordination channel

  2. Create incident tracking (Priority 1)
     Timeout: 5 minutes
     Create incident ticket with details and timeline

  3. Assess customer impact (Priority 2)
     Timeout: 15 minutes
     Determine scope and severity of user impact

  4. Engage response team (Priority 2)
     Timeout: 30 minutes
     Page appropriate technical responders

  5. Begin investigation (Priority 3)
     Timeout: 15 minutes
     Start technical analysis and debugging

COMMUNICATION:
  Subject: ⚠️ [SEV2] web-frontend - Users reporting slow page loads on the main websit...
  Urgency: SEV2
  Recipients: on-call, engineering-leadership, product-team
  Channels: pager, slack, email
  Update Frequency: Every 30 minutes

============================================================

================================================================================
INCIDENT TIMELINE RECONSTRUCTION
================================================================================

OVERVIEW:
  Time Range: 2024-03-15T14:30:00+00:00 to 2024-03-15T15:40:00+00:00
  Total Duration: 70 minutes
  Total Events: 21
  Phases Detected: 12

PHASES:
  DETECTION:
    Start: 2024-03-15T14:30:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Initial detection of the incident through monitoring or observation

  ESCALATION:
    Start: 2024-03-15T14:32:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Escalation to additional resources or higher severity response

  TRIAGE:
    Start: 2024-03-15T14:35:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Assessment and initial investigation of the incident

  ESCALATION:
    Start: 2024-03-15T14:38:00+00:00
    Duration: 9.0 minutes
    Events: 5
    Description: Escalation to additional resources or higher severity response

  TRIAGE:
    Start: 2024-03-15T14:50:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Assessment and initial investigation of the incident

  ESCALATION:
    Start: 2024-03-15T14:52:00+00:00
    Duration: 10.0 minutes
    Events: 4
    Description: Escalation to additional resources or higher severity response

  TRIAGE:
    Start: 2024-03-15T15:05:00+00:00
    Duration: 7.0 minutes
    Events: 2
    Description: Assessment and initial investigation of the incident

  DETECTION:
    Start: 2024-03-15T15:15:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Initial detection of the incident through monitoring or observation

  RESOLUTION:
    Start: 2024-03-15T15:18:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Confirmation that the incident has been resolved

  DETECTION:
    Start: 2024-03-15T15:25:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Initial detection of the incident through monitoring or observation

  RESOLUTION:
    Start: 2024-03-15T15:30:00+00:00
    Duration: 5.0 minutes
    Events: 2
    Description: Confirmation that the incident has been resolved

  TRIAGE:
    Start: 2024-03-15T15:40:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Assessment and initial investigation of the incident

KEY METRICS:
  Time to Mitigation: 0 minutes
  Time to Resolution: 48.0 minutes
  Events per Hour: 18.0
  Unique Sources: 7

INCIDENT NARRATIVE:
Incident Timeline Summary:
The incident began at 2024-03-15 14:30:00 UTC and concluded at 2024-03-15 15:40:00 UTC, lasting approximately 70 minutes.

The incident progressed through 12 distinct phases: detection, escalation, triage, escalation, triage, escalation, triage, detection, resolution, detection, resolution, triage.

Key milestones:
- Detection: 14:30 (0 min)
- Escalation: 14:32 (0 min)
- Triage: 14:35 (0 min)
- Escalation: 14:38 (9 min)
- Triage: 14:50 (0 min)
- Escalation: 14:52 (10 min)
- Triage: 15:05 (7 min)
- Detection: 15:15 (0 min)
- Resolution: 15:18 (0 min)
- Detection: 15:25 (0 min)
- Resolution: 15:30 (5 min)
- Triage: 15:40 (0 min)

================================================================================

Incident Commander Skill

A comprehensive incident response framework providing structured tools for managing technology incidents from detection through resolution and post-incident review.

Overview

This skill implements battle-tested practices from SRE and DevOps teams at scale, providing:

Automated Severity Classification - Intelligent incident triage
Timeline Reconstruction - Transform scattered events into coherent narratives
Post-Incident Review Generation - Structured PIRs with RCA frameworks
Communication Templates - Pre-built stakeholder communication
Comprehensive Documentation - Reference guides for incident response

Quick Start

Classify an Incident

# From JSON file
python scripts/incident_classifier.py --input incident.json --format text

# From stdin text
echo "Database is down affecting all users" | python scripts/incident_classifier.py --format text

# Interactive mode
python scripts/incident_classifier.py --interactive

Reconstruct Timeline

# Analyze event timeline
python scripts/timeline_reconstructor.py --input events.json --format text

# With gap analysis
python scripts/timeline_reconstructor.py --input events.json --gap-analysis --format markdown

Generate PIR Document

# Basic PIR
python scripts/pir_generator.py --incident incident.json --format markdown

# Comprehensive PIR with timeline
python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method fishbone

Scripts

incident_classifier.py

Purpose: Analyzes incident descriptions and provides severity classification, team recommendations, and response templates.

Input: JSON object with incident details or plain text description Output: JSON + human-readable classification report

Example Input:

{
  "description": "Database connection timeouts causing 500 errors",
  "service": "payment-api",
  "affected_users": "80%",
  "business_impact": "high"
}

Key Features:

SEV1-4 severity classification
Recommended response teams
Initial action prioritization
Communication templates
Response timelines

timeline_reconstructor.py

Purpose: Reconstructs incident timelines from timestamped events, identifies phases, and performs gap analysis.

Input: JSON array of timestamped events Output: Formatted timeline with phase analysis and metrics

Example Input:

[
  {
    "timestamp": "2024-01-01T12:00:00Z",
    "source": "monitoring",
    "message": "High error rate detected",
    "severity": "critical",
    "actor": "system"
  }
]

Key Features:

Phase detection (detection → triage → mitigation → resolution)
Duration analysis
Gap identification
Communication effectiveness analysis
Response metrics

pir_generator.py

Purpose: Generates comprehensive Post-Incident Review documents with multiple RCA frameworks.

Input: Incident data JSON, optional timeline data Output: Structured PIR document with RCA analysis

Key Features:

Multiple RCA methods (5 Whys, Fishbone, Timeline, Bow Tie)
Automated action item generation
Lessons learned categorization
Follow-up planning
Completeness assessment

Sample Data

The assets/ directory contains sample data files for testing:

sample_incident_classification.json - Database connection pool exhaustion incident
sample_timeline_events.json - Complete timeline with 21 events across phases
sample_incident_pir_data.json - Comprehensive incident data for PIR generation
simple_incident.json - Minimal incident for basic testing
simple_timeline_events.json - Simple 4-event timeline

Expected Outputs

The expected_outputs/ directory contains reference outputs showing what each script produces:

incident_classification_text_output.txt - Detailed classification report
timeline_reconstruction_text_output.txt - Complete timeline analysis
pir_markdown_output.md - Full PIR document
simple_incident_classification.txt - Basic classification example

Reference Documentation

references/incident_severity_matrix.md

Complete severity classification system with:

SEV1-4 definitions and criteria
Response requirements and timelines
Escalation paths
Communication requirements
Decision trees and examples

references/rca_frameworks_guide.md

Detailed guide for root cause analysis:

5 Whys methodology
Fishbone (Ishikawa) diagram analysis
Timeline analysis techniques
Bow Tie analysis for high-risk incidents
Framework selection guidelines

references/communication_templates.md

Standardized communication templates:

Severity-specific notification templates
Stakeholder-specific messaging
Escalation communications
Resolution notifications
Customer communication guidelines

Usage Patterns

End-to-End Incident Workflow

1. Initial Classification

echo "Payment API returning 500 errors for 70% of requests" | \
  python scripts/incident_classifier.py --format text

2. Timeline Reconstruction (after collecting events)

python scripts/timeline_reconstructor.py \
  --input events.json \
  --gap-analysis \
  --format markdown \
  --output timeline.md

3. PIR Generation (after incident resolution)

python scripts/pir_generator.py \
  --incident incident.json \
  --timeline timeline.md \
  --rca-method fishbone \
  --output pir.md

Integration Examples

CI/CD Pipeline Integration:

# Classify deployment issues
cat deployment_error.log | python scripts/incident_classifier.py --format json

Monitoring Integration:

# Process alert events
curl -s "monitoring-api/events" | python scripts/timeline_reconstructor.py --format text

Runbook Generation: Use classification output to automatically select appropriate runbooks and escalation procedures.

Quality Standards

Zero External Dependencies - All scripts use only Python standard library
Dual Output Format - Both JSON (machine-readable) and text (human-readable)
Robust Input Handling - Graceful handling of missing or malformed data
Professional Defaults - Opinionated, battle-tested configurations
Comprehensive Testing - Sample data and expected outputs included

Technical Requirements

Python 3.6+
No external dependencies required
Works with standard Unix tools (pipes, redirection)
Cross-platform compatible

Severity Classification Reference

Severity	Description	Response Time	Update Frequency
SEV1	Complete outage	5 minutes	Every 15 minutes
SEV2	Major degradation	15 minutes	Every 30 minutes
SEV3	Minor impact	2 hours	At milestones
SEV4	Low impact	1-2 days	Weekly

Getting Help

Each script includes comprehensive help:

python scripts/incident_classifier.py --help
python scripts/timeline_reconstructor.py --help  
python scripts/pir_generator.py --help

For methodology questions, refer to the reference documentation in the references/ directory.

Contributing

When adding new features: 1. Maintain zero external dependencies 2. Add comprehensive examples to assets/ 3. Update expected outputs in expected_outputs/ 4. Follow the established patterns for argument parsing and output formatting

License

This skill is part of the claude-skills repository. See the main repository LICENSE for details.

Incident Communication Templates

Overview

This document provides standardized communication templates for incident response. These templates ensure consistent, clear communication across different severity levels and stakeholder groups.

Template Usage Guidelines

General Principles

1. Be Clear and Concise - Use simple language, avoid jargon 2. Be Factual - Only state what is known, avoid speculation 3. Be Timely - Send updates at committed intervals 4. Be Actionable - Include next steps and expected timelines 5. Be Accountable - Include contact information for follow-up

Template Selection

Choose templates based on incident severity and audience
Customize templates with specific incident details
Always include next update time and contact information
Escalate template types as severity increases

---

SEV1 Templates

Initial Alert - Internal Teams

Subject: 🚨 [SEV1] CRITICAL: {Service} Complete Outage - Immediate Response Required

CRITICAL INCIDENT ALERT - IMMEDIATE ATTENTION REQUIRED

Incident Summary:
- Service: {Service Name}
- Status: Complete Outage
- Start Time: {Timestamp}
- Customer Impact: {Impact Description}
- Estimated Affected Users: {Number/Percentage}

Immediate Actions Needed:
✓ Incident Commander: {Name} - ASSIGNED
✓ War Room: {Bridge/Chat Link} - JOIN NOW
✓ On-Call Response: {Team} - PAGED
⏳ Executive Notification: In progress
⏳ Status Page Update: Within 15 minutes

Current Situation:
{Brief description of what we know}

What We're Doing:
{Immediate response actions being taken}

Next Update: {Timestamp - 15 minutes from now}

Incident Commander: {Name}
Contact: {Phone/Slack}

THIS IS A CUSTOMER-IMPACTING INCIDENT REQUIRING IMMEDIATE ATTENTION

Executive Notification - SEV1

Subject: 🚨 URGENT: Customer-Impacting Outage - {Service}

EXECUTIVE ALERT: Critical customer-facing incident

Service: {Service Name}
Impact: {Customer impact description}
Duration: {Current duration} (started {start time})
Business Impact: {Revenue/SLA/compliance implications}

Customer Impact Summary:
- Affected Users: {Number/percentage}
- Revenue Impact: {$ amount if known}
- SLA Status: {Breach status}
- Customer Escalations: {Number if any}

Response Status:
- Incident Commander: {Name} ({contact})
- Response Team Size: {Number of engineers}
- Root Cause: {If known, otherwise "Under investigation"}
- ETA to Resolution: {If known, otherwise "Investigating"}

Executive Actions Required:
- [ ] Customer communication approval needed
- [ ] Legal/compliance notification: {If applicable}
- [ ] PR/Media response preparation: {If needed}
- [ ] Resource allocation decisions: {If escalation needed}

War Room: {Link}
Next Update: {15 minutes from now}

This incident meets SEV1 criteria and requires executive oversight.

{Incident Commander contact information}

Customer Communication - SEV1

Subject: Service Disruption - Immediate Action Being Taken

We are currently experiencing a service disruption affecting {service description}.

What's Happening:
{Clear, customer-friendly description of the issue}

Impact:
{What customers are experiencing - be specific}

What We're Doing:
We detected this issue at {time} and immediately mobilized our engineering team. We are actively working to resolve this issue and will provide updates every 15 minutes.

Current Actions:
• {Action 1 - customer-friendly description}
• {Action 2 - customer-friendly description}
• {Action 3 - customer-friendly description}

Workaround:
{If available, provide clear steps}
{If not available: "We are working on alternative solutions and will share them as soon as available."}

Next Update: {Timestamp}
Status Page: {Link}
Support: {Contact information if different from usual}

We sincerely apologize for the inconvenience and are committed to resolving this as quickly as possible.

{Company Name} Team

Status Page Update - SEV1

Status: Major Outage

{Timestamp} - Investigating

We are currently investigating reports of {service} being unavailable. Our team has been alerted and is actively investigating the cause.

Affected Services: {List of affected services}
Impact: {Customer-facing impact description}

We will provide an update within 15 minutes.

{Timestamp} - Identified

We have identified the cause of the {service} outage. Our engineering team is implementing a fix.

Root Cause: {Brief, customer-friendly explanation}
Expected Resolution: {Timeline if known}

Next update in 15 minutes.

{Timestamp} - Monitoring

The fix has been implemented and we are monitoring the service recovery. 

Current Status: {Recovery progress}
Next Steps: {What we're monitoring}

We expect full service restoration within {timeframe}.

{Timestamp} - Resolved

{Service} is now fully operational. We have confirmed that all functionality is working as expected.

Total Duration: {Duration}
Root Cause: {Brief summary}

We apologize for the inconvenience. A full post-incident review will be conducted and shared within 24 hours.

---

SEV2 Templates

Team Notification - SEV2

Subject: ⚠️ [SEV2] {Service} Performance Issues - Response Team Mobilizing

SEV2 INCIDENT: Performance degradation requiring active response

Incident Details:
- Service: {Service Name}
- Issue: {Description of performance issue}
- Start Time: {Timestamp}
- Affected Users: {Percentage/description}
- Business Impact: {Impact on business operations}

Current Status:
{What we know about the issue}

Response Team:
- Incident Commander: {Name} ({contact})
- Primary Responder: {Name} ({team})
- Supporting Teams: {List of engaged teams}

Immediate Actions:
✓ {Action 1 - completed}
⏳ {Action 2 - in progress}
⏳ {Action 3 - next step}

Metrics:
- Error Rate: {Current vs normal}
- Response Time: {Current vs normal}  
- Throughput: {Current vs normal}

Communication Plan:
- Internal Updates: Every 30 minutes
- Stakeholder Notification: {If needed}
- Status Page Update: {Planned/not needed}

Coordination Channel: {Slack channel}
Next Update: {30 minutes from now}

Incident Commander: {Name} | {Contact}

Stakeholder Update - SEV2

Subject: [SEV2] Service Performance Update - {Service}

Service Performance Incident Update

Service: {Service Name}
Duration: {Current duration}
Impact: {Description of user impact}

Current Status:
{Brief status of the incident and response efforts}

What We Know:
• {Key finding 1}
• {Key finding 2}
• {Key finding 3}

What We're Doing:
• {Response action 1}
• {Response action 2}
• {Monitoring/verification steps}

Customer Impact:
{Realistic assessment of what users are experiencing}

Workaround:
{If available, provide steps}

Expected Resolution:
{Timeline if known, otherwise "Continuing investigation"}

Next Update: {30 minutes}
Contact: {Incident Commander information}

This incident is being actively managed and does not currently require escalation.

Customer Communication - SEV2 (Optional)

Subject: Temporary Service Performance Issues

We are currently experiencing performance issues with {service name} that may affect your experience.

What You Might Notice:
{Specific symptoms users might experience}

What We're Doing:
Our team identified this issue at {time} and is actively working on a resolution. We expect to have this resolved within {timeframe}.

Workaround:
{If applicable, provide simple workaround steps}

We will update our status page at {link} with progress information.

Thank you for your patience as we work to resolve this issue quickly.

{Company Name} Support Team

---

SEV3 Templates

Team Assignment - SEV3

Subject: [SEV3] Issue Assignment - {Component} Issue

SEV3 Issue Assignment

Service/Component: {Affected component}
Issue: {Description}
Reported: {Timestamp}
Reporter: {Person/system that reported}

Issue Details:
{Detailed description of the problem}

Impact Assessment:
- Affected Users: {Scope}
- Business Impact: {Assessment}
- Urgency: {Business hours response appropriate}

Assignment:
- Primary: {Engineer name}
- Team: {Responsible team}
- Expected Response: {Within 2-4 hours}

Investigation Plan:
1. {Investigation step 1}
2. {Investigation step 2}
3. {Communication checkpoint}

Workaround:
{If known, otherwise "Investigating alternatives"}

This issue will be tracked in {ticket system} as {ticket number}.

Team Lead: {Name} | {Contact}

Status Update - SEV3

Subject: [SEV3] Progress Update - {Component}

SEV3 Issue Progress Update

Issue: {Brief description}
Assigned to: {Engineer/Team}
Investigation Status: {Current progress}

Findings So Far:
{What has been discovered during investigation}

Next Steps:
{Planned actions and timeline}

Impact Update:
{Any changes to scope or urgency}

Expected Resolution:
{Timeline if known}

This issue continues to be tracked as SEV3 with no escalation required.

Contact: {Assigned engineer} | {Team lead}

---

SEV4 Templates

Issue Documentation - SEV4

Subject: [SEV4] Issue Documented - {Description}

SEV4 Issue Logged

Description: {Clear description of the issue}
Reporter: {Name/system}
Date: {Date reported}

Impact:
{Minimal impact description}

Priority Assessment:
This issue has been classified as SEV4 and will be addressed in the normal development cycle.

Assignment:
- Team: {Responsible team}
- Sprint: {Target sprint}
- Estimated Effort: {Story points/hours}

This issue is tracked as {ticket number} in {system}.

Product Owner: {Name}

---

Escalation Templates

Severity Escalation

Subject: ESCALATION: {Original Severity} → {New Severity} - {Service}

SEVERITY ESCALATION NOTIFICATION

Original Classification: {Original severity}
New Classification: {New severity}  
Escalation Time: {Timestamp}
Escalated By: {Name and role}

Escalation Reasons:
• {Reason 1 - scope expansion/duration/impact}
• {Reason 2}
• {Reason 3}

Updated Impact:
{New assessment of customer/business impact}

Updated Response Requirements:
{New response team, communication frequency, etc.}

Previous Response Actions:
{Summary of actions taken under previous severity}

New Incident Commander: {If changed}
Updated Communication Plan: {New frequency/recipients}

All stakeholders should adjust response according to {new severity} protocols.

Incident Commander: {Name} | {Contact}

Management Escalation

Subject: MANAGEMENT ESCALATION: Extended {Severity} Incident - {Service}

Management Escalation Required

Incident: {Service} {brief description}
Original Severity: {Severity}
Duration: {Current duration}
Escalation Trigger: {Duration threshold/scope change/customer escalation}

Current Status:
{Brief status of incident response}

Challenges Encountered:
• {Challenge 1}
• {Challenge 2}
• {Resource/expertise needs}

Business Impact:
{Updated assessment of business implications}

Management Decision Required:
• {Decision 1 - resource allocation/external expertise/communication}
• {Decision 2}

Recommended Actions:
{Incident Commander's recommendations}

This escalation follows standard procedures for {trigger type}.

Incident Commander: {Name}
Contact: {Phone/Slack}
War Room: {Link}

---

Resolution Templates

Resolution Confirmation - All Severities

Subject: RESOLVED: [{Severity}] {Service} Incident - {Brief Description}

INCIDENT RESOLVED

Service: {Service Name}
Issue: {Brief description}
Duration: {Total duration}
Resolution Time: {Timestamp}

Resolution Summary:
{Brief description of how the issue was resolved}

Root Cause:
{Brief explanation - detailed PIR to follow}

Impact Summary:
- Users Affected: {Final count/percentage}
- Business Impact: {Final assessment}
- Services Affected: {List}

Resolution Actions Taken:
• {Action 1}
• {Action 2}
• {Verification steps}

Monitoring:
We will continue monitoring {service} for {duration} to ensure stability.

Next Steps:
• Post-incident review scheduled for {date}
• Action items to be tracked in {system}
• Follow-up communication: {If needed}

Thank you to everyone who participated in the incident response.

Incident Commander: {Name}

Customer Resolution Communication

Subject: Service Restored - Thank You for Your Patience

Service Update: Issue Resolved

We're pleased to report that the {service} issues have been fully resolved as of {timestamp}.

What Was Fixed:
{Customer-friendly explanation of the resolution}

Duration:
The issue lasted {duration} from {start time} to {end time}.

What We Learned:
{Brief, high-level takeaway}

Our Commitment:
We are conducting a thorough review of this incident and will implement improvements to prevent similar issues in the future. A summary of our findings and improvements will be shared {timeframe}.

We sincerely apologize for any inconvenience this may have caused and appreciate your patience while we worked to resolve the issue.

If you continue to experience any problems, please contact our support team at {contact information}.

Thank you,
{Company Name} Team

---

Template Customization Guidelines

Placeholders to Always Replace

{Service} / {Service Name} - Specific service or component
{Timestamp} - Specific date/time in consistent format
{Name} / {Contact} - Actual names and contact information
{Duration} - Actual time durations
{Link} - Real URLs to war rooms, status pages, etc.

Language Guidelines

Use active voice ("We are investigating" not "The issue is being investigated")
Be specific about timelines ("within 30 minutes" not "soon")
Avoid technical jargon in customer communications
Include empathy in customer-facing messages
Use consistent terminology throughout incident lifecycle

Timing Guidelines

Severity	Initial Notification	Update Frequency	Resolution Notification
SEV1	Immediate (< 5 min)	Every 15 minutes	Immediate
SEV2	Within 15 minutes	Every 30 minutes	Within 15 minutes
SEV3	Within 2 hours	At milestones	Within 1 hour
SEV4	Within 1 business day	Weekly	When resolved

Audience-Specific Considerations

Engineering Teams

Include technical details
Provide specific metrics and logs
Include coordination channels
List specific actions and owners

Executive/Business

Focus on business impact
Include customer and revenue implications
Provide clear timeline and resource needs
Highlight any external factors (PR, legal, compliance)

Customers

Use plain language
Focus on customer impact and workarounds
Provide realistic timelines
Include support contact information
Show empathy and accountability

---

Last Updated: February 2026 Next Review: May 2026 Owner: Incident Management Team

Incident Severity Classification Matrix

Overview

This document defines the severity classification system used for incident response. The classification determines response requirements, escalation paths, and communication frequency.

Severity Levels

SEV1 - Critical Outage

Definition: Complete service failure affecting all users or critical business functions

Impact Criteria

Customer-facing services completely unavailable
Data loss or corruption affecting users
Security breaches with customer data exposure
Revenue-generating systems down
SLA violations with financial penalties
> 75% of users affected

Response Requirements

Metric	Requirement
Response Time	Immediate (0-5 minutes)
Incident Commander	Assigned within 5 minutes
War Room	Established within 10 minutes
Executive Notification	Within 15 minutes
Public Status Page	Updated within 15 minutes
Customer Communication	Within 30 minutes

Escalation Path

1. Immediate: On-call Engineer → Incident Commander 2. 15 minutes: VP Engineering + Customer Success VP 3. 30 minutes: CTO 4. 60 minutes: CEO + Full Executive Team

Communication Requirements

Frequency: Every 15 minutes until resolution
Channels: PagerDuty, Phone, Slack, Email, Status Page
Recipients: All engineering, executives, customer success
Template: SEV1 Executive Alert Template

---

SEV2 - Major Impact

Definition: Significant degradation affecting subset of users or non-critical functions

Impact Criteria

Partial service degradation (25-75% of users affected)
Performance issues causing user frustration
Non-critical features unavailable
Internal tools impacting productivity
Data inconsistencies not affecting user experience
API errors affecting integrations

Response Requirements

Metric	Requirement
Response Time	15 minutes
Incident Commander	Assigned within 30 minutes
Status Page Update	Within 30 minutes
Stakeholder Notification	Within 1 hour
Team Assembly	Within 30 minutes

Escalation Path

1. Immediate: On-call Engineer → Team Lead 2. 30 minutes: Engineering Manager 3. 2 hours: VP Engineering 4. 4 hours: CTO (if unresolved)

Communication Requirements

Frequency: Every 30 minutes during active response
Channels: PagerDuty, Slack, Email
Recipients: Engineering team, product team, relevant stakeholders
Template: SEV2 Major Impact Template

---

SEV3 - Minor Impact

Definition: Limited impact with workarounds available

Impact Criteria

Single feature or component affected
< 25% of users impacted
Workarounds available
Performance degradation not significantly impacting UX
Non-urgent monitoring alerts
Development/test environment issues

Response Requirements

Metric	Requirement
Response Time	2 hours (business hours)
After Hours Response	Next business day
Team Assignment	Within 4 hours
Status Page Update	Optional
Internal Notification	Within 2 hours

Escalation Path

1. Immediate: Assigned Engineer 2. 4 hours: Team Lead 3. 1 business day: Engineering Manager (if needed)

Communication Requirements

Frequency: At key milestones only
Channels: Slack, Email
Recipients: Assigned team, team lead
Template: SEV3 Minor Impact Template

---

SEV4 - Low Impact

Definition: Minimal impact, cosmetic issues, or planned maintenance

Impact Criteria

Cosmetic bugs
Documentation issues
Logging or monitoring gaps
Performance issues with no user impact
Development/test environment issues
Feature requests or enhancements

Response Requirements

Metric	Requirement
Response Time	1-2 business days
Assignment	Next sprint planning
Tracking	Standard ticket system
Escalation	None required

Communication Requirements

Frequency: Standard development cycle updates
Channels: Ticket system
Recipients: Product owner, assigned developer
Template: Standard issue template

Classification Guidelines

User Impact Assessment

Impact Scope	Description	Typical Severity
All Users	100% of users affected	SEV1
Major Subset	50-75% of users affected	SEV1/SEV2
Significant Subset	25-50% of users affected	SEV2
Limited Users	5-25% of users affected	SEV2/SEV3
Few Users	< 5% of users affected	SEV3/SEV4
No User Impact	Internal only	SEV4

Business Impact Assessment

Business Impact	Description	Severity Boost
Revenue Loss	Direct revenue impact	+1 severity level
SLA Breach	Contract violations	+1 severity level
Regulatory	Compliance implications	+1 severity level
Brand Damage	Public-facing issues	+1 severity level
Security	Data or system security	+2 severity levels

Duration Considerations

Duration	Impact on Classification
< 15 minutes	May reduce severity by 1 level
15-60 minutes	Standard classification
1-4 hours	May increase severity by 1 level
> 4 hours	Significant severity increase

Decision Tree

1. Is this a security incident with data exposure?
   → YES: SEV1 (regardless of user count)
   → NO: Continue to step 2

2. Are revenue-generating services completely down?
   → YES: SEV1
   → NO: Continue to step 3

3. What percentage of users are affected?
   → > 75%: SEV1
   → 25-75%: SEV2
   → 5-25%: SEV3
   → < 5%: SEV4

4. Apply business impact modifiers
5. Consider duration factors
6. When in doubt, err on higher severity

Examples

SEV1 Examples

Payment processing system completely down
All user authentication failing
Database corruption causing data loss
Security breach with customer data exposed
Website returning 500 errors for all users

SEV2 Examples

Payment processing slow (30-second delays)
Search functionality returning incomplete results
API rate limits causing partner integration issues
Dashboard displaying stale data (> 1 hour old)
Mobile app crashing for 40% of users

SEV3 Examples

Single feature in admin panel not working
Email notifications delayed by 1 hour
Non-critical API endpoint returning errors
Cosmetic UI bug in settings page
Development environment deployment failing

SEV4 Examples

Typo in help documentation
Log format change needed for analysis
Non-critical performance optimization
Internal tool enhancement request
Test data cleanup needed

Escalation Triggers

Automatic Escalation

SEV1 incidents automatically escalate every 30 minutes if unresolved
SEV2 incidents escalate after 2 hours without significant progress
Any incident with expanding scope increases severity
Customer escalation to support triggers severity review

Manual Escalation

Incident Commander can escalate at any time
Technical leads can request escalation
Business stakeholders can request severity review
External factors (media attention, regulatory) trigger escalation

Communication Templates

SEV1 Executive Alert

Subject: 🚨 CRITICAL INCIDENT - [Service] Complete Outage

URGENT: Customer-facing service outage requiring immediate attention

Service: [Service Name]
Start Time: [Timestamp]
Impact: [Description of customer impact]
Estimated Affected Users: [Number/Percentage]
Business Impact: [Revenue/SLA/Brand implications]

Incident Commander: [Name] ([Contact])
Response Team: [Team members engaged]

Current Status: [Brief status update]
Next Update: [Timestamp - 15 minutes from now]
War Room: [Bridge/Chat link]

This is a customer-impacting incident requiring executive awareness.

SEV2 Major Impact

Subject: ⚠️ [SEV2] [Service] - Major Performance Impact

Major service degradation affecting user experience

Service: [Service Name]
Start Time: [Timestamp] 
Impact: [Description of user impact]
Scope: [Affected functionality/users]

Response Team: [Team Lead] + [Team members]
Status: [Current mitigation efforts]
Workaround: [If available]

Next Update: 30 minutes
Status Page: [Link if updated]

Review and Updates

This severity matrix should be reviewed quarterly and updated based on:

Incident response learnings
Business priority changes
Service architecture evolution
Regulatory requirement changes
Customer feedback and SLA updates

Last Updated: February 2026 Next Review: May 2026 Owner: Engineering Leadership

Incident Response Framework Reference

Production-grade incident management knowledge base synthesizing PagerDuty, Google SRE, and Atlassian methodologies into a unified, opinionated framework. This document is the source of truth for incident commanders operating under pressure.

---

1. Industry Framework Comparison

PagerDuty Incident Response Model

PagerDuty's open-source incident response process defines four core roles and five process phases. The model prioritizes speed of mobilization over process perfection.

Roles:

Incident Commander (IC): Owns the incident end-to-end. Does NOT perform technical investigation. Delegates, coordinates, and makes final escalation decisions. The IC is the single point of authority; conflicting opinions are resolved by the IC, not by committee.
Scribe: Captures timestamped decisions, actions, and findings in the incident channel. The scribe never participates in technical work. A good scribe reduces postmortem preparation time by 70%.
Subject Matter Expert (SME): Pulled in on-demand for specific subsystems. SMEs report findings to the IC, not to each other. Parallel SME investigations must be coordinated through the IC to avoid duplicated effort.
Customer Liaison: Owns all outbound customer communication. Drafts status page updates for IC approval. Shields the technical team from inbound customer inquiries during active incidents.

Process Phases: Detect, Triage, Mobilize, Mitigate, Resolve, Postmortem.

Communication Protocol: PagerDuty mandates a dedicated Slack channel per incident, a bridge call for SEV1/SEV2, and status updates at fixed cadences (every 15 min for SEV1, every 30 min for SEV2). All decisions are announced in the channel, never in DMs or side threads.

Google SRE: Managing Incidents (Chapter 14)

Google's SRE model, documented in Site Reliability Engineering (O'Reilly, 2016), emphasizes role separation and clear handoffs as the primary mechanisms for preventing incident chaos.

Key Principles:

Operational vs. Communication Tracks: Google splits incident work into two parallel tracks. The operational track handles technical mitigation. The communication track handles stakeholder updates, executive briefings, and customer notifications. These tracks run independently with the IC bridging them.
Role Separation is Non-Negotiable: The person debugging the system must never be the person updating stakeholders. Cognitive load from context-switching between technical work and communication degrades both outputs. Google measured a 40% increase in mean-time-to-resolution (MTTR) when a single person attempted both.
Clear Handoffs: When an IC rotates out (recommended every 60-90 minutes for SEV1), the handoff includes: current status summary, active hypotheses, pending actions, and escalation state. Handoffs happen on the bridge call, not asynchronously.
Defined Command Post: All communication flows through a single channel. Google uses the term "command post" -- a virtual or physical location where all incident participants converge.

Atlassian Incident Management Model

Atlassian's model, published in their Incident Management Handbook, is severity-driven and template-heavy. It favors structured playbooks over improvisation.

Key Characteristics:

Severity Levels Drive Everything: The assigned severity determines who gets paged, what communication templates are used, response time SLAs, and postmortem requirements. Severity is assigned at triage and reassessed every 30 minutes.
Handbook-Driven Approach: Atlassian maintains runbooks for every known failure mode. During incidents, responders follow documented playbooks before improvising. This reduces MTTR for known issues by 50-60% but requires significant upfront investment in documentation.
Communication Templates: Pre-written templates for status page updates, customer emails, and executive summaries. Templates include severity-specific language and are reviewed quarterly. This eliminates wordsmithing during active incidents.
Values-Based Decisions: When runbooks do not cover the situation, Atlassian defaults to a decision hierarchy: (1) protect customer data, (2) restore service, (3) preserve evidence for root cause analysis.

Framework Comparison Table

Dimension	PagerDuty	Google SRE	Atlassian
Primary strength	Speed of mobilization	Role separation discipline	Structured playbooks
IC authority model	IC has final say	IC coordinates, escalates to VP if blocked	IC follows handbook, escalates if off-script
Communication style	Dedicated channel + bridge	Command post with dual tracks	Template-driven status updates
Handoff protocol	Informal	Formal on-call handoff script	Rotation policy in handbook
Postmortem requirement	All SEV1/SEV2	All incidents	SEV1/SEV2 mandatory, SEV3 optional
Best for	Fast-moving startups	Large-scale distributed systems	Regulated or process-heavy orgs
Weakness	Under-documented for edge cases	Heavyweight for small teams	Rigid, slow to adapt to novel failures

When to Use Which Framework

Teams under 20 engineers: Start with PagerDuty's model. It is lightweight and prescriptive enough to work without heavy process investment. Add Atlassian-style runbooks as you identify recurring failure modes.
Teams running 50+ microservices: Adopt Google SRE's dual-track model. The operational/communication split becomes critical when incidents span multiple teams and subsystems.
Regulated industries (finance, healthcare, government): Use Atlassian's handbook-driven approach as the foundation. Regulatory auditors expect documented procedures, and templates satisfy compliance requirements for incident communication records.
Hybrid (recommended for most teams at scale): Use PagerDuty's role definitions, Google's track separation, and Atlassian's template library. This is the approach codified in the rest of this document.

---

2. Severity Definitions

Severity Classification Matrix

Severity	Impact	Response Time	Update Cadence	Escalation Trigger	Example
SEV1	Total service outage or data breach affecting all users. Revenue loss exceeding $10K/hour. Security incident with active exfiltration.	Page IC + on-call within 5 min. All hands mobilized within 15 min.	Every 15 min to stakeholders. Continuous updates in incident channel.	Immediate executive notification. Board notification for data breaches.	Primary database cluster down. Payment processing system offline. Active ransomware attack.
SEV2	Major feature degraded for >30% of users. Revenue impact $1K-$10K/hour. Data integrity concerns without confirmed loss.	IC assigned within 15 min. Responders mobilized within 30 min.	Every 30 min to stakeholders. Every 15 min in incident channel.	Executive notification if unresolved after 1 hour. Upgrade to SEV1 if impact expands.	Search functionality returning errors for 40% of queries. Checkout flow failing intermittently. Authentication latency exceeding 10s.
SEV3	Minor feature degraded or non-critical service impaired. Workaround available. No direct revenue impact.	Acknowledged within 1 hour. Investigation started within 4 hours.	Every 2 hours to stakeholders if actively worked. Daily if deferred.	Escalate to SEV2 if workaround fails or user complaints exceed 50 in 1 hour.	Admin dashboard loading slowly. Email notifications delayed by 30+ minutes. Non-critical API endpoint returning 5xx for <5% of requests.
SEV4	Cosmetic issue, minor bug, or internal tooling degradation. No user-facing impact or negligible impact.	Acknowledged within 1 business day. Prioritized against backlog.	No scheduled updates. Tracked in issue tracker.	Escalate to SEV3 if internal productivity impact exceeds 2 hours/day across team.	Logging pipeline dropping non-critical debug logs. Internal metrics dashboard showing stale data. Minor UI alignment issue on one browser.

Customer-Facing Signals by Severity

SEV1 Signals: Support ticket volume spikes >500% of baseline within 15 minutes. Social media mentions of outage trend upward. Revenue dashboards show >95% drop in transaction volume. Multiple monitoring systems alarm simultaneously.

SEV2 Signals: Support ticket volume spikes 100-500% of baseline. Specific feature-related complaints cluster in support channels. Partial transaction failures visible in payment dashboards. Single monitoring system shows sustained alerting.

SEV3 Signals: Sporadic support tickets with a common pattern (under 20/hour). Users report intermittent issues with workarounds. Monitoring shows degraded but not critical metrics.

SEV4 Signals: Internal team notices issue during routine work. Occasional user mention with no pattern or urgency. Monitoring shows minor anomaly within acceptable thresholds.

Severity Upgrade and Downgrade Criteria

Upgrade from SEV2 to SEV1: Impact expands to >80% of users, revenue impact confirmed above $10K/hour, data integrity compromise confirmed, or mitigation attempt fails after 45 minutes.

Downgrade from SEV1 to SEV2: Partial mitigation restores service for >70% of users, revenue impact drops below $10K/hour, and no ongoing data integrity concern.

Downgrade from SEV2 to SEV3: Workaround deployed and communicated, impact limited to <10% of users, and no revenue impact.

Severity changes must be announced by the IC in the incident channel with justification. The scribe logs the timestamp and rationale.

---

3. Role Definitions

Incident Commander (IC)

The IC is the single decision-maker during an incident. This role exists to eliminate decision-by-committee, which adds 20-40 minutes to MTTR in measured studies.

Responsibilities:

Assign severity level at triage (reassess every 30 minutes)
Assign all other incident roles
Approve status page updates before publication
Make go/no-go decisions on mitigation strategies (rollback, feature flag, scaling)
Decide when to escalate to executive leadership
Declare incident resolved and initiate postmortem scheduling

Decision Authority: The IC can authorize rollbacks, page any team member regardless of org chart, approve customer communications, and override objections from individual contributors during active mitigation. The IC cannot approve financial expenditures above $50K or public press statements -- those require VP/C-level approval.

What the IC Must NOT Do: Debug code, write queries, SSH into production servers, or perform any hands-on technical work. The moment an IC starts debugging, incident coordination degrades. If the IC is the only person with domain expertise, they must hand off IC duties before engaging technically.

Communications Lead

Responsibilities:

Draft all status page updates using severity-appropriate templates
Coordinate with Customer Liaison on outbound customer messaging
Maintain the executive summary document (updated every 30 min for SEV1/SEV2)
Manage the stakeholder notification list and delivery
Post scheduled updates even when there is no new information ("We are continuing to investigate" is a valid update)

Operations Lead

Responsibilities:

Coordinate technical investigation across SMEs
Maintain the running hypothesis list and assign investigation tasks
Report technical findings to the IC in plain language
Execute mitigation actions approved by the IC
Track parallel workstreams and prevent duplicated effort

Scribe

Responsibilities:

Maintain a timestamped log of all decisions, actions, and findings
Document who said what and when in the incident channel
Capture rollback decisions, hypothesis changes, and escalation triggers
Produce the initial postmortem timeline (saves 2-4 hours of postmortem prep)

Subject Matter Experts (SMEs)

SMEs are paged on-demand by the IC for specific subsystems. They report findings to the Operations Lead, not directly to stakeholders. An SME who identifies a potential fix proposes it to the IC for approval before executing. SMEs are released from the incident explicitly by the IC when their subsystem is cleared.

Customer Liaison

Owns the customer-facing voice during the incident. Monitors support channels for inbound customer reports. Drafts customer notification emails. Updates the public status page (after IC approval). Shields the technical team from direct customer inquiries during active mitigation.

---

4. Communication Protocols

Incident Channel Naming Convention

Format: #inc-YYYYMMDD-brief-desc

Examples:

#inc-20260216-payment-api-timeout
#inc-20260216-db-primary-failover
#inc-20260216-auth-service-degraded

Channel topic must include: severity, IC name, bridge call link, status page link. Example topic: SEV1 | IC: @jane.smith | Bridge: https://meet.example.com/inc-20260216 | Status: https://status.example.com

Internal Status Update Templates

SEV1/SEV2 Update Template (posted in incident channel and executive Slack channel):

INCIDENT UPDATE - [SEV1/SEV2] - [HH:MM UTC]
Status: [Investigating | Identified | Mitigating | Resolved]
Impact: [Specific user-facing impact in plain language]
Current Action: [What is actively being done right now]
Next Update: [HH:MM UTC]
IC: @[name]

Executive Summary Template (for SEV1, updated every 30 min):

EXECUTIVE SUMMARY - [Incident Title] - [HH:MM UTC]
Severity: SEV1
Duration: [X hours Y minutes]
Customer Impact: [Number of affected users/transactions]
Revenue Impact: [Estimated $ if known, "assessing" if not]
Current Status: [One sentence]
Mitigation ETA: [Estimated time or "unknown"]
Next Escalation Point: [What triggers executive action]

Status Page Update Templates

SEV1 Initial Post:

Title: [Service Name] - Service Disruption
Body: We are currently experiencing a disruption affecting [service/feature].
Users may encounter [specific symptom: errors, timeouts, inability to access].
Our engineering team has been mobilized and is actively investigating.
We will provide an update within 15 minutes.

SEV1 Update (mitigation in progress):

Title: [Service Name] - Service Disruption (Update)
Body: We have identified the cause of the disruption affecting [service/feature]
and are implementing a fix. Some users may continue to experience [symptom].
We expect to have an update on resolution within [X] minutes.

SEV1 Resolution:

Title: [Service Name] - Resolved
Body: The disruption affecting [service/feature] has been resolved as of [HH:MM UTC].
Service has been restored to normal operation. Users should no longer experience
[symptom]. We will publish a full incident report within 48 hours.
We apologize for the inconvenience.

SEV2 Initial Post:

Title: [Service Name] - Degraded Performance
Body: We are investigating reports of degraded performance affecting [feature].
Some users may experience [specific symptom]. A workaround is [available/not yet available].
Our team is actively investigating and we will provide an update within 30 minutes.

Bridge Call / War Room Etiquette

1. Mute by default. Unmute only when speaking to the IC or Operations Lead. 2. Identify yourself before speaking. "This is [name] from [team]." Every time. 3. State findings, then recommendations. "Database replication lag is 45 seconds and climbing. I recommend we fail over to the secondary cluster." 4. IC confirms before action. No unilateral action on production systems during an incident. The IC says "approved" or "hold" before anyone executes. 5. No side conversations. If two SMEs need to discuss a hypothesis, they take it to a breakout channel and report back findings to the main bridge. 6. Time-box debugging. The IC sets 15-minute timers for investigation threads. If a hypothesis is not confirmed or denied in 15 minutes, pivot to the next hypothesis or escalate.

Customer Notification Templates

SEV1 Customer Email (B2B, enterprise accounts):

Subject: [Company Name] Service Incident - [Date]

Dear [Customer Name],

We are writing to inform you of a service incident affecting [product/service]
that began at [HH:MM UTC] on [date].

Impact: [Specific impact to this customer's usage]
Current Status: [Brief status]
Expected Resolution: [ETA if known, or "We are working to resolve this as quickly as possible"]

We will continue to provide updates every [15/30] minutes until resolution.
Your dedicated account team is available at [contact info] for any questions.

Sincerely,
[Name], [Title]

---

5. Escalation Matrix

Escalation Tiers

Tier 1 - Within Team (0-15 minutes): On-call engineer investigates. If the issue is within the team's domain and matches a known runbook, resolve without escalation. Page the IC if severity is SEV2 or higher, or if the issue is not resolved within 15 minutes.

Tier 2 - Cross-Team (15-45 minutes): IC pages SMEs from adjacent teams. Common cross-team escalations: database team for replication issues, networking team for connectivity failures, security team for suspicious activity. Cross-team SMEs join the incident channel and bridge call.

Tier 3 - Executive (45+ minutes or immediate for SEV1): VP of Engineering notified for all SEV1 incidents immediately. CTO notified if SEV1 exceeds 1 hour without mitigation progress. CEO notified if SEV1 involves data breach or regulatory implications. Executive involvement is for resource allocation and external communication decisions, not technical direction.

Time-Based Escalation Triggers

Elapsed Time	SEV1 Action	SEV2 Action
0 min	Page IC + all on-call. Notify VP Eng.	Page IC + primary on-call.
15 min	Confirm all roles staffed. Open bridge call.	IC assesses if additional SMEs needed.
30 min	If no mitigation path identified, page backup on-call for all related services.	First stakeholder update. Reassess severity.
45 min	Escalate to CTO if no progress. Consider customer notification.	If no progress, consider escalating to SEV1.
60 min	CTO briefing. Initiate customer notification if not already done.	Notify VP Eng. Page cross-team SMEs.
90 min	IC rotation (fresh IC takes over). Reassess all hypotheses.	IC rotation if needed.
120 min	CEO briefing if data breach or regulatory risk. External PR team engaged.	Escalate to SEV1 if impact has not decreased.

Escalation Path Examples

Database failover failure: On-call DBA (Tier 1, 0-15 min) -> IC + DBA team lead (Tier 2, 15 min) -> Infrastructure VP + cloud provider support (Tier 3, 45 min)

Payment processing outage: On-call payments engineer (Tier 1, 0-5 min) -> IC + payments team lead + payment provider liaison (Tier 2, 5 min, immediate due to revenue impact) -> CFO + VP Eng (Tier 3, 15 min if provider-side issue confirmed)

Security incident (suspected breach): Security on-call (Tier 1, 0-5 min) -> CISO + IC + legal counsel (Tier 2, immediate) -> CEO + external incident response firm (Tier 3, within 1 hour if breach confirmed)

On-Call Rotation Best Practices

Primary + secondary on-call for every critical service. Secondary is paged automatically if primary does not acknowledge within 5 minutes.
On-call shifts are 7 days maximum. Longer rotations degrade alertness and response quality.
Handoff checklist: Current open issues, recent deploys in the last 48 hours, known risks or maintenance windows, escalation contacts for dependent services.
On-call load budget: No more than 2 pages per night on average, measured weekly. Exceeding this indicates systemic reliability issues that must be addressed with engineering investment, not heroic on-call effort.

---

6. Incident Lifecycle Phases

Phase 1: Detection

Detection comes from three sources, in order of preference:

1. Automated monitoring (preferred): Alerting rules on latency (p99 > 2x baseline), error rates (5xx > 1% of requests), saturation (CPU > 85%, memory > 90%, disk > 80%), and business metrics (transaction volume drops > 20% from 15-minute rolling average). Alerts should fire within 60 seconds of threshold breach. 2. Internal reports: An engineer notices anomalous behavior during routine work. Internal detection typically adds 5-15 minutes to response time compared to automated monitoring. 3. Customer reports: Customers contact support about issues. This is the worst detection source. If customers detect incidents before monitoring, the monitoring coverage has a gap that must be closed in the postmortem.

Detection SLA: SEV1 incidents must be detected within 5 minutes of impact onset. If detection latency exceeds this, the postmortem must include a monitoring improvement action item.

Phase 2: Triage

The first responder performs initial triage within 5 minutes of detection:

1. Scope assessment: How many users, services, or regions are affected? Check dashboards, not assumptions. 2. Severity assignment: Use the severity matrix in Section 2. When in doubt, assign higher severity. Downgrading is cheap; delayed escalation is expensive. 3. IC assignment: For SEV1/SEV2, page the on-call IC immediately. For SEV3, the first responder may self-assign IC duties. 4. Initial hypothesis: What changed in the last 2 hours? Check deploy logs, config changes, upstream dependency status, and traffic patterns. 70% of incidents correlate with a change deployed in the prior 2 hours.

Phase 3: Mobilization

The IC executes mobilization within 10 minutes of assignment:

1. Create incident channel: #inc-YYYYMMDD-brief-desc. Set topic with severity, IC name, bridge link. 2. Assign roles: Communications Lead, Operations Lead, Scribe. For SEV3/SEV4, the IC may cover multiple roles. 3. Open bridge call (SEV1/SEV2): Share link in incident channel. All responders join within 5 minutes. 4. Post initial summary: Current understanding, affected services, assigned roles, first actions. 5. Notify stakeholders: Page dependent teams. Notify customer support leadership. For SEV1, notify executive chain per escalation matrix.

Phase 4: Investigation

Investigation runs as parallel workstreams coordinated by the Operations Lead:

Workstream discipline: Each SME investigates one hypothesis at a time. The Operations Lead tracks active hypotheses on a shared list. Completed investigations report: confirmed, denied, or inconclusive.
Hypothesis testing priority: (1) Recent changes (deploys, configs, feature flags), (2) Upstream dependency failures, (3) Capacity exhaustion, (4) Data corruption, (5) Security compromise.
15-minute rule: If a hypothesis is not confirmed or denied within 15 minutes, the IC decides whether to continue, pivot, or escalate. Unbounded investigation is the leading cause of extended MTTR.
Evidence collection: Screenshots, log snippets, metric graphs, and query results are posted in the incident channel, not described verbally. The scribe tags evidence with timestamps.

Phase 5: Mitigation

Mitigation prioritizes restoring service over finding root cause:

Rollback first: If a deploy correlates with the incident, roll it back before investigating further. A 5-minute rollback beats a 45-minute investigation. Rollback authority rests with the IC.
Feature flags: Disable the suspected feature via feature flag if available. This is faster and less risky than a full rollback.
Scaling: If the issue is capacity-related, scale horizontally before investigating the traffic source.
Failover: If a primary system is unrecoverable, fail over to the secondary. Test failover procedures quarterly so this is a routine, not a gamble.
Customer workaround: If mitigation will take time, publish a workaround for customers (e.g., "Use the mobile app while we restore web access").

Mitigation verification: After applying mitigation, monitor key metrics for 15 minutes before declaring the issue mitigated. Premature declarations that the issue is mitigated followed by recurrence damage team credibility and customer trust.

Phase 6: Resolution

Resolution is declared when the root cause is addressed and service is operating normally:

Verification checklist: Error rates returned to baseline, latency returned to baseline, no ongoing customer reports, monitoring confirms stability for 30+ minutes.
Incident channel update: IC posts final status with resolution summary, total duration, and next steps.
Status page update: Post resolution notice within 15 minutes of declaring resolved.
Stand down: IC explicitly releases all responders. SMEs return to normal work. Bridge call is closed.

Phase 7: Postmortem

Postmortem is mandatory for SEV1 and SEV2. Optional but recommended for SEV3. Never conducted for SEV4.

Timeline: Postmortem document drafted within 24 hours. Postmortem meeting held within 72 hours (3 business days). Action items assigned and tracked in the team's issue tracker.
Blameless standard: The postmortem examines systems, processes, and tools -- not individual performance. "Why did the system allow this?" not "Why did [person] do this?"
Required sections: Timeline (from scribe's log), root cause analysis (using 5 Whys or fault tree), impact summary (users, revenue, duration), what went well, what went poorly, action items with owners and due dates.
Action items and recurrence: Every postmortem produces 3-7 concrete action items. Items without owners and due dates are not action items. Teams should close 80%+ within 30 days. If the same root cause appears in two postmortems within 6 months, escalate to engineering leadership as a systemic reliability investment area.

incident-commander reference

Reference Information

Architecture Diagram: {link}
Monitoring Dashboard: {link}
Related Runbooks: {links to dependent service runbooks}


### Post-Incident Review (PIR) Framework

#### PIR Timeline and Ownership

**Timeline:**
- **24 hours:** Initial PIR draft completed by Incident Commander
- **3 business days:** Final PIR published with all stakeholder input
- **1 week:** Action items assigned with owners and due dates
- **4 weeks:** Follow-up review on action item progress

**Roles:**
- **PIR Owner:** Incident Commander (can delegate writing but owns completion)
- **Technical Contributors:** All engineers involved in response
- **Review Committee:** Engineering leadership, affected product teams
- **Action Item Owners:** Assigned based on expertise and capacity

#### Root Cause Analysis Frameworks

#### 1. Five Whys Method

The Five Whys technique involves asking "why" repeatedly to drill down to root causes:

**Example Application:**
- **Problem:** Database became unresponsive during peak traffic
- **Why 1:** Why did the database become unresponsive? → Connection pool was exhausted
- **Why 2:** Why was the connection pool exhausted? → Application was creating more connections than usual
- **Why 3:** Why was the application creating more connections? → New feature wasn't properly connection pooling
- **Why 4:** Why wasn't the feature properly connection pooling? → Code review missed this pattern
- **Why 5:** Why did code review miss this? → No automated checks for connection pooling patterns

**Best Practices:**
- Ask "why" at least 3 times, often need 5+ iterations
- Focus on process failures, not individual blame
- Each "why" should point to a actionable system improvement
- Consider multiple root cause paths, not just one linear chain

#### 2. Fishbone (Ishikawa) Diagram

Systematic analysis across multiple categories of potential causes:

**Categories:**
- **People:** Training, experience, communication, handoffs
- **Process:** Procedures, change management, review processes
- **Technology:** Architecture, tooling, monitoring, automation
- **Environment:** Infrastructure, dependencies, external factors

**Application Method:**
1. State the problem clearly at the "head" of the fishbone
2. For each category, brainstorm potential contributing factors
3. For each factor, ask what caused that factor (sub-causes)
4. Identify the factors most likely to be root causes
5. Validate root causes with evidence from the incident

#### 3. Timeline Analysis

Reconstruct the incident chronologically to identify decision points and missed opportunities:

**Timeline Elements:**
- **Detection:** When was the issue first observable? When was it first detected?
- **Notification:** How quickly were the right people informed?
- **Response:** What actions were taken and how effective were they?
- **Communication:** When were stakeholders updated?
- **Resolution:** What finally resolved the issue?

**Analysis Questions:**
- Where were there delays and what caused them?
- What decisions would we make differently with perfect information?
- Where did communication break down?
- What automation could have detected/resolved faster?

### Escalation Paths

#### Technical Escalation

**Level 1:** On-call engineer
- **Responsibility:** Initial response and common issue resolution
- **Escalation Trigger:** Issue not resolved within SLA timeframe
- **Timeframe:** 15 minutes (SEV1), 30 minutes (SEV2)

**Level 2:** Senior engineer/Team lead
- **Responsibility:** Complex technical issues requiring deeper expertise
- **Escalation Trigger:** Level 1 requests help or timeout occurs
- **Timeframe:** 30 minutes (SEV1), 1 hour (SEV2)

**Level 3:** Engineering Manager/Staff Engineer
- **Responsibility:** Cross-team coordination and architectural decisions
- **Escalation Trigger:** Issue spans multiple systems or teams
- **Timeframe:** 45 minutes (SEV1), 2 hours (SEV2)

**Level 4:** Director of Engineering/CTO
- **Responsibility:** Resource allocation and business impact decisions
- **Escalation Trigger:** Extended outage or significant business impact
- **Timeframe:** 1 hour (SEV1), 4 hours (SEV2)

#### Business Escalation

**Customer Impact Assessment:**
- **High:** Revenue loss, SLA breaches, customer churn risk
- **Medium:** User experience degradation, support ticket volume
- **Low:** Internal tools, development impact only

**Escalation Matrix:**

| Severity | Duration | Business Escalation |
|----------|----------|-------------------|
| SEV1 | Immediate | VP Engineering |
| SEV1 | 30 minutes | CTO + Customer Success VP |
| SEV1 | 1 hour | CEO + Full Executive Team |
| SEV2 | 2 hours | VP Engineering |
| SEV2 | 4 hours | CTO |
| SEV3 | 1 business day | Engineering Manager |

### Status Page Management

#### Update Principles

1. **Transparency:** Provide factual information without speculation
2. **Timeliness:** Update within committed timeframes
3. **Clarity:** Use customer-friendly language, avoid technical jargon
4. **Completeness:** Include impact scope, status, and next update time

#### Status Categories

- **Operational:** All systems functioning normally
- **Degraded Performance:** Some users may experience slowness
- **Partial Outage:** Subset of features unavailable
- **Major Outage:** Service unavailable for most/all users
- **Under Maintenance:** Planned maintenance window

#### Update Template

{Timestamp} - {Status Category}

{Brief description of current state}

Impact: {who is affected and how} Cause: {root cause if known, "under investigation" if not} Resolution: {what's being done to fix it}

Next update: {specific time}

We apologize for any inconvenience this may cause.


### Action Item Framework

#### Action Item Categories

1. **Immediate Fixes**
   - Critical bugs discovered during incident
   - Security vulnerabilities exposed
   - Data integrity issues

2. **Process Improvements**
   - Communication gaps
   - Escalation procedure updates
   - Runbook additions/updates

3. **Technical Debt**
   - Architecture improvements
   - Monitoring enhancements
   - Automation opportunities

4. **Organizational Changes**
   - Team structure adjustments
   - Training requirements
   - Tool/platform investments

#### Action Item Template

Title: {Concise description of the action} Priority: {Critical/High/Medium/Low} Category: {Fix/Process/Technical/Organizational} Owner: {Assigned person} Due Date: {Specific date} Success Criteria: {How will we know this is complete} Dependencies: {What needs to happen first} Related PIRs: {Links to other incidents this addresses}

Description: {Detailed description of what needs to be done and why}

Implementation Plan: 1. {Step 1} 2. {Step 2} 3. {Validation step}

Progress Updates:

{Date}: {Progress update}
{Date}: {Progress update}

Related skills

Azure DiagnosticsSystematically diagnose and resolve production issues on Microsoft Azure using official Microsoft guidance.485k1.3k

Azure MessagingQuickly diagnose and fix connection, authentication, and message-processing failures when using Azure Event Hubs or Service Bus SDKs.473k1.3k

Use My BrowserWhen their agent task requires access to the live browser session, rendered DOM state, authenticated dashboards, localhost apps, or DevTools-selected elements instea269k72

Diagnosing BugsGet a systematic, step-by-step process that surfaces the real root cause instead of guessing at bugs.263k195k

Systematic DebuggingFollow a repeatable four-phase process that forces root-cause discovery before any code changes.205k263k

Safe DebugGet conservative, non-destructive diagnosis of deep learning and agent errors before any code changes are made.176k513

How it compares

Use incident-commander for outage command and PIRs; route security breaches to incident-response instead.

FAQ

What severities does incident-commander define?

incident-commander uses SEV1 through SEV4 labels for operational availability incidents, with SEV1 requiring incident commander assignment within 5 minutes and executive notification within 15 minutes for complete outages.

Is incident-commander for security breaches?

incident-commander targets availability and reliability outages, not security forensics. Security events such as intrusion or data exfiltration should route to the separate incident-response skill aligned with NIST SP 800-61.

Is Incident Commander safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Debuggingmonitoringinfra