
Chaos Engineer
Design controlled failure experiments, game days, and injection frameworks so production resilience is tested before customers find the cracks.
Overview
Chaos Engineer is an agent skill most often used in Operate (also Ship resilience validation) that designs chaos experiments, failure injection, and game days with runbooks and rollback controls.
Install
npx skills add https://github.com/jeffallan/claude-skills --skill chaos-engineerWhat is this skill?
- Core workflow: system analysis → experiment design → failure injection → observation → learning loop
- Outputs include runbooks, experiment manifests, rollback procedures, and post-mortem templates
- Blast-radius controls and safety mechanisms for Chaos Monkey, Litmus, and similar tooling
- Game day planning and continuous chaos hooks for CI/CD pipelines
- Maps architecture, dependencies, critical paths, and failure modes before injecting faults
- Core workflow with 5 stages from system analysis through experiment design and safety controls
- Skill metadata version 1.1.0
Adoption & trust: 2.2k installs on skills.sh; 9.7k GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You do not know whether your distributed system survives dependency outages, latency spikes, or regional failures until real users trigger them.
Who is it for?
Indie SaaS or API operators on Kubernetes or multi-service stacks preparing game days or continuous resilience testing.
Skip if: Single-process local apps with no production footprint, or teams without staging/canary paths where blast radius cannot be bounded.
When should I use this skill?
Designing chaos experiments, failure injection frameworks, game day exercises, blast radius control, Chaos Monkey, Litmus Chaos, or antifragile resilience work.
What do I get? / Deliverables
You leave with documented experiments, safe injection steps, rollback paths, and post-mortem-ready learnings that harden antifragile production behavior.
- Chaos experiment manifests and hypotheses
- Runbooks with rollback procedures
- Post-mortem and game day templates
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Chaos engineering is how you run and harden live distributed systems after ship—not a one-off build task. Experiments touch deployment topology, blast radius, and infra dependencies—the core of production infrastructure operations.
Where it fits
Run a pre-launch game day that kills a dependency pod and verifies checkout still degrades gracefully.
Author an experiment manifest with max blast radius before injecting network latency in staging.
Define steady-state SLIs and alert thresholds to judge experiment success or automatic abort.
Turn post-mortem findings into backlog items for circuit breakers and retry policy fixes.
How it compares
Use for planned fault injection and experiment design—not everyday application debugging or generic uptime monitoring setup alone.
Common Questions / FAQ
Who is chaos-engineer for?
Builders and small teams responsible for production distributed systems who need structured chaos experiments, game days, and resilience artifacts—not ad-hoc “break prod and see.”
When should I use chaos-engineer?
In Operate when hardening infra and failure modes; in Ship when validating resilience before major launches; and when implementing Chaos Monkey, Litmus, or CI/CD fault injection.
Is chaos-engineer safe to install?
Review the Security Audits panel on this Prism page; the skill describes destructive testing workflows—always enforce blast-radius limits and staging gates in your environment, never rely on the skill alone for safety.
Workflow Chain
Then invoke: sre engineer, kubernetes specialist
SKILL.md
READMESKILL.md - Chaos Engineer
# Chaos Engineer ## When to Use This Skill - Designing and executing chaos experiments - Implementing failure injection frameworks (Chaos Monkey, Litmus, etc.) - Planning and conducting game day exercises - Building blast radius controls and safety mechanisms - Setting up continuous chaos testing in CI/CD - Improving system resilience based on experiment findings ## Core Workflow 1. **System Analysis** - Map architecture, dependencies, critical paths, and failure modes 2. **Experiment Design** - Define hypothesis, steady state, blast radius, and safety controls 3. **Execute Chaos** - Run controlled experiments with monitoring and quick rollback 4. **Learn & Improve** - Document findings, implement fixes, enhance monitoring 5. **Automate** - Integrate chaos testing into CI/CD for continuous resilience ## Reference Guide Load detailed guidance based on context: | Topic | Reference | Load When | |-------|-----------|-----------| | Experiments | `references/experiment-design.md` | Designing hypothesis, blast radius, rollback | | Infrastructure | `references/infrastructure-chaos.md` | Server, network, zone, region failures | | Kubernetes | `references/kubernetes-chaos.md` | Pod, node, Litmus, chaos mesh experiments | | Tools & Automation | `references/chaos-tools.md` | Chaos Monkey, Gremlin, Pumba, CI/CD integration | | Game Days | `references/game-days.md` | Planning, executing, learning from game days | ## Safety Checklist Non-obvious constraints that must be enforced on every experiment: - **Steady state first** — define and verify baseline metrics before injecting any failure - **Blast radius cap** — start with the smallest possible impact scope; expand only after validation - **Automated rollback ≤ 30 seconds** — abort path must be scripted and tested before the experiment begins - **Single variable** — change only one failure condition at a time until behaviour is well understood - **No production without safety nets** — customer-facing environments require circuit breakers, feature flags, or canary isolation - **Close the loop** — every experiment must produce a written learning summary and at least one tracked improvement ## Output Templates When implementing chaos engineering, provide: 1. Experiment design document (hypothesis, metrics, blast radius) 2. Implementation code (failure injection scripts/manifests) 3. Monitoring setup and alert configuration 4. Rollback procedures and safety controls 5. Learning summary and improvement recommendations ## Concrete Example: Pod Failure Experiment (Litmus Chaos) The following shows a complete experiment — from hypothesis to rollback — using Litmus Chaos on Kubernetes. ### Step 1 — Define steady state and apply the experiment ```bash # Verify baseline: p99 latency < 200ms, error rate < 0.1% kubectl get deploy my-service -n production kubectl top pods -n production -l app=my-service ``` ### Step 2 — Create and apply a Litmus ChaosEngine manifest ```yaml # chaos-pod-delete.yaml apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: my-servic