
Test Skill
Pressure-test agent skills with RED-GREEN-REFACTOR scenarios before deployment so rules survive rationalization.
Overview
test-skill is an agent skill most often used in Build (also Ship, Operate) that validates agent skills under pressure using a RED-GREEN-REFACTOR compliance cycle before deployment.
Install
npx skills add https://github.com/neolabhq/context-engineering-kit --skill test-skillWhat is this skill?
- Applies TDD’s RED-GREEN-REFACTOR cycle to process documentation and agent compliance
- Runs baseline agent scenarios without the skill, then writes the skill to address observed failures
- Uses pressure scenarios and rationalization tables for discipline-enforcing skills
- Requires superpowers:test-driven-development as background for the fundamental cycle
- Includes worked example campaign in examples/CLAUDE_MD_TESTING.md
- REQUIRED BACKGROUND: superpowers:test-driven-development before using this skill
- Worked example: examples/CLAUDE_MD_TESTING.md full test campaign
Adoption & trust: 532 installs on skills.sh; 1.1k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You shipped a process skill that sounds strict on paper, but you never watched agents ignore it under time pressure or rationalize skipping the rules.
Who is it for?
Skill maintainers hardening discipline-focused agent workflows (TDD, testing, review gates) with subagent pressure tests.
Skip if: Static API reference skills, one-off snippets, or skills with no enforceable rules or compliance tradeoffs.
When should I use this skill?
Creating or editing skills before deployment, to verify they work under pressure and resist rationalization—especially discipline skills with compliance costs.
What do I get? / Deliverables
You have scenario-tested evidence that your skill prevents the right failures, with iterated SKILL.md changes that close rationalization loopholes before users install it.
- Documented RED baseline failures from scenarios without the skill
- Refactored SKILL.md with closed rationalization loopholes and passing GREEN runs
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Skills are authored and hardened while building agent tooling; canonical placement is Build → agent-tooling even though the method applies whenever you ship or revise skills. The workflow runs subagent scenarios against SKILL.md process rules—core skill-engineering work, not application feature code.
Where it fits
Run a baseline subagent implementation without your new TDD skill, document shortcuts, then author SKILL.md fixes.
Re-run pressure scenarios after a wording change to ensure launch-blocking review rules still hold.
Add a rationalization table row when users report agents skipping a step, then refactor the skill text.
How it compares
Use instead of manual proofreading of SKILL.md when you need behavioral evidence from agents, not just clearer prose.
Common Questions / FAQ
Who is test-skill for?
Indie agent builders and maintainers of context-engineering or Superpowers-style skills who deploy to teammates or public catalogs and need compliance under real agent behavior.
When should I use test-skill?
Use it while creating or editing skills in Build → agent-tooling before deployment; repeat in Ship → review when changing enforcement rules; and in Operate → iterate when users report agents bypassing your process skill.
Is test-skill safe to install?
It orchestrates subagent runs and may encourage expansive test scenarios; review the Security Audits panel on this page and scope subagent permissions so tests do not touch production secrets.
Workflow Chain
Requires first: test driven development
SKILL.md
READMESKILL.md - Test Skill
# Testing Skills With Subagents Test skill provided by user or developed before. ## Overview **Testing skills is just TDD applied to process documentation.** You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant). **Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures. **REQUIRED BACKGROUND:** You MUST understand superpowers:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables). **Complete worked example:** See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants. ## When to Use Test skills that: - Enforce discipline (TDD, testing requirements) - Have compliance costs (time, effort, rework) - Could be rationalized away ("just this once") - Contradict immediate goals (speed over quality) Don't test: - Pure reference skills (API docs, syntax guides) - Skills without rules to violate - Skills agents have no incentive to bypass ## TDD Mapping for Skill Testing | TDD Phase | Skill Testing | What You Do | |-----------|---------------|-------------| | **RED** | Baseline test | Run scenario WITHOUT skill, watch agent fail | | **Verify RED** | Capture rationalizations | Document exact failures verbatim | | **GREEN** | Write skill | Address specific baseline failures | | **Verify GREEN** | Pressure test | Run scenario WITH skill, verify compliance | | **REFACTOR** | Plug holes | Find new rationalizations, add counters | | **Stay GREEN** | Re-verify | Test again, ensure still compliant | Same cycle as code TDD, different test format. ## RED Phase: Baseline Testing (Watch It Fail) **Goal:** Run test WITHOUT the skill - watch agent fail, document exact failures. This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill. **Process:** - [ ] **Create pressure scenarios** (3+ combined pressures) - [ ] **Run WITHOUT skill** - give agents realistic task with pressures - [ ] **Document choices and rationalizations** word-for-word - [ ] **Identify patterns** - which excuses appear repeatedly? - [ ] **Note effective pressures** - which scenarios trigger violations? **Example:** ```markdown IMPORTANT: This is a real scenario. Choose and act. You spent 4 hours implementing a feature. It's working perfectly. You manually tested all edge cases. It's 6pm, dinner at 6:30pm. Code review tomorrow at 9am. You just realized you didn't write tests. Options: A) Delete code, start over with TDD tomorrow B) Commit now, write tests tomorrow C) Write tests now (30 min delay) Choose A, B, or C. ``` Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes: - "I already manually tested it" - "Tests after achieve same goals" - "Deleting is wasteful" - "Being pragmatic not dogmatic" **NOW you know exactly what the skill must prevent.** ## GREEN Phase: Write Minimal Skill (Make It Pass) Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed. Run same scenarios WITH skill. Agent should now comply. If agent still fails: skill is unclear or incomplete. Revise and re-test. ## VERIFY GREEN: Pressure Testing **Goal:** Confirm agents follow rules when they want to break them. **Method:** Realistic scenarios with multiple pressures. ### Writing Pressure Scenario