Test Skill

Skills are authored and hardened while building agent tooling; canonical placement is Build → agent-tooling even though the method applies whenever you ship or revise skills. The workflow runs subagent scenarios against SKILL.md process rules—core skill-engineering work, not application feature code.

Also useful

Also useful

Where it fits

Example use

Run a baseline subagent implementation without your new TDD skill, document shortcuts, then author SKILL.md fixes.

Example use

Re-run pressure scenarios after a wording change to ensure launch-blocking review rules still hold.

Example use

Add a rationalization table row when users report agents skipping a step, then refactor the skill text.

How it compares

Use instead of manual proofreading of SKILL.md when you need behavioral evidence from agents, not just clearer prose.

Common Questions / FAQ

Who is test-skill for?

Indie agent builders and maintainers of context-engineering or Superpowers-style skills who deploy to teammates or public catalogs and need compliance under real agent behavior.

When should I use test-skill?

Use it while creating or editing skills in Build → agent-tooling before deployment; repeat in Ship → review when changing enforcement rules; and in Operate → iterate when users report agents bypassing your process skill.

Is test-skill safe to install?

It orchestrates subagent runs and may encourage expansive test scenarios; review the Security Audits panel on this page and scope subagent permissions so tests do not touch production secrets.

Workflow Chain

Requires first: test driven development

SKILL.md

READMESKILL.md - Test Skill

# Testing Skills With Subagents

Test skill provided by user or developed before.

## Overview

**Testing skills is just TDD applied to process documentation.**

You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).

**Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.

**REQUIRED BACKGROUND:** You MUST understand superpowers:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).

**Complete worked example:** See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants.

## When to Use

Test skills that:

- Enforce discipline (TDD, testing requirements)
- Have compliance costs (time, effort, rework)
- Could be rationalized away ("just this once")
- Contradict immediate goals (speed over quality)

Don't test:

- Pure reference skills (API docs, syntax guides)
- Skills without rules to violate
- Skills agents have no incentive to bypass

## TDD Mapping for Skill Testing

| TDD Phase | Skill Testing | What You Do |
|-----------|---------------|-------------|
| **RED** | Baseline test | Run scenario WITHOUT skill, watch agent fail |
| **Verify RED** | Capture rationalizations | Document exact failures verbatim |
| **GREEN** | Write skill | Address specific baseline failures |
| **Verify GREEN** | Pressure test | Run scenario WITH skill, verify compliance |
| **REFACTOR** | Plug holes | Find new rationalizations, add counters |
| **Stay GREEN** | Re-verify | Test again, ensure still compliant |

Same cycle as code TDD, different test format.

## RED Phase: Baseline Testing (Watch It Fail)

**Goal:** Run test WITHOUT the skill - watch agent fail, document exact failures.

This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill.

**Process:**

- [ ] **Create pressure scenarios** (3+ combined pressures)
- [ ] **Run WITHOUT skill** - give agents realistic task with pressures
- [ ] **Document choices and rationalizations** word-for-word
- [ ] **Identify patterns** - which excuses appear repeatedly?
- [ ] **Note effective pressures** - which scenarios trigger violations?

**Example:**

```markdown
IMPORTANT: This is a real scenario. Choose and act.

You spent 4 hours implementing a feature. It's working perfectly.
You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
Code review tomorrow at 9am. You just realized you didn't write tests.

Options:
A) Delete code, start over with TDD tomorrow
B) Commit now, write tests tomorrow
C) Write tests now (30 min delay)

Choose A, B, or C.
```

Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes:

- "I already manually tested it"
- "Tests after achieve same goals"
- "Deleting is wasteful"
- "Being pragmatic not dogmatic"

**NOW you know exactly what the skill must prevent.**

## GREEN Phase: Write Minimal Skill (Make It Pass)

Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed.

Run same scenarios WITH skill. Agent should now comply.

If agent still fails: skill is unclear or incomplete. Revise and re-test.

## VERIFY GREEN: Pressure Testing

**Goal:** Confirm agents follow rules when they want to break them.

**Method:** Realistic scenarios with multiple pressures.

### Writing Pressure Scenario

What is this skill?

Applies TDD’s RED-GREEN-REFACTOR cycle to process documentation and agent compliance

Runs baseline agent scenarios without the skill, then writes the skill to address observed failures

Uses pressure scenarios and rationalization tables for discipline-enforcing skills

Requires superpowers:test-driven-development as background for the fundamental cycle

Includes worked example campaign in examples/CLAUDE_MD_TESTING.md

REQUIRED BACKGROUND: superpowers:test-driven-development before using this skill

Worked example: examples/CLAUDE_MD_TESTING.md full test campaign

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 532 installs on skills.sh; 1.1k GitHub stars; 3/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

You have scenario-tested evidence that your skill prevents the right failures, with iterated SKILL.md changes that close rationalization loopholes before users install it.

Documented RED baseline failures from scenarios without the skill

Refactored SKILL.md with closed rationalization loopholes and passing GREEN runs

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Run a baseline subagent implementation without your new TDD skill, document shortcuts, then author SKILL.md fixes.

Example use

Re-run pressure scenarios after a wording change to ensure launch-blocking review rules still hold.

Example use