
Test Prompt
Prove commands, hooks, skills, and subagent instructions behave correctly before you ship them, using RED-GREEN-REFACTOR on prompts with subagents.
Overview
test-prompt is an agent skill most often used in Build (also Ship, Operate) that verifies prompts via subagent RED-GREEN-REFACTOR before deployment.
Install
npx skills add https://github.com/neolabhq/context-engineering-kit --skill test-promptWhat is this skill?
- Applies TDD RED-GREEN-REFACTOR cycle to prompt engineering
- Tests commands, hooks, skills, subagent instructions, and production LLM prompts
- Uses subagents for isolated RED runs without the prompt installed
- Requires watching agent failure without the prompt before writing fixes
- Documents related test-skill for discipline-enforcing skills specifically
- RED-GREEN-REFACTOR cycle applied to prompt engineering
Adoption & trust: 529 installs on skills.sh; 1.1k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You ship prompts and skills that look fine in chat but fail inconsistently because you never observed agent behavior without them.
Who is it for?
New hooks, discipline skills, subagent briefs, and production features where prompt drift or loopholes are expensive.
Skip if: Throwaway one-off chat messages or prompts you will delete before the next session with no repeat use.
When should I use this skill?
Creating or editing any prompt (commands, hooks, skills, subagent instructions) to verify desired behavior before deployment.
What do I get? / Deliverables
You deploy prompts backed by observed failures and fixes, with refactor passes that close compliance loopholes.
- Documented RED failure observations from unprompted agent runs
- Revised prompt text with GREEN compliance and REFACTOR hardening notes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Agent-tooling is the primary shelf because the skill targets authoring and hardening LLM instructions inside your agent stack. Prompt validation is part of building reliable agent capabilities, not a one-off frontend or backend task.
Where it fits
Run RED scenarios on a new commit hook before adding guardrail text in GREEN.
Test subagent instructions for doc-generation tasks so outputs stay on template.
Refactor a production user prompt after GREEN passes to block known jailbreak phrasings.
Re-test an updated skill when users report the agent ignores a formerly enforced rule.
How it compares
Use instead of manual “try it once in the UI” checks that skip isolated RED runs and refactor verification.
Common Questions / FAQ
Who is test-prompt for?
Builders maintaining Claude Code, Cursor, or Codex skills and prompts who want TDD-style evidence before agents rely on new instructions.
When should I use test-prompt?
While building agent-tooling and docs prompts; during ship testing when consistency is required; and during operate when you revise hooks or commands after production misses.
Is test-prompt safe to install?
Check the Security Audits panel on this page; subagent test runs may need shell or network—scope permissions to sandbox scenarios only.
Workflow Chain
Requires first: tdd test driven development
SKILL.md
READMESKILL.md - Test Prompt
# Testing Prompts With Subagents Test any prompt before deployment: commands, hooks, skills, subagent instructions, or production LLM prompts. ## Overview **Testing prompts is TDD applied to LLM instructions.** Run scenarios without the prompt (RED - watch agent behavior), write prompt addressing failures (GREEN - watch agent comply), then close loopholes (REFACTOR - verify robustness). **Core principle:** If you didn't watch an agent fail without the prompt, you don't know what the prompt needs to fix. **REQUIRED BACKGROUND:** - You MUST understand `tdd:test-driven-development` - defines RED-GREEN-REFACTOR cycle - You SHOULD understand `prompt-engineering` skill - provides prompt optimization techniques **Related skill:** See `test-skill` for testing discipline-enforcing skills specifically. This command covers ALL prompts. ## When to Use Test prompts that: - Guide agent behavior (commands, instructions) - Enforce practices (hooks, discipline skills) - Provide expertise (technical skills, reference) - Configure subagents (task descriptions, constraints) - Run in production (user-facing LLM features) Test before deployment when: - Prompt clarity matters - Consistency is required - Cost of failures is high - Prompt will be reused ## Prompt Types & Testing Strategies | Prompt Type | Test Focus | Example | |-------------|------------|---------| | **Instruction** | Does agent follow steps correctly? | Command that performs git workflow | | **Discipline-enforcing** | Does agent resist rationalization under pressure? | Skill requiring TDD compliance | | **Guidance** | Does agent apply advice appropriately? | Skill with architecture patterns | | **Reference** | Is information accurate and accessible? | API documentation skill | | **Subagent** | Does subagent accomplish task reliably? | Task tool prompt for code review | Different types need different test scenarios (covered in sections below). ## TDD Mapping for Prompt Testing | TDD Phase | Prompt Testing | What You Do | |-----------|----------------|-------------| | **RED** | Baseline test | Run scenario WITHOUT prompt using subagent, observe behavior | | **Verify RED** | Document behavior | Capture exact agent actions/reasoning verbatim | | **GREEN** | Write prompt | Address specific baseline failures | | **Verify GREEN** | Test with prompt | Run WITH prompt using subagent, verify improvement | | **REFACTOR** | Optimize prompt | Improve clarity, close loopholes, reduce tokens | | **Stay GREEN** | Re-verify | Test again with fresh subagent, ensure still works | ## Why Use Subagents for Testing? **Subagents provide:** 1. **Clean slate** - No conversation history affecting behavior 2. **Isolation** - Test only the prompt, not accumulated context 3. **Reproducibility** - Same starting conditions every run 4. **Parallelization** - Test multiple scenarios simultaneously 5. **Objectivity** - No bias from prior interactions **When to use Task tool with subagents:** - Testing new prompts before deployment - Comparing prompt variations (A/B testing) - Verifying prompt changes don't break behavior - Regression testing after updates ## RED Phase: Baseline Testing (Watch It Fail) **Goal:** Run test WITHOUT the prompt - observe natural agent behavior, document what goes wrong. This proves what the prompt needs to fix. ### Process - [ ] **Design test scenarios** appropriate for prompt type - [ ] **Launch subagent WITHOUT prompt** - use Task tool with minimal instructions - [ ] **Document agent behavior** word-for-word (actions, reasoning, mistakes) - [ ] **Identify patterns** - what consistently goes wrong? - [ ] **Note severity** - which failures are critical vs. minor? ### Scenario Design by Prompt Type ####