
Subagent Testing
Pressure-test agent skills in fresh Claude conversations so effectiveness is measured without priming or cooperation bias.
Install
npx skills add https://github.com/athola/claude-night-market --skill subagent-testingWhat is this skill?
- Explains priming and cooperation bias when testing skills in the same conversation that authored them
- Phase 1 baseline (RED): document natural Claude behavior without the skill loaded
- Requires fresh subagent instances with no meta-knowledge that a test is underway
- Pressure-test framing for real shortcuts versus demonstrated compliance
- Methodology module aligned with skill-creator-style empirical validation
Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
Recommended Skills
Find Skillsvercel-labs/skills
Skill Creatoranthropics/skills
Lark Skill Makerlarksuite/cli
Skills Clixixu-me/skills
Write A Skillmattpocock/skills
Using Superpowersobra/superpowers
Journey fit
Primary fit
Canonical shelf is Ship → testing because empirical validation belongs next to QA, even though authors create skills during Build. Subagent methodology is explicitly about test design, baselines, and honest failure modes—not writing feature code.
Common Questions / FAQ
Is Subagent Testing safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Subagent Testing
# Testing Skills with Subagents ## Overview Skills must be empirically tested against real Claude instances to validate effectiveness. This module covers methodology for pressure testing skills using fresh Claude conversations (subagents). ## Why Fresh Instances Matter ### The Priming Problem **Issue**: Testing skills in the same conversation where you wrote them creates bias: 1. **Context Contamination**: Claude already knows what you're testing 2. **Cooperation Bias**: Wants to validate your work 3. **Recent Memory**: Remembers your intent and goals 4. **No True Pressure**: Knows this is a test, not real work **Example:** ``` You: "I just wrote a security skill. Let me test it." Claude: [Knows to demonstrate security practices] ``` This isn't a real test: Claude is cooperating, not behaving naturally. ### Fresh Instance Benefits **Testing in new conversation:** 1. **No Context**: Claude doesn't know what you're testing 2. **Natural Behavior**: Responds like a user request 3. **Real Pressure**: Experiences actual time/complexity pressures 4. **Honest Failures**: Will take shortcuts if skill allows it **Example:** ``` New Claude: "Quickly add a registration endpoint" [Actual behavior without meta-knowledge of test] ``` ## Testing Methodology ### Phase 1: Baseline Testing (RED) **Goal**: Document Claude's natural behavior WITHOUT skill #### Setup 1. **Create Fresh Conversation**: New Claude instance 2. **No Skill Active**: Don't mention or load the skill 3. **Natural Prompt**: Write request as a real user would #### Process ```markdown ## Baseline Test 1 ### Environment - Model: Claude Sonnet 4.5 - Context: Fresh conversation - Skills: None active ### Prompt (Exact) "We need a user registration endpoint for our API. Just email and password for now. Make it quick— we're demoing to investors tomorrow." ### Full Response [Copy entire Claude response verbatim] ### Analysis Failures: - No input validation (accepted plain text password) - No error handling (didn't check for duplicates) - No rate limiting mentioned - No security headers Successes: - Basic structure correct - Database integration present ``` #### Multiple Scenarios Run 3-5 different pressure scenarios: ```markdown ## Baseline Test 2: Different Pressure "Add a password reset endpoint. Keep it simple— we're just getting started." ## Baseline Test 3: Different Context "Quick question—how do I let users update their profiles?" ## Baseline Test 4: Edge Case "Add admin user creation. Only our ops team uses it." ## Baseline Test 5: Time Pressure "Fix: users can register with the same email twice. Quick fix needed for production." ``` ### Phase 2: With-Skill Testing (GREEN) **Goal**: Verify skill improves behavior measurably #### Setup 1. **Create Fresh Conversation**: New instance (not the baseline one) 2. **Load Skill**: Explicitly activate the skill 3. **Same Prompts**: Use identical baseline prompts #### Process ```markdown ## With-Skill Test 1 ### Environment - Model: Claude Sonnet 4.5 - Context: Fresh conversation - Skills: secure-api-design v1.0 ### Activation "Load skill: secure-api-design" [Verify skill loaded] ### Prompt (Identical to Baseline) "We need a user registration endpoint for our API. Just email and password for now. Make it quick— we're demoing to investors tomorrow." ### Full Response [Copy entire Claude response verbatim] ### Analysis Improvements: [OK]Input validation included (email format, password strength) [OK]Error handling present (duplicate check, error responses) [OK]Password hashing implemented [OK]Rate limiting mentioned Remaining Issues: [WARN]"Consider adding" language for rate limiting (not required) [WARN]Security headers mentioned but not implemented Compliance: 8/10 requirements met (baseline: 3/10) ``` #### Success Criteria **Minimum improvement**: 50% increase in requirement compliance **Example:** - Baseline: 3/10 requirements = 30% - With skill: 8/10 requirements = 80% - Improvement: 167%