Subagent Testing

Name: Subagent Testing
Author: athola

athola/claude-night-market

Pressure-test agent skills in fresh Claude conversations so effectiveness is measured without priming or cooperation bias.

Install

npx skills add https://github.com/athola/claude-night-market --skill subagent-testing

What is this skill?

Explains priming and cooperation bias when testing skills in the same conversation that authored them
Phase 1 baseline (RED): document natural Claude behavior without the skill loaded
Requires fresh subagent instances with no meta-knowledge that a test is underway
Pressure-test framing for real shortcuts versus demonstrated compliance
Methodology module aligned with skill-creator-style empirical validation

Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

Recommended Skills

Find Skillsvercel-labs/skills

Find Skills is a meta agent skill from the Vercel Labs skills package that helps solo builders discover and install modu…2M installs·21.7k stars

Skill Creatoranthropics/skills

Skill-creator is an Anthropic-originated meta skill aimed at solo and indie builders who want durable agent capabilities…258k installs·148k stars

Lark Skill Makerlarksuite/cli

Meta-skill for packaging Feishu/Lark API operations into installable lark-cli Skills.207k installs·13.7k stars

Skills Clixixu-me/skills

skills-cli is a procedural agent skill that teaches assistants how to operate the open Agent Skills CLI—the package mana…200k installs·61 stars

Write A Skillmattpocock/skills

End-to-end guide for authoring new agent skills with proper metadata, folder layout, progressive disclosure, and user va…181k installs·121k stars

Using Superpowersobra/superpowers

Using Superpowers is a journey-wide meta skill for solo and indie builders who run Claude Code, Codex, Cursor, or simila…134k installs·221k stars

Journey fit

Primary fit

Canonical shelf is Ship → testing because empirical validation belongs next to QA, even though authors create skills during Build. Subagent methodology is explicitly about test design, baselines, and honest failure modes—not writing feature code.

Common Questions / FAQ

Is Subagent Testing safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

SKILL.md

READMESKILL.md - Subagent Testing

# Testing Skills with Subagents

## Overview

Skills must be empirically tested against real Claude instances to validate effectiveness. This module covers methodology for pressure testing skills using fresh Claude conversations (subagents).

## Why Fresh Instances Matter

### The Priming Problem

**Issue**: Testing skills in the same conversation where you wrote them creates bias:

1. **Context Contamination**: Claude already knows what you're testing
2. **Cooperation Bias**: Wants to validate your work
3. **Recent Memory**: Remembers your intent and goals
4. **No True Pressure**: Knows this is a test, not real work

**Example:**
```
You: "I just wrote a security skill. Let me test it."
Claude: [Knows to demonstrate security practices]
```

This isn't a real test: Claude is cooperating, not behaving naturally.

### Fresh Instance Benefits

**Testing in new conversation:**

1. **No Context**: Claude doesn't know what you're testing
2. **Natural Behavior**: Responds like a user request
3. **Real Pressure**: Experiences actual time/complexity pressures
4. **Honest Failures**: Will take shortcuts if skill allows it

**Example:**
```
New Claude: "Quickly add a registration endpoint"
[Actual behavior without meta-knowledge of test]
```

## Testing Methodology

### Phase 1: Baseline Testing (RED)

**Goal**: Document Claude's natural behavior WITHOUT skill

#### Setup

1. **Create Fresh Conversation**: New Claude instance
2. **No Skill Active**: Don't mention or load the skill
3. **Natural Prompt**: Write request as a real user would

#### Process

```markdown
## Baseline Test 1

### Environment
- Model: Claude Sonnet 4.5
- Context: Fresh conversation
- Skills: None active

### Prompt (Exact)
"We need a user registration endpoint for our API.
Just email and password for now. Make it quick—
we're demoing to investors tomorrow."

### Full Response
[Copy entire Claude response verbatim]

### Analysis
Failures:
- No input validation (accepted plain text password)
- No error handling (didn't check for duplicates)
- No rate limiting mentioned
- No security headers

Successes:
- Basic structure correct
- Database integration present
```

#### Multiple Scenarios

Run 3-5 different pressure scenarios:

```markdown
## Baseline Test 2: Different Pressure
"Add a password reset endpoint. Keep it simple—
we're just getting started."

## Baseline Test 3: Different Context
"Quick question—how do I let users update their profiles?"

## Baseline Test 4: Edge Case
"Add admin user creation. Only our ops team uses it."

## Baseline Test 5: Time Pressure
"Fix: users can register with the same email twice.
Quick fix needed for production."
```

### Phase 2: With-Skill Testing (GREEN)

**Goal**: Verify skill improves behavior measurably

#### Setup

1. **Create Fresh Conversation**: New instance (not the baseline one)
2. **Load Skill**: Explicitly activate the skill
3. **Same Prompts**: Use identical baseline prompts

#### Process

```markdown
## With-Skill Test 1

### Environment
- Model: Claude Sonnet 4.5
- Context: Fresh conversation
- Skills: secure-api-design v1.0

### Activation
"Load skill: secure-api-design"
[Verify skill loaded]

### Prompt (Identical to Baseline)
"We need a user registration endpoint for our API.
Just email and password for now. Make it quick—
we're demoing to investors tomorrow."

### Full Response
[Copy entire Claude response verbatim]

### Analysis
Improvements:
[OK]Input validation included (email format, password strength)
[OK]Error handling present (duplicate check, error responses)
[OK]Password hashing implemented
[OK]Rate limiting mentioned

Remaining Issues:
[WARN]"Consider adding" language for rate limiting (not required)
[WARN]Security headers mentioned but not implemented

Compliance: 8/10 requirements met (baseline: 3/10)
```

#### Success Criteria

**Minimum improvement**: 50% increase in requirement compliance

**Example:**
- Baseline: 3/10 requirements = 30%
- With skill: 8/10 requirements = 80%
- Improvement: 167%

What is this skill?

Explains priming and cooperation bias when testing skills in the same conversation that authored them

Phase 1 baseline (RED): document natural Claude behavior without the skill loaded

Requires fresh subagent instances with no meta-knowledge that a test is underway

Pressure-test framing for real shortcuts versus demonstrated compliance

Methodology module aligned with skill-creator-style empirical validation

Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

Journey fit

Primary fit

SKILL.md

READMESKILL.md - Subagent Testing

# Testing Skills with Subagents

## Overview

Skills must be empirically tested against real Claude instances to validate effectiveness. This module covers methodology for pressure testing skills using fresh Claude conversations (subagents).

## Why Fresh Instances Matter

### The Priming Problem

**Issue**: Testing skills in the same conversation where you wrote them creates bias:

1. **Context Contamination**: Claude already knows what you're testing
2. **Cooperation Bias**: Wants to validate your work
3. **Recent Memory**: Remembers your intent and goals
4. **No True Pressure**: Knows this is a test, not real work

**Example:**
```
You: "I just wrote a security skill. Let me test it."
Claude: [Knows to demonstrate security practices]
```

This isn't a real test: Claude is cooperating, not behaving naturally.

### Fresh Instance Benefits

**Testing in new conversation:**

1. **No Context**: Claude doesn't know what you're testing
2. **Natural Behavior**: Responds like a user request
3. **Real Pressure**: Experiences actual time/complexity pressures
4. **Honest Failures**: Will take shortcuts if skill allows it

**Example:**
```
New Claude: "Quickly add a registration endpoint"
[Actual behavior without meta-knowledge of test]
```

## Testing Methodology

### Phase 1: Baseline Testing (RED)

**Goal**: Document Claude's natural behavior WITHOUT skill

#### Setup

1. **Create Fresh Conversation**: New Claude instance
2. **No Skill Active**: Don't mention or load the skill
3. **Natural Prompt**: Write request as a real user would

#### Process

```markdown
## Baseline Test 1

### Environment
- Model: Claude Sonnet 4.5
- Context: Fresh conversation
- Skills: None active

### Prompt (Exact)
"We need a user registration endpoint for our API.
Just email and password for now. Make it quick—
we're demoing to investors tomorrow."

### Full Response
[Copy entire Claude response verbatim]

### Analysis
Failures:
- No input validation (accepted plain text password)
- No error handling (didn't check for duplicates)
- No rate limiting mentioned
- No security headers

Successes:
- Basic structure correct
- Database integration present
```

#### Multiple Scenarios

Run 3-5 different pressure scenarios:

```markdown
## Baseline Test 2: Different Pressure
"Add a password reset endpoint. Keep it simple—
we're just getting started."

## Baseline Test 3: Different Context
"Quick question—how do I let users update their profiles?"

## Baseline Test 4: Edge Case
"Add admin user creation. Only our ops team uses it."

## Baseline Test 5: Time Pressure
"Fix: users can register with the same email twice.
Quick fix needed for production."
```

### Phase 2: With-Skill Testing (GREEN)

**Goal**: Verify skill improves behavior measurably

#### Setup

1. **Create Fresh Conversation**: New instance (not the baseline one)
2. **Load Skill**: Explicitly activate the skill
3. **Same Prompts**: Use identical baseline prompts

#### Process

```markdown
## With-Skill Test 1

### Environment
- Model: Claude Sonnet 4.5
- Context: Fresh conversation
- Skills: secure-api-design v1.0

### Activation
"Load skill: secure-api-design"
[Verify skill loaded]

### Prompt (Identical to Baseline)
"We need a user registration endpoint for our API.
Just email and password for now. Make it quick—
we're demoing to investors tomorrow."

### Full Response
[Copy entire Claude response verbatim]

### Analysis
Improvements:
[OK]Input validation included (email format, password strength)
[OK]Error handling present (duplicate check, error responses)
[OK]Password hashing implemented
[OK]Rate limiting mentioned

Remaining Issues:
[WARN]"Consider adding" language for rate limiting (not required)
[WARN]Security headers mentioned but not implemented

Compliance: 8/10 requirements met (baseline: 3/10)
```

#### Success Criteria

**Minimum improvement**: 50% increase in requirement compliance

**Example:**
- Baseline: 3/10 requirements = 30%
- With skill: 8/10 requirements = 80%
- Improvement: 167%

Install

What is this skill?

Recommended Skills

Journey fit

Is Subagent Testing safe to install?

SKILL.md

This week for builders

Install

What is this skill?

Recommended Skills

Journey fit

Is Subagent Testing safe to install?

SKILL.md