Gan Style Harness

Long-running harnesses are adopted when you are actively building product with agents, not during initial idea sketches alone. Agent-tooling is the shelf for orchestration patterns—separate generator and evaluator roles and tool-backed iteration.

Also useful

Also useful

ValidatePrototype & spike

Where it fits

Example use

Spin a clickable full-stack prototype where evaluator rejects shallow UI before you commit to the stack.

Example use

Orchestrate Generator and Evaluator agents across repo tools for a multi-feature milestone.

Example use

Iterate marketing and app chrome until the evaluator flags layout, typography, or interaction issues.

Example use

Use the evaluator role as a pre-ship gate on completeness and regressions before tagging a release.

How it compares

Use as a multi-agent quality harness instead of one self-reviewing Claude session for whole-app generation.

Common Questions / FAQ

Who is gan-style-harness for?

Solo builders using agentic IDEs who need autonomous, long-running app builds with a separate strict reviewer—not casual snippet edits.

When should I use gan-style-harness?

In Build (agent-tooling) for full applications from a prompt; in Build (frontend) for high visual quality; in Ship (review) when evaluator passes gate quality before you ship.

Is gan-style-harness safe to install?

It declares broad tool use (shell, writes, subtasks); treat it as high-permission orchestration and review the Security Audits panel on this Prism page before enabling in production repos.

SKILL.md

READMESKILL.md - Gan Style Harness

# GAN-Style Harness Skill

> Inspired by [Anthropic's Harness Design for Long-Running Application Development](https://www.anthropic.com/engineering/harness-design-long-running-apps) (March 24, 2026)

A multi-agent harness that separates **generation** from **evaluation**, creating an adversarial feedback loop that drives quality far beyond what a single agent can achieve.

## Core Insight

> When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a **separate evaluator** to be ruthlessly strict is far more tractable than teaching a generator to self-critique.

This is the same dynamic as GANs (Generative Adversarial Networks): the Generator produces, the Evaluator critiques, and that feedback drives the next iteration.

## When to Use

- Building complete applications from a one-line prompt
- Frontend design tasks requiring high visual quality
- Full-stack projects that need working features, not just code
- Any task where "AI slop" aesthetics are unacceptable
- Projects where you want to invest $50-200 for production-quality output

## When NOT to Use

- Quick single-file fixes (use standard `claude -p`)
- Tasks with tight budget constraints (<$10)
- Simple refactoring (use de-sloppify pattern instead)
- Tasks that are already well-specified with tests (use TDD workflow)

## Architecture

```
                    ┌─────────────┐
                    │   PLANNER   │
                    │  (Opus 4.6) │
                    └──────┬──────┘
                           │ Product Spec
                           │ (features, sprints, design direction)
                           ▼
              ┌────────────────────────┐
              │                        │
              │   GENERATOR-EVALUATOR  │
              │      FEEDBACK LOOP     │
              │                        │
              │  ┌──────────┐          │
              │  │GENERATOR │--build-->│──┐
              │  │(Opus 4.6)│          │  │
              │  └────▲─────┘          │  │
              │       │                │  │ live app
              │    feedback             │  │
              │       │                │  │
              │  ┌────┴─────┐          │  │
              │  │EVALUATOR │<-test----│──┘
              │  │(Opus 4.6)│          │
              │  │+Playwright│         │
              │  └──────────┘          │
              │                        │
              │   5-15 iterations      │
              └────────────────────────┘
```

## The Three Agents

### 1. Planner Agent

**Role:** Product manager — expands a brief prompt into a full product specification.

**Key behaviors:**
- Takes a one-line prompt and produces a 16-feature, multi-sprint specification
- Defines user stories, technical requirements, and visual design direction
- Is deliberately **ambitious** — conservative planning leads to underwhelming results
- Produces evaluation criteria that the Evaluator will use later

**Model:** Opus 4.6 (needs deep reasoning for spec expansion)

### 2. Generator Agent

**Role:** Developer — implements features according to the spec.

**Key behaviors:**
- Works in structured sprints (or continuous mode with newer models)
- Negotiates a "sprint contract" with the Evaluator before writing code
- Uses full-stack tooling: React, FastAPI/Express, databases, CSS
- Manages git for version control between iterations
- Reads Evaluator feedback and incorporates it in next iteration

**Model:** Opus 4.6 (needs strong coding capability)

### 3. Evaluator Agent

**Role:** QA engineer — tests the live running application, not just code.

**Key behaviors:**
- Uses **Playwright MCP** to int

What is this skill?

Separates Generator and ruthlessly strict Evaluator to break agent self-praise loops

Based on Anthropic March 2026 long-running application harness design

Targets full-stack and frontend design tasks where “AI slop” aesthetics fail the bar

Uses Read, Write, Edit, Bash, Grep, Glob, and Task for autonomous multi-step runs

Explicit when-NOT-to-use guardrail for quick single-file fixes

Inspired by Anthropic harness design paper (March 24, 2026) for long-running application development

Declares 7 agent tools: Read, Write, Edit, Bash, Grep, Glob, Task

Compatible agents: Claude Code, Cursor, Codex

Adoption & trust: 3.2k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

ValidatePrototype & spike

Where it fits

Example use

Spin a clickable full-stack prototype where evaluator rejects shallow UI before you commit to the stack.

Example use

Orchestrate Generator and Evaluator agents across repo tools for a multi-feature milestone.

Example use

Iterate marketing and app chrome until the evaluator flags layout, typography, or interaction issues.

Example use