Autoresearch

Autoresearch is shelved under Build agent-tooling because its primary job is iterative code modification toward a stated goal. The protocol is an agent iteration engine (Goal, Scope, Metric, Verify, Guard)—tooling for how the agent works, not a one-off integration.

Also useful

Also useful

Where it fits

Example use

Loop TS refactors in src/**/*.ts until bundle-size Verify drops.

Example use

Iterate patches until pytest Verify reports a higher pass count with Guard enforcing lint.

Example use

Tune config files until a latency benchmark Verify improves without breaking Guard deploy dry-run.

How it compares

Use instead of unstructured “keep trying until it feels better” chat when you need explicit keep/discard decisions tied to a number.

Common Questions / FAQ

Who is autoresearch for?

Solo developers using Codex or similar agents who want Karpathy-style autoresearch loops with defined metrics and safety guards.

When should I use autoresearch?

In Build when optimizing scoped code against a metric; in Ship when Verify is tests or QA scores; in Operate when iterating on perf or error-rate commands you can shell out.

Is autoresearch safe to install?

It can modify files and run shell Verify/Guard commands—review the Security Audits panel on this page and tighten Scope and Guard before unlimited iterations.

SKILL.md

READMESKILL.md - Autoresearch

interface:
  display_name: "Autoresearch"
  short_description: "Autonomous goal-directed iteration engine"
  brand_color: "#7C3AED"
  default_prompt: "Set a goal, define a metric, let Codex loop until done"

policy:
  allow_implicit_invocation: true


---
name: autoresearch
description: "Autonomous iteration loop: modify, verify, keep/discard against any metric"
argument-hint: "[Goal: <text>] [Scope: <glob>] [Metric: <text>] [Verify: <cmd>] [Guard: <cmd>] [Iterations: N] [--evals]"
---

EXECUTE IMMEDIATELY — do not deliberate before reading this protocol.

## Parse Arguments

Extract from $ARGUMENTS:
- `Goal:` — what to improve
- `Scope:` or `--scope` — file globs
- `Metric:` — what to measure
- `Direction:` — higher_is_better (default) or lower_is_better
- `Verify:` — shell command that outputs a number
- `Guard:` — optional safety command (must always pass)
- `Iterations:` or `--iterations` — integer N for bounded mode (default: 25). "unlimited" for unbounded.
- `--evals` — enable mid-loop checkpoints
- `--evals-interval N` — checkpoint frequency override
- `--chain <targets>` — comma-separated downstream commands

## Setup (if required context missing)

If Goal, Scope, Metric, or Verify missing → use request_user_input (single batched call):
  Q1 (Goal): "What do you want to improve?"
  Q2 (Scope): "Which files?" — suggest globs from project
  Q3 (Metric+Verify): "How to measure? Provide a shell command that outputs a number"
  Q4 (Guard): "Safety command that must always pass?" — options: test cmd, build cmd, skip
If ALL provided inline → skip setup, proceed directly.

## Precondition Checks

1. Verify git repo exists (`git rev-parse --git-dir`)
2. Check clean working tree (`git status --porcelain`) — warn if dirty
3. Check for stale lock files, detached HEAD
4. If Guard set → run Guard to establish guard baseline
5. Fail fast on any critical issue. Warn on non-critical.

## Verify Safety Screen

Before first dry-run, screen Verify command for: rm -rf, fork bombs, curl|sh, embedded credentials, outbound writes. Block dangerous commands.

## Establish Baseline (Iteration 0)

1. Run Verify command → extract numeric metric
2. Record as iteration 0 in TSV: `0\t{timestamp}\t{commit}\t{metric}\t0.0\t{guard}\t-\tbaseline\tinitial state`
3. Create output directory: `autoresearch/loop-{YYMMDD}-{HHMM}/`
4. Write TSV header: `# metric_direction: {direction}\niteration\ttimestamp\tcommit\tmetric\tdelta\tguard\tguard-metric\tstatus\tdescription`

## Iteration Loop

For each iteration (1 to max_iterations, or unbounded):

### Phase 1: Review (read git history as memory)
- Read last 10-20 lines of results TSV
- Run `git log --oneline -20` — see what worked/failed
- If last iteration was "keep" → run `git diff HEAD~1` to see what improved metric
- Identify: what worked, what failed, what's untried

### Phase 2: Modify
- Based on review, make ONE focused change to improve the metric
- Change must be atomic — one logical unit of work

### Phase 3: Commit
- Stage and commit with `experiment: {description}` prefix
- Record commit SHA

### Phase 4: Verify
- Run Verify command → extract new metric value
- Calculate delta from previous iteration
- Metric improved (correct direction) → candidate for keep

### Phase 5: Guard (if configured)
- Run Guard command. If fails → revert regardless of metric improvement

### Phase 6: Decide
- **keep** — metric improved, guard passed → commit stays
- **discard** — metric worsened → `git revert HEAD --no-edit`
- **crash** — verify/guard command failed → `git revert HEAD --no-edit`
- **no-op** — no change made this iteration
- **hook-blocked** — git hook blocked the commit
- **metric-error** — verify output not a valid number → `git revert HEAD --no-edit`

### Phase 7: Log
Append row to TSV: iteration, timestamp, commit/-, metric, delta, guard status, guard-metric, status, description

### Eval Checkpoint
If --evals: check if current_iteration % interval == 0 → run checkpoint analysis.

### Bounded Check
If b

What is this skill?

EXECUTE IMMEDIATELY protocol: parse Goal, Scope, Metric, Verify, Guard, and Iterations from arguments

Default 25 iterations or unlimited mode with optional --evals mid-loop checkpoints

Direction flag for higher_is_better or lower_is_better metric optimization

Optional Guard shell command that must pass before keeping changes

Optional --chain for comma-separated downstream commands after the loop

Optional --evals mid-loop checkpoints with --evals-interval override

Compatible agents: Codex, Claude Code, Cursor, any compatible agent

Adoption & trust: 997 installs on skills.sh; 4.9k GitHub stars; 2/3 security scanners passed (skills.sh audits).

Who is it for?

Indie builders with a reproducible numeric metric (tests, benchmarks, bundle analyzers) and clear file scope who want autonomous experimentation under guardrails.

Skip if: Open-ended product discovery without a verify command, changes that cannot be rolled back per iteration, or goals with no quantifiable shell metric.

What do I get? / Deliverables

The agent loops scoped edits against Verify (and Guard), retaining only changes that improve the metric until N iterations complete or you chain follow-up commands.

Retained code changes that improved the metric

Discarded attempts logged through the iteration protocol

Optional chained downstream command results

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Loop TS refactors in src/**/*.ts until bundle-size Verify drops.

Example use

Iterate patches until pytest Verify reports a higher pass count with Guard enforcing lint.

Example use