
Autoresearch
Run a bounded autonomous loop that edits scoped files, measures a numeric metric via shell verify, and keeps or discards each change.
Overview
Autoresearch is an agent skill most often used in Build (also Ship testing, Operate iterate) that runs modify–verify–keep/discard loops against any numeric shell metric.
Install
npx skills add https://github.com/uditgoenka/autoresearch --skill autoresearchWhat is this skill?
- EXECUTE IMMEDIATELY protocol: parse Goal, Scope, Metric, Verify, Guard, and Iterations from arguments
- Default 25 iterations or unlimited mode with optional --evals mid-loop checkpoints
- Direction flag for higher_is_better or lower_is_better metric optimization
- Optional Guard shell command that must pass before keeping changes
- Optional --chain for comma-separated downstream commands after the loop
- Optional --evals mid-loop checkpoints with --evals-interval override
Adoption & trust: 997 installs on skills.sh; 4.9k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You know what to improve and how to measure it, but manually applying and judging dozens of agent edits burns time and loses track of what worked.
Who is it for?
Indie builders with a reproducible numeric metric (tests, benchmarks, bundle analyzers) and clear file scope who want autonomous experimentation under guardrails.
Skip if: Open-ended product discovery without a verify command, changes that cannot be rolled back per iteration, or goals with no quantifiable shell metric.
When should I use this skill?
User sets a goal and metric with Verify (and optional Scope, Guard, Iterations or --chain) and wants immediate autonomous iteration—not planning-only chat.
What do I get? / Deliverables
The agent loops scoped edits against Verify (and Guard), retaining only changes that improve the metric until N iterations complete or you chain follow-up commands.
- Retained code changes that improved the metric
- Discarded attempts logged through the iteration protocol
- Optional chained downstream command results
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Autoresearch is shelved under Build agent-tooling because its primary job is iterative code modification toward a stated goal. The protocol is an agent iteration engine (Goal, Scope, Metric, Verify, Guard)—tooling for how the agent works, not a one-off integration.
Where it fits
Loop TS refactors in src/**/*.ts until bundle-size Verify drops.
Iterate patches until pytest Verify reports a higher pass count with Guard enforcing lint.
Tune config files until a latency benchmark Verify improves without breaking Guard deploy dry-run.
How it compares
Use instead of unstructured “keep trying until it feels better” chat when you need explicit keep/discard decisions tied to a number.
Common Questions / FAQ
Who is autoresearch for?
Solo developers using Codex or similar agents who want Karpathy-style autoresearch loops with defined metrics and safety guards.
When should I use autoresearch?
In Build when optimizing scoped code against a metric; in Ship when Verify is tests or QA scores; in Operate when iterating on perf or error-rate commands you can shell out.
Is autoresearch safe to install?
It can modify files and run shell Verify/Guard commands—review the Security Audits panel on this page and tighten Scope and Guard before unlimited iterations.
SKILL.md
READMESKILL.md - Autoresearch
interface: display_name: "Autoresearch" short_description: "Autonomous goal-directed iteration engine" brand_color: "#7C3AED" default_prompt: "Set a goal, define a metric, let Codex loop until done" policy: allow_implicit_invocation: true --- name: autoresearch description: "Autonomous iteration loop: modify, verify, keep/discard against any metric" argument-hint: "[Goal: <text>] [Scope: <glob>] [Metric: <text>] [Verify: <cmd>] [Guard: <cmd>] [Iterations: N] [--evals]" --- EXECUTE IMMEDIATELY — do not deliberate before reading this protocol. ## Parse Arguments Extract from $ARGUMENTS: - `Goal:` — what to improve - `Scope:` or `--scope` — file globs - `Metric:` — what to measure - `Direction:` — higher_is_better (default) or lower_is_better - `Verify:` — shell command that outputs a number - `Guard:` — optional safety command (must always pass) - `Iterations:` or `--iterations` — integer N for bounded mode (default: 25). "unlimited" for unbounded. - `--evals` — enable mid-loop checkpoints - `--evals-interval N` — checkpoint frequency override - `--chain <targets>` — comma-separated downstream commands ## Setup (if required context missing) If Goal, Scope, Metric, or Verify missing → use request_user_input (single batched call): Q1 (Goal): "What do you want to improve?" Q2 (Scope): "Which files?" — suggest globs from project Q3 (Metric+Verify): "How to measure? Provide a shell command that outputs a number" Q4 (Guard): "Safety command that must always pass?" — options: test cmd, build cmd, skip If ALL provided inline → skip setup, proceed directly. ## Precondition Checks 1. Verify git repo exists (`git rev-parse --git-dir`) 2. Check clean working tree (`git status --porcelain`) — warn if dirty 3. Check for stale lock files, detached HEAD 4. If Guard set → run Guard to establish guard baseline 5. Fail fast on any critical issue. Warn on non-critical. ## Verify Safety Screen Before first dry-run, screen Verify command for: rm -rf, fork bombs, curl|sh, embedded credentials, outbound writes. Block dangerous commands. ## Establish Baseline (Iteration 0) 1. Run Verify command → extract numeric metric 2. Record as iteration 0 in TSV: `0\t{timestamp}\t{commit}\t{metric}\t0.0\t{guard}\t-\tbaseline\tinitial state` 3. Create output directory: `autoresearch/loop-{YYMMDD}-{HHMM}/` 4. Write TSV header: `# metric_direction: {direction}\niteration\ttimestamp\tcommit\tmetric\tdelta\tguard\tguard-metric\tstatus\tdescription` ## Iteration Loop For each iteration (1 to max_iterations, or unbounded): ### Phase 1: Review (read git history as memory) - Read last 10-20 lines of results TSV - Run `git log --oneline -20` — see what worked/failed - If last iteration was "keep" → run `git diff HEAD~1` to see what improved metric - Identify: what worked, what failed, what's untried ### Phase 2: Modify - Based on review, make ONE focused change to improve the metric - Change must be atomic — one logical unit of work ### Phase 3: Commit - Stage and commit with `experiment: {description}` prefix - Record commit SHA ### Phase 4: Verify - Run Verify command → extract new metric value - Calculate delta from previous iteration - Metric improved (correct direction) → candidate for keep ### Phase 5: Guard (if configured) - Run Guard command. If fails → revert regardless of metric improvement ### Phase 6: Decide - **keep** — metric improved, guard passed → commit stays - **discard** — metric worsened → `git revert HEAD --no-edit` - **crash** — verify/guard command failed → `git revert HEAD --no-edit` - **no-op** — no change made this iteration - **hook-blocked** — git hook blocked the commit - **metric-error** — verify output not a valid number → `git revert HEAD --no-edit` ### Phase 7: Log Append row to TSV: iteration, timestamp, commit/-, metric, delta, guard status, guard-metric, status, description ### Eval Checkpoint If --evals: check if current_iteration % interval == 0 → run checkpoint analysis. ### Bounded Check If b