Autoresearch Agent

Name: Autoresearch Agent
Author: alirezarezvani

alirezarezvani/claude-skills

572 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

autoresearch-agent is a Claude Code skill that runs metric-driven experiment loops to benchmark and iteratively improve code speed, bundle size, test reliability, or Docker build time for developers optimizing production

About

autoresearch-agent is a Claude Code skill from alirezarezvani/claude-skills that automates engineering experiment loops with setup_experiment.py scripts and benchmark evaluators. Domains include code speed optimization with pytest benches targeting p50_ms, bundle size reduction via webpack builds, test reliability, and Docker build time, running roughly 5 minutes per experiment or about 12 per hour. Developers reach for autoresearch-agent when a metric is defined and an evaluator command exists but manual tuning is too slow. The agent iterates on algorithms, caching, query patterns, and I/O until the configured metric moves in the chosen direction.

CLI `setup_experiment.py` scaffolds domains with target file, eval command, metric, and optimize direction (lower/higher
Engineering presets: API speed, bundle size, flaky-test pass rate, and Docker build speed with pluggable evaluators.
Documented throughput: ~5 min per experiment, ~12/hour, ~100 overnight for unattended sweeps.
Agents optimize algorithms, caching, I/O, webpack config, Dockerfile layers, and parser logic against your chosen metric
Free local cost model—runs your existing pytest, npm build, and custom evaluate.py scripts.

Autoresearch Agent by the numbers

572 all-time installs (skills.sh)
Ranked #392 of 2,725 Automation & Workflows skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill autoresearch-agent

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/autoresearch-agent.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/autoresearch-agent)

Installs	572
repo stars	★ 23.5k
Security audit	1 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you automate performance experiments on code?

Spin up metric-driven experiment loops that benchmark and iteratively improve code speed, bundle size, test reliability, or Docker build time.

Who is it for?

Backend and platform engineers with existing benchmark commands who want automated metric-driven optimization loops overnight.

Skip if: Developers without measurable eval scripts or those solving one-off bugs without quantified performance targets should skip autoresearch-agent.

When should I use this skill?

A developer asks to optimize API speed, reduce bundle size, improve test reliability, or shorten Docker build time with automated experiments.

What you get

Benchmarked code changes with metric deltas for p50 latency, bundle size, test reliability, or Docker build duration.

Experiment metric reports
Iteratively optimized code or config changes

By the numbers

Runs roughly 5 minutes per experiment, about 12 per hour, ~100 overnight
Supports code speed, bundle size, test reliability, and Docker build time domains

Files

SKILL.mdMarkdownGitHub ↗

Autoresearch Agent

You sleep. The agent experiments. You wake up to results.

Autonomous experiment loop inspired by Karpathy's autoresearch. The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.

Not one guess — fifty measured attempts, compounding.

---

Slash Commands

Command	What it does
`/ar:setup`	Set up a new experiment interactively
`/ar:run`	Run a single experiment iteration
`/ar:loop`	Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly)
`/ar:status`	Show dashboard and results
`/ar:resume`	Resume a paused experiment

---

When This Skill Activates

Recognize these patterns from the user:

"Make this faster / smaller / better"
"Optimize [file] for [metric]"
"Improve my [headlines / copy / prompts]"
"Run experiments overnight"
"I want to get [metric] from X to Y"
Any request involving: optimize, benchmark, improve, experiment loop, autoresearch

If the user describes a target file + a way to measure success → this skill applies.

---

Setup

First Time — Create the Experiment

Run the setup script. The user decides where experiments live:

Project-level (inside repo, git-tracked, shareable with team):

python scripts/setup_experiment.py \
  --domain engineering \
  --name api-speed \
  --target src/api/search.py \
  --eval "pytest bench.py --tb=no -q" \
  --metric p50_ms \
  --direction lower \
  --scope project

User-level (personal, in ~/.autoresearch/):

python scripts/setup_experiment.py \
  --domain marketing \
  --name medium-ctr \
  --target content/titles.md \
  --eval "python evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content \
  --scope user

The --scope flag determines where .autoresearch/ lives:

project (default) → .autoresearch/ in the repo root. Experiment definitions are git-tracked. Results are gitignored.
user → ~/.autoresearch/ in the home directory. Everything is personal.

What Setup Creates

.autoresearch/
├── config.yaml                        ← Global settings
├── .gitignore                         ← Ignores results.tsv, *.log
└── {domain}/{experiment-name}/
    ├── program.md                     ← Objectives, constraints, strategy
    ├── config.cfg                     ← Target, eval cmd, metric, direction
    ├── results.tsv                    ← Experiment log (gitignored)
    └── evaluate.py                    ← Evaluation script (if --evaluator used)

results.tsv columns: commit | metric | status | description

commit — short git hash
metric — float value or "N/A" for crashes
status — keep | discard | crash
description — what changed or why it crashed

Domains

Domain	Use Cases
`engineering`	Code speed, memory, bundle size, test pass rate, build time
`marketing`	Headlines, social copy, email subjects, ad copy, engagement
`content`	Article structure, SEO descriptions, readability, CTR
`prompts`	System prompts, chatbot tone, agent instructions
`custom`	Anything else with a measurable metric

If `program.md` Already Exists

The user may have written their own program.md. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.

---

Agent Protocol

You are the loop. The scripts handle setup and evaluation — you handle the creative work.

Before Starting

1. Read .autoresearch/{domain}/{name}/config.cfg to get:

target — the file you edit
evaluate_cmd — the command that measures your changes
metric — the metric name to look for in eval output
metric_direction — "lower" or "higher" is better
time_budget_minutes — max time per evaluation

2. Read program.md for strategy, constraints, and what you can/cannot change 3. Read results.tsv for experiment history (columns: commit, metric, status, description) 4. Checkout the experiment branch: git checkout autoresearch/{domain}/{name}

Each Iteration

1. Review results.tsv — what worked? What failed? What hasn't been tried? 2. Decide ONE change to the target file. One variable per experiment. 3. Edit the target file 4. Commit: git add {target} && git commit -m "experiment: {description}" 5. Evaluate: python scripts/run_experiment.py --experiment {domain}/{name} --single 6. Read the output — it prints KEEP, DISCARD, or CRASH with the metric value 7. Go to step 1

What the Script Handles (you don't)

Running the eval command with timeout
Parsing the metric from eval output
Comparing to previous best
Reverting the commit on failure (git reset --hard HEAD~1)
Logging the result to results.tsv

Starting an Experiment

# Single iteration (the agent calls this repeatedly)
python scripts/run_experiment.py --experiment engineering/api-speed --single

# Dry run (test setup before starting)
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run

Strategy Escalation

Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
Runs 6-15: Systematic exploration (vary one parameter at a time)
Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
Runs 30+: Radical experiments (completely different approaches)
If no improvement in 20+ runs: update program.md Strategy section

Self-Improvement

After every 10 experiments, review results.tsv for patterns. Update the Strategy section of program.md with what you learned (e.g., "caching changes consistently improve by 5-10%", "refactoring attempts never improve the metric"). Future iterations benefit from this accumulated knowledge.

Stopping

Run until interrupted by the user, context limit reached, or goal in program.md is met
Before stopping: ensure results.tsv is up to date
On context limit: the next session can resume — results.tsv and git log persist

Rules

One change per experiment. Don't change 5 things at once. You won't know what worked.
Simplicity criterion. A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
Never modify the evaluator. evaluate.py is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
Timeout. If a run exceeds 2.5× the time budget, kill it and treat as crash.
Crash handling. If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
No new dependencies. Only use what's already available in the project.

---

Evaluators

Ready-to-use evaluation scripts. Copied into the experiment directory during setup with --evaluator.

Free Evaluators (no API cost)

Evaluator	Metric	Use Case
`benchmark_speed`	`p50_ms` (lower)	Function/API execution time
`benchmark_size`	`size_bytes` (lower)	File, bundle, Docker image size
`test_pass_rate`	`pass_rate` (higher)	Test suite pass percentage
`build_speed`	`build_seconds` (lower)	Build/compile/Docker build time
`memory_usage`	`peak_mb` (lower)	Peak memory during execution

LLM Judge Evaluators (uses your subscription)

Evaluator	Metric	Use Case
`llm_judge_content`	`ctr_score` 0-10 (higher)	Headlines, titles, descriptions
`llm_judge_prompt`	`quality_score` 0-100 (higher)	System prompts, agent instructions
`llm_judge_copy`	`engagement_score` 0-10 (higher)	Social posts, ad copy, emails

LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside evaluate.py — the agent cannot modify it. This prevents the agent from gaming its own evaluator.

The user's existing subscription covers the cost:

Claude Code Max → unlimited Claude calls for evaluation
Codex CLI (ChatGPT Pro) → unlimited Codex calls
Gemini CLI (free tier) → free evaluation calls

Custom Evaluators

If no built-in evaluator fits, the user writes their own evaluate.py. Only requirement: it must print metric_name: value to stdout.

#!/usr/bin/env python3
# My custom evaluator — DO NOT MODIFY after experiment starts
import subprocess
result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
# Parse and output
print(f"my_metric: {parse_score(result.stdout)}")

---

Viewing Results

# Single experiment
python scripts/log_results.py --experiment engineering/api-speed

# All experiments in a domain
python scripts/log_results.py --domain engineering

# Cross-experiment dashboard
python scripts/log_results.py --dashboard

# Export formats
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md

Dashboard Output

DOMAIN          EXPERIMENT          RUNS  KEPT  BEST         Δ FROM START  STATUS
engineering     api-speed            47    14   185ms        -76.9%        active
engineering     bundle-size          23     8   412KB        -58.3%        paused
marketing       medium-ctr           31    11   8.4/10       +68.0%        active
prompts         support-tone         15     6   82/100       +46.4%        done

Export Formats

TSV — default, tab-separated (compatible with spreadsheets)
CSV — comma-separated, with proper quoting
Markdown — formatted table, readable in GitHub/docs

---

Proactive Triggers

Flag these without being asked:

No evaluation command works → Test it before starting the loop. Run once, verify output.
Target file not in git → git init && git add . && git commit -m 'initial' first.
Metric direction unclear → Ask: is lower or higher better? Must know before starting.
Time budget too short → If eval takes longer than budget, every run crashes.
Agent modifying evaluate.py → Hard stop. This invalidates all comparisons.
5 consecutive crashes → Pause the loop. Alert the user. Don't keep burning cycles.
No improvement in 20+ runs → Suggest changing strategy in program.md or trying a different approach.

---

Installation

One-liner (any tool)

git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/

Multi-tool install

./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw

OpenClaw

clawhub install cs-autoresearch-agent

---

Related Skills

self-improving-agent — improves an agent's own memory/rules over time. NOT for structured experiment loops.
senior-ml-engineer — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
tdd-guide — test-driven development. Complementary — tests can be the evaluation function.
skill-security-auditor — audit skills before publishing. NOT for optimization loops.

Experiment Domains Guide

Domain: Engineering

Code Speed Optimization

python scripts/setup_experiment.py \
  --domain engineering \
  --name api-speed \
  --target src/api/search.py \
  --eval "python -m pytest tests/bench_search.py --tb=no -q" \
  --metric p50_ms \
  --direction lower \
  --evaluator benchmark_speed

What the agent optimizes: Algorithm, data structures, caching, query patterns, I/O. Cost: Free — just runs benchmarks. Speed: ~5 min/experiment, ~12/hour, ~100 overnight.

Bundle Size Reduction

python scripts/setup_experiment.py \
  --domain engineering \
  --name bundle-size \
  --target webpack.config.js \
  --eval "npm run build && python .autoresearch/engineering/bundle-size/evaluate.py" \
  --metric size_bytes \
  --direction lower \
  --evaluator benchmark_size

Edit evaluate.py to set TARGET_FILE = "dist/main.js" and add BUILD_CMD = "npm run build".

Test Pass Rate

python scripts/setup_experiment.py \
  --domain engineering \
  --name fix-flaky-tests \
  --target src/utils/parser.py \
  --eval "python .autoresearch/engineering/fix-flaky-tests/evaluate.py" \
  --metric pass_rate \
  --direction higher \
  --evaluator test_pass_rate

Docker Build Speed

python scripts/setup_experiment.py \
  --domain engineering \
  --name docker-build \
  --target Dockerfile \
  --eval "python .autoresearch/engineering/docker-build/evaluate.py" \
  --metric build_seconds \
  --direction lower \
  --evaluator build_speed

Memory Optimization

python scripts/setup_experiment.py \
  --domain engineering \
  --name memory-usage \
  --target src/processor.py \
  --eval "python .autoresearch/engineering/memory-usage/evaluate.py" \
  --metric peak_mb \
  --direction lower \
  --evaluator memory_usage

ML Training (Karpathy-style)

Requires NVIDIA GPU. See autoresearch.

python scripts/setup_experiment.py \
  --domain engineering \
  --name ml-training \
  --target train.py \
  --eval "uv run train.py" \
  --metric val_bpb \
  --direction lower \
  --time-budget 5

---

Domain: Marketing

Medium Article Headlines

python scripts/setup_experiment.py \
  --domain marketing \
  --name medium-ctr \
  --target content/titles.md \
  --eval "python .autoresearch/marketing/medium-ctr/evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content

Edit evaluate.py: set TARGET_FILE = "content/titles.md" and CLI_TOOL = "claude".

What the agent optimizes: Title phrasing, curiosity gaps, specificity, emotional triggers. Cost: Uses your CLI subscription (Claude Max = unlimited). Speed: ~2 min/experiment, ~30/hour.

Social Media Copy

python scripts/setup_experiment.py \
  --domain marketing \
  --name twitter-engagement \
  --target social/tweets.md \
  --eval "python .autoresearch/marketing/twitter-engagement/evaluate.py" \
  --metric engagement_score \
  --direction higher \
  --evaluator llm_judge_copy

Edit evaluate.py: set PLATFORM = "twitter" (or linkedin, instagram).

Email Subject Lines

python scripts/setup_experiment.py \
  --domain marketing \
  --name email-open-rate \
  --target emails/subjects.md \
  --eval "python .autoresearch/marketing/email-open-rate/evaluate.py" \
  --metric engagement_score \
  --direction higher \
  --evaluator llm_judge_copy

Edit evaluate.py: set PLATFORM = "email".

Ad Copy

python scripts/setup_experiment.py \
  --domain marketing \
  --name ad-copy-q2 \
  --target ads/google-search.md \
  --eval "python .autoresearch/marketing/ad-copy-q2/evaluate.py" \
  --metric engagement_score \
  --direction higher \
  --evaluator llm_judge_copy

Edit evaluate.py: set PLATFORM = "ad".

---

Domain: Content

Article Structure & Readability

python scripts/setup_experiment.py \
  --domain content \
  --name article-structure \
  --target drafts/my-article.md \
  --eval "python .autoresearch/content/article-structure/evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content

SEO Descriptions

python scripts/setup_experiment.py \
  --domain content \
  --name seo-meta \
  --target seo/descriptions.md \
  --eval "python .autoresearch/content/seo-meta/evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content

---

Domain: Prompts

System Prompt Optimization

python scripts/setup_experiment.py \
  --domain prompts \
  --name support-bot \
  --target prompts/support-system.md \
  --eval "python .autoresearch/prompts/support-bot/evaluate.py" \
  --metric quality_score \
  --direction higher \
  --evaluator llm_judge_prompt

Requires tests/cases.json with test inputs and expected outputs:

[
  {
    "input": "I can't log in to my account",
    "expected": "Ask for email, check account status, offer password reset"
  },
  {
    "input": "How do I cancel my subscription?",
    "expected": "Empathetic response, explain cancellation steps, offer retention"
  }
]

Agent Skill Optimization

python scripts/setup_experiment.py \
  --domain prompts \
  --name skill-improvement \
  --target SKILL.md \
  --eval "python .autoresearch/prompts/skill-improvement/evaluate.py" \
  --metric quality_score \
  --direction higher \
  --evaluator llm_judge_prompt

---

Choosing Your Domain

I want to...	Domain	Evaluator	Cost
Speed up my code	engineering	benchmark_speed	Free
Shrink my bundle	engineering	benchmark_size	Free
Fix flaky tests	engineering	test_pass_rate	Free
Speed up Docker builds	engineering	build_speed	Free
Reduce memory usage	engineering	memory_usage	Free
Train ML models	engineering	(custom)	Free + GPU
Write better headlines	marketing	llm_judge_content	Subscription
Improve social posts	marketing	llm_judge_copy	Subscription
Optimize email subjects	marketing	llm_judge_copy	Subscription
Improve ad copy	marketing	llm_judge_copy	Subscription
Optimize article structure	content	llm_judge_content	Subscription
Improve SEO descriptions	content	llm_judge_content	Subscription
Optimize system prompts	prompts	llm_judge_prompt	Subscription
Improve agent skills	prompts	llm_judge_prompt	Subscription

First time? Start with an engineering experiment (free, fast, measurable). Once comfortable, try content/marketing with LLM judges.

program.md Templates

Copy the template for your domain and paste into your project root as program.md.

---

ML Training (Karpathy-style)

# autoresearch — ML Training

## Goal
Minimize val_bpb on the validation set. Lower is better.

## What You Can Change (train.py only)
- Model architecture (depth, width, attention heads, FFN ratio)
- Optimizer (learning rate, warmup, scheduler, weight decay)
- Training loop (batch size, gradient accumulation, clipping)
- Regularization (dropout, weight tying, etc.)
- Any self-contained improvement that doesn't require new packages

## What You Cannot Change
- prepare.py (fixed — contains evaluation harness)
- Dependencies (pyproject.toml is locked)
- Time budget (always 5 minutes, wall clock)
- Evaluation metric (val_bpb is the ground truth)

## Strategy
1. First run: establish baseline. Do not change anything.
2. Explore learning rate range (try 2x and 0.5x current)
3. Try depth changes (±2 layers)
4. Try optimizer changes (Muon vs. AdamW variants)
5. If things improve, double down. If stuck, try something radical.

## Simplicity Rule
A small improvement with ugly code is NOT worth it.
Equal performance with simpler code IS worth it.
Removing code that gets same results is the best outcome.

## Stop When
val_bpb < 0.95 OR after 100 experiments, whichever comes first.

---

Prompt Engineering

# autoresearch — Prompt Optimization

## Goal
Maximize eval_score on the test suite. Higher is better (0-100).

## What You Can Change (prompt.md only)
- System prompt instructions
- Examples and few-shot demonstrations
- Output format specifications
- Chain-of-thought instructions
- Persona and tone
- Task decomposition strategies

## What You Cannot Change
- evaluate.py (fixed evaluation harness)
- Test cases in tests/ (ground truth)
- Model being evaluated (specified in evaluate.py)
- Scoring criteria (defined in evaluate.py)

## Strategy
1. First run: baseline with current prompt (or empty)
2. Add clear role/persona definition
3. Add output format specification
4. Add chain-of-thought instruction
5. Add 2-3 diverse examples
6. Refine based on failure modes from run.log

## Evaluation
- evaluate.py runs the prompt against 20 test cases
- Each test case is scored 1-10 by your CLI tool (Claude, Codex, or Gemini)
- quality_score = average * 10 (maps to 10-100)
- Run log shows which test cases failed

## Stop When
eval_score >= 85 OR after 50 experiments.

---

Code Performance

# autoresearch — Performance Optimization

## Goal
Minimize p50_ms (median latency). Lower is better.

## What You Can Change (src/module.py only)
- Algorithm implementation
- Data structures (use faster alternatives)
- Caching and memoization
- Vectorization (NumPy, etc.)
- Loop optimization
- I/O patterns
- Memory allocation patterns

## What You Cannot Change
- benchmark.py (fixed benchmark harness)
- Public API (function signatures must stay the same)
- External dependencies (add nothing new)
- Correctness tests (tests/ must still pass)

## Constraints
- Correctness is non-negotiable. benchmark.py runs tests first.
- If tests fail → immediate crash status, no metric recorded.
- Memory usage: p99 < 2x baseline acceptable, hard limit at 4x.

## Strategy
1. Baseline: profile first, don't guess
2. Check if there's any O(n²) → O(n log n) opportunity
3. Try caching repeated computations
4. Try NumPy vectorization for loops
5. Try algorithm-level changes last (higher risk)

## Stop When
p50_ms < 50ms OR improvement plateaus for 10 consecutive experiments.

---

Agent Skill Optimization

# autoresearch — Skill Optimization

## Goal
Maximize pass_rate on the task evaluation suite. Higher is better (0-1).

## What You Can Change (SKILL.md only)
- Skill description and trigger phrases
- Core workflow steps and ordering
- Decision frameworks and rules
- Output format specifications
- Example inputs/outputs
- Related skills disambiguation
- Proactive trigger conditions

## What You Cannot Change
- your custom evaluate.py (see Custom Evaluators in SKILL.md)
- Test tasks in tests/ (ground truth benchmark)
- Skill name (used for routing)
- License or metadata

## Evaluation
- evaluate.py runs SKILL.md against 15 standardized tasks
- Your CLI tool scores each task: 0 (fail), 0.5 (partial), 1 (pass)
- pass_rate = sum(scores) / 15

## Strategy
1. Baseline: run as-is
2. Improve trigger description (better routing = more passes)
3. Sharpen the core workflow (clearer = better execution)
4. Add missing edge cases to the rules section
5. Improve disambiguation (reduce false-positive routing)

## Simplicity Rule
A shorter SKILL.md that achieves the same score is better.
Aim for 200-400 lines total.

## Stop When
pass_rate >= 0.90 OR after 30 experiments.

#!/usr/bin/env python3
"""
autoresearch-agent: Results Viewer

View experiment results in multiple formats: terminal, CSV, Markdown.
Supports single experiment, domain, or cross-experiment dashboard.

Usage:
    python scripts/log_results.py --experiment engineering/api-speed
    python scripts/log_results.py --domain engineering
    python scripts/log_results.py --dashboard
    python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
    python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
    python scripts/log_results.py --dashboard --format markdown --output dashboard.md
"""

import argparse
import csv
import io
import sys
import time
from pathlib import Path


def find_autoresearch_root():
    """Find .autoresearch/ in project or user home."""
    project_root = Path(".").resolve() / ".autoresearch"
    if project_root.exists():
        return project_root
    user_root = Path.home() / ".autoresearch"
    if user_root.exists():
        return user_root
    return None


def load_config(experiment_dir):
    """Load config.cfg."""
    cfg_file = experiment_dir / "config.cfg"
    config = {}
    if cfg_file.exists():
        for line in cfg_file.read_text().splitlines():
            if ":" in line:
                k, v = line.split(":", 1)
                config[k.strip()] = v.strip()
    return config


def load_results(experiment_dir):
    """Load results.tsv into list of dicts."""
    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return []
    results = []
    for line in tsv.read_text().splitlines()[1:]:
        parts = line.split("\t")
        if len(parts) >= 4:
            try:
                metric = float(parts[1]) if parts[1] != "N/A" else None
            except ValueError:
                metric = None
            results.append({
                "commit": parts[0],
                "metric": metric,
                "status": parts[2],
                "description": parts[3],
            })
    return results


def compute_stats(results, direction):
    """Compute statistics from results."""
    keeps = [r for r in results if r["status"] == "keep"]
    discards = [r for r in results if r["status"] == "discard"]
    crashes = [r for r in results if r["status"] == "crash"]

    valid_keeps = [r for r in keeps if r["metric"] is not None]
    baseline = valid_keeps[0]["metric"] if valid_keeps else None
    if valid_keeps:
        best = min(r["metric"] for r in valid_keeps) if direction == "lower" else max(r["metric"] for r in valid_keeps)
    else:
        best = None

    pct_change = None
    if baseline is not None and best is not None and baseline != 0:
        if direction == "lower":
            pct_change = (baseline - best) / baseline * 100
        else:
            pct_change = (best - baseline) / baseline * 100

    return {
        "total": len(results),
        "keeps": len(keeps),
        "discards": len(discards),
        "crashes": len(crashes),
        "baseline": baseline,
        "best": best,
        "pct_change": pct_change,
    }


# --- Terminal Output ---

def print_experiment(experiment_dir, experiment_path):
    """Print single experiment results to terminal."""
    config = load_config(experiment_dir)
    results = load_results(experiment_dir)
    direction = config.get("metric_direction", "lower")
    metric_name = config.get("metric", "metric")

    if not results:
        print(f"No results for {experiment_path}")
        return

    stats = compute_stats(results, direction)

    print(f"\n{'─' * 65}")
    print(f"  {experiment_path}")
    print(f"  Target: {config.get('target', '?')} | Metric: {metric_name} ({direction})")
    print(f"{'─' * 65}")
    print(f"  Total: {stats['total']} | Keep: {stats['keeps']} | Discard: {stats['discards']} | Crash: {stats['crashes']}")

    if stats["baseline"] is not None and stats["best"] is not None:
        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
        print(f"  Baseline: {stats['baseline']:.6f} -> Best: {stats['best']:.6f}{pct}")

    print(f"\n  {'COMMIT':<10} {'METRIC':>12} {'STATUS':<10} DESCRIPTION")
    print(f"  {'─' * 60}")
    for r in results:
        m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A     "
        icon = {"keep": "+", "discard": "-", "crash": "!"}.get(r["status"], "?")
        print(f"  {r['commit']:<10} {m:>12} {icon} {r['status']:<7} {r['description'][:35]}")
    print()


def print_dashboard(root):
    """Print cross-experiment dashboard."""
    experiments = []
    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
                continue
            config = load_config(exp_dir)
            results = load_results(exp_dir)
            direction = config.get("metric_direction", "lower")
            stats = compute_stats(results, direction)

            best_str = f"{stats['best']:.4f}" if stats["best"] is not None else "—"
            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else "—"

            # Determine status
            status = "idle"
            if stats["total"] > 0:
                tsv = exp_dir / "results.tsv"
                if tsv.exists():
                    age_hours = (time.time() - tsv.stat().st_mtime) / 3600
                    status = "active" if age_hours < 1 else "paused" if age_hours < 24 else "done"

            experiments.append({
                "domain": domain_dir.name,
                "name": exp_dir.name,
                "runs": stats["total"],
                "kept": stats["keeps"],
                "best": best_str,
                "change": pct_str,
                "status": status,
                "metric": config.get("metric", "?"),
            })

    if not experiments:
        print("No experiments found.")
        return experiments

    print(f"\n{'─' * 90}")
    print(f"  autoresearch — Dashboard")
    print(f"{'─' * 90}")
    print(f"  {'DOMAIN':<15} {'EXPERIMENT':<20} {'RUNS':>5} {'KEPT':>5} {'BEST':>12} {'CHANGE':>10} {'STATUS':<8}")
    print(f"  {'─' * 85}")
    for e in experiments:
        print(f"  {e['domain']:<15} {e['name']:<20} {e['runs']:>5} {e['kept']:>5} {e['best']:>12} {e['change']:>10} {e['status']:<8}")
    print()
    return experiments


# --- CSV Export ---

def export_experiment_csv(experiment_dir, experiment_path):
    """Export single experiment as CSV string."""
    config = load_config(experiment_dir)
    results = load_results(experiment_dir)
    direction = config.get("metric_direction", "lower")
    stats = compute_stats(results, direction)

    buf = io.StringIO()
    writer = csv.writer(buf)

    # Header with metadata
    writer.writerow(["# Experiment", experiment_path])
    writer.writerow(["# Target", config.get("target", "")])
    writer.writerow(["# Metric", f"{config.get('metric', '')} ({direction} is better)"])
    if stats["baseline"] is not None:
        writer.writerow(["# Baseline", f"{stats['baseline']:.6f}"])
    if stats["best"] is not None:
        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
        writer.writerow(["# Best", f"{stats['best']:.6f}{pct}"])
    writer.writerow(["# Total", stats["total"]])
    writer.writerow(["# Keep/Discard/Crash", f"{stats['keeps']}/{stats['discards']}/{stats['crashes']}"])
    writer.writerow([])

    writer.writerow(["Commit", "Metric", "Status", "Description"])
    for r in results:
        m = f"{r['metric']:.6f}" if r["metric"] is not None else "N/A"
        writer.writerow([r["commit"], m, r["status"], r["description"]])

    return buf.getvalue()


def export_dashboard_csv(root, domain_filter=None):
    """Export dashboard as CSV string."""
    experiments = []
    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
        if domain_filter and domain_dir.name != domain_filter:
            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
                continue
            config = load_config(exp_dir)
            results = load_results(exp_dir)
            direction = config.get("metric_direction", "lower")
            stats = compute_stats(results, direction)
            best_str = f"{stats['best']:.6f}" if stats["best"] is not None else ""
            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else ""
            experiments.append([
                domain_dir.name, exp_dir.name, config.get("metric", ""),
                stats["total"], stats["keeps"], stats["discards"], stats["crashes"],
                best_str, pct_str
            ])

    buf = io.StringIO()
    writer = csv.writer(buf)
    writer.writerow(["Domain", "Experiment", "Metric", "Runs", "Kept", "Discarded", "Crashed", "Best", "Change"])
    for e in experiments:
        writer.writerow(e)
    return buf.getvalue()


# --- Markdown Export ---

def export_experiment_markdown(experiment_dir, experiment_path):
    """Export single experiment as Markdown string."""
    config = load_config(experiment_dir)
    results = load_results(experiment_dir)
    direction = config.get("metric_direction", "lower")
    metric_name = config.get("metric", "metric")
    stats = compute_stats(results, direction)

    lines = []
    lines.append(f"# Autoresearch: {experiment_path}\n")
    lines.append(f"**Target:** `{config.get('target', '?')}`  ")
    lines.append(f"**Metric:** `{metric_name}` ({direction} is better)  ")
    lines.append(f"**Experiments:** {stats['total']} total — {stats['keeps']} kept, {stats['discards']} discarded, {stats['crashes']} crashed\n")

    if stats["baseline"] is not None and stats["best"] is not None:
        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
        lines.append(f"**Progress:** `{stats['baseline']:.6f}` → `{stats['best']:.6f}`{pct}\n")

    lines.append(f"| Commit | Metric | Status | Description |")
    lines.append(f"|--------|--------|--------|-------------|")
    for r in results:
        m = f"`{r['metric']:.6f}`" if r["metric"] is not None else "N/A"
        lines.append(f"| `{r['commit']}` | {m} | {r['status']} | {r['description']} |")
    lines.append("")

    return "\n".join(lines)


def export_dashboard_markdown(root, domain_filter=None):
    """Export dashboard as Markdown string."""
    lines = []
    lines.append("# Autoresearch Dashboard\n")
    lines.append("| Domain | Experiment | Metric | Runs | Kept | Best | Change | Status |")
    lines.append("|--------|-----------|--------|------|------|------|--------|--------|")

    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
        if domain_filter and domain_dir.name != domain_filter:
            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
                continue
            config = load_config(exp_dir)
            results = load_results(exp_dir)
            direction = config.get("metric_direction", "lower")
            stats = compute_stats(results, direction)
            best = f"`{stats['best']:.4f}`" if stats["best"] is not None else "—"
            pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else "—"

            tsv = exp_dir / "results.tsv"
            status = "idle"
            if tsv.exists() and stats["total"] > 0:
                age_h = (time.time() - tsv.stat().st_mtime) / 3600
                status = "active" if age_h < 1 else "paused" if age_h < 24 else "done"

            lines.append(f"| {domain_dir.name} | {exp_dir.name} | {config.get('metric', '?')} | {stats['total']} | {stats['keeps']} | {best} | {pct} | {status} |")

    lines.append("")
    return "\n".join(lines)


# --- Main ---

def main():
    parser = argparse.ArgumentParser(description="autoresearch-agent results viewer")
    parser.add_argument("--experiment", help="Show one experiment: domain/name")
    parser.add_argument("--domain", help="Show all experiments in a domain")
    parser.add_argument("--dashboard", action="store_true", help="Cross-experiment dashboard")
    parser.add_argument("--format", choices=["terminal", "csv", "markdown"], default="terminal",
                        help="Output format (default: terminal)")
    parser.add_argument("--output", "-o", help="Write to file instead of stdout")
    parser.add_argument("--all", action="store_true", help="Show all experiments (alias for --dashboard)")
    args = parser.parse_args()

    root = find_autoresearch_root()
    if root is None:
        print("No .autoresearch/ found. Run setup_experiment.py first.")
        sys.exit(1)

    output_text = None

    # Single experiment
    if args.experiment:
        experiment_dir = root / args.experiment
        if not experiment_dir.exists():
            print(f"Experiment not found: {args.experiment}")
            sys.exit(1)

        if args.format == "csv":
            output_text = export_experiment_csv(experiment_dir, args.experiment)
        elif args.format == "markdown":
            output_text = export_experiment_markdown(experiment_dir, args.experiment)
        else:
            print_experiment(experiment_dir, args.experiment)
            return

    # Domain
    elif args.domain:
        domain_dir = root / args.domain
        if not domain_dir.exists():
            print(f"Domain not found: {args.domain}")
            sys.exit(1)
        for exp_dir in sorted(domain_dir.iterdir()):
            if exp_dir.is_dir() and (exp_dir / "config.cfg").exists():
                if args.format == "terminal":
                    print_experiment(exp_dir, f"{args.domain}/{exp_dir.name}")
                # For CSV/MD, fall through to dashboard with domain filter
        if args.format != "terminal":
            # Use dashboard export filtered to domain
            output_text = export_dashboard_csv(root, domain_filter=args.domain) if args.format == "csv" else export_dashboard_markdown(root, domain_filter=args.domain)
        else:
            return

    # Dashboard
    elif args.dashboard or args.all:
        if args.format == "csv":
            output_text = export_dashboard_csv(root)
        elif args.format == "markdown":
            output_text = export_dashboard_markdown(root)
        else:
            print_dashboard(root)
            return

    else:
        # Default: dashboard
        if args.format == "terminal":
            print_dashboard(root)
            return
        output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)

    # Write output
    if output_text:
        if args.output:
            Path(args.output).write_text(output_text)
            print(f"Written to {args.output}")
        else:
            print(output_text)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
autoresearch-agent: Experiment Runner

Executes a single experiment iteration. The AI agent is the loop —
it calls this script repeatedly. The script handles evaluation,
metric parsing, keep/discard decisions, and git rollback on failure.

Usage:
    python scripts/run_experiment.py --experiment engineering/api-speed --single
    python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
    python scripts/run_experiment.py --experiment engineering/api-speed --single --description "added caching"
"""

import argparse
import subprocess
import sys
import time
from datetime import datetime
from pathlib import Path


def find_autoresearch_root():
    """Find .autoresearch/ in project or user home."""
    project_root = Path(".").resolve() / ".autoresearch"
    if project_root.exists():
        return project_root
    user_root = Path.home() / ".autoresearch"
    if user_root.exists():
        return user_root
    return None


def load_config(experiment_dir):
    """Load config.cfg from experiment directory."""
    cfg_file = experiment_dir / "config.cfg"
    if not cfg_file.exists():
        print(f"  Error: no config.cfg in {experiment_dir}")
        sys.exit(1)
    config = {}
    for line in cfg_file.read_text().splitlines():
        if ":" in line:
            k, v = line.split(":", 1)
            config[k.strip()] = v.strip()
    return config


def run_git(args, cwd=None, timeout=30):
    """Run a git command safely (no shell injection). Returns (returncode, stdout, stderr)."""
    result = subprocess.run(
        ["git"] + args,
        capture_output=True, text=True,
        cwd=cwd, timeout=timeout
    )
    return result.returncode, result.stdout.strip(), result.stderr.strip()


def get_current_commit(path):
    """Get short hash of current HEAD."""
    _, commit, _ = run_git(["rev-parse", "--short", "HEAD"], cwd=path)
    return commit


def get_best_metric(experiment_dir, direction):
    """Read the best metric from results.tsv."""
    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return None
    lines = [l for l in tsv.read_text().splitlines()[1:] if "\tkeep\t" in l]
    if not lines:
        return None
    metrics = []
    for line in lines:
        parts = line.split("\t")
        try:
            if parts[1] != "N/A":
                metrics.append(float(parts[1]))
        except (ValueError, IndexError):
            continue
    if not metrics:
        return None
    return min(metrics) if direction == "lower" else max(metrics)


def run_evaluation(project_root, eval_cmd, time_budget_minutes, log_file):
    """Run evaluation with time limit. Output goes to log_file.

    Note: shell=True is intentional here — eval_cmd is user-provided and
    may contain pipes, redirects, or chained commands.
    """
    hard_limit = time_budget_minutes * 60 * 2.5
    t0 = time.time()
    try:
        with open(log_file, "w") as lf:
            result = subprocess.run(
                eval_cmd, shell=True,
                stdout=lf, stderr=subprocess.STDOUT,
                cwd=str(project_root),
                timeout=hard_limit
            )
        elapsed = time.time() - t0
        return result.returncode, elapsed
    except subprocess.TimeoutExpired:
        elapsed = time.time() - t0
        return -1, elapsed


def extract_metric(log_file, metric_grep):
    """Extract metric value from log file."""
    log_path = Path(log_file)
    if not log_path.exists():
        return None
    for line in reversed(log_path.read_text().splitlines()):
        stripped = line.strip()
        if stripped.startswith(metric_grep.lstrip("^")):
            try:
                return float(stripped.split(":")[-1].strip())
            except ValueError:
                continue
    return None


def is_improvement(new_val, old_val, direction):
    """Check if new result is better than old."""
    if old_val is None:
        return True
    if direction == "lower":
        return new_val < old_val
    return new_val > old_val


def log_result(experiment_dir, commit, metric_val, status, description):
    """Append result to results.tsv."""
    tsv = experiment_dir / "results.tsv"
    metric_str = f"{metric_val:.6f}" if metric_val is not None else "N/A"
    with open(tsv, "a") as f:
        f.write(f"{commit}\t{metric_str}\t{status}\t{description}\n")


def get_experiment_count(experiment_dir):
    """Count experiments run so far."""
    tsv = experiment_dir / "results.tsv"
    if not tsv.exists():
        return 0
    return max(0, len(tsv.read_text().splitlines()) - 1)


def get_description_from_diff(project_root):
    """Auto-generate a description from git diff --stat HEAD~1."""
    code, diff_stat, _ = run_git(["diff", "--stat", "HEAD~1"], cwd=str(project_root))
    if code == 0 and diff_stat:
        return diff_stat.split("\n")[0][:50]
    return "experiment"


def read_last_lines(filepath, n=5):
    """Read last n lines of a file (replaces tail shell command)."""
    path = Path(filepath)
    if not path.exists():
        return ""
    lines = path.read_text().splitlines()
    return "\n".join(lines[-n:])


def run_single(project_root, experiment_dir, config, exp_num, dry_run=False, description=None):
    """Run one experiment iteration."""
    direction = config.get("metric_direction", "lower")
    metric_grep = config.get("metric_grep", "^metric:")
    eval_cmd = config.get("evaluate_cmd", "python evaluate.py")
    time_budget = int(config.get("time_budget_minutes", 5))
    metric_name = config.get("metric", "metric")
    log_file = str(experiment_dir / "run.log")

    best = get_best_metric(experiment_dir, direction)
    ts = datetime.now().strftime("%H:%M:%S")

    print(f"\n[{ts}] Experiment #{exp_num}")
    print(f"  Best {metric_name}: {best}")

    if dry_run:
        print("  [DRY RUN] Would run evaluation and check metric")
        return "dry_run"

    # Auto-generate description if not provided
    if not description:
        description = get_description_from_diff(str(project_root))

    # Run evaluation
    print(f"  Running: {eval_cmd} (budget: {time_budget}m)")
    ret_code, elapsed = run_evaluation(project_root, eval_cmd, time_budget, log_file)

    commit = get_current_commit(str(project_root))

    # Timeout
    if ret_code == -1:
        print(f"  TIMEOUT after {elapsed:.0f}s — discarding")
        run_git(["checkout", "--", "."], cwd=str(project_root))
        run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
        log_result(experiment_dir, commit, None, "crash", f"timeout_{elapsed:.0f}s")
        return "crash"

    # Crash
    if ret_code != 0:
        tail = read_last_lines(log_file, 5)
        print(f"  CRASH (exit {ret_code}) after {elapsed:.0f}s")
        print(f"  Last output: {tail[:200]}")
        run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
        log_result(experiment_dir, commit, None, "crash", f"exit_{ret_code}")
        return "crash"

    # Extract metric
    metric_val = extract_metric(log_file, metric_grep)
    if metric_val is None:
        print(f"  Could not parse {metric_name} from run.log")
        run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
        log_result(experiment_dir, commit, None, "crash", "metric_parse_failed")
        return "crash"

    delta = ""
    if best is not None:
        diff = metric_val - best
        delta = f" (delta {diff:+.4f})"

    print(f"  {metric_name}: {metric_val:.6f}{delta} in {elapsed:.0f}s")

    # Keep or discard
    if is_improvement(metric_val, best, direction):
        print(f"  KEEP — improvement")
        log_result(experiment_dir, commit, metric_val, "keep", description)
        return "keep"
    else:
        print(f"  DISCARD — no improvement")
        run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
        best_str = f"{best:.4f}" if best is not None else "?"
        log_result(experiment_dir, commit, metric_val, "discard",
                   f"no_improvement_{metric_val:.4f}_vs_{best_str}")
        return "discard"


def main():
    parser = argparse.ArgumentParser(description="autoresearch-agent runner")
    parser.add_argument("--experiment", help="Experiment path: domain/name (e.g. engineering/api-speed)")
    parser.add_argument("--single", action="store_true", help="Run one experiment iteration")
    parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
    parser.add_argument("--description", help="Description of the change (auto-generated from git diff if omitted)")
    parser.add_argument("--path", default=".", help="Project root")
    args = parser.parse_args()

    project_root = Path(args.path).resolve()
    root = find_autoresearch_root()

    if root is None:
        print("No .autoresearch/ found. Run setup_experiment.py first.")
        sys.exit(1)

    if not args.experiment:
        print("Specify --experiment domain/name")
        sys.exit(1)

    experiment_dir = root / args.experiment
    if not experiment_dir.exists():
        print(f"Experiment not found: {experiment_dir}")
        print("Run: python scripts/setup_experiment.py --list")
        sys.exit(1)

    config = load_config(experiment_dir)

    print(f"\n  autoresearch-agent")
    print(f"  Experiment: {args.experiment}")
    print(f"  Target: {config.get('target', '?')}")
    print(f"  Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
    print(f"  Budget: {config.get('time_budget_minutes', '?')} min/experiment")
    print(f"  Mode: {'dry-run' if args.dry_run else 'single'}")

    exp_num = get_experiment_count(experiment_dir) + 1
    run_single(project_root, experiment_dir, config, exp_num, args.dry_run, args.description)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
autoresearch-agent: Setup Experiment

Initialize a new experiment with domain, target, evaluator, and git branch.
Creates the .autoresearch/{domain}/{name}/ directory structure.

Usage:
    python scripts/setup_experiment.py --domain engineering --name api-speed \
        --target src/api/search.py --eval "pytest bench.py" \
        --metric p50_ms --direction lower

    python scripts/setup_experiment.py --domain marketing --name medium-ctr \
        --target content/titles.md --eval "python evaluate.py" \
        --metric ctr_score --direction higher --evaluator llm_judge_content

    python scripts/setup_experiment.py --list          # List all experiments
    python scripts/setup_experiment.py --list-evaluators  # List available evaluators
"""

import argparse
import shutil
import subprocess
import sys
from datetime import datetime
from pathlib import Path

DOMAINS = ["engineering", "marketing", "content", "prompts", "custom"]

EVALUATOR_DIR = Path(__file__).parent.parent / "evaluators"

DEFAULT_CONFIG = """# autoresearch global config
default_time_budget_minutes: 5
default_scope: project
dashboard_format: markdown
"""

GITIGNORE_CONTENT = """# autoresearch — experiment logs are local state
**/results.tsv
**/run.log
**/run.*.log
config.yaml
"""


def run_cmd(cmd, cwd=None, timeout=None):
    """Run shell command, return (returncode, stdout, stderr)."""
    result = subprocess.run(
        cmd, shell=True, capture_output=True, text=True,
        cwd=cwd, timeout=timeout
    )
    return result.returncode, result.stdout.strip(), result.stderr.strip()


def get_autoresearch_root(scope, project_root=None):
    """Get the .autoresearch root directory based on scope."""
    if scope == "user":
        return Path.home() / ".autoresearch"
    return Path(project_root or ".") / ".autoresearch"


def init_root(root):
    """Initialize .autoresearch root if it doesn't exist."""
    created = False
    if not root.exists():
        root.mkdir(parents=True)
        created = True
        print(f"  Created {root}/")

    config_file = root / "config.yaml"
    if not config_file.exists():
        config_file.write_text(DEFAULT_CONFIG)
        print(f"  Created {config_file}")

    gitignore = root / ".gitignore"
    if not gitignore.exists():
        gitignore.write_text(GITIGNORE_CONTENT)
        print(f"  Created {gitignore}")

    return created


def create_program_md(experiment_dir, domain, name, target, metric, direction, constraints=""):
    """Generate a program.md template for the experiment."""
    direction_word = "Minimize" if direction == "lower" else "Maximize"
    content = f"""# autoresearch — {name}

## Goal
{direction_word} `{metric}` on `{target}`. {"Lower" if direction == "lower" else "Higher"} is better.

## What the Agent Can Change
- Only `{target}` — this is the single file being optimized.
- Everything inside that file is fair game unless constrained below.

## What the Agent Cannot Change
- The evaluation script (`evaluate.py` or the eval command). It is read-only.
- Dependencies — do not add new packages or imports that aren't already available.
- Any other files in the project unless explicitly noted here.
{f"- Additional constraints: {constraints}" if constraints else ""}

## Strategy
1. First run: establish baseline. Do not change anything.
2. Profile/analyze the current state — understand why the metric is what it is.
3. Try the most obvious improvement first (low-hanging fruit).
4. If that works, push further in the same direction.
5. If stuck, try something orthogonal or radical.
6. Read the git log of previous experiments. Don't repeat failed approaches.

## Simplicity Rule
A small improvement that adds ugly complexity is NOT worth it.
Equal performance with simpler code IS worth it.
Removing code that gets same results is the best outcome.

## Stop When
You don't stop. The human will interrupt you when they're satisfied.
If no improvement in 20+ consecutive runs, change strategy drastically.
"""
    (experiment_dir / "program.md").write_text(content)


def create_config(experiment_dir, target, eval_cmd, metric, direction, time_budget):
    """Write experiment config."""
    content = f"""target: {target}
evaluate_cmd: {eval_cmd}
metric: {metric}
metric_direction: {direction}
metric_grep: ^{metric}:
time_budget_minutes: {time_budget}
created: {datetime.now().strftime('%Y-%m-%d %H:%M')}
"""
    (experiment_dir / "config.cfg").write_text(content)


def init_results_tsv(experiment_dir):
    """Create results.tsv with header."""
    tsv = experiment_dir / "results.tsv"
    if tsv.exists():
        print(f"  results.tsv already exists ({tsv.stat().st_size} bytes)")
        return
    tsv.write_text("commit\tmetric\tstatus\tdescription\n")
    print("  Created results.tsv")


def copy_evaluator(experiment_dir, evaluator_name):
    """Copy a built-in evaluator to the experiment directory."""
    source = EVALUATOR_DIR / f"{evaluator_name}.py"
    if not source.exists():
        print(f"  Warning: evaluator '{evaluator_name}' not found in {EVALUATOR_DIR}")
        print(f"  Available: {', '.join(f.stem for f in EVALUATOR_DIR.glob('*.py'))}")
        return False
    dest = experiment_dir / "evaluate.py"
    shutil.copy2(source, dest)
    print(f"  Copied evaluator: {evaluator_name}.py -> evaluate.py")
    return True


def create_branch(path, domain, name):
    """Create and checkout the experiment branch."""
    branch = f"autoresearch/{domain}/{name}"
    result = subprocess.run(
        ["git", "checkout", "-b", branch],
        cwd=path, capture_output=True, text=True
    )
    if result.returncode != 0:
        if "already exists" in result.stderr:
            print(f"  Branch '{branch}' already exists. Checking out...")
            subprocess.run(
                ["git", "checkout", branch],
                cwd=path, capture_output=True, text=True
            )
            return branch
        print(f"  Warning: could not create branch: {result.stderr}")
        return None
    print(f"  Created branch: {branch}")
    return branch


def list_experiments(root):
    """List all experiments across all domains."""
    if not root.exists():
        print("No experiments found. Run setup to create your first experiment.")
        return

    experiments = []
    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir():
                continue
            cfg_file = exp_dir / "config.cfg"
            if not cfg_file.exists():
                continue
            config = {}
            for line in cfg_file.read_text().splitlines():
                if ":" in line:
                    k, v = line.split(":", 1)
                    config[k.strip()] = v.strip()

            # Count results
            tsv = exp_dir / "results.tsv"
            runs = 0
            if tsv.exists():
                runs = max(0, len(tsv.read_text().splitlines()) - 1)

            experiments.append({
                "domain": domain_dir.name,
                "name": exp_dir.name,
                "target": config.get("target", "?"),
                "metric": config.get("metric", "?"),
                "runs": runs,
            })

    if not experiments:
        print("No experiments found.")
        return

    print(f"\n{'DOMAIN':<15} {'EXPERIMENT':<25} {'TARGET':<30} {'METRIC':<15} {'RUNS':>5}")
    print("-" * 95)
    for e in experiments:
        print(f"{e['domain']:<15} {e['name']:<25} {e['target']:<30} {e['metric']:<15} {e['runs']:>5}")
    print(f"\nTotal: {len(experiments)} experiments")


def list_evaluators():
    """List available built-in evaluators."""
    if not EVALUATOR_DIR.exists():
        print("No evaluators directory found.")
        return

    print(f"\nAvailable evaluators ({EVALUATOR_DIR}):\n")
    for f in sorted(EVALUATOR_DIR.glob("*.py")):
        # Read first docstring line
        desc = ""
        for line in f.read_text().splitlines():
            stripped = line.strip()
            if stripped.startswith('"""') or stripped.startswith("'''"):
                quote = stripped[:3]
                # Single-line docstring: """Description."""
                after_quote = stripped[3:]
                if after_quote and after_quote.rstrip(quote[0]).strip():
                    desc = after_quote.rstrip('"').rstrip("'").strip()
                    break
                continue
            if stripped and not line.startswith("#!"):
                desc = stripped.strip('"').strip("'")
                break
        print(f"  {f.stem:<25} {desc}")


def main():
    parser = argparse.ArgumentParser(description="autoresearch-agent setup")
    parser.add_argument("--domain", choices=DOMAINS, help="Experiment domain")
    parser.add_argument("--name", help="Experiment name (e.g. api-speed, medium-ctr)")
    parser.add_argument("--target", help="Target file to optimize")
    parser.add_argument("--eval", dest="eval_cmd", help="Evaluation command")
    parser.add_argument("--metric", help="Metric name (must appear in eval output as 'name: value')")
    parser.add_argument("--direction", choices=["lower", "higher"], default="lower",
                        help="Is lower or higher better?")
    parser.add_argument("--time-budget", type=int, default=5, help="Minutes per experiment (default: 5)")
    parser.add_argument("--evaluator", help="Built-in evaluator to copy (e.g. benchmark_speed)")
    parser.add_argument("--scope", choices=["project", "user"], default="project",
                        help="Where to store experiments: project (./) or user (~/)")
    parser.add_argument("--constraints", default="", help="Additional constraints for program.md")
    parser.add_argument("--path", default=".", help="Project root path")
    parser.add_argument("--skip-branch", action="store_true", help="Don't create git branch")
    parser.add_argument("--list", action="store_true", help="List all experiments")
    parser.add_argument("--list-evaluators", action="store_true", help="List available evaluators")
    args = parser.parse_args()

    project_root = Path(args.path).resolve()

    # List mode
    if args.list:
        root = get_autoresearch_root("project", project_root)
        list_experiments(root)
        user_root = get_autoresearch_root("user")
        if user_root.exists() and user_root != root:
            print(f"\n--- User-level experiments ({user_root}) ---")
            list_experiments(user_root)
        return

    if args.list_evaluators:
        list_evaluators()
        return

    # Validate required args for setup
    if not all([args.domain, args.name, args.target, args.eval_cmd, args.metric]):
        parser.error("Required: --domain, --name, --target, --eval, --metric")

    root = get_autoresearch_root(args.scope, project_root)

    print(f"\n  autoresearch-agent setup")
    print(f"  Project: {project_root}")
    print(f"  Scope: {args.scope}")
    print(f"  Domain: {args.domain}")
    print(f"  Experiment: {args.name}")
    print(f"  Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")

    # Check git
    result = subprocess.run(
        ["git", "rev-parse", "--is-inside-work-tree"],
        cwd=str(project_root), capture_output=True, text=True
    )
    code = result.returncode
    if code != 0:
        print("  Error: not a git repository. Run: git init && git add . && git commit -m 'initial'")
        sys.exit(1)
    print("  Git repository found")

    # Check target file
    target_path = project_root / args.target
    if not target_path.exists():
        print(f"  Error: target file not found: {args.target}")
        sys.exit(1)
    print(f"  Target file found: {args.target}")

    # Init root
    init_root(root)

    # Create experiment directory
    experiment_dir = root / args.domain / args.name
    if experiment_dir.exists():
        print(f"  Warning: experiment '{args.domain}/{args.name}' already exists.")
        print(f"  Use --name with a different name, or delete {experiment_dir}")
        sys.exit(1)
    experiment_dir.mkdir(parents=True)
    print(f"  Created {experiment_dir}/")

    # Create files
    create_program_md(experiment_dir, args.domain, args.name,
                      args.target, args.metric, args.direction, args.constraints)
    print("  Created program.md")

    create_config(experiment_dir, args.target, args.eval_cmd,
                  args.metric, args.direction, args.time_budget)
    print("  Created config.cfg")

    init_results_tsv(experiment_dir)

    # Copy evaluator if specified
    if args.evaluator:
        copy_evaluator(experiment_dir, args.evaluator)

    # Create git branch
    if not args.skip_branch:
        create_branch(str(project_root), args.domain, args.name)

    # Test evaluation command
    print(f"\n  Testing evaluation: {args.eval_cmd}")
    code, out, err = run_cmd(args.eval_cmd, cwd=str(project_root), timeout=60)
    if code != 0:
        print(f"  Warning: eval command failed (exit {code})")
        if err:
            print(f"  stderr: {err[:200]}")
        print("  Fix the eval command before running the experiment loop.")
    else:
        # Check metric is parseable
        full_output = out + "\n" + err
        metric_found = False
        for line in full_output.splitlines():
            if line.strip().startswith(f"{args.metric}:"):
                metric_found = True
                print(f"  Eval works. Baseline: {line.strip()}")
                break
        if not metric_found:
            print(f"  Warning: eval ran but '{args.metric}:' not found in output.")
            print(f"  Make sure your eval command outputs: {args.metric}: <value>")

    # Summary
    print(f"\n  Setup complete!")
    print(f"  Experiment: {args.domain}/{args.name}")
    print(f"  Target: {args.target}")
    print(f"  Metric: {args.metric} ({args.direction} is better)")
    print(f"  Budget: {args.time_budget} min/experiment")
    if not args.skip_branch:
        print(f"  Branch: autoresearch/{args.domain}/{args.name}")
    print(f"\n  To start:")
    print(f"  python scripts/run_experiment.py --experiment {args.domain}/{args.name} --single")


if __name__ == "__main__":
    main()

Related skills

Agent BrowserGive their coding agent reliable, high-fidelity control over any website or Electron desktop app.577k39.1k

Lark ApprovalLet their AI coding agent create, read, update, and approve items in Lark (Feishu) approval workflows without leaving the coding environment.471k

Lark EventHandle Feishu/Lark bot events, webhooks, and subscription callbacks in agent-driven backend code.471k

Lark Workflow Meeting SummaryAutomatically generate structured meeting summaries and action items from Lark/Feishu calls without manual note-taking.470k

Lark Workflow Standup ReportAutomatically generate and post daily standup reports in Lark/Feishu from their workflow and activity data.470k

Lark Vc AgentGive their coding agent the ability to read, create, and update documents, tasks, and wiki pages inside Feishu (Lark).415k

FAQ

What metrics can autoresearch-agent optimize?

autoresearch-agent optimizes code speed (p50_ms), webpack bundle size, test reliability, and Docker build time using setup_experiment.py with domain-specific benchmark evaluators and configurable lower-is-better directions.

How fast does autoresearch-agent run experiments?

autoresearch-agent completes roughly 5 minutes per experiment, about 12 experiments per hour, and around 100 overnight for engineering domains like API speed and bundle size reduction.

Is Autoresearch Agent safe to install?

skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Automation & Workflowstestingdevopsbackend

About

Autoresearch Agent by the numbers

Add your badge

How do you automate performance experiments on code?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Autoresearch Agent

Slash Commands

When This Skill Activates

Setup

First Time — Create the Experiment

What Setup Creates

Domains

If program.md Already Exists

Agent Protocol

Before Starting

Each Iteration

What the Script Handles (you don't)

Starting an Experiment

Strategy Escalation

Self-Improvement

Stopping

Rules

Evaluators

Free Evaluators (no API cost)

LLM Judge Evaluators (uses your subscription)

Custom Evaluators

Viewing Results

Dashboard Output

Export Formats

Proactive Triggers

Installation

One-liner (any tool)

Multi-tool install

OpenClaw

Related Skills

Experiment Domains Guide

Domain: Engineering

Code Speed Optimization

Bundle Size Reduction

Test Pass Rate

Docker Build Speed

Memory Optimization

ML Training (Karpathy-style)

Domain: Marketing

Medium Article Headlines

Social Media Copy

Email Subject Lines

Ad Copy

Domain: Content

Article Structure & Readability

SEO Descriptions

Domain: Prompts

System Prompt Optimization

Agent Skill Optimization

Choosing Your Domain

program.md Templates

ML Training (Karpathy-style)

Prompt Engineering

Code Performance

Agent Skill Optimization

Related skills

FAQ

What metrics can autoresearch-agent optimize?

How fast does autoresearch-agent run experiments?

Is Autoresearch Agent safe to install?

This week in AI coding

If `program.md` Already Exists