
Autoresearch Agent
Spin up metric-driven experiment loops that benchmark and iteratively improve code speed, bundle size, test reliability, or Docker build time.
Overview
Autoresearch Agent is an agent skill most often used in Ship (also Build, Operate) that runs metric-driven engineering experiments to optimize speed, size, tests, and container builds.
Install
npx skills add https://github.com/alirezarezvani/claude-skills --skill autoresearch-agentWhat is this skill?
- CLI `setup_experiment.py` scaffolds domains with target file, eval command, metric, and optimize direction (lower/higher
- Engineering presets: API speed, bundle size, flaky-test pass rate, and Docker build speed with pluggable evaluators.
- Documented throughput: ~5 min per experiment, ~12/hour, ~100 overnight for unattended sweeps.
- Agents optimize algorithms, caching, I/O, webpack config, Dockerfile layers, and parser logic against your chosen metric
- Free local cost model—runs your existing pytest, npm build, and custom evaluate.py scripts.
- ~5 minutes per experiment
- ~12 experiments per hour
- ~100 experiments overnight
Adoption & trust: 531 installs on skills.sh; 17.5k GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You know something is slow, bloated, or flaky but lack a structured loop to measure changes and keep only improvements.
Who is it for?
Solo builders with existing pytest, npm build, or Docker pipelines who want overnight or hourly optimization sweeps on a single target.
Skip if: Greenfield features with no eval command, purely qualitative UX work, or teams that cannot allow repeated benchmark or build runs on their machine.
When should I use this skill?
You have a measurable bottleneck (latency, bundle, tests, or image build) and want an agent to loop on a target file with a defined eval and metric.
What do I get? / Deliverables
You get a configured autoresearch experiment with clear metrics and eval commands so your agent can iterate until benchmarks or pass rates move in the right direction.
- Configured experiment under .autoresearch with metric and direction
- Iterated code or config changes tied to benchmark results
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Ship because the skill’s core loop is evaluate → change target → re-run benchmarks and tests, which maps to performance and quality gates before release. Perf is the primary outcome (p50 latency, bundle bytes, build seconds); flaky-test pass-rate runs still sit on the perf/quality path rather than greenfield implementation.
Where it fits
Target `src/api/search.py` and chase lower p50_ms before adding caching layers to production routes.
Run pass_rate evaluators on flaky parser tests until CI stabilizes pre-release.
Iterate webpack config against size_bytes after `npm run build` to hit a mobile budget.
Re-open a docker-build experiment when image build_seconds regress after dependency upgrades.
How it compares
Use instead of one-off “make it faster” chat prompts when you need reproducible metrics and direction-aware optimization loops.
Common Questions / FAQ
Who is autoresearch-agent for?
Indie and solo developers shipping APIs, web bundles, or containerized apps who want an agent-guided experiment harness tied to real benchmarks and test pass rates.
When should I use autoresearch-agent?
During Ship perf and testing work to cut API p50, shrink dist assets, raise flaky-test pass rate, or shorten Docker builds; during Build when tuning hot paths; during Operate when regression hunting after deploy.
Is autoresearch-agent safe to install?
It drives shell evals and builds on your repo—review scripts and targets before unattended runs, and check the Security Audits panel on this Prism page before trusting third-party skill content.
SKILL.md
READMESKILL.md - Autoresearch Agent
# Experiment Domains Guide ## Domain: Engineering ### Code Speed Optimization ```bash python scripts/setup_experiment.py \ --domain engineering \ --name api-speed \ --target src/api/search.py \ --eval "python -m pytest tests/bench_search.py --tb=no -q" \ --metric p50_ms \ --direction lower \ --evaluator benchmark_speed ``` **What the agent optimizes:** Algorithm, data structures, caching, query patterns, I/O. **Cost:** Free — just runs benchmarks. **Speed:** ~5 min/experiment, ~12/hour, ~100 overnight. ### Bundle Size Reduction ```bash python scripts/setup_experiment.py \ --domain engineering \ --name bundle-size \ --target webpack.config.js \ --eval "npm run build && python .autoresearch/engineering/bundle-size/evaluate.py" \ --metric size_bytes \ --direction lower \ --evaluator benchmark_size ``` Edit `evaluate.py` to set `TARGET_FILE = "dist/main.js"` and add `BUILD_CMD = "npm run build"`. ### Test Pass Rate ```bash python scripts/setup_experiment.py \ --domain engineering \ --name fix-flaky-tests \ --target src/utils/parser.py \ --eval "python .autoresearch/engineering/fix-flaky-tests/evaluate.py" \ --metric pass_rate \ --direction higher \ --evaluator test_pass_rate ``` ### Docker Build Speed ```bash python scripts/setup_experiment.py \ --domain engineering \ --name docker-build \ --target Dockerfile \ --eval "python .autoresearch/engineering/docker-build/evaluate.py" \ --metric build_seconds \ --direction lower \ --evaluator build_speed ``` ### Memory Optimization ```bash python scripts/setup_experiment.py \ --domain engineering \ --name memory-usage \ --target src/processor.py \ --eval "python .autoresearch/engineering/memory-usage/evaluate.py" \ --metric peak_mb \ --direction lower \ --evaluator memory_usage ``` ### ML Training (Karpathy-style) Requires NVIDIA GPU. See [autoresearch](https://github.com/karpathy/autoresearch). ```bash python scripts/setup_experiment.py \ --domain engineering \ --name ml-training \ --target train.py \ --eval "uv run train.py" \ --metric val_bpb \ --direction lower \ --time-budget 5 ``` --- ## Domain: Marketing ### Medium Article Headlines ```bash python scripts/setup_experiment.py \ --domain marketing \ --name medium-ctr \ --target content/titles.md \ --eval "python .autoresearch/marketing/medium-ctr/evaluate.py" \ --metric ctr_score \ --direction higher \ --evaluator llm_judge_content ``` Edit `evaluate.py`: set `TARGET_FILE = "content/titles.md"` and `CLI_TOOL = "claude"`. **What the agent optimizes:** Title phrasing, curiosity gaps, specificity, emotional triggers. **Cost:** Uses your CLI subscription (Claude Max = unlimited). **Speed:** ~2 min/experiment, ~30/hour. ### Social Media Copy ```bash python scripts/setup_experiment.py \ --domain marketing \ --name twitter-engagement \ --target social/tweets.md \ --eval "python .autoresearch/marketing/twitter-engagement/evaluate.py" \ --metric engagement_score \ --direction higher \ --evaluator llm_judge_copy ``` Edit `evaluate.py`: set `PLATFORM = "twitter"` (or linkedin, instagram). ### Email Subject Lines ```bash python scripts/setup_experiment.py \ --domain marketing \ --name email-open-rate \ --target emails/subjects.md \ --eval "python .autoresearch/marketing/email-open-rate/evaluate.py" \ --metric engagement_score \ --direction higher \ --evaluator llm_judge_copy ``` Edit `evaluate.py`: set `PLATFORM = "email"`. ### Ad Copy ```bash python scripts/setup_experiment.py \ --domain marketing \ --name ad-copy-q2 \ --target ads/google-search.md \ --eval "python .autoresearch/marketing/ad-copy-q2/evaluate.py" \ --metric engagement_score \ --direction higher \ --evaluator llm_judge_copy ``` Edit `evaluate.py`: set `PLATFORM = "ad"`. --- ## Domain: Content ### Article Structure & Readability ```bash python scripts/setup_experiment.py \ --domain content \ --name articl