
Experiment Design
Structure ML and agentic research runs through a four-stage baseline-to-publication experiment ladder with explicit completion criteria.
Overview
Experiment-design is an agent skill most often used in Validate (also Build integrations, Grow analytics) that stages ML experiments from baseline through tuning to creative evaluation with explicit completion criteria.
Install
npx skills add https://github.com/lingzhi227/agent-research-skills --skill experiment-designWhat is this skill?
- 4-stage progressive framework: Initial Implementation → Baseline Tuning → Creative Research → (Stage 4 implied completio
- Stage 1 completion criteria: decreasing loss, reasonable validation metrics, end-to-end execution without errors
- Stage 2 requires hyperparameter tuning, ≥2 datasets, and 3 seeds for statistical significance
- Stage 3 mandates novel improvements, 3+ datasets, 3+ baselines, ablations, and publication-quality figures
- Prompts extracted from AI-Scientist-v2 agent_manager.py and AI-Researcher exp_analyser.py
- 4-stage progressive experiment framework
- Stage 2: at least 2 datasets and 3 seeds
- Stage 3: 3+ datasets, 3+ baselines, and ablation of key design choices
Adoption & trust: 741 installs on skills.sh; 114 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a research idea but no disciplined sequence from first runnable baseline to statistically credible, ablation-backed results.
Who is it for?
Indie researchers and agent-scientist stacks (AI-Scientist-style) who need numbered gates before expensive Stage 3 creativity.
Skip if: One-off script fixes or production feature work with no training loops, datasets, or baseline comparisons.
When should I use this skill?
Starting or advancing ML or agentic research experiments that need staged gates from baseline implementation through tuning to creative evaluation.
What do I get? / Deliverables
After the skill runs, each experiment stage has defined completion checks so the agent does not jump to novel claims before baselines and multi-seed tuning pass.
- Stage-gated experiment plan aligned to Initial Implementation, Baseline Tuning, and Creative Research criteria
- Documented metrics, seeds, ablations, and figure requirements for Stage 3
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Validate prototype is the first serious proof phase where a working baseline and tuned comparisons must exist before creative claims. Prototype subphase matches Stage 1–2 goals: end-to-end runnable baseline, convergence checks, and multi-seed comparison to baselines.
Where it fits
Lock Stage 1 end-to-end training on the smallest dataset before expanding scope.
Embed stage prompts into an AI-Researcher analyser loop so the agent advances only when completion criteria pass.
Compare three-seed metrics across datasets before claiming improvement in a write-up or dashboard.
How it compares
Use instead of ad-hoc “run training once” chat when you need staged gates comparable to formal experiment design docs.
Common Questions / FAQ
Who is experiment-design for?
Solo builders and small teams running agent-assisted ML or scientific coding who want AI-Scientist-v2-style staged prompts rather than unstructured trial and error.
When should I use experiment-design?
In Validate when prototyping models; in Build when wiring research agents; in Grow when turning experiment logs into comparable metrics—especially before Stage 3 creative changes.
Is experiment-design safe to install?
It is prompt-level methodology; review the Security Audits panel on this page for the parent repo, since downstream runs may execute arbitrary training code you supply.
SKILL.md
READMESKILL.md - Experiment Design
# Experiment Design Stage Prompts Extracted from AI-Scientist-v2 (agent_manager.py) and AI-Researcher (exp_analyser.py). ## 4-Stage Progressive Experiment Framework (AI-Scientist-v2) ### Stage 1: Initial Implementation **Goal**: Get a working baseline on a simple dataset. ``` Stage 1 - Initial Implementation: - Implement the core method on the simplest dataset - Ensure training converges (check training curves) - Establish baseline metrics - Verify code runs without errors Completion criteria: - Training loss decreases - Validation metrics are reasonable (not random) - Code executes end-to-end without errors ``` ### Stage 2: Baseline Tuning **Goal**: Optimize hyperparameters and test on multiple datasets. ``` Stage 2 - Baseline Tuning: - Tune learning rate, batch size, and key hyperparameters - Test on at least 2 datasets - Compare against published baselines - Run 3 seeds for statistical significance Completion criteria: - Results competitive with or better than baselines - Consistent across multiple seeds - Training curves show stable convergence ``` ### Stage 3: Creative Research **Goal**: Novel improvements and comprehensive evaluation. ``` Stage 3 - Creative Research: - Implement novel improvements to the method - Test on 3+ datasets - Compare against 3+ baselines - Ablation of key design choices - Generate publication-quality figures Completion criteria: - Clear improvement over baselines on most datasets - Ablation supports contribution claims - Figures are informative and well-designed ``` ### Stage 4: Ablation Studies **Goal**: Systematic component analysis. ``` Stage 4 - Ablation Studies: - Remove/modify each key component one at a time - Measure impact on performance - Sensitivity analysis for key hyperparameters - Report statistical significance (mean ± std over 3+ seeds) Completion criteria: - Every claimed contribution verified by ablation - Hyperparameter sensitivity is reasonable - Results table is complete with all comparisons ``` ## VLM-Based Stage Completion Check (AI-Scientist-v2) ``` Examine the training curves and results: 1. Is the training loss decreasing? 2. Is validation performance improving? 3. Has the model converged or does it need more epochs? 4. Are there signs of overfitting? 5. Is the performance competitive with baselines? Based on this analysis, determine if the current stage is complete or if more experiments are needed. ``` ## Best-Node Selection (AI-Scientist-v2) ``` Given the following experiment results and their training curves, holistically select the best experiment considering: 1. Final test performance (primary metric) 2. Training stability (smooth loss curves) 3. Consistency across seeds 4. Generalization (train-test gap) Experiment results: {results_json} Select the best experiment and justify your choice. ``` ## Ablation Study Design (AI-Researcher) ``` Given the experimental results: {results} Design an ablation study to verify each component's contribution: 1. List all key components of the method 2. For each component, propose a variant where it is removed/replaced 3. Predict expected impact of each removal 4. Prioritize: test the most impactful ablations first ``` ## Sensitivity Analysis (AI-Researcher) ``` Design a sensitivity analysis for these hyperparameters: {hyperparameters} For each hyperparameter: 1. Define a reasonable range to test 2. Specify the number of values to try 3. Identify which metrics to track 4. Note any interactions between hyperparameters ``` #!/usr/bin/env python3 """Generate experiment design from a research plan. Takes a research plan (JSON or text description) and generates a structured experiment design with baselines, ablation matrix, hyperparameter grid, and evaluation metrics. Self-contained: uses only stdlib. Usage: python design_experiments.py --plan research_plan.json --output experiment_design.json python design_experiments.py --method "contrastive learning" --task "image classification" --output des