
Experiment Code
Run iterative ML or research experiments with agent-driven fix loops, subprocess execution, and structured result handoff back to the coder.
Overview
experiment-code is an agent skill most often used in Validate (also Build) that encodes subprocess experiment loops and hill-climbing optimization patterns for research coding agents.
Install
npx skills add https://github.com/lingzhi227/agent-research-skills --skill experiment-codeWhat is this skill?
- AI-Scientist-style loop with up to 5 runs and 4 fix iterations per failed run
- Subprocess execution of experiment.py with 7200s timeout and truncated stderr feedback
- Automatic prompt chaining from failures or summarized means from final_info.json
- Hill-climbing code optimization pattern for iterative scientific coding (AgentLaboratory-style)
- Explicit ALL_COMPLETED break condition to stop when the agent marks the study done
- MAX_RUNS = 5
- MAX_ITERS = 4 per run
- MAX_STDERR_OUTPUT = 1500 characters
Adoption & trust: 678 installs on skills.sh; 114 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want an agent to run training or evaluation scripts repeatedly but lack a bounded loop that turns stderr and JSON metrics into the next coding prompt.
Who is it for?
Indie ML builders wiring Claude Code or Codex to execute experiment.py under run_* directories with automatic failure recovery.
Skip if: Teams that only need static documentation of algorithms without executing shell subprocesses or managing iterative agent fix cycles.
When should I use this skill?
Implementing or debugging automated research experiment execution with agent fix loops and subprocess result feedback.
What do I get? / Deliverables
You get reusable loop templates with MAX_RUNS, MAX_ITERS, and structured next_prompt text so experiments advance run-by-run or stop cleanly on ALL_COMPLETED.
- Experiment execution loop implementation
- run_experiment subprocess wrapper
- Hill-climbing optimization scaffold
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Experiment loops belong on the Validate shelf because they prove hypotheses through repeated runs before you treat results as production-ready. Prototype is where you execute run_N folders, parse final_info.json, and hill-climb code until metrics stabilize.
Where it fits
Spin run_1 through run_5 with capped fix iterations while validating a new model architecture.
Embed the perform_experiments helper so your repo skill drives experiment.py after each codegen pass.
Re-run a failing nightly eval with stderr tail prompts until return_code is zero without unbounded agent spend.
How it compares
Pattern reference for agent-driven lab loops—not a hosted experiment platform or managed notebook runner.
Common Questions / FAQ
Who is experiment-code for?
Solo and indie builders running AI-research or ML prototypes who want their coding agent to execute, parse, and retry experiments safely.
When should I use experiment-code?
During Validate when prototyping hypotheses with run folders; during Build when hardening agent-tooling that launches long-running training jobs; and during Operate when iterating on failed production eval scripts—always with explicit timeouts and iteration caps.
Is experiment-code safe to install?
It describes subprocess execution and arbitrary Python commands—review the Security Audits panel on this page and restrict cwd, timeouts, and network before running on sensitive machines.
SKILL.md
READMESKILL.md - Experiment Code
# Experiment Code Patterns Reference ## Pattern 1: Experiment Execution Loop (AI-Scientist) ```python MAX_ITERS = 4 # Max fix attempts per run MAX_RUNS = 5 # Max experiment runs MAX_STDERR_OUTPUT = 1500 # Truncate stderr def perform_experiments(idea, folder_name, coder, baseline_results): current_iter = 0 run = 1 next_prompt = initial_prompt.format(...) while run < MAX_RUNS + 1: if current_iter >= MAX_ITERS: break coder_out = coder.run(next_prompt) if "ALL_COMPLETED" in coder_out: break return_code, next_prompt = run_experiment(folder_name, run) if return_code == 0: run += 1 current_iter = 0 current_iter += 1 def run_experiment(folder_name, run_num, timeout=7200): command = ["python", "experiment.py", f"--out_dir=run_{run_num}"] result = subprocess.run(command, cwd=cwd, stderr=subprocess.PIPE, text=True, timeout=timeout) if result.returncode != 0: stderr_output = result.stderr[-MAX_STDERR_OUTPUT:] next_prompt = f"Run failed with the following error {stderr_output}" else: results = json.load(open(f"run_{run_num}/final_info.json")) results = {k: v["means"] for k, v in results.items()} next_prompt = f"Run {run_num} completed. Results: {results}" return result.returncode, next_prompt ``` ## Pattern 2: Hill-Climbing Code Optimization (AgentLaboratory) ```python def solve(self): num_attempts = 0 best_pkg = None top_score = None while True: model_resp = query_model( system_prompt=self.system_prompt(), prompt=f"History: {self.history_str()}\nEnter a command: ", temp=1.0 ) cmd_str, code_lines, prev_code_ret, should_execute_code, score = \ self.process_command(model_resp) if score is not None: if top_score is None or score > top_score: best_pkg = copy(code_lines), copy(prev_code_ret), ... top_score = score if num_attempts >= self.min_gen_trials and top_score is not None: break num_attempts += 1 # Keep best code variant if top_score > self.best_codes[-1][1]: self.best_codes.append((copy(self.code_lines), copy(top_score), ...)) self.best_codes.sort(key=lambda x: x[1], reverse=True) if len(self.best_codes) >= self.max_codes: self.best_codes.pop(-1) self.code_reflect = self.reflect_code() ``` ## Pattern 3: Initial Code Generation with Error History (AgentLaboratory) ```python def gen_initial_code(self): num_attempts = 0 error_hist = [] while True: if num_attempts == 0: err_hist = "" else: err = f"Previous command: {model_resp}. Error: {cmd_str}. " \ f"Do not repeat this error." error_hist.append(err) if len(error_hist) == 5: error_hist.pop(0) err_hist = "Error history:\n" + "\n".join(error_hist) + \ "\nDO NOT REPEAT THESE." model_resp = query_model( system_prompt=self.system_prompt(), prompt=f"{err_hist}\nUse ```REPLACE to create initial code: ", temp=1.0 ) cmd_str, code_lines, prev_code_ret, should_execute_code, score = \ self.process_command(model_resp) if score is not None: break num_attempts += 1 return code_lines, prev_code_ret, score ``` ## Pattern 4: Code Reflection for Improvement (AgentLaboratory) ```python def reflect_code(self): code_strs = "\n\n".join([ f"Code variant:\n{code}\nScore: {score}" for code, score, _ in self.best_codes ]) prompt = f"""Please reflect on ideas for how to improve your current code. Examine the provided code and think very specifically (with precise ideas) on how to improve performance, w