Experiment Code

Experiment loops belong on the Validate shelf because they prove hypotheses through repeated runs before you treat results as production-ready. Prototype is where you execute run_N folders, parse final_info.json, and hill-climb code until metrics stabilize.

Also useful

Also useful

Where it fits

Example use

Spin run_1 through run_5 with capped fix iterations while validating a new model architecture.

Example use

Embed the perform_experiments helper so your repo skill drives experiment.py after each codegen pass.

Example use

Re-run a failing nightly eval with stderr tail prompts until return_code is zero without unbounded agent spend.

How it compares

Pattern reference for agent-driven lab loops—not a hosted experiment platform or managed notebook runner.

Common Questions / FAQ

Who is experiment-code for?

Solo and indie builders running AI-research or ML prototypes who want their coding agent to execute, parse, and retry experiments safely.

When should I use experiment-code?

During Validate when prototyping hypotheses with run folders; during Build when hardening agent-tooling that launches long-running training jobs; and during Operate when iterating on failed production eval scripts—always with explicit timeouts and iteration caps.

Is experiment-code safe to install?

It describes subprocess execution and arbitrary Python commands—review the Security Audits panel on this page and restrict cwd, timeouts, and network before running on sensitive machines.

SKILL.md

READMESKILL.md - Experiment Code

# Experiment Code Patterns Reference

## Pattern 1: Experiment Execution Loop (AI-Scientist)

```python
MAX_ITERS = 4      # Max fix attempts per run
MAX_RUNS = 5       # Max experiment runs
MAX_STDERR_OUTPUT = 1500  # Truncate stderr

def perform_experiments(idea, folder_name, coder, baseline_results):
    current_iter = 0
    run = 1
    next_prompt = initial_prompt.format(...)

    while run < MAX_RUNS + 1:
        if current_iter >= MAX_ITERS:
            break
        coder_out = coder.run(next_prompt)
        if "ALL_COMPLETED" in coder_out:
            break
        return_code, next_prompt = run_experiment(folder_name, run)
        if return_code == 0:
            run += 1
            current_iter = 0
        current_iter += 1

def run_experiment(folder_name, run_num, timeout=7200):
    command = ["python", "experiment.py", f"--out_dir=run_{run_num}"]
    result = subprocess.run(command, cwd=cwd, stderr=subprocess.PIPE,
                           text=True, timeout=timeout)
    if result.returncode != 0:
        stderr_output = result.stderr[-MAX_STDERR_OUTPUT:]
        next_prompt = f"Run failed with the following error {stderr_output}"
    else:
        results = json.load(open(f"run_{run_num}/final_info.json"))
        results = {k: v["means"] for k, v in results.items()}
        next_prompt = f"Run {run_num} completed. Results: {results}"
    return result.returncode, next_prompt
```

## Pattern 2: Hill-Climbing Code Optimization (AgentLaboratory)

```python
def solve(self):
    num_attempts = 0
    best_pkg = None
    top_score = None

    while True:
        model_resp = query_model(
            system_prompt=self.system_prompt(),
            prompt=f"History: {self.history_str()}\nEnter a command: ",
            temp=1.0
        )
        cmd_str, code_lines, prev_code_ret, should_execute_code, score = \
            self.process_command(model_resp)

        if score is not None:
            if top_score is None or score > top_score:
                best_pkg = copy(code_lines), copy(prev_code_ret), ...
                top_score = score

        if num_attempts >= self.min_gen_trials and top_score is not None:
            break
        num_attempts += 1

    # Keep best code variant
    if top_score > self.best_codes[-1][1]:
        self.best_codes.append((copy(self.code_lines), copy(top_score), ...))
        self.best_codes.sort(key=lambda x: x[1], reverse=True)
        if len(self.best_codes) >= self.max_codes:
            self.best_codes.pop(-1)
            self.code_reflect = self.reflect_code()
```

## Pattern 3: Initial Code Generation with Error History (AgentLaboratory)

```python
def gen_initial_code(self):
    num_attempts = 0
    error_hist = []

    while True:
        if num_attempts == 0:
            err_hist = ""
        else:
            err = f"Previous command: {model_resp}. Error: {cmd_str}. " \
                  f"Do not repeat this error."
            error_hist.append(err)
            if len(error_hist) == 5:
                error_hist.pop(0)
            err_hist = "Error history:\n" + "\n".join(error_hist) + \
                      "\nDO NOT REPEAT THESE."

        model_resp = query_model(
            system_prompt=self.system_prompt(),
            prompt=f"{err_hist}\nUse ```REPLACE to create initial code: ",
            temp=1.0
        )
        cmd_str, code_lines, prev_code_ret, should_execute_code, score = \
            self.process_command(model_resp)
        if score is not None:
            break
        num_attempts += 1

    return code_lines, prev_code_ret, score
```

## Pattern 4: Code Reflection for Improvement (AgentLaboratory)

```python
def reflect_code(self):
    code_strs = "\n\n".join([
        f"Code variant:\n{code}\nScore: {score}"
        for code, score, _ in self.best_codes
    ])

    prompt = f"""Please reflect on ideas for how to improve your current code.
Examine the provided code and think very specifically (with precise ideas)
on how to improve performance, w

What is this skill?

AI-Scientist-style loop with up to 5 runs and 4 fix iterations per failed run

Subprocess execution of experiment.py with 7200s timeout and truncated stderr feedback

Automatic prompt chaining from failures or summarized means from final_info.json

Hill-climbing code optimization pattern for iterative scientific coding (AgentLaboratory-style)

Explicit ALL_COMPLETED break condition to stop when the agent marks the study done

MAX_RUNS = 5

MAX_ITERS = 4 per run

MAX_STDERR_OUTPUT = 1500 characters

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 678 installs on skills.sh; 114 GitHub stars; 2/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Spin run_1 through run_5 with capped fix iterations while validating a new model architecture.

Example use

Embed the perform_experiments helper so your repo skill drives experiment.py after each codegen pass.

Example use