Ai Research Reproduction

Name: Ai Research Reproduction
Author: lllllllama

lllllllama/rigorpilot-skills

176k installs
512 repo stars
Updated July 26, 2026
lllllllama/rigorpilot-skills

ai-research-reproduction is a Claude Code skill for README-first deep learning repository reproduction with end-to-end intake, setup, execution, and evidence recording.

About

A repository reproduction skill for deep learning projects. Use it when you want a minimal trustworthy run that reads the README first, selects the smallest documented target, and records all evidence and deviations.

README-first deep learning repository reproduction with end-to-end flow
Selects smallest trustworthy target (inference, evaluation, or training)
Records all assumptions, deviations, and evidence in standardized outputs

Ai Research Reproduction by the numbers

176,011 all-time installs (skills.sh)
+25,339 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #2 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

ai-research-reproduction capabilities & compatibility

Capabilities: repository intake · environment bootstrap · execution orchestration · evidence collection
Use cases: research · debugging · testing

npx skills add https://github.com/lllllllama/rigorpilot-skills --skill ai-research-reproduction

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/lllllllama/rigorpilot-skills/ai-research-reproduction.svg)](https://skillselion.com/skills/lllllllama/rigorpilot-skills/ai-research-reproduction)

Installs	176k
repo stars	★ 512
Security audit	3 / 3 scanners passed
Last updated	July 26, 2026
Repository	lllllllama/rigorpilot-skills ↗

What it does

Reproduce deep learning repository results with minimal trustworthy flow and auditable evidence.

Who is it for?

Minimal trustworthy reproduction,Baseline verification,Training startup or short-run verification

Skip if: Paper summaries,Generic environment setup,Standalone command execution,Broad research assistance

When should I use this skill?

The target is an AI repository with README, scripts, and documented commands, and the request spans multiple trusted phases (intake, setup, execution, analysis).

What you get

Standardized repro_outputs/ bundle with SUMMARY.md, COMMANDS.md, LOG.md, SCIENTIFIC_CHANGELOG.md, COMPARABILITY_REPORT.md, status.json, and optional PATCHES.md.

repro_outputs/SUMMARY.md
repro_outputs/COMMANDS.md
repro_outputs/LOG.md

By the numbers

7 standardized output files documenting reproduction with assumptions and deviations

Files

SKILL.mdMarkdownGitHub ↗

ai-research-reproduction

Purpose

Use this as the Rigor Reproduce compatible skill slug for README-first deep learning repository reproduction. The installed slug remains ai-research-reproduction for compatibility. The skill guides the agent toward a minimal trustworthy run with auditable evidence; it should not micromanage implementation details that the model can infer from the repository. Reproduction is not "make it run by changing anything"; it means faithfully reading the README, environment, weights, datasets, and documented commands, then recording results and deviations.

Start from the shared operating principles in ../../references/agent-operating-principles.md, then load ../../references/research-rigor-principles.md and ../../references/deep-learning-experiment-principles.md when scientific meaning, comparability, or experiment details are at stake.

Fit

Use this skill when all are true:

The target is an AI code repository with a README, scripts, configs, or

documented commands.

The request spans multiple trusted phases such as intake, setup, execution,

training verification, analysis, paper-gap resolution, and reporting.

The desired result is a small reproducible target, not broad experimentation.

Do not use this skill for paper summaries, generic environment setup, isolated repo scanning, standalone command execution, open-ended research design, or explicit candidate-only exploration.

Trusted Target Selection

Choose the smallest target that can honestly demonstrate repository-grounded reproduction:

1. documented inference 2. documented evaluation 3. documented training startup or partial verification 4. full training only after explicit user confirmation

Treat README guidance as the primary reproduction intent. Use repository files to clarify the README, not to silently replace it. When the README and paper conflict, record the conflict and use paper-context-resolver only for the narrow reproduction-critical gap.

Workflow

1. Read the README and nearby repo signals. 2. Use repo-intake-and-plan to extract documented commands and candidate targets. 3. Select and justify the minimum trustworthy target. 4. Use env-and-assets-bootstrap only for target-specific environment, checkpoint, dataset, and cache assumptions. 5. Use analyze-project only when structure, insertion points, or suspicious implementation patterns need read-only clarification. 6. Use minimal-run-and-audit for documented inference, evaluation, smoke, or sanity execution. 7. Use run-train instead when the selected trusted target is training startup, short-run verification, full kickoff, or resume. 8. Pause for human review before fuller training claims or any change that could alter dataset, split, checkpoint, preprocessing, metric, loss, model semantics, or result interpretation. 9. Write the standardized outputs and give a concise final note in the user's language when practical.

Patch Boundary

Prefer no repository edits. If edits are needed, keep them conservative and auditable:

Try command-line arguments, environment variables, path fixes, dependency

version fixes, or dependency-file fixes before code changes.

Reproduction fixes are allowed when needed, but they must not be hidden. State

what changed, why it was necessary, whether it changes scientific meaning, and whether it affects comparability with the paper, README, or baseline.

Avoid changing model architecture, core inference semantics, training logic,

loss functions, or experiment meaning.

If repository files must change, create a branch named

repro/YYYY-MM-DD-short-task, keep verified patch commits sparse, and record README-fidelity impact in PATCHES.md.

See references/patch-policy.md.

Outputs

Always target repro_outputs/:

SUMMARY.md
COMMANDS.md
LOG.md
SCIENTIFIC_CHANGELOG.md
COMPARABILITY_REPORT.md
status.json
PATCHES.md   # only if patches were applied

Use the templates under assets/ and the field rules in references/output-spec.md.

Put the shortest high-value summary in SUMMARY.md.
Put copyable commands in COMMANDS.md.
Put process evidence, assumptions, failures, and decisions in LOG.md.
Put scientific meaning and change effects in SCIENTIFIC_CHANGELOG.md.
Put comparison anchors and protocol deviations in COMPARABILITY_REPORT.md.
Put durable machine-readable state in status.json.
Put branch, commit, validation, and README-fidelity impact in PATCHES.md when needed.
Distinguish verified facts from inferred guesses.

Reference Loading

Load references/language-policy.md when writing human-readable outputs.
Load ../../references/research-rigor-principles.md before making

comparability, contribution, or research-result claims.

Load ../../references/deep-learning-experiment-principles.md when dataset,

split, metric, checkpoint, training, or evaluation details matter.

Load references/research-safety-principles.md before protocol-sensitive

decisions.

Load references/patch-policy.md before modifying repository files.
Keep specialized logic in sub-skills, scripts, templates, or references rather

than expanding this entrypoint.

display_name: Rigor Reproduce
short_description: Rigor Reproduce compatible slug for end-to-end README-first deep learning repo reproduction.
default_prompt: Reproduce this deep learning paper repository with README-first rules, choose the smallest documented inference or evaluation target, keep patches conservative and clearly labeled, and write standardized outputs to repro_outputs/.

Commands

Setup

{{setup_commands}}

Assets

{{asset_commands}}

Main run

{{run_commands}}

Verification

{{verification_commands}}

Notes

Main run label: {{main_run_label}}
Main run source: {{main_run_source}}
Main run section: {{main_run_section}}

{
  "schema_version": "1.0",
  "generated_at": "{{generated_at}}",
  "user_language": "{{user_language}}",
  "target_repo": "{{target_repo}}",
  "readme_first": true,
  "selected_goal": "{{selected_goal}}",
  "goal_priority": "{{goal_priority}}",
  "status": "{{status}}",
  "documented_command_status": "{{documented_command_status}}",
  "documented_command": "{{documented_command}}",
  "documented_command_kind": "{{documented_command_kind}}",
  "documented_command_source": "{{documented_command_source}}",
  "documented_command_section": "{{documented_command_section}}",
  "patches_applied": false,
  "patch_branch": null,
  "readme_fidelity": null,
  "highest_patch_risk": null,
  "evidence_level": "{{evidence_level}}",
  "assumptions": [],
  "unverified_inferences": [],
  "protocol_deviations": [],
  "human_decisions_required": [],
  "next_safe_action": "{{next_safe_action}}",
  "artifact_provenance": [],
  "verified_commit_count": 0,
  "outputs": {
    "summary": "repro_outputs/SUMMARY.md",
    "commands": "repro_outputs/COMMANDS.md",
    "log": "repro_outputs/LOG.md",
    "status": "repro_outputs/status.json",
    "patches": null
  },
  "notes": []
}

Output Spec

All runs should target the same output directory:

repro_outputs/

When the selected trustworthy target is documented training, the orchestrator may also emit a supplemental:

train_outputs/

That training bundle should hold the training-specific checkpoint, metric, and monitoring state, while repro_outputs/ remains the primary reproduction-facing summary.

`SUMMARY.md`

Audience:

first human reader
another model that needs the high-level result fast

Requirements:

keep it within one page when possible
state target repo and selected reproduction goal
state overall outcome clearly
list the main documented command that was attempted or verified
list the biggest blocker if not successful
when patches were applied, surface patch state briefly:
patches_applied
patch branch
README fidelity impact
highest patch risk

`COMMANDS.md`

Requirements:

commands must be copyable
separate setup, assets, run, and verification steps
label each command as documented, adapted, or inferred
avoid dumping noise from the shell history

`LOG.md`

Requirements:

concise chronological record
include assumptions, evidence, failures, retries, and decisions
distinguish between README-backed steps and inferred steps

`status.json`

Requirements:

keys remain in English
enums remain stable
values can summarize both success and partial verification
preserve observability for assumptions, deviations, evidence level, and human review points

Suggested top-level keys:

schema_version
generated_at
user_language
target_repo
readme_first
selected_goal
goal_priority
status
documented_command_status
documented_command
documented_command_kind
documented_command_source
documented_command_section
patches_applied
patch_branch
readme_fidelity
highest_patch_risk
evidence_level
assumptions
unverified_inferences
protocol_deviations
human_decisions_required
next_safe_action
artifact_provenance
verified_commit_count
outputs
notes

Recommended status enums:

success
partial
blocked
not_run

Recommended evidence level enums:

direct
mixed
inferred

Field intent:

assumptions
important assumptions that still shape execution or interpretation
unverified_inferences
bounded inferences that were useful but not directly verified
protocol_deviations
meaningful differences from README, paper, or documented setup
human_decisions_required
decisions that should not be taken implicitly by the agent
next_safe_action
the lowest-risk next step a researcher can review or run
artifact_provenance
where key inputs or outputs came from, such as README, repo path, paper, dataset root, checkpoint, or generated logs

`PATCHES.md`

Only create this file when repository files were modified.

Requirements:

record patch branch name
record highest patch risk
record verified commits in order
explain what changed and why for each verified commit
explain how each change was verified
state whether README fidelity was preserved, clarified, or diverged
record changed files for each verified commit
keep human-readable prose in the user's language when practical, but preserve commit hashes, branch names, and command strings verbatim

Patch Policy

Default stance

Avoid patching repository code unless reproduction is blocked and the change is both conservative and auditable.

Allowed-first changes

Prefer these classes first:

environment variables
filesystem path fixes
dependency pin adjustments
requirements.txt or environment.yml corrections
command-line argument fixes that preserve documented intent

Allowed-with-caution changes

small compatibility fixes for modern Python or library versions
explicit path joins or file existence guards
OS portability fixes when the documented command remains semantically the same

These still need justification and verification.

Default disallowed changes

architecture changes
dataset label changes
loss or metric definition changes
training loop rewrites
inference output logic rewrites
silent behavior changes that make the command "pass" but alter experiment meaning

Branching and commits

Before the first repo file edit, create:

repro/YYYY-MM-DD-short-task

Commit only after a verified group of changes.

Preferred commit message:

repro: <scope> for documented <command>

Keep verified patch commits sparse:

ideal: 0-2
upper practical bound for this workflow: about 3

Reporting

Every patch must be reflected in PATCHES.md with:

branch
highest risk
commit hash
changed files
rationale
verification
README fidelity impact

If no repository files were modified:

do not create PATCHES.md
keep patches_applied false in status.json
keep the explanation in SUMMARY.md or LOG.md focused on the run and blocker, not on hypothetical patches

#!/usr/bin/env python3
"""Minimal orchestration for README-first reproduction scaffolding."""

from __future__ import annotations

import argparse
import json
import re
import shlex
import subprocess
import sys
import tempfile
from pathlib import Path
from typing import Any, Dict, List


def locale(user_language: str) -> str:
    return "zh" if user_language.lower().startswith("zh") else "en"


def text(user_language: str, en: str, zh: str) -> str:
    return zh if locale(user_language) == "zh" else en


def run_json(script: Path, args: List[str]) -> Dict[str, Any]:
    command = [sys.executable, str(script), *args]
    result = subprocess.run(command, check=True, capture_output=True, text=True)
    return json.loads(result.stdout)


def write_bundle(script: Path, output_dir: Path, context: Dict[str, Any]) -> None:
    output_dir.mkdir(parents=True, exist_ok=True)
    with tempfile.NamedTemporaryFile("w", encoding="utf-8", suffix=".json", delete=False) as handle:
        context_path = Path(handle.name)
        handle.write(json.dumps(context, indent=2, ensure_ascii=False))

    try:
        subprocess.run(
            [
                sys.executable,
                str(script),
                "--context-json",
                str(context_path),
                "--output-dir",
                str(output_dir),
            ],
            check=True,
        )
    finally:
        if context_path.exists():
            context_path.unlink()


def build_asset_commands(asset_data: Dict[str, Any]) -> List[Dict[str, str]]:
    commands: List[Dict[str, str]] = []
    for item in asset_data.get("manifest", []):
        group = item.get("asset_group", "asset")
        target = item.get("target_path", "")
        if item.get("status") == "present":
            commands.append({"label": "inferred", "command": f"# Found existing {group} asset path at {item.get('source_hint')}."})
        else:
            commands.append({"label": "inferred", "command": f"# Prepare {group} assets under {target} before the documented run."})

    for hint in asset_data.get("text_hints", [])[:3]:
        descriptor = hint.get("paths") or hint.get("urls") or hint.get("line", "")
        source = Path(hint.get("source", "README.md")).name
        commands.append({"label": "documented", "command": f"# Asset hint from {source}: {descriptor}"})
    return commands


def derive_dataset_hint(asset_data: Dict[str, Any]) -> str:
    for hint in asset_data.get("text_hints", []):
        if "dataset" in hint.get("line", "").lower():
            return hint.get("paths") or hint.get("urls") or "README-documented dataset"
    for item in asset_data.get("manifest", []):
        if item.get("asset_group") in {"datasets", "data"} and item.get("status") == "present":
            return item.get("source_hint", "repo-local dataset")
    return "unknown"


def derive_checkpoint_hint(asset_data: Dict[str, Any]) -> str:
    for hint in asset_data.get("text_hints", []):
        line = hint.get("line", "").lower()
        if "checkpoint" in line or "weight" in line or "model" in line:
            return hint.get("paths") or hint.get("urls") or "README-documented checkpoint"
    for item in asset_data.get("manifest", []):
        if item.get("asset_group") in {"checkpoints", "weights"} and item.get("status") == "present":
            return item.get("source_hint", "repo-local checkpoint")
    return "none"


def extract_config_path(command: str) -> str | None:
    tokens = shlex.split(command, posix=False)
    for index, token in enumerate(tokens):
        if token in {"--config", "--cfg"} and index + 1 < len(tokens):
            return tokens[index + 1]
        if token.startswith("--config="):
            return token.split("=", 1)[1]
        if token.startswith("--cfg="):
            return token.split("=", 1)[1]
    return None


def estimate_training_duration(repo_path: Path, command: str, max_train_steps: int) -> str:
    if max_train_steps > 0:
        if max_train_steps <= 200:
            return f"roughly minutes to under 1 hour for about {max_train_steps} steps, depending on dataset size and GPU throughput"
        if max_train_steps <= 5000:
            return f"roughly hours for about {max_train_steps} steps, depending on dataset size and GPU throughput"
        return f"likely many hours to multi-day for about {max_train_steps} steps, depending on dataset size and GPU throughput"

    config_rel = extract_config_path(command)
    if config_rel:
        config_path = (repo_path / config_rel).resolve()
        if config_path.exists() and config_path.suffix.lower() in {".yaml", ".yml", ".json", ".toml", ".py"}:
            text_content = config_path.read_text(encoding="utf-8", errors="replace")
            step_match = None
            for key in ["max_steps", "total_steps", "train_steps", "num_steps"]:
                step_match = re.search(rf"{key}\s*[:=]\s*(\d+)", text_content, flags=re.IGNORECASE)
                if step_match:
                    steps = int(step_match.group(1))
                    if steps <= 200:
                        return f"roughly minutes to under 1 hour from config-bound {steps} steps, depending on GPU throughput"
                    if steps <= 5000:
                        return f"roughly hours from config-bound {steps} steps, depending on GPU throughput"
                    return f"likely many hours to multi-day from config-bound {steps} steps, depending on dataset size and GPU throughput"

            epoch_match = None
            for key in ["epochs", "max_epochs", "num_epochs", "train_epochs"]:
                epoch_match = re.search(rf"{key}\s*[:=]\s*(\d+)", text_content, flags=re.IGNORECASE)
                if epoch_match:
                    epochs = int(epoch_match.group(1))
                    if epochs <= 3:
                        return f"roughly minutes to under 1 hour for about {epochs} epochs, depending on dataset size and GPU throughput"
                    if epochs <= 20:
                        return f"roughly hours for about {epochs} epochs, depending on dataset size and GPU throughput"
                    return f"likely many hours to multi-day for about {epochs} epochs, depending on dataset size and GPU throughput"

    return "unknown; likely hours to multi-day on the full dataset until a bounded schedule is confirmed"


def command_score(command: Dict[str, Any]) -> int:
    text_value = str(command.get("command", "")).lower()
    kind = command.get("kind", "run")
    score = {"run": 40, "smoke": 30, "asset": 10, "setup": 0}.get(kind, 0)

    if any(token in text_value for token in ["python ", "python3 ", "./", "whisper "]):
        score += 8
    if any(token in text_value for token in ["txt2img", "img2img", "amg.py", "transcribe", "infer", "eval"]):
        score += 8
    if "<" in text_value and ">" in text_value:
        score -= 10
    if text_value.startswith(("pip install", "conda install", "conda env create", "conda activate", "git clone", "cd ")):
        score -= 12
    return score


def choose_goal(commands: List[Dict[str, Any]]) -> Dict[str, Any]:
    for category in ["inference", "evaluation", "training", "other"]:
        candidates = [item for item in commands if item.get("category") == category]
        if not candidates:
            continue
        best = max(candidates, key=command_score)
        return {
            "selected_goal": category,
            "goal_priority": category,
            "documented_command": best.get("command", ""),
            "command_source": best.get("source", "readme"),
            "documented_command_kind": best.get("kind", "run"),
            "documented_command_section": best.get("section"),
        }

    return {
        "selected_goal": "repo-intake-only",
        "goal_priority": "other",
        "documented_command": "",
        "command_source": "none",
        "documented_command_kind": "none",
        "documented_command_section": None,
    }


def plan_skill_chain(selected_goal: str, include_analysis_pass: bool, include_paper_gap: bool) -> List[str]:
    chain = [
        "repo-intake-and-plan",
        "env-and-assets-bootstrap",
    ]
    if include_analysis_pass:
        chain.append("analyze-project")
    chain.append("run-train" if selected_goal == "training" else "minimal-run-and-audit")
    if include_paper_gap:
        chain.append("paper-context-resolver")
    return chain


def maybe_run_command(repo_path: Path, command: str, timeout: int, user_language: str) -> Dict[str, Any]:
    if not command:
        return {
            "status": "not_run",
            "documented_command_status": "not_run",
            "execution_log": [],
            "main_blocker": text(
                user_language,
                "No documented command was extracted from README.",
                "README 中未提取到已文档化命令。",
            ),
        }

    try:
        result = subprocess.run(
            shlex.split(command, posix=False),
            cwd=repo_path,
            capture_output=True,
            text=True,
            timeout=timeout,
            check=False,
        )
    except FileNotFoundError as exc:
        return {
            "status": "blocked",
            "documented_command_status": "blocked",
            "execution_log": [f"Command failed before launch: {exc}"],
            "main_blocker": text(
                user_language,
                f"Executable not found for documented command: {exc}",
                f"文档命令缺少可执行程序：{exc}",
            ),
        }
    except subprocess.TimeoutExpired:
        return {
            "status": "partial",
            "documented_command_status": "partial",
            "execution_log": [f"Command timed out after {timeout} seconds."],
            "main_blocker": text(
                user_language,
                f"Selected documented command did not finish within {timeout} seconds.",
                f"选定的文档命令未在 {timeout} 秒内完成。",
            ),
        }

    combined: List[str] = []
    if result.stdout.strip():
        combined.append("STDOUT:\n" + result.stdout.strip())
    if result.stderr.strip():
        combined.append("STDERR:\n" + result.stderr.strip())

    if result.returncode == 0:
        return {
            "status": "success",
            "documented_command_status": "success",
            "execution_log": combined,
            "main_blocker": text(user_language, "None.", "无。"),
        }

    return {
        "status": "partial",
        "documented_command_status": "partial",
        "execution_log": combined,
        "main_blocker": text(
            user_language,
            f"Selected documented command exited with code {result.returncode}.",
            f"选定的文档命令以退出码 {result.returncode} 结束。",
        ),
    }


def maybe_run_training(
    *,
    repo_path: Path,
    command: str,
    train_script: Path,
    lane: str,
    user_language: str,
    full_training_authorized: bool,
    train_timeout: int,
    dataset_hint: str,
    checkpoint_hint: str,
    resume_from: str,
    max_train_steps: int,
) -> Dict[str, Any]:
    if not command:
        return {
            "status": "not_run",
            "documented_command_status": "not_run",
            "execution_log": [],
            "main_blocker": text(
                user_language,
                "No documented training command was extracted from README.",
                "README 中未提取到已文档化训练命令。",
            ),
            "lane": lane,
            "run_mode": "startup_verification" if lane == "trusted" else "full_kickoff",
            "resume_from": resume_from or None,
            "dataset": dataset_hint,
            "checkpoint_source": checkpoint_hint,
            "max_steps": max_train_steps,
            "completed_steps": 0,
            "best_metric": None,
            "best_checkpoint": None,
            "stop_reason": "not_run",
            "last_epoch": None,
            "last_step": None,
            "observed_metrics": {},
            "checkpoint_candidates": [],
            "monitoring_scope": "not_run",
        }

    if resume_from:
        run_mode = "resume"
    elif lane == "trusted" and not full_training_authorized:
        run_mode = "startup_verification"
    else:
        run_mode = "full_kickoff"

    return run_json(
        train_script,
        [
            "--repo",
            str(repo_path),
            "--command",
            command,
            "--timeout",
            str(train_timeout),
            "--lane",
            lane,
            "--run-mode",
            run_mode,
            "--dataset",
            dataset_hint,
            "--checkpoint-source",
            checkpoint_hint,
            "--resume-from",
            resume_from,
            "--max-steps",
            str(max_train_steps),
        ],
    )


def build_context(
    *,
    repo_path: Path,
    scan_data: Dict[str, Any],
    command_data: Dict[str, Any],
    setup_plan: Dict[str, Any],
    asset_data: Dict[str, Any],
    run_data: Dict[str, Any],
    user_language: str,
    run_selected: bool,
    include_analysis_pass: bool,
    include_paper_gap: bool,
    lane: str,
    full_training_authorized: bool,
) -> Dict[str, Any]:
    chosen = choose_goal(command_data.get("commands", []))
    skill_chain = plan_skill_chain(chosen["selected_goal"], include_analysis_pass, include_paper_gap)
    execution_skill = "run-train" if chosen["selected_goal"] == "training" else "minimal-run-and-audit"
    status = run_data["status"] if run_selected else "not_run"
    documented_status = (
        run_data["documented_command_status"]
        if run_selected
        else ("not_run" if not chosen["documented_command"] else "documented")
    )

    structure = scan_data.get("structure", {})
    setup_commands = setup_plan.get("setup_commands", [])
    asset_commands = build_asset_commands(asset_data)
    dataset_hint = run_data.get("dataset") or derive_dataset_hint(asset_data)
    checkpoint_hint = run_data.get("checkpoint_source") or derive_checkpoint_hint(asset_data)
    training_duration_hint = (
        estimate_training_duration(repo_path, chosen["documented_command"], int(run_data.get("max_steps") or 0))
        if chosen["selected_goal"] == "training" and chosen["documented_command"]
        else None
    )

    notes: List[str] = []
    notes.extend(scan_data.get("warnings", []))
    notes.extend(command_data.get("warnings", []))
    notes.extend(setup_plan.get("setup_notes", []))
    notes.extend(run_data.get("execution_log", []))

    assumptions = [
        "README remains the primary source of truth.",
        "Environment creation should prefer isolated setup before any semantic code changes.",
        "Model architecture should remain unchanged unless the researcher explicitly requests otherwise.",
    ]
    if chosen["selected_goal"] == "training" and lane == "trusted" and not full_training_authorized:
        assumptions.append("Only startup verification is allowed before the researcher explicitly authorizes a fuller training reproduction run.")

    unverified_inferences = [
        "Asset and dataset hints remain conservative until the repo or README confirms the exact path layout."
    ]
    protocol_deviations: List[str] = []
    human_decisions_required: List[str] = []

    if not chosen["documented_command"]:
        result_summary = text(
            user_language,
            "No documented runnable command was extracted. Repo intake was completed.",
            "未提取到可运行的文档命令，已完成仓库 intake。",
        )
    elif chosen["selected_goal"] != "training":
        result_summary = text(
            user_language,
            f"Selected goal `{chosen['selected_goal']}` from README evidence.",
            f"已根据 README 证据选择目标 `{chosen['selected_goal']}`。",
        )
    else:
        result_summary = text(
            user_language,
            "Selected the documented training command after no smaller inference or evaluation target was available.",
            "在没有更小的推理或评测目标时，已选择文档中的训练命令。",
        )

    if run_selected:
        if status == "success":
            result_summary = text(user_language, "Selected documented command finished successfully.", "选定的文档命令已成功完成。")
        elif status == "partial":
            result_summary = (
                text(
                    user_language,
                    "Selected training command produced early training evidence within the current monitoring window.",
                    "选定的训练命令已在当前监控窗口内产生早期训练证据。",
                )
                if chosen["selected_goal"] == "training"
                else text(
                    user_language,
                    "Selected documented command started but did not complete cleanly.",
                    "选定的文档命令已启动，但未完整成功结束。",
                )
            )
        elif status == "blocked":
            result_summary = (
                text(user_language, "Selected training command could not be launched.", "选定的训练命令无法启动。")
                if chosen["selected_goal"] == "training"
                else text(user_language, "Selected documented command could not be launched.", "选定的文档命令无法启动。")
            )

    section = chosen.get("documented_command_section")
    command_notes = [
        text(
            user_language,
            f"README path: {scan_data.get('readme_path') or 'not found'}",
            f"README 路径：{scan_data.get('readme_path') or 'not found'}",
        ),
        text(
            user_language,
            f"Detected top-level entries: {', '.join(structure.get('top_level', [])) or 'none'}",
            f"检测到的顶层条目：{', '.join(structure.get('top_level', [])) or 'none'}",
        ),
    ]
    if setup_plan.get("environment_file"):
        command_notes.append(f"Environment plan source: {setup_plan['environment_file']}")
    command_notes.extend(setup_plan.get("setup_notes", []))
    if chosen["documented_command"]:
        source_note = text(
            user_language,
            f"Main run label: documented from README ({chosen.get('command_source', 'readme')})",
            f"主运行标签：来自 README 的 documented（{chosen.get('command_source', 'readme')}）",
        )
        if section:
            source_note += text(user_language, f", section `{section}`", f"，章节 `{section}`")
        command_notes.append(source_note)
    command_notes.append(f"Planned skill chain: {', '.join(skill_chain)}")

    if setup_plan.get("unresolved_setup_risks"):
        human_decisions_required.extend(setup_plan["unresolved_setup_risks"])
    if not chosen["documented_command"]:
        human_decisions_required.append("Select or confirm a documented runnable command before treating this as a reproduction run.")
    if chosen["selected_goal"] == "training" and lane == "trusted" and not full_training_authorized:
        human_decisions_required.append("Review the startup verification evidence and confirm whether to continue with a fuller training reproduction run.")
    if run_selected and status in {"partial", "blocked"}:
        human_decisions_required.append("Review the blocker before adapting commands, dependencies, or protocol-sensitive settings.")

    if chosen["selected_goal"] == "training":
        if lane == "trusted" and not full_training_authorized:
            next_action = text(
                user_language,
                f"Review `train_outputs/status.json`, then decide whether to authorize a fuller training reproduction run. Planned command: `{chosen['documented_command']}`. Estimated duration: {training_duration_hint}.",
                f"先检查 `train_outputs/status.json`，再决定是否授权更完整的训练复现。计划继续执行的命令是：`{chosen['documented_command']}`。保守预估时长：{training_duration_hint}。",
            )
            next_safe_action = "Keep the repo unchanged, review startup evidence, and only continue with fuller training after explicit researcher approval."
        elif lane == "explore":
            next_action = text(
                user_language,
                "Review the recorded training evidence and continue isolated exploratory training if the variant still looks promising.",
                "先检查已记录的训练证据，如该变体仍有希望，再继续隔离的探索训练。",
            )
            next_safe_action = "Keep exploratory changes isolated and compare the recorded early metrics before widening the search."
        else:
            next_action = text(
                user_language,
                "Review the current training record and continue monitoring or resume from the latest checkpoint if needed.",
                "先检查当前训练记录，如有需要，再继续监控或从最新 checkpoint 恢复。",
            )
            next_safe_action = "Preserve the documented training semantics and continue from recorded checkpoints only if the current run remains faithful."
    else:
        next_action = (
            text(user_language, "Prepare environment and assets, then retry the documented command.", "先准备环境与资源，再重试该文档命令。")
            if status in {"partial", "blocked", "not_run"}
            else text(user_language, "Review outputs and continue with the next documented verification step.", "检查输出后继续下一步文档化验证。")
        )
        next_safe_action = (
            "Review setup assumptions and confirm the next documented command before making any semantic changes."
            if status in {"partial", "blocked", "not_run"}
            else "Review generated outputs and confirm that the next documented verification step preserves experiment meaning."
        )

    run_commands = ([{"label": "documented", "command": chosen["documented_command"]}] if chosen["documented_command"] else [])
    verification_commands = (
        [{"label": "inferred", "command": "python - <<'PY'\nimport pathlib\nprint(pathlib.Path('train_outputs/status.json').exists())\nPY"}]
        if chosen["selected_goal"] == "training"
        else [{"label": "inferred", "command": "# Add metric check, artifact check, or smoke verification command here."}]
    )

    evidence = [
        text(
            user_language,
            f"Detected files: {', '.join(scan_data.get('detected_files', [])) or 'none'}",
            f"检测到的文件：{', '.join(scan_data.get('detected_files', [])) or 'none'}",
        ),
        text(
            user_language,
            f"Command categories: {json.dumps(command_data.get('counts', {}), ensure_ascii=False)}",
            f"命令分类：{json.dumps(command_data.get('counts', {}), ensure_ascii=False)}",
        ),
        text(
            user_language,
            f"Selected command kind: {chosen.get('documented_command_kind', 'none')}",
            f"已选命令类型：{chosen.get('documented_command_kind', 'none')}",
        ),
    ]
    if setup_plan.get("environment_file"):
        evidence.append(f"Environment file: {setup_plan['environment_file']}")
    if asset_data.get("text_hints"):
        evidence.append(f"Asset hints detected: {len(asset_data['text_hints'])}")

    timeline = [
        text(user_language, "Scanned repository structure and key metadata files.", "已扫描仓库结构和关键元数据文件。"),
        text(user_language, "Extracted README code blocks and shell-like commands.", "已提取 README 中的代码块和 shell 风格命令。"),
        text(user_language, f"Selected `{chosen['selected_goal']}` as the smallest trustworthy target.", f"已将 `{chosen['selected_goal']}` 选为最小可信目标。"),
        text(user_language, "Prepared conservative setup and asset assumptions.", "已准备保守的环境与资源假设。"),
        text(user_language, "Execution step was skipped." if not run_selected else "Attempted the selected documented command.", "执行步骤已跳过。" if not run_selected else "已尝试选定的文档命令。"),
    ]
    if chosen["selected_goal"] == "training":
        timeline.append(text(user_language, f"Training lane `{lane}` selected with run mode `{run_data.get('run_mode', 'startup_verification')}`.", f"已选择训练 lane `{lane}`，运行模式为 `{run_data.get('run_mode', 'startup_verification')}`。"))
        if training_duration_hint:
            timeline.append(text(user_language, f"Estimated fuller training duration: {training_duration_hint}.", f"保守估计完整训练时长：{training_duration_hint}。"))

    artifact_provenance = [
        {"artifact": "readme", "source": scan_data.get("readme_path") or "not found", "kind": "repo_file"},
        {"artifact": "documented_command", "source": chosen.get("command_source", "none"), "kind": "readme_extraction"},
        {"artifact": "environment_plan", "source": setup_plan.get("environment_file") or "inferred", "kind": "setup_plan"},
        {"artifact": "asset_manifest", "source": "artifacts/assets/asset_manifest.json", "kind": "generated"},
        {"artifact": "output_dir", "source": "repro_outputs/", "kind": "generated"},
    ]
    if chosen["selected_goal"] == "training":
        artifact_provenance.append({"artifact": "train_outputs", "source": "train_outputs/", "kind": "generated"})

    return {
        "schema_version": "1.0",
        "generated_at": scan_data.get("generated_at"),
        "user_language": user_language,
        "target_repo": str(repo_path.resolve()),
        "readme_first": True,
        "lane": lane,
        "selected_goal": chosen["selected_goal"],
        "goal_priority": chosen["goal_priority"],
        "execution_skill": execution_skill,
        "planned_skill_chain": skill_chain,
        "status": status,
        "documented_command_status": documented_status,
        "documented_command": chosen["documented_command"] or "None extracted",
        "documented_command_kind": chosen.get("documented_command_kind", "none"),
        "documented_command_source": chosen.get("command_source", "none"),
        "documented_command_section": chosen.get("documented_command_section"),
        "evidence_level": "direct" if chosen["documented_command"] else "mixed",
        "result_summary": result_summary,
        "main_blocker": run_data.get("main_blocker", text(user_language, "No blocker recorded.", "未记录阻塞项。")),
        "next_action": next_action,
        "next_safe_action": next_safe_action,
        "setup_commands": setup_commands,
        "asset_commands": asset_commands,
        "run_commands": run_commands,
        "verification_commands": verification_commands,
        "command_notes": command_notes,
        "timeline": timeline,
        "assumptions": assumptions,
        "unverified_inferences": unverified_inferences,
        "evidence": evidence,
        "blockers": [run_data.get("main_blocker", text(user_language, "None.", "无。"))],
        "protocol_deviations": protocol_deviations,
        "human_decisions_required": human_decisions_required,
        "artifact_provenance": artifact_provenance,
        "notes": notes,
        "patches_applied": False,
        "patch_branch": "",
        "readme_fidelity": "preserved",
        "highest_patch_risk": "low",
        "verified_commits": [],
        "validation_summary": "",
        "patch_notes": [],
        "full_training_authorized": full_training_authorized,
        "requires_full_training_confirmation": chosen["selected_goal"] == "training" and lane == "trusted" and not full_training_authorized,
        "run_mode": run_data.get("run_mode", "startup_verification" if chosen["selected_goal"] == "training" else None),
        "resume_from": run_data.get("resume_from"),
        "dataset": dataset_hint,
        "checkpoint_source": checkpoint_hint,
        "full_training_command": chosen["documented_command"] if chosen["selected_goal"] == "training" else None,
        "training_duration_hint": training_duration_hint,
        "max_steps": run_data.get("max_steps"),
        "completed_steps": run_data.get("completed_steps"),
        "best_metric": run_data.get("best_metric"),
        "best_checkpoint": run_data.get("best_checkpoint"),
        "stop_reason": run_data.get("stop_reason"),
        "last_epoch": run_data.get("last_epoch"),
        "last_step": run_data.get("last_step"),
        "observed_metrics": run_data.get("observed_metrics", {}),
        "checkpoint_candidates": run_data.get("checkpoint_candidates", []),
        "monitoring_scope": run_data.get("monitoring_scope"),
    }


def main() -> int:
    parser = argparse.ArgumentParser(description="Run a minimal README-first reproduction orchestration.")
    parser.add_argument("--repo", required=True, help="Path to the target repository.")
    parser.add_argument("--output-dir", default="repro_outputs", help="Directory to write standardized outputs into.")
    parser.add_argument("--train-output-dir", default="", help="Optional override for the supplemental training output directory.")
    parser.add_argument("--user-language", default="en", help="Language tag for human-readable reports.")
    parser.add_argument("--run-selected", action="store_true", help="Execute the selected documented command.")
    parser.add_argument("--include-analysis-pass", action="store_true", help="Include analyze-project in the planned skill chain.")
    parser.add_argument("--include-paper-gap", action="store_true", help="Include paper-context-resolver in the planned skill chain.")
    parser.add_argument("--timeout", type=int, default=120, help="Execution timeout in seconds for non-training documented commands.")
    parser.add_argument("--train-timeout", type=int, default=120, help="Monitoring timeout in seconds for training commands.")
    parser.add_argument("--lane", choices=["trusted", "explore"], default="trusted", help="Execution lane policy.")
    parser.add_argument("--full-training-authorized", action="store_true", help="Allow the orchestrator to proceed beyond startup verification for training.")
    parser.add_argument("--resume-from", default="", help="Optional checkpoint path to pass through to run-train.")
    parser.add_argument("--max-train-steps", type=int, default=0, help="Optional expected max train steps for reporting.")
    args = parser.parse_args()

    repo_path = Path(args.repo).resolve()
    base_dir = Path(__file__).resolve().parents[2]
    scan_script = base_dir / "repo-intake-and-plan" / "scripts" / "scan_repo.py"
    extract_script = base_dir / "repo-intake-and-plan" / "scripts" / "extract_commands.py"
    setup_script = base_dir / "env-and-assets-bootstrap" / "scripts" / "plan_setup.py"
    asset_script = base_dir / "env-and-assets-bootstrap" / "scripts" / "prepare_assets.py"
    repro_write_script = base_dir / "minimal-run-and-audit" / "scripts" / "write_outputs.py"
    train_write_script = base_dir / "run-train" / "scripts" / "write_outputs.py"
    train_execute_script = base_dir / "run-train" / "scripts" / "run_training.py"

    scan_data = run_json(scan_script, ["--repo", str(repo_path), "--json"])
    readme_path = scan_data.get("readme_path")
    command_data: Dict[str, Any] = {"commands": [], "counts": {}, "warnings": []}
    if readme_path:
        command_data = run_json(extract_script, ["--readme", readme_path, "--json"])

    output_dir = Path(args.output_dir).resolve()
    train_output_dir = Path(args.train_output_dir).resolve() if args.train_output_dir else output_dir.parent / "train_outputs"
    assets_root = output_dir.parent / "artifacts" / "assets"
    asset_manifest_path = assets_root / "asset_manifest.json"

    setup_plan = run_json(setup_script, ["--repo", str(repo_path), "--json"])
    asset_data = run_json(
        asset_script,
        [
            "--repo",
            str(repo_path),
            "--assets-root",
            str(assets_root),
            "--output-json",
            str(asset_manifest_path),
        ],
    )

    chosen = choose_goal(command_data.get("commands", []))
    dataset_hint = derive_dataset_hint(asset_data)
    checkpoint_hint = derive_checkpoint_hint(asset_data)
    run_data: Dict[str, Any] = {
        "status": "not_run",
        "documented_command_status": "not_run",
        "execution_log": [],
        "main_blocker": text(args.user_language, "Execution was not requested.", "未请求执行。"),
        "lane": args.lane,
        "run_mode": "startup_verification" if chosen["selected_goal"] == "training" and args.lane == "trusted" and not args.full_training_authorized else ("full_kickoff" if chosen["selected_goal"] == "training" else None),
        "resume_from": args.resume_from or None,
        "dataset": dataset_hint,
        "checkpoint_source": checkpoint_hint,
        "max_steps": args.max_train_steps,
        "completed_steps": 0,
        "best_metric": None,
        "best_checkpoint": None,
        "stop_reason": "not_run" if chosen["selected_goal"] == "training" else None,
        "last_epoch": None,
        "last_step": None,
        "observed_metrics": {},
        "checkpoint_candidates": [],
        "monitoring_scope": "not_run",
    }
    if args.run_selected:
        if chosen["selected_goal"] == "training":
            run_data = maybe_run_training(
                repo_path=repo_path,
                command=chosen["documented_command"],
                train_script=train_execute_script,
                lane=args.lane,
                user_language=args.user_language,
                full_training_authorized=args.full_training_authorized,
                train_timeout=args.train_timeout,
                dataset_hint=dataset_hint,
                checkpoint_hint=checkpoint_hint,
                resume_from=args.resume_from,
                max_train_steps=args.max_train_steps,
            )
        else:
            run_data = maybe_run_command(repo_path, chosen["documented_command"], args.timeout, args.user_language)

    context = build_context(
        repo_path=repo_path,
        scan_data=scan_data,
        command_data=command_data,
        setup_plan=setup_plan,
        asset_data=asset_data,
        run_data=run_data,
        user_language=args.user_language,
        run_selected=args.run_selected,
        include_analysis_pass=args.include_analysis_pass,
        include_paper_gap=args.include_paper_gap,
        lane=args.lane,
        full_training_authorized=args.full_training_authorized,
    )

    write_bundle(repro_write_script, output_dir, context)
    if context["selected_goal"] == "training":
        write_bundle(train_write_script, train_output_dir, context)

    print(json.dumps(context, indent=2, ensure_ascii=False))
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

Env And Assets BootstrapCreate a reproducible, conservative conda environment plus required checkpoints, datasets and caches before attempting to run any AI research paper reproduction.140k507

Forks & variants (3)

Ai Research Reproduction has 3 known copies in the catalog totaling 617 installs. They canonicalize to this original listing.

lllllllama - 573 installs
lllllllama - 34 installs
lllllllama - 10 installs

How it compares

Use ai-research-reproduction for full README-first orchestration; use repo-intake-and-plan when you only need target selection before committing to a full reproduction run.

FAQ

What is a minimum trustworthy target?

In order: (1) documented inference, (2) documented evaluation, (3) training startup or partial verification, (4) full training only after explicit confirmation.

Are code changes allowed?

Yes, but must be conservative, auditable, and recorded in PATCHES.md with README-fidelity impact.

Is Ai Research Reproduction safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLpipelinesdatabasesanalytics

About

Ai Research Reproduction by the numbers

ai-research-reproduction capabilities & compatibility

Add your badge

What it does

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

ai-research-reproduction

Purpose

Fit

Trusted Target Selection

Workflow

Patch Boundary

Outputs

Reference Loading

Commands

Setup

Assets

Main run

Verification

Notes

Reproduction Log

Context

Timeline

Assumptions

Unverified inferences

Evidence

Protocol deviations

Command provenance

Human review checkpoints

Failures or blockers

Next safe action

Patch Record

Patch overview

Verified commits

{{commit_hash}} {{commit_summary}}

Validation summary

Notes

Reproduction Summary

Result

Main blocker

Next action

Architecture

Main idea

Why this split

Data flow

Output strategy

Non-goals

Language Policy

Goal

Rules

Mixed-language handling

Conflict handling

Output Spec

SUMMARY.md

COMMANDS.md

LOG.md

status.json

PATCHES.md

Patch Policy

Default stance

Allowed-first changes

Allowed-with-caution changes

Default disallowed changes

Branching and commits

Reporting

Research Safety Principles

Default stance

Non-negotiable rules

Evidence discipline

Human review checkpoints

Observability requirements

Long-running work

Related skills

Forks & variants (3)

How it compares

FAQ

`{{commit_hash}}` {{commit_summary}}

`SUMMARY.md`

`COMMANDS.md`

`LOG.md`

`status.json`

`PATCHES.md`