Ai Paper Reproduction

Canonical shelf is Validate because the first commitment is proving the published repository path works before you treat results as product truth. Prototype subphase fits a smallest documented inference or evaluation target—not a full reimplementation—from the repo’s own scripts and configs.

Also useful

Also useful

Where it fits

Example use

Select the smallest documented eval script from the README and record whether metrics match the paper within stated tolerance.

Example use

Wire a verified inference entrypoint into your own agent tooling only after the orchestrator confirms the upstream command path.

Example use

OperateIteration & experiments

Re-run the trusted execution stage after dependency bumps to catch silent numeric drift before you ship a demo.

Example use

Refresh `repro_outputs/` after upstream maintainers change configs so production claims stay tied to reproducible evidence.

How it compares

Use this structured reproduction orchestrator instead of a generic “clone and try random commands” coding session.

Common Questions / FAQ

Who is ai-paper-reproduction for?

It is for solo builders and small teams who must reproduce AI paper repositories with traceable commands, patches, and outputs—not merely summarize the paper.

When should I use ai-paper-reproduction?

Use it in Validate when proving a repo’s smallest inference or eval path; in Build when integrating verified scripts into your stack; and in Ship when you need regression-style re-runs before trusting reported numbers.

Is ai-paper-reproduction safe to install?

Check the Security Audits panel on this Prism page; reproduction flows may execute upstream training or inference code, use network installs, and modify the repo under stated patch rules—review permissions and repo trust first.

SKILL.md

READMESKILL.md - Ai Paper Reproduction

# ai-paper-reproduction

## Use when

- The user wants the agent to reproduce an AI paper repository.
- The target is a code repository with a README, scripts, configs, or documented commands.
- The goal is a minimal trustworthy run, not unlimited experimentation.
- The user needs standardized outputs that another human or model can audit quickly.
- The task spans more than one stage, such as intake plus setup, or setup plus execution plus reporting.

## Do not use when

- The task is a general literature review or paper summary.
- The task is to design a new model, benchmark suite, or training pipeline from scratch.
- The repository is not centered on AI or does not expose a documented reproduction path.
- The user primarily wants a deep code refactor rather than README-first reproduction.
- The user is explicitly asking for only one narrow phase that a sub-skill already covers cleanly.
- The user is explicitly authorizing exploratory branch-only experimentation instead of trusted reproduction.

## Success criteria

- README is treated as the primary source of reproduction intent.
- A minimum trustworthy target is selected and justified.
- Documented inference is preferred over evaluation, and evaluation is preferred over training.
- Any repo edits remain conservative, explicit, and auditable.
- Assumptions, protocol deviations, and human decision points are surfaced rather than hidden.
- `repro_outputs/` is generated with consistent structure and stable machine-readable fields.
- Final user-facing explanation is short and follows the user's language when practical.

## Interaction and usability policy

- Keep the workflow simple enough for a new user to understand quickly.
- Prefer short, concrete plans over exhaustive research.
- Expose commands, assumptions, blockers, and evidence.
- Avoid turning the skill into an opaque automation layer.
- Preserve a low learning cost for both humans and downstream agents.

## Language policy

- Human-readable Markdown outputs should follow the user's language when it is clear.
- If the user's language is unclear, default to concise English.
- Machine-readable fields, filenames, keys, and enum values stay in stable English.
- Paths, package names, CLI commands, config keys, and code identifiers remain unchanged.

See `references/language-policy.md`.

## Reproduction policy

Core priority order:

1. documented inference
2. documented evaluation
3. documented training startup or partial verification
4. full training only when the user explicitly asks later

Rules:

- README-first: use repository files to clarify, not casually override, the README.
- Aim for minimal trustworthy reproduction rather than maximum task coverage.
- Treat smoke tests, startup verification, and early-step checks as valid training evidence when full training is not appropriate.
- In trusted reproduction, a documented training command should first be checked through startup verification or a short monitoring window, then paused for explicit human confirmation before broader training continues.
- In explicitly authorized explore-lane execution, the training record can continue without the trusted-lane confirmation pause, but it must stay isolated from

What is this skill?

Orchestrates end-to-end reproduction: intake, setup, trusted execution, optional training and gap analysis

README-first target selection for the smallest documented inference or evaluation run

Enforces conservative patch rules with recorded assumptions, deviations, and human decision points

Writes a standardized auditable `repro_outputs/` evidence bundle

Explicitly excludes literature summaries, silent protocol changes, and scratch research outside the repo

Standardized `repro_outputs/` reporting bundle

Multi-stage flow spanning intake, setup, execution, and optional training or gap resolution

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 9.1k installs on skills.sh; 412 GitHub stars; 0/3 security scanners passed (skills.sh audits).

Who is it for?

Indie builders, ML engineers, or reviewers validating a paper codebase before citing metrics, demoing results, or building on top of the repo.

Skip if: Pure literature reviews, designing new benchmarks from scratch, repos with no documented reproduction path, or tasks that need broad open-ended research outside repository evidence.

What do I get? / Deliverables

You get a conservative, evidence-backed reproduction run with logged deviations and a standardized `repro_outputs/` package suitable for quick human or model audit.

Populated `repro_outputs/` evidence bundle

Recorded assumptions, deviations, patches, and human decision points

Minimal trustworthy execution log for the chosen inference or eval target

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Select the smallest documented eval script from the README and record whether metrics match the paper within stated tolerance.

Example use

Wire a verified inference entrypoint into your own agent tooling only after the orchestrator confirms the upstream command path.

Example use