Paper To Code

Implementation lands in Build once you commit to reproducing methods; the skill’s output is production-shaped code, not just a reading note. Backend subphase fits training loops, datasets, and model code—the bulk of Paper2Code-style artifacts.

Also useful

Also useful

Where it fits

Example use

Compare two segmentation papers by planning repos before picking which architecture to productize.

Example use

Generate a minimal training stack to see if reported metrics are achievable on a public dataset slice.

Example use

Fill in remaining modules in dependency order after the planning stage locked interfaces in Mermaid diagrams.

How it compares

Structured three-stage codegen from papers—not a single-shot “summarize this PDF” chat prompt.

Common Questions / FAQ

Who is paper-to-code for?

Solo builders and small teams who want agent-assisted reproduction of ML papers into code repos with explicit planning and file-level analysis.

When should I use paper-to-code?

Use it in Idea (research) while comparing methods, in Validate (prototype) to prove a paper runs on your data, and in Build (backend) to generate the training and evaluation codebase.

Is paper-to-code safe to install?

Generated code can execute arbitrary training logic and download datasets; review the Security Audits panel on this page and audit outputs before running on sensitive machines or data.

SKILL.md

READMESKILL.md - Paper To Code

# Paper to Code

Convert a research paper into a complete, runnable code repository.

## Input

- `$0` — Paper PDF path, paper text, or paper URL

## References

- Paper2Code prompts (planning, analysis, coding stages): `~/.claude/skills/paper-to-code/references/paper-to-code-prompts.md`

## Workflow (from Paper2Code)

### Stage 1: Planning
Four-turn conversation to create a comprehensive plan:

1. **Overall Plan**: Extract methodology, experiments, datasets, hyperparameters, evaluation metrics
2. **Architecture Design**: Generate file list, Mermaid classDiagram, sequenceDiagram
3. **Task Breakdown**: Logic analysis per file, dependency-ordered task list, required packages
4. **Configuration**: Extract training details into `config.yaml`

### Stage 2: Analysis
For each file in the task list (dependency order):
1. Conduct detailed logic analysis
2. Map paper methodology to code structure
3. Reference the config.yaml for all settings
4. Follow the UML class diagram interfaces strictly

### Stage 3: Coding
For each file in dependency order:
1. Generate code with access to all previously generated files
2. Follow the design's data structures and interfaces exactly
3. Reference config.yaml — never fabricate configuration values
4. Write complete code — no TODOs or placeholders

### Stage 4: Debugging (if needed)
If execution fails:
1. Collect error messages
2. Identify root cause using SEARCH/REPLACE diff format
3. Apply minimal fixes preserving original intent
4. Re-run until successful

## Output Structure

```
reproduced_code/
├── config.yaml        # Training configuration
├── main.py            # Entry point
├── model.py           # Model architecture
├── dataset_loader.py  # Data loading
├── trainer.py         # Training loop
├── evaluation.py      # Metrics and evaluation
├── reproduce.sh       # Run script
└── requirements.txt   # Dependencies
```

## Key Constraints

- **Dependency order**: Each file is generated with access to all previously generated files
- **Interface contracts**: Mermaid diagrams serve as rigid interface definitions across all stages
- **No fabrication**: Only use configurations explicitly stated in the paper
- **Complete code**: Every function must be fully implemented

## Rules

- Follow the paper's methodology exactly — do not invent improvements
- Generate code in dependency order (data loading → model → training → evaluation → main)
- Use config.yaml for all hyperparameters and settings
- Every class/method in UML diagram must exist in code
- Generate a reproduce.sh script for one-command execution
- If paper details are ambiguous, note them explicitly

## Related Skills
- Upstream: [literature-search](../literature-search/)
- Downstream: [experiment-code](../experiment-code/)
- See also: [code-debugging](../code-debugging/), [algorithm-design](../algorithm-design/)


# Paper-to-Code Prompts

Verbatim prompts extracted from Paper2Code (codes/1_planning.py, 2_analyzing.py, 3_coding.py, 4_debugging.py).

## Stage 1: Planning

### 1.1 Overall Plan Generation

System prompt:
```
You are an expert researcher and strategic planner with a deep understanding of experimental design and reproducibility in scientific research.
You will receive a research paper in {paper_format} format.
Your task is to create a detailed and efficient plan to reproduce the experiments and methodologies described in the paper.
This plan should align precisely with the paper's methodology, experimental setup, and evaluation metrics.

Instructions:
1. Align with the Paper: Your plan must strictly follow the methods, datasets, model configurations, hyperparameters, and experimental setups described in the paper.

What is this skill?

Three-stage Paper2Code pipeline: Planning → Analysis → Coding

Planning uses a four-turn flow: overall plan, architecture, task breakdown, config.yaml

Emits Mermaid classDiagram and sequenceDiagram plus strict interface adherence

Analysis and coding walk files in dependency order with cross-file context

argument-hint accepts paper PDF path, pasted text, or paper URL

3-stage pipeline: Planning, Analysis, Coding

Planning stage documents a 4-turn conversation flow

Configuration stage extracts training details into config.yaml

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 666 installs on skills.sh; 114 GitHub stars; 1/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Compare two segmentation papers by planning repos before picking which architecture to productize.

Example use

Generate a minimal training stack to see if reported metrics are achievable on a public dataset slice.

Example use