Ml Training Recipes

Name: Ml Training Recipes
Author: orchestra-research

orchestra-research/ai-research-skills

326 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

ML Training Recipes is an agent skill that supplies copy-ready transformer architecture patterns for training modern LLM backends.

About

ML Training Recipes is a reference skill that packages modern transformer implementation patterns for solo builders training or customizing language models. It walks through RMSNorm, rotary position embeddings, grouped-query attention, sliding-window flash attention, value embeddings, activation choices, residual scaling, logit soft capping, assembled transformer blocks, and configuration conventions—each with concise Python-oriented guidance meant to be copied into a real training repo. Use it when you are past the idea stage and actively coding a model stack rather than shopping hosted APIs. The content assumes comfort with PyTorch-style modules and transformer training loops. It does not replace experiment design or dataset curation; it accelerates correct, contemporary architecture wiring so you spend fewer cycles debugging norm placement or attention variants.

10-section architecture patterns reference (RMSNorm through model configuration)
Pre-norm transformer blocks with RMSNorm and residual scaling patterns
Modern attention stack: RoPE, sliding-window Flash Attention, and GQA
ResFormer value embedding, activation, logit soft-capping, and full block assembly
Python-oriented snippets intended to plug into a training codebase

Ml Training Recipes by the numbers

326 all-time installs (skills.sh)
+37 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #559 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill ml-training-recipes

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/ml-training-recipes.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/ml-training-recipes)

Installs	326
repo stars	★ 11.2k
Security audit	3 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

What it does

Copy vetted transformer building blocks—RMSNorm, RoPE, GQA, Flash Attention—when you implement or fine-tune an LLM backend as a ML engineer.

Who is it for?

Best when you're fine-tuning or training small-to-mid LLMs and want vetted architecture snippets instead of piecing papers together ad hoc.

Skip if: Skip if you only consume hosted models via API with no custom training code, or beginners without PyTorch familiarity.

When should I use this skill?

Implementing or refactoring transformer training code and you need standard patterns for norms, attention, and blocks.

What you get

You assemble blocks from documented recipes (pre-norm RMSNorm, RoPE, GQA, flash attention) consistent with a full model configuration pattern.

Transformer module implementations aligned to reference patterns
Model configuration structure matching documented conventions

By the numbers

10 architecture pattern sections in the reference table of contents

Files

SKILL.mdMarkdownGitHub ↗

ML Training Recipes

Battle-tested patterns for PyTorch training across domains. Drawn from production codebases (Karpathy's autoresearch/nanochat, torchvision, HuggingFace) and modern training practice.

Reference files (read when needed)

references/architecture.md — Transformer/LLM architecture code patterns, weight init
references/optimizers.md — Muon, AdamW hybrid, per-group LR, compiled optimizer steps
references/domain-specific.md — Vision, diffusion, contrastive, distributed, checkpointing, data loading
references/scaling-and-selection.md — Scaling laws, compute budget tables, decision trees, DGX Spark
references/biomedical.md — Drug discovery, protein models, medical imaging, genomics, clinical NLP
references/experiment-loop.md — Autonomous experiment loop (autoresearch keep/discard/revert)

---

Architecture Selection

Pick the right model by data type and data scale:

Data Type	< 10K samples	10K-100K	> 100K
Images	Pretrained CNN + fine-tune	Fine-tune ViT or CNN	ViT from scratch
Text (gen)	Few-shot prompting	Fine-tune GPT/LLaMA (LoRA)	Pretrain from scratch
Tabular	XGBoost/LightGBM	Still XGBoost	Neural viable
Audio	Pretrained Whisper	Fine-tune AST	Train from scratch
Molecules	Pretrained GNN	Fine-tune molecular LM	Train GNN from scratch
Proteins	ESM-2 embeddings + head	Fine-tune ESM-2	Train protein LM
Medical img	Pretrained CNN	nnU-Net (auto-config)	Swin-UNETR / MedSAM

Key principle: architecture matters less than training recipe at equal compute. A well-tuned ResNet beats a poorly-tuned ViT (ref: "ResNet Strikes Back", Wightman 2021).

For biomedical domains, see references/biomedical.md. For sequence model selection and compute planning, see references/scaling-and-selection.md.

---

Scaling Laws

Chinchilla rule (Hoffmann et al., 2022)

Compute-optimal training: ~20 tokens per parameter.

Model Size	Compute-Optimal	Inference-Optimal (100×)
125M	2.5B tokens	12.5B tokens
1B	20B tokens	100B tokens
7B	140B tokens	700B tokens

FLOPs ≈ 6 × N × D (N=params, D=tokens). Data repetition limit: ~4 epochs before diminishing returns.

---

Training Loop

import gc, time, torch

torch.manual_seed(42)
torch.set_float32_matmul_precision("high")  # TF32 on Ampere+
autocast_ctx = torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16)

grad_accum_steps = total_batch_size // (batch_size * seq_len)
step = 0

while not done:
    t0 = time.time()
    for micro_step in range(grad_accum_steps):
        with autocast_ctx:
            loss = model(x, y)
        (loss / grad_accum_steps).backward()
        x, y = next(train_loader)

    update_lr(optimizer, progress)
    optimizer.step()
    model.zero_grad(set_to_none=True)  # frees memory vs zeroing

    if loss.item() > 100:  # fast-fail on divergence
        print("FAIL: loss exploded"); exit(1)

    torch.cuda.synchronize()
    if step == 0:
        gc.collect(); gc.freeze(); gc.disable()  # avoid ~500ms GC stalls
    step += 1

Key principles

Gradient clipping: clip_grad_norm_(params, 1.0) — near-universal for Transformers.

Exception: Muon optimizer normalizes updates via orthogonalization, so clipping is optional.

Tensor Core alignment: batch size, hidden dims should be multiples of 8 (bf16) or 64 (A100).
Time-based budgets make experiments comparable across hardware.
`cudnn.benchmark = True` for fixed-size vision inputs.

---

Optimizer Configuration

Modern LLM training uses different optimizers per parameter group:

Parameter Type	Optimizer	LR (base)	Weight Decay
2D weight matrices	Muon	0.04	0.2
Token embeddings	AdamW	0.6 × scale	0.0
Unembedding (lm_head)	AdamW	0.004 × scale	0.0
Per-layer scalars	AdamW	0.005 × scale	0.0

LR scaling by dimension: lr * (d_model / 768)^(-0.5) — keeps dynamics stable across sizes.

Rules of thumb

Embeddings need higher LR (sparse updates). Never weight-decay embeddings.
Weight decay scheduling: linearly decay WD to 0 over training.
AdamW defaults: β1=0.9, β2=0.95, eps=1e-10 (not default 1e-8 — prevents stale updates in bf16).

For Muon details (polar express orthogonalization, NorMuon), see references/optimizers.md.

---

Learning Rate Scheduling

Time-based (autoresearch style)

def get_lr_multiplier(progress):  # progress = elapsed_time / time_budget
    if progress < warmup_ratio:
        return progress / warmup_ratio
    elif progress < 1.0 - warmdown_ratio:
        return 1.0
    else:
        cooldown = (1.0 - progress) / warmdown_ratio
        return cooldown + (1 - cooldown) * final_lr_frac

Cosine decay

def get_lr(step, total_steps, max_lr, min_lr, warmup_steps):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

WSD (Warmup-Stable-Decay): gaining traction — easier to resume training mid-run.

Guidance

Warmup: 1-5% of training. Zero warmup valid with Muon (autoresearch uses WARMUP_RATIO=0.0).
Warmdown: 30-50% of training in LR decay. Matters more than warmup for final quality.
Final LR: 0 or ~10% of peak. Zero is simpler.

---

Mixed Precision & Compilation

import os
os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"  # before torch import

import torch
torch.set_float32_matmul_precision("high")
autocast_ctx = torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16)
model = torch.compile(model, dynamic=False)

bf16 (Ampere+): same exponent as fp32, no loss scaling needed. Preferred over fp16.
fp16: needs GradScaler. Use only on V100 or older.
dynamic=False enables max optimization. Add fullgraph=True if no graph breaks.
First steps are slow (JIT) — exclude from timing.

---

Memory & Performance

Meta device init (large models)

with torch.device("meta"):
    model = GPT(config)          # zero memory
model.to_empty(device="cuda")
model.init_weights()

MFU (Model FLOPs Utilization)

achieved_flops = model_flops_per_token * batch_tokens / step_time
mfu = achieved_flops / gpu_peak_flops
# H100 SXM: 989.5 TFLOPS | A100: 312 | RTX 4090: 165

Good targets: >30% decent, >40% good, >50% excellent (single-GPU).

OOM solutions (in order)

1. Reduce DEVICE_BATCH_SIZE, increase grad_accum_steps 2. PYTORCH_ALLOC_CONF=expandable_segments:True 3. model.zero_grad(set_to_none=True) 4. Meta device init → to_empty 5. Activation checkpointing: torch.utils.checkpoint.checkpoint() 6. 8-bit optimizer (bitsandbytes): ~30% savings on optimizer states

---

Hyperparameter Search

Priority order (tune first → last)

1. Learning rate — most impactful. Always tune first. 2. Batch size — largest that fits. Speed knob, not quality knob. 3. Weight decay — 0.01-0.1 for AdamW. 4. Warmup steps — 1-5% of training.

The 2025 default recipe

Setting	Value
Optimizer	AdamW (β1=0.9, β2=0.95, eps=1e-10)
Weight decay	0.1
LR schedule	Cosine decay or WSD
Peak LR	3e-4 (scale down for larger models)
Precision	bf16
Grad clipping	max_norm=1.0
Normalization	RMSNorm (pre-norm)
Activation	SwiGLU
Position encoding	RoPE
Attention	Flash Attention, optionally GQA

---

Debugging Checklist

Karpathy's recipe (still canonical)

1. Become one with the data — visualize, check distributions, verify labels 2. Get end-to-end running first — verify on a trivial case 3. Overfit one batch — if you can't, you have a bug 4. Then regularize — add regularization only after overfitting works 5. Tune hyperparameters — start with known defaults

Loss exploding / NaN

1. Reduce LR (3-10× smaller) 2. Add gradient clipping: clip_grad_norm_(params, 1.0) 3. Check for inf/nan in inputs 4. Add logit soft capping: softcap * tanh(logits / softcap) 5. Add QK-norm in attention 6. Verify weight init (zero-init output projections?) 7. Check loss reduction with gradient accumulation (loss / grad_accum_steps)

Slow training / Low MFU

1. Verify torch.compile is active 2. Check torch.set_float32_matmul_precision("high") 3. Pin memory + non_blocking transfers 4. Profile with torch.profiler 5. GC stalls? gc.freeze(); gc.disable() 6. Tensor Core alignment: dims multiples of 8/64

Loss plateau / Slow convergence

1. LR too low — try 2-5× larger 2. Warmup too long 3. Weight decay too high 4. Verify LR schedule is actually applied (print each step) 5. Model too small for task

Silent failures

1. Data leakage between train/val 2. Wrong preprocessing at inference — augmentation mismatch 3. Label errors — use cleanlab to detect 4. Shuffling bugs — correlated batches 5. Tokenizer mismatch with pretrained model

What to monitor

Gradient norms — spike precedes loss spike
Per-layer activation stats — reveals exploding/vanishing
Dead neurons — >50% zero ReLU = dying ReLU problem
Learning rate — verify schedule applied (common silent bug)

---

Experiment Management

Track experiments in TSV for easy comparison:

commit  val_bpb  memory_gb  status   description
a1b2c3d 0.9979   44.0       keep     baseline
b2c3d4e 0.9932   44.2       keep     increase matrix LR to 0.04
c3d4e5f 1.0050   44.0       discard  switch to GeLU (worse)

Simplicity criterion: all else equal, simpler is better. Removing something and getting equal results is a great outcome. For systematic agent-driven experimentation, see references/experiment-loop.md.

Evaluation metrics by domain

Domain	Primary Metric	Notes
LLM	BPB (bits per byte)	Vocab-size-independent
Classification	Accuracy / F1	Macro-F1 for imbalanced
Segmentation	mIoU / Dice	Per-class IoU reveals weak spots
Generation	FID	Needs >10k samples
Regression	RMSE / MAE	Log-transform skewed targets

Architecture Patterns Reference

Detailed code patterns for modern transformer architectures. Referenced from the main SKILL.md.

1. RMSNorm 2. Rotary Position Embeddings (RoPE) 3. Flash Attention with Sliding Window 4. Grouped Query Attention (GQA) 5. Value Embedding (ResFormer) 6. Activation Functions 7. Residual Scaling 8. Logit Soft Capping 9. Full Transformer Block 10. Model Configuration Pattern

---

RMSNorm

Root Mean Square Layer Normalization — drops the mean-centering of LayerNorm, keeping only the variance normalization. ~15% faster with equivalent quality for transformers.

def norm(x):
    return F.rms_norm(x, (x.size(-1),))

Apply pre-norm (before attention and MLP), not post-norm:

class Block(nn.Module):
    def forward(self, x):
        x = x + self.attn(norm(x))   # pre-norm
        x = x + self.mlp(norm(x))    # pre-norm
        return x

Also normalize the final output before the unembedding layer:

x = norm(x)
logits = self.lm_head(x)

---

Rotary Position Embeddings (RoPE)

RoPE encodes position through rotation of query/key pairs. It's relative (only depends on distance between tokens) and naturally handles varying sequence lengths.

Precomputation

Compute cos/sin tables once at model init, not every forward pass:

def precompute_rotary(seq_len, head_dim, base=10000, device=None):
    """Precompute RoPE cos/sin for positions [0, seq_len)."""
    channel_range = torch.arange(0, head_dim, 2, dtype=torch.float32, device=device)
    inv_freq = 1.0 / (base ** (channel_range / head_dim))
    t = torch.arange(seq_len, dtype=torch.float32, device=device)
    freqs = torch.outer(t, inv_freq)
    cos, sin = freqs.cos().bfloat16(), freqs.sin().bfloat16()
    # Shape: [1, seq_len, 1, head_dim//2] for broadcasting
    return cos[None, :, None, :], sin[None, :, None, :]

self.register_buffer("cos", cos, persistent=False)
self.register_buffer("sin", sin, persistent=False)

Application

def apply_rotary_emb(x, cos, sin):
    """Apply RoPE to query or key tensor. x shape: [B, T, H, D]."""
    d = x.shape[3] // 2
    x1, x2 = x[..., :d], x[..., d:]
    y1 = x1 * cos + x2 * sin
    y2 = x1 * (-sin) + x2 * cos
    return torch.cat([y1, y2], dim=3)

Tips

Pre-allocate for seq_len * 10 (or max expected length) to avoid recomputation
Apply RoPE after splitting into heads but before attention
Normalize q and k after RoPE: q, k = norm(q), norm(k) (QK-norm stabilizes training)

---

Flash Attention with Sliding Window

Flash Attention computes exact attention in O(N) memory instead of O(N^2), and is significantly faster due to IO-awareness.

Sliding Window Pattern

Use a repeating pattern like SSSL — most layers use short (local) windows, with periodic long (global) windows. The last layer always gets full context.

def compute_window_sizes(config):
    pattern = config.window_pattern.upper()  # e.g., "SSSL"
    long_window = config.sequence_len
    short_window = long_window // 2  # half context

    window_sizes = []
    for layer_idx in range(config.n_layer):
        char = pattern[layer_idx % len(pattern)]
        if char == "L":
            window_sizes.append((long_window, 0))
        else:
            window_sizes.append((short_window, 0))

    # Last layer always gets full context
    window_sizes[-1] = (long_window, 0)
    return window_sizes

This saves ~25% attention compute while maintaining quality — most layers only need local context, and information propagates through the occasional global layer.

Integration

# Using Flash Attention 3
from kernels import get_kernel
fa3 = get_kernel("kernels-community/flash-attn3").flash_attn_interface

y = fa3.flash_attn_func(q, k, v, causal=True, window_size=window_size)

# Or using PyTorch native (2.0+)
y = F.scaled_dot_product_attention(q, k, v, is_causal=True)

---

Grouped Query Attention (GQA)

Use fewer KV heads than query heads. Saves memory/compute with minimal quality loss.

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_head = config.n_head        # e.g., 12
        self.n_kv_head = config.n_kv_head  # e.g., 4 (GQA) or 1 (MQA)
        self.head_dim = config.n_embd // config.n_head

        assert config.n_head % config.n_kv_head == 0

        self.c_q = nn.Linear(config.n_embd, self.n_head * self.head_dim, bias=False)
        self.c_k = nn.Linear(config.n_embd, self.n_kv_head * self.head_dim, bias=False)
        self.c_v = nn.Linear(config.n_embd, self.n_kv_head * self.head_dim, bias=False)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)

Common ratios:

MHA (multi-head): n_kv_head = n_head — full quality, most memory
GQA: n_kv_head = n_head / 4 — good tradeoff
MQA (multi-query): n_kv_head = 1 — most memory savings, slight quality loss

---

Value Embedding (ResFormer)

Alternating layers receive value embeddings — learned per-token vectors added to the V projection with an input-dependent gate. This creates a "value residual stream" parallel to the main residual.

def has_ve(layer_idx, n_layer):
    """Alternating layers get value embeddings, last layer always included."""
    return layer_idx % 2 == (n_layer - 1) % 2

# In attention forward:
if ve is not None:
    ve = ve.view(B, T, self.n_kv_head, self.head_dim)
    # Input-dependent gate: sigmoid output scaled by 2 (neutral at init)
    gate = 2 * torch.sigmoid(self.ve_gate(x[..., :gate_channels]))
    v = v + gate.unsqueeze(-1) * ve

Initialize gate weights to zero so sigmoid(0) = 0.5, scaled by 2 = 1.0 (neutral start):

nn.init.zeros_(block.attn.ve_gate.weight)

---

Activation Functions

ReluSquared (recommended for simplicity)

def forward(self, x):
    x = self.c_fc(x)
    x = F.relu(x).square()  # sparse + smooth
    x = self.c_proj(x)
    return x

Benefits: naturally sparse (ReLU zeros + squaring), simple, good performance.

SwiGLU (recommended for quality)

class SwiGLUMLP(nn.Module):
    def __init__(self, config):
        hidden = int(config.n_embd * 8 / 3)  # ~2.67x, compensate for gate
        hidden = ((hidden + 63) // 64) * 64   # round to 64 for efficiency
        self.w1 = nn.Linear(config.n_embd, hidden, bias=False)
        self.w2 = nn.Linear(hidden, config.n_embd, bias=False)
        self.w3 = nn.Linear(config.n_embd, hidden, bias=False)  # gate

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

GELU (safe default)

x = F.gelu(self.c_fc(x))

---

Residual Scaling

Learnable per-layer residual scaling stabilizes deep networks:

class GPT(nn.Module):
    def __init__(self, config):
        self.resid_lambdas = nn.Parameter(torch.ones(config.n_layer))
        self.x0_lambdas = nn.Parameter(torch.zeros(config.n_layer))

    def forward(self, idx):
        x = norm(self.wte(idx))
        x0 = x  # save initial representation

        for i, block in enumerate(self.transformer.h):
            # x0 skip connection: mix in initial representation
            x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0
            x = block(x, ...)

        return norm(x)

Initialize: resid_lambdas = 1.0 (normal residual), x0_lambdas = 0.1 (small initial skip).

This helps because:

Deep networks can have vanishing/exploding residual norms
x0 skip connections let gradients flow directly to the embedding layer
Learnable scaling lets the network decide how much skip vs. residual per layer

---

Logit Soft Capping

Prevents extreme logit values that cause training instability:

softcap = 15
logits = self.lm_head(x).float()  # compute in fp32 for stability
logits = softcap * torch.tanh(logits / softcap)

This smoothly clamps logits to [-softcap, +softcap]. Values in the normal range (much smaller than softcap) pass through nearly unchanged; extreme values are compressed.

---

Model Configuration Pattern

Use a dataclass for clean configuration:

@dataclass
class GPTConfig:
    sequence_len: int = 2048
    vocab_size: int = 32768
    n_layer: int = 12
    n_head: int = 6
    n_kv_head: int = 6
    n_embd: int = 768
    window_pattern: str = "SSSL"

def build_config(depth, aspect_ratio=64, head_dim=128):
    """Derive model dimensions from depth using aspect ratio."""
    base_dim = depth * aspect_ratio
    model_dim = ((base_dim + head_dim - 1) // head_dim) * head_dim  # round to head_dim
    num_heads = model_dim // head_dim
    return GPTConfig(n_layer=depth, n_head=num_heads, n_kv_head=num_heads, n_embd=model_dim)

The aspect ratio pattern (d_model = depth * ratio) keeps width proportional to depth, which empirical research shows is more compute-efficient than scaling width alone.

---

FLOPs Estimation

For monitoring MFU, estimate FLOPs per token:

def estimate_flops_per_token(model):
    """Forward + backward FLOPs per token (approx 6 * params + attention)."""
    # Main rule: 6 * N (2 for fwd matmuls, 4 for bwd matmuls per param)
    # Exclude embeddings (sparse lookups, not matmuls)
    nparams_dense = sum(p.numel() for p in model.parameters())
    nparams_dense -= model.wte.weight.numel()      # token embedding
    nparams_dense -= model.lm_head.weight.numel()   # if tied, already counted

    # Attention FLOPs: 2 * n_heads * head_dim * seq_len per layer (Q@K + attn@V)
    attn_flops = 0
    for window in model.window_sizes:
        effective_seq = min(window[0], model.config.sequence_len)
        attn_flops += 12 * model.config.n_head * head_dim * effective_seq

    return 6 * nparams_dense + attn_flops

Biomedical & Pharmaceutical ML Reference

Models, architectures, and training patterns specific to biomedical and pharmaceutical domains. Referenced from SKILL.md.

1. Molecular Property Prediction & Drug Discovery 2. Molecular Generation 3. Protein Structure & Language Models 4. Drug-Target Interaction 5. Medical Imaging 6. Genomic & Sequence Models 7. Single-Cell Omics 8. Clinical NLP 9. EHR & Survival Analysis 10. Biomedical Training Tricks

---

Molecular Property Prediction & Drug Discovery

Graph Neural Networks for molecules

Molecules are naturally graphs (atoms = nodes, bonds = edges). GNNs are the dominant architecture.

Model	Key Idea	Best For
SchNet	Continuous filter convolutions on 3D coordinates	Small molecules, QM properties
DimeNet / DimeNet++	Directional message passing (angles between bonds)	Geometry-sensitive properties
GemNet	Triplet interactions + geometric embeddings	State-of-art on OC20 catalyst dataset
MPNN (Gilmer et al.)	General message passing framework	Baseline for molecular graphs
AttentiveFP	Graph attention for molecular fingerprints	ADMET prediction

Molecular fingerprints + transformers

Model	Approach	Use Case
MolBERT	BERT pretrained on SMILES strings	Molecular property prediction
ChemBERTa	RoBERTa on SMILES	Transfer learning for chemistry
Uni-Mol	3D molecular representation learning	Broad molecular tasks
MoLFormer	Large-scale SMILES transformer	Virtual screening

Practical setup for molecular GNNs

from torch_geometric.data import Data, DataLoader
from torch_geometric.nn import GCNConv, global_mean_pool

class MolGNN(nn.Module):
    def __init__(self, in_feats, hidden, out_feats, n_layers=3):
        super().__init__()
        self.convs = nn.ModuleList()
        self.convs.append(GCNConv(in_feats, hidden))
        for _ in range(n_layers - 1):
            self.convs.append(GCNConv(hidden, hidden))
        self.head = nn.Linear(hidden, out_feats)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        for conv in self.convs:
            x = F.relu(conv(x, edge_index))
        x = global_mean_pool(x, batch)  # graph-level readout
        return self.head(x)

Key libraries: PyTorch Geometric, DGL, RDKit (featurization), DeepChem

ADMET prediction

Absorption, Distribution, Metabolism, Excretion, Toxicity — critical for drug candidates:

Use MoleculeNet benchmarks for evaluation (BBBP, BACE, ClinTox, Tox21, HIV, SIDER)
Multi-task learning across ADMET endpoints often outperforms single-task
Scaffold splitting (not random) for realistic evaluation — prevents data leakage from similar molecules

---

Molecular Generation

String-based (SMILES)

Model	Approach	Strength
REINVENT	RNN + reinforcement learning	Optimizes for desired properties
SMILES VAE	Variational autoencoder on SMILES	Latent space interpolation
MolGPT	GPT-style autoregressive on SMILES	Conditional generation

Graph-based

Model	Approach	Strength
JT-VAE	Junction tree variational autoencoder	Guarantees valid molecules
GraphAF	Autoregressive flow on graphs	Flexible, sequential generation
MoFlow	Normalizing flows for molecules	Invertible, exact likelihood

3D structure-aware generation

Model	Approach	Use Case
EDM (Hoogeboom et al.)	Equivariant diffusion in 3D	Generate 3D conformers
DiffSBDD	Diffusion for structure-based drug design	Protein pocket → ligand
TargetDiff	SE(3)-equivariant diffusion	Target-aware molecule generation

Retrosynthesis

Predict how to synthesize a target molecule (work backward from product to reactants):

Template-based: classify reaction templates (fast, limited coverage)
Template-free: seq2seq translation from product SMILES to reactant SMILES
Key models: Molecular Transformer, LocalRetro, Graph2SMILES

---

Protein Structure & Language Models

Structure prediction

Model	Input	Output	Notes
AlphaFold2	MSA + sequence	3D structure	Revolutionary accuracy; needs MSA database search
AlphaFold3	Sequence(s) + ligands	Complex structure	Handles protein-ligand, protein-DNA/RNA complexes
ESMFold	Single sequence (no MSA)	3D structure	Much faster; ESM-2 embeddings → structure
RoseTTAFold	MSA + templates	3D structure	Three-track architecture, open-source
OpenFold	Same as AF2	3D structure	Open-source reimplementation of AlphaFold2

Protein language models

Pretrained on millions of protein sequences — learn evolutionary and structural features:

Model	Size	Pretraining	Best For
ESM-2	8M-15B params	Masked language modeling on UniRef	General protein tasks, structure prediction
ProtTrans (ProtBERT, ProtT5)	Up to 3B	MLM/denoising on UniRef/BFD	Sequence classification, function prediction
ProGen2	Up to 6.4B	Autoregressive on protein sequences	Protein design and generation

# Using ESM-2 for protein embeddings
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("facebook/esm2_t33_650M_UR50D")
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")

inputs = tokenizer("MKTAYIAKQRQISFVK", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state  # per-residue embeddings

Fine-tuning protein LMs

Contact prediction: predict which residue pairs are close in 3D
Function annotation: GO term prediction from embeddings
Fitness prediction: mutant → wild-type fitness (DMS data)
Subcellular localization: where in the cell the protein goes

Use per-residue embeddings for residue-level tasks, mean-pooled for protein-level tasks.

---

Drug-Target Interaction

Predict whether a drug molecule binds to a protein target:

Model	Drug Rep	Target Rep	Notes
DeepDTA	SMILES CNN	Protein sequence CNN	Simple baseline
GraphDTA	Molecular graph GNN	Protein sequence CNN	Better than DeepDTA
DrugBAN	Graph + bilinear attention	Protein sequence	State-of-art on benchmark
MolTrans	Molecular substructure	Protein subsequence	Interaction-aware transformer

Virtual screening pipeline

1. Target: protein structure (from AlphaFold or PDB) 2. Library: millions of candidate molecules (ZINC, Enamine REAL) 3. Docking: quick physics-based filter (AutoDock Vina, Glide) 4. ML scoring: GNN/transformer re-ranking of top candidates 5. ADMET filter: predict toxicity, solubility, permeability 6. Synthesis check: retrosynthesis feasibility

---

Medical Imaging

Architectures by task

Task	Architecture	Notes
Classification	ViT or EfficientNet (pretrained)	Fine-tune from ImageNet or medical-specific pretraining
Segmentation	U-Net / nnU-Net	nnU-Net auto-configures for each dataset
3D segmentation	Swin-UNETR / V-Net / 3D U-Net	For CT/MRI volumes
Detection	DETR / Faster R-CNN	Lesion detection, cell counting
Foundation model	MedSAM / BiomedCLIP	Zero/few-shot adaptation

nnU-Net (self-configuring segmentation)

nnU-Net automatically configures architecture, preprocessing, and training for any medical segmentation task:

# nnU-Net auto-configures everything
nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity
nnUNetv2_train DATASET_ID 3d_fullres FOLD
nnUNetv2_predict -i INPUT_FOLDER -o OUTPUT_FOLDER -d DATASET_ID -c 3d_fullres

Key decisions nnU-Net makes automatically:

2D vs 3D vs cascade architecture
Patch size, batch size based on GPU memory
Preprocessing (resampling, normalization per modality)
Augmentation (rotation, scaling, mirroring, elastic deformation)
Postprocessing (connected components, etc.)

Medical imaging training patterns

# Common medical image preprocessing
import monai.transforms as mt

train_transforms = mt.Compose([
    mt.LoadImaged(keys=["image", "label"]),
    mt.EnsureChannelFirstd(keys=["image", "label"]),
    mt.Spacingd(keys=["image", "label"], pixdim=(1.0, 1.0, 1.0)),  # isotropic
    mt.ScaleIntensityRanged(keys=["image"], a_min=-175, a_max=250,
                            b_min=0.0, b_max=1.0, clip=True),  # CT window
    mt.CropForegroundd(keys=["image", "label"], source_key="image"),
    mt.RandCropByPosNegLabeld(
        keys=["image", "label"], label_key="label",
        spatial_size=(96, 96, 96), pos=1, neg=1, num_samples=4),
    mt.RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=0),
    mt.RandRotate90d(keys=["image", "label"], prob=0.5),
])

Loss functions for medical segmentation

# Dice + Cross-Entropy (standard for medical segmentation)
from monai.losses import DiceCELoss
loss_fn = DiceCELoss(to_onehot_y=True, softmax=True)

# For highly imbalanced segmentation (tiny lesions)
from monai.losses import FocalLoss, TverskyLoss
loss_fn = TverskyLoss(alpha=0.3, beta=0.7)  # penalize FN more than FP

Key libraries

MONAI — PyTorch framework for medical imaging (transforms, losses, networks, metrics)
TorchIO — data loading and augmentation for 3D medical images
nnU-Net — self-configuring segmentation
MedPy — medical image processing utilities

---

Genomic & Sequence Models

DNA/RNA language models

Model	Architecture	Sequence Length	Best For
DNABERT-2	BERT with BPE tokenization	512-4K	Short regulatory sequences, promoters
HyenaDNA	Hyena (long-range SSM)	Up to 1M bp	Long-range regulatory elements, whole genes
Evo	StripedHyena	Up to 131K bp	DNA/RNA generation, fitness prediction
Enformer	Transformer	200K bp input	Gene expression prediction from sequence
Nucleotide Transformer	BERT-style	6K tokens	Variant effect prediction
Caduceus	Bidirectional Mamba	Up to 131K bp	Complements Evo; bidirectional

Enformer for gene expression

# Enformer predicts gene expression tracks from 200kb DNA sequence
# Output: 896 spatial bins × 5,313 tracks (CAGE, DNase, histone marks)
# Architecture: convolutional stem → 11 transformer layers → prediction heads
#
# Key insight: long-range enhancer-promoter interactions require >100kb context
# which is why Enformer uses 200kb input windows

Variant effect prediction

Predict whether a DNA/protein variant is pathogenic:

ESM-1v: zero-shot variant effect from protein LM log-likelihood ratios
AlphaMissense: AlphaFold-derived pathogenicity predictions
CADD / SpliceAI: established tools for genomic variant scoring
Fine-tune DNABERT or HyenaDNA on ClinVar for custom variant classifiers

---

Single-Cell Omics

Foundation models for single-cell

Model	Architecture	Training Data	Use Case
scVI	VAE	Per-dataset	Batch correction, normalization, imputation
scGPT	GPT-style autoregressive	33M cells	Cell annotation, perturbation prediction, integration
Geneformer	BERT-style (rank-ordered genes)	30M cells	Transfer learning for gene network analysis
scFoundation	Transformer	50M cells	General single-cell foundation model

scVI setup

import scvi

# Register the AnnData object
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")

# Train the model
model = scvi.model.SCVI(adata, n_latent=30, n_layers=2)
model.train(max_epochs=200, early_stopping=True)

# Get latent representation (for clustering, visualization)
latent = model.get_latent_representation()
adata.obsm["X_scVI"] = latent

# Get normalized, batch-corrected expression
adata.layers["scvi_normalized"] = model.get_normalized_expression()

Key considerations for single-cell ML

Sparsity: scRNA-seq matrices are ~90-95% zeros — use sparse representations
Batch effects: biggest confounder; always include batch correction (scVI, Harmony, Scanorama)
Gene selection: highly variable genes (HVGs) — typically 2000-5000 genes for downstream analysis
Preprocessing: log1p normalization, or use raw counts with models that handle them (scVI)
Evaluation: silhouette score (bio conservation vs batch mixing), LISI scores, kBET

---

Clinical NLP

Biomedical language models

Model	Base	Pretraining Corpus	Best For
PubMedBERT	BERT	PubMed abstracts (from scratch)	Biomedical NER, relation extraction
BioBERT	BERT	PubMed + PMC (continued pretraining)	General biomedical NLP
BioGPT	GPT-2	PubMed abstracts	Biomedical text generation
GatorTron	BERT (large)	Clinical notes + PubMed (90B words)	Clinical NLP, de-identified EHR
Med-PaLM 2	PaLM 2	Medical QA fine-tuning	Medical question answering
BioMistral	Mistral-7B	PubMed continued pretraining	Open-source biomedical LLM

Clinical NLP tasks

Named Entity Recognition (NER): extract drugs, diseases, genes, procedures from text
Relation Extraction: drug-drug interactions, gene-disease associations
Medical coding: ICD-10, SNOMED-CT, MeSH term assignment
De-identification: remove PHI from clinical notes (HIPAA compliance)
Clinical trial matching: patient → eligible trials

Practical pattern

from transformers import AutoModelForTokenClassification, AutoTokenizer

# PubMedBERT for biomedical NER
model = AutoModelForTokenClassification.from_pretrained(
    "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext",
    num_labels=num_entity_types
)

# Fine-tune on domain-specific NER dataset (BC5CDR, NCBI-disease, etc.)
# Use BIO tagging scheme
# Typical hyperparameters:
#   lr: 2e-5, epochs: 20, batch_size: 16, warmup: 10%

---

EHR & Survival Analysis

EHR modeling

Electronic Health Records are sequential, multimodal, and irregularly sampled:

Approach	Architecture	Key Idea
BEHRT	BERT on medical codes	Treat visits as "sentences", codes as "tokens"
Med-BERT	BERT with structured EHR	Pretrain on diagnosis codes for disease prediction
RETAIN	Reverse-time attention RNN	Interpretable predictions from visit sequences
STraTS	Self-supervised transformer	Handles irregular time intervals

Survival analysis (time-to-event)

# Cox proportional hazards with neural network
# Loss: negative partial log-likelihood
def cox_ph_loss(risk_scores, events, times):
    """
    risk_scores: model output (higher = higher risk)
    events: 1 if event occurred, 0 if censored
    times: time to event or censoring
    """
    order = torch.argsort(times, descending=True)
    risk_scores = risk_scores[order]
    events = events[order]

    log_risk = torch.logcumsumexp(risk_scores, dim=0)
    loss = -torch.mean((risk_scores - log_risk) * events)
    return loss

# Evaluation metric: concordance index (C-index)
# C-index > 0.7 is decent, > 0.8 is good for clinical prediction

DeepSurv / DeepHit

DeepSurv: neural network + Cox PH (continuous time, proportional hazards assumption)
DeepHit: directly predicts discrete time survival distribution (no PH assumption)
Key advantage: can model complex nonlinear covariate interactions that Cox can't

---

Biomedical Training Tricks

Small dataset strategies (most biomedical datasets are small)

1. Domain-specific pretraining — always start from a biomedical pretrained model, not generic ImageNet/BERT 2. Transfer learning pipeline: generic pretrained → domain pretrained → task fine-tuned 3. Data augmentation: aggressive but domain-appropriate (see safety notes below) 4. Few-shot learning: prototypical networks, MAML for rare disease classification 5. Self-supervised pretraining on unlabeled biomedical data, then fine-tune on labeled 6. Multi-task learning: train on multiple related endpoints simultaneously 7. Cross-validation: k-fold (k=5-10) is mandatory for small biomedical datasets; a single train/val/test split is unreliable

Class imbalance (very common in biomedical)

# Strategy 1: Weighted loss
class_counts = torch.tensor([1000, 50, 30])  # healthy, disease_A, disease_B
weights = 1.0 / class_counts
weights = weights / weights.sum() * len(weights)
loss_fn = nn.CrossEntropyLoss(weight=weights)

# Strategy 2: Focal loss (for extreme imbalance)
def focal_loss(logits, targets, gamma=2.0, alpha=0.25):
    ce = F.cross_entropy(logits, targets, reduction='none')
    pt = torch.exp(-ce)
    return (alpha * (1 - pt) ** gamma * ce).mean()

# Strategy 3: Oversampling with WeightedRandomSampler
from torch.utils.data import WeightedRandomSampler
sample_weights = [weights[label] for label in labels]
sampler = WeightedRandomSampler(sample_weights, num_samples=len(labels))

Medical image augmentation safety

Some standard augmentations are unsafe for medical images:

Augmentation	Safe?	Notes
Horizontal flip	Depends	Safe for dermoscopy, unsafe for chest X-ray (heart laterality matters)
Vertical flip	Usually no	Anatomy has orientation
Random crop	Yes	With care for lesion location
Color jitter	Sometimes	Safe for natural images, problematic for stained histology
Elastic deformation	Yes	Mimics tissue deformation, widely used in medical segmentation
Intensity scaling	Yes	Mimics scanner variation
Mixup/CutMix	Caution	Can create anatomically impossible combinations
Rotation	Small angles	±15° usually safe; 90°/180° depends on modality

Regulatory considerations (FDA / EMA)

When building models for clinical deployment:

Locked algorithm: model weights cannot change after regulatory submission
Predetermined change control plan: document how the model can be updated
Dataset documentation: detailed provenance, demographics, inclusion/exclusion criteria
Performance by subgroup: report metrics stratified by age, sex, ethnicity, disease severity
Failure mode analysis: characterize where the model fails and how gracefully
Intended use statement: narrow, specific clinical context
Validation: external validation on data from a different institution is expected

Domain-specific pretraining sources

Domain	Pretraining Data	Scale
Molecular	PubChem, ZINC, ChEMBL	100M+ molecules
Protein	UniRef50/90, UniProt, BFD	250M+ sequences
Genomic	Human reference genome, 1000 Genomes	~3B bp per genome
Medical imaging	MIMIC-CXR, CheXpert, NIH ChestX-ray14	200K-400K images
Clinical text	MIMIC-III/IV clinical notes	2M+ notes
Biomedical text	PubMed, PMC full text	36M+ abstracts
Single-cell	CellxGene, HCA	50M+ cells

Key biomedical ML libraries

Library	Purpose
PyTorch Geometric	GNNs for molecules and graphs
DGL	Alternative GNN framework
RDKit	Molecular featurization, SMILES processing
DeepChem	Molecular ML models and datasets
MONAI	Medical imaging (transforms, losses, architectures)
TorchIO	3D medical image augmentation and loading
scanpy / scverse	Single-cell analysis ecosystem
scvi-tools	Deep learning for single-cell
Biopython	Sequence parsing, alignment, PDB handling
HuggingFace transformers	Biomedical LMs (PubMedBERT, ESM-2)
OpenFold	Protein structure prediction
lifelines	Survival analysis (Cox PH, Kaplan-Meier)
pysurv / auton-survival	Neural survival models

Domain-Specific Training Patterns

Patterns for vision, diffusion, and other non-LLM training scenarios. Referenced from SKILL.md.

1. Computer Vision Training 2. Diffusion Model Training 3. EMA (Exponential Moving Average) Models 4. Contrastive / Self-Supervised Learning 5. Fine-Tuning & Transfer Learning 6. Multi-GPU / Distributed Training 7. Checkpointing 8. Data Loading for Images

---

Computer Vision Training

Data augmentation pipeline

Data augmentation is often more impactful than architecture changes in vision:

import torchvision.transforms.v2 as T

train_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.08, 1.0)),
    T.RandomHorizontalFlip(),
    T.RandAugment(num_ops=2, magnitude=9),  # automated augmentation
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

val_transform = T.Compose([
    T.Resize(256),
    T.CenterCrop(224),
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

MixUp and CutMix

Regularization via input mixing — very effective for classification:

from torchvision.transforms.v2 import MixUp, CutMix

mixup = MixUp(alpha=0.2, num_classes=num_classes)
cutmix = CutMix(alpha=1.0, num_classes=num_classes)
# Apply randomly to each batch
mix_fn = T.RandomChoice([mixup, cutmix])

for images, targets in train_loader:
    images, targets = mix_fn(images, targets)
    # targets are now soft labels (one-hot blended)
    loss = F.cross_entropy(model(images), targets)

Stochastic depth (drop path)

Randomly drop residual blocks during training — better than dropout for vision:

class DropPath(nn.Module):
    def __init__(self, drop_prob=0.0):
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        if not self.training or self.drop_prob == 0.0:
            return x
        keep_prob = 1 - self.drop_prob
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        mask = torch.bernoulli(torch.full(shape, keep_prob, device=x.device))
        return x * mask / keep_prob

Use linearly increasing drop rates: layer 0 gets 0%, last layer gets max (e.g., 0.2):

drop_rates = [x.item() for x in torch.linspace(0, 0.2, num_layers)]

Label smoothing

loss = F.cross_entropy(logits, targets, label_smoothing=0.1)

Progressive resizing

Train at low resolution first, then increase — saves compute and acts as regularization:

# Phase 1: 160x160, lr=1e-3, epochs 0-60
# Phase 2: 224x224, lr=3e-4, epochs 60-90
# Phase 3: 288x288, lr=1e-4, epochs 90-100

Vision optimizer recipes

# ViT / Vision Transformer
optimizer = torch.optim.AdamW(params, lr=1e-3, weight_decay=0.05, betas=(0.9, 0.999))
# + cosine LR decay, 5-epoch warmup, batch_size=1024

# ConvNeXt / CNN
optimizer = torch.optim.AdamW(params, lr=4e-3, weight_decay=0.05)
# + cosine LR decay, 20-epoch warmup, layer-wise LR decay

# ResNet (classic SGD recipe)
optimizer = torch.optim.SGD(params, lr=0.1, momentum=0.9, weight_decay=1e-4)
# + step LR decay (0.1x at epoch 30, 60, 90)

---

Diffusion Model Training

Training loop for DDPM-style

import torch.nn.functional as F

def train_step(model, x_0, noise_schedule):
    B = x_0.shape[0]
    # Sample random timesteps
    t = torch.randint(0, noise_schedule.num_timesteps, (B,), device=x_0.device)

    # Sample noise
    noise = torch.randn_like(x_0)

    # Forward diffusion: add noise
    x_t = noise_schedule.q_sample(x_0, t, noise)

    # Predict noise (or v, or x_0)
    pred = model(x_t, t)

    # Simple MSE loss on noise prediction
    loss = F.mse_loss(pred, noise)
    return loss

Noise schedules

# Linear schedule (DDPM original)
betas = torch.linspace(1e-4, 0.02, 1000)

# Cosine schedule (improved DDPM — better for high-res)
def cosine_schedule(timesteps, s=0.008):
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos((x / timesteps + s) / (1 + s) * torch.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - alphas_cumprod[1:] / alphas_cumprod[:-1]
    return torch.clamp(betas, 0.0001, 0.9999)

Flow matching (modern, simpler)

def flow_matching_loss(model, x_0):
    """Conditional flow matching — simpler than DDPM, often better."""
    t = torch.rand(x_0.shape[0], 1, 1, 1, device=x_0.device)  # uniform [0, 1]
    noise = torch.randn_like(x_0)

    # Interpolate between noise and data
    x_t = (1 - t) * noise + t * x_0

    # Target velocity: data - noise
    target = x_0 - noise

    # Predict velocity
    pred = model(x_t, t.squeeze())
    return F.mse_loss(pred, target)

v-prediction (better for low SNR regions)

# v = alpha * noise - sigma * x_0
# Better than epsilon-prediction for high-resolution images
def v_prediction_loss(model, x_0, alpha, sigma):
    noise = torch.randn_like(x_0)
    x_t = alpha * x_0 + sigma * noise
    v_target = alpha * noise - sigma * x_0
    v_pred = model(x_t, t)
    return F.mse_loss(v_pred, v_target)

Classifier-free guidance training

def train_step_cfg(model, x_0, condition, p_uncond=0.1):
    """Train with random condition dropout for classifier-free guidance."""
    # Randomly drop condition with probability p_uncond
    mask = torch.rand(x_0.shape[0]) < p_uncond
    condition_masked = condition.clone()
    condition_masked[mask] = 0  # or null embedding

    t = torch.randint(0, T, (x_0.shape[0],), device=x_0.device)
    noise = torch.randn_like(x_0)
    x_t = q_sample(x_0, t, noise)

    pred = model(x_t, t, condition_masked)
    return F.mse_loss(pred, noise)

Diffusion model tips

EMA is essential — use EMA weights for inference (see EMA section below)
Large batch sizes work well (256-2048 for image diffusion)
AdamW with lr=1e-4, no weight decay on biases/norms
No LR warmup needed for most diffusion models (just constant LR works)
Train for many steps — diffusion models are hungry (1M+ steps for ImageNet quality)
Monitor FID every N steps on a fixed set of samples, not every step (expensive)

---

EMA Models

Exponential Moving Average of weights produces smoother, higher-quality models for inference. Essential for diffusion models, also useful for any generative model or self-supervised learning.

class EMA:
    def __init__(self, model, decay=0.9999):
        self.decay = decay
        self.shadow = {name: param.clone().detach()
                       for name, param in model.named_parameters()}

    @torch.no_grad()
    def update(self, model):
        for name, param in model.named_parameters():
            self.shadow[name].lerp_(param.data, 1 - self.decay)

    def apply(self, model):
        """Swap model weights with EMA weights for inference."""
        self.backup = {name: param.clone()
                       for name, param in model.named_parameters()}
        for name, param in model.named_parameters():
            param.data.copy_(self.shadow[name])

    def restore(self, model):
        """Restore original weights after inference."""
        for name, param in model.named_parameters():
            param.data.copy_(self.backup[name])

Usage in training loop

ema = EMA(model, decay=0.9999)

for step, (x, y) in enumerate(train_loader):
    loss = model(x, y)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    ema.update(model)  # update EMA after each step

    # For evaluation: temporarily swap to EMA weights
    if step % eval_interval == 0:
        ema.apply(model)
        val_metric = evaluate(model, val_loader)
        ema.restore(model)

EMA decay warmup

Start with lower decay and ramp up to avoid the EMA lagging during early fast learning:

def ema_decay_schedule(step, base_decay=0.9999, warmup_steps=2000):
    return min(base_decay, 1 - (1 - base_decay) * (1 + step) / (1 + warmup_steps))

---

Contrastive / Self-Supervised Learning

SimCLR-style contrastive loss

def contrastive_loss(z1, z2, temperature=0.5):
    """NT-Xent loss for contrastive learning."""
    z1 = F.normalize(z1, dim=1)
    z2 = F.normalize(z2, dim=1)

    B = z1.shape[0]
    z = torch.cat([z1, z2], dim=0)  # [2B, D]
    sim = z @ z.T / temperature     # [2B, 2B]

    # Mask out self-similarity
    mask = ~torch.eye(2 * B, dtype=torch.bool, device=z.device)
    sim = sim.masked_fill(~mask, -1e9)

    # Positive pairs: (i, i+B) and (i+B, i)
    labels = torch.cat([torch.arange(B, 2*B), torch.arange(B)], dim=0).to(z.device)
    return F.cross_entropy(sim, labels)

Key patterns for self-supervised

Two augmented views of same image → attract; different images → repel
Large batch sizes critical (4096+ for SimCLR) — more negatives = better
Projection head (MLP) between backbone and loss — discard after pretraining
LARS/LAMB optimizer for very large batch training
Momentum encoder (MoCo, BYOL) — use EMA of encoder as the target network

---

Fine-Tuning & Transfer Learning

Layer-wise LR decay

Deeper (earlier) layers get smaller LR — they need less adaptation:

def get_layer_lrs(model, base_lr, decay_factor=0.65, num_layers=12):
    """Assign exponentially decaying LR to each layer."""
    param_groups = []
    for i in range(num_layers):
        lr = base_lr * (decay_factor ** (num_layers - 1 - i))
        layer_params = get_layer_params(model, i)  # implement per architecture
        param_groups.append({"params": layer_params, "lr": lr})

    # Head gets full LR
    param_groups.append({"params": model.head.parameters(), "lr": base_lr})
    return param_groups

Freezing strategies

# Strategy 1: Freeze all, unfreeze head only
for param in model.parameters():
    param.requires_grad = False
for param in model.head.parameters():
    param.requires_grad = True

# Strategy 2: Gradual unfreezing (from top layers down)
def unfreeze_layers(model, num_layers_to_unfreeze):
    layers = list(model.children())
    for layer in layers[-num_layers_to_unfreeze:]:
        for param in layer.parameters():
            param.requires_grad = True

# Strategy 3: LoRA (low-rank adaptation) — efficient for large models
# Only train small low-rank matrices added to existing weights
# Saves memory and prevents catastrophic forgetting

Fine-tuning tips

Lower LR than pretraining (10-100x smaller)
Shorter training (5-20 epochs typically)
Freeze BatchNorm statistics: model.eval() for BN layers but model.train() for dropout
Warmup is important — prevents destroying pretrained features early on

---

Multi-GPU / Distributed Training

DDP (DistributedDataParallel) — most common

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Init process group
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# Wrap model
model = model.to(local_rank)
model = DDP(model, device_ids=[local_rank])

# Use DistributedSampler for data
sampler = torch.utils.data.distributed.DistributedSampler(dataset)
loader = DataLoader(dataset, sampler=sampler, batch_size=per_gpu_batch)

# Remember to set epoch for proper shuffling
for epoch in range(num_epochs):
    sampler.set_epoch(epoch)

FSDP (Fully Sharded Data Parallel) — for large models

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

model = FSDP(
    model,
    auto_wrap_policy=size_based_auto_wrap_policy,  # wrap layers > threshold
    mixed_precision=MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.bfloat16,
        buffer_dtype=torch.bfloat16,
    ),
)

Scaling rules

Linear scaling: When scaling batch_size by k, scale LR by k (up to a point)
Square root scaling: lr_new = lr_base * sqrt(batch_new / batch_base) — more conservative, often works better
Warmup: Scale warmup duration with batch size increase
Gradient accumulation: Equivalent to larger batch size without more GPUs

---

Checkpointing

Save/load with proper state

def save_checkpoint(model, optimizer, scheduler, step, path):
    torch.save({
        'step': step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict() if scheduler else None,
        'rng_state': torch.cuda.get_rng_state(),
    }, path)

def load_checkpoint(model, optimizer, scheduler, path):
    ckpt = torch.load(path, map_location='cpu', weights_only=False)
    model.load_state_dict(ckpt['model_state_dict'])
    optimizer.load_state_dict(ckpt['optimizer_state_dict'])
    if scheduler and ckpt.get('scheduler_state_dict'):
        scheduler.load_state_dict(ckpt['scheduler_state_dict'])
    torch.cuda.set_rng_state(ckpt['rng_state'])
    return ckpt['step']

Best practices

Save every N steps (not just every epoch) — long epochs can lose hours of work
Keep last K checkpoints + best checkpoint (by val metric)
Save optimizer state for exact resumption — without it, training dynamics change
For DDP/FSDP: save only on rank 0, load on all ranks

---

Data Loading for Images

Efficient ImageFolder with workers

from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder

train_dataset = ImageFolder(root="data/train", transform=train_transform)

train_loader = DataLoader(
    train_dataset,
    batch_size=256,
    shuffle=True,
    num_workers=8,             # rule of thumb: 4 * num_GPUs
    pin_memory=True,           # faster CPU→GPU transfer
    persistent_workers=True,   # avoid re-spawning workers each epoch
    prefetch_factor=2,         # prefetch 2 batches per worker
    drop_last=True,            # avoid small last batch (bad for BatchNorm)
)

WebDataset for large-scale (millions of images)

import webdataset as wds

dataset = (
    wds.WebDataset("data/train-{000000..000099}.tar")
    .shuffle(1000)
    .decode("pil")
    .to_tuple("jpg", "cls")
    .map_tuple(train_transform, lambda x: x)
    .batched(256)
)

FFCV for maximum throughput

# FFCV can be 3-7x faster than standard PyTorch DataLoader
# Writes data to a custom binary format, then reads with zero-copy
from ffcv.loader import Loader, OrderOption
from ffcv.fields.decoders import RandomResizedCropRGBImageDecoder

loader = Loader(
    "data/train.beton",
    batch_size=256,
    order=OrderOption.QUASI_RANDOM,
    num_workers=8,
    pipelines={
        "image": [RandomResizedCropRGBImageDecoder((224, 224))],
        "label": [IntDecoder(), ToTensor(), ToDevice(device)],
    },
)

---

LLM Data Loading

Pinned buffers for zero-copy transfers

# Pre-allocate pinned CPU + GPU buffers
cpu_buffer = torch.empty(2 * B * T, dtype=torch.long, pin_memory=True)
gpu_buffer = torch.empty(2 * B * T, dtype=torch.long, device="cuda")
gpu_buffer.copy_(cpu_buffer, non_blocking=True)

Best-fit packing (no padding)

Instead of padding sequences to fixed length (wastes compute), pack documents tightly: 1. Maintain a buffer of tokenized documents 2. For each row, greedily fit the largest document that fits remaining space 3. If nothing fits, crop the shortest to fill exactly 4. Every row starts with BOS token 5. Result: 100% utilization, no wasted tokens

Infinite iterators

def make_dataloader(split):
    """Yields (x, y, epoch) forever, cycling through data."""
    epoch = 1
    while True:
        for batch in data_source:
            yield process(batch), epoch
        epoch += 1

---

Architecture Pattern Tables

Transformer / LLM components

Component	Recommended	Why
Normalization	RMSNorm	~same quality as LayerNorm, fewer ops
Position encoding	RoPE	Relative, extrapolates well, standard
Attention	Flash Attention 3	Memory-efficient, faster, exact
Activation	ReluSquared or SwiGLU	ReluSquared: sparse. SwiGLU: better quality
Residual	Learnable scaling + x0 skip	Stabilizes deep networks
Logit cap	Soft capping	`softcap * tanh(logits / softcap)`
Init	Zero-init output projections	Residual stream starts clean
KV heads	GQA	Saves memory with minimal quality loss

Vision (CNN / ViT) components

Component	Recommended	Why
Backbone	ConvNeXt v2 or ViT	ConvNeXt: modern CNN. ViT: scalable
Data augmentation	RandAugment + MixUp + CutMix	More impactful than architecture
Regularization	Stochastic depth + label smoothing	Better than dropout for vision
Optimizer	AdamW (ViT) / SGD+momentum (CNN)	ViTs need adaptive methods
Resolution	Progressive resizing	Train small → finetune large

Diffusion model components

Component	Recommended	Why
Architecture	U-Net or DiT	DiT scales better
Noise schedule	Cosine or flow matching	Flow matching: simpler, state-of-art
Loss	MSE on noise or v-prediction	v-prediction better at low SNR
EMA	Keep EMA model for inference	Higher quality samples
Sampling	DDIM / DPM-Solver++	Faster than DDPM

General supervised

Component	Recommended	Why
Optimizer	AdamW	Safe default
Early stopping	Patience 5-10 epochs	Prevents overfitting
Class imbalance	Weighted loss or oversampling	Weighted loss is simpler

---

BPB Evaluation for Language Models

@torch.no_grad()
def evaluate_bpb(model, val_loader, token_bytes):
    total_nats, total_bytes = 0.0, 0
    for x, y in val_loader:
        loss_per_token = F.cross_entropy(..., reduction='none').view(-1)
        nbytes = token_bytes[y.view(-1)]
        mask = nbytes > 0
        total_nats += (loss_per_token * mask).sum().item()
        total_bytes += nbytes.sum().item()
    return total_nats / (math.log(2) * total_bytes)

EMA smoothed loss

ema_beta = 0.9
smooth_loss = 0
for step in range(num_steps):
    smooth_loss = ema_beta * smooth_loss + (1 - ema_beta) * loss.item()
    debiased = smooth_loss / (1 - ema_beta ** (step + 1))

Final summary format

Print structured output for easy parsing:

val_bpb:          0.997900
training_seconds: 300.1
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6

Autonomous Experiment Loop (autoresearch pattern)

A systematic workflow for rapid ML experimentation, drawn from Karpathy's autoresearch project. Use this when iterating on architecture or hyperparameters and you want to run many quick experiments.

Core idea

Run every experiment with a fixed time budget (e.g., 5 minutes) so results are directly comparable. This enables ~12 experiments/hour or ~100 overnight. The key insight: wall-clock time is a better budget unit than steps or epochs because it naturally accounts for throughput differences between configs.

The experiment loop

1. Read current state (results.tsv, train.py)
2. Decide what to try next (one change at a time)
3. Modify train.py
4. git commit -m "description of change"
5. Run training (with timeout)
6. Parse results from stdout
7. Decision:
   - If val_bpb improved → KEEP (advance branch)
   - If val_bpb worsened → DISCARD (git reset --hard HEAD~1)
   - If crashed → FIX trivial bugs and retry, or LOG and move on
8. Append result to results.tsv
9. Repeat

Results tracking

commit    val_bpb   memory_gb  status   description
a1b2c3d   0.9979    44.0       keep     baseline
b2c3d4e   0.9932    44.2       keep     increase matrix LR to 0.04
c3d4e5f   1.0050    44.0       discard  switch to GeLU activation
d4e5f6g   0.0000    0.0        crash    double model width (OOM)

Key principles

Single-file constraint

Confine all changes to one file (e.g., train.py). This makes diffs reviewable and rollbacks clean. Everything — model, optimizer, data loading, evaluation — lives in one file during experimentation. Refactor into modules only after the experiment phase.

Keep/discard discipline

Keep: val metric improved (or equal with less memory/time)
Discard: val metric worsened, regardless of how clever the idea was
The simplicity criterion: all else being equal, simpler is better. Removing something and

getting equal results is a great outcome — it means the removed thing was dead weight.

Crash recovery

Trivial crash (typo, shape mismatch): fix and retry the same experiment
Fundamental crash (OOM, numerical instability): log as crash, move on
Timeout (>2x budget): kill the process, log as timeout

Fixed budget comparison

import time

TIME_BUDGET = 300  # 5 minutes
t_start = time.time()

for step in range(max_steps):
    # ... training step ...
    elapsed = time.time() - t_start
    if elapsed >= TIME_BUDGET:
        break

Tokenizer training

When training from scratch, train a BPE tokenizer on your data:

import rustbpe

# GPT-4 split pattern (handles code, numbers, whitespace well)
SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

# Train tokenizer
tokenizer = rustbpe.Tokenizer()
tokenizer.train(
    text_iterator,        # yields text chunks
    vocab_size=8192,      # small vocab for quick experiments; 32K+ for production
    split_pattern=SPLIT_PATTERN,
    special_tokens=["<|bos|>"]
)

# Build token_bytes lookup for BPB evaluation
token_bytes = torch.zeros(vocab_size, dtype=torch.long)
for i in range(vocab_size):
    token_bytes[i] = len(tokenizer.decode([i]).encode("utf-8"))

Vocab size tradeoffs

Vocab Size	Use Case	Notes
4K-8K	Quick experiments, small models	Faster tokenizer training, more tokens per doc
32K	Standard LLM pretraining	Good balance of compression and vocab coverage
64K-128K	Multilingual, code-heavy	Better compression but larger embedding table

Data preparation

Shard-based train/val split

# Use last shard as validation (always the same data for consistent eval)
shard_files = sorted(glob("data/shard_*.bin"))
val_shard = shard_files[-1]       # pinned validation
train_shards = shard_files[:-1]   # everything else

Split by shard, not by random sampling — this ensures no data leakage and makes the val set deterministic across experiments.

Environment setup

import os
os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"  # BEFORE torch import
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"               # clean logs

import torch

Setting PYTORCH_ALLOC_CONF before importing torch is important — it configures the CUDA allocator at initialization time.

Optimizer Patterns Reference

Deep dive into optimizer configurations for modern LLM training. Referenced from the main SKILL.md.

1. AdamW Best Practices 2. Muon Optimizer 3. Hybrid MuonAdamW 4. Per-Parameter-Group Configuration 5. LR Scaling Rules 6. Weight Decay Strategies 7. Momentum Scheduling 8. Compiled Optimizer Steps

---

AdamW Best Practices

AdamW (decoupled weight decay) is the baseline optimizer for everything that isn't a 2D matrix in modern LLM training.

# Typical hyperparameters for LLM pretraining
optimizer = torch.optim.AdamW(
    params,
    lr=3e-4,
    betas=(0.9, 0.95),    # β1=0.9, β2=0.95 (not the default 0.999)
    eps=1e-8,
    weight_decay=0.1,
)

Key differences from default PyTorch AdamW

β2 = 0.95 (not 0.999): Faster adaptation to changing gradient statistics. The default 0.999

has a ~1000-step memory, too slow for the rapidly changing loss landscape of LLM training.

β1 = 0.8-0.9: Some modern recipes use 0.8 for faster momentum.
eps = 1e-10 (not 1e-8): Smaller epsilon for bf16 training where gradients can be very small. autoresearch uses 1e-10; 1e-8 can cause stale updates when gradient second moments are tiny.

Fused step (for torch.compile)

To avoid recompilation when hyperparameters change, use 0-D CPU tensors:

# Create once at init
self._lr_t = torch.tensor(0.0, dtype=torch.float32, device="cpu")

# Fill before each step (no recompile)
self._lr_t.fill_(group['lr'])

@torch.compile(dynamic=False, fullgraph=True)
def adamw_step_fused(p, grad, exp_avg, exp_avg_sq, step_t, lr_t, beta1_t, beta2_t, eps_t, wd_t):
    p.mul_(1 - lr_t * wd_t)
    exp_avg.lerp_(grad, 1 - beta1_t)
    exp_avg_sq.lerp_(grad.square(), 1 - beta2_t)
    bias1 = 1 - beta1_t ** step_t
    bias2 = 1 - beta2_t ** step_t
    denom = (exp_avg_sq / bias2).sqrt() + eps_t
    p.add_(exp_avg / denom, alpha=-lr_t / bias1)

---

Muon Optimizer

Muon is designed for 2D matrix (weight) parameters. It uses Nesterov momentum followed by "Polar Express" orthogonalization — a fast Newton-Schulz iteration that approximates the matrix polar decomposition (finding the nearest orthogonal matrix to the gradient).

Why orthogonalize gradients?

Standard gradient descent updates can create rank-deficient weight matrices over time. Orthogonalizing the update direction encourages diverse feature learning and prevents mode collapse in the weight space. Think of it as giving every update direction "equal voice."

Core algorithm

1. Nesterov momentum: Standard momentum with look-ahead 2. Polar Express: Newton-Schulz iterations to orthogonalize the gradient matrix 3. NorMuon: Variance reduction that normalizes per-row or per-column 4. Cautious update: Only update weights where the gradient agrees with the parameter sign

@torch.compile(dynamic=False, fullgraph=True)
def muon_step_fused(grads, params, momentum_buf, second_momentum_buf,
                    momentum, lr, wd, beta2, ns_steps, red_dim):
    # 1. Nesterov momentum
    momentum_buf.lerp_(grads, 1 - momentum)
    g = grads.lerp_(momentum_buf, momentum)

    # 2. Polar Express (Newton-Schulz orthogonalization)
    X = g.bfloat16()
    X = X / (X.norm(dim=(-2, -1), keepdim=True) * 1.02 + 1e-6)
    coeffs = [  # Pre-computed optimal coefficients
        (8.16, -22.48, 15.88),
        (4.04, -2.81, 0.50),
        (3.89, -2.77, 0.51),
        (3.29, -2.37, 0.46),
        (2.35, -1.71, 0.42),
    ]
    # Choose which dimension to contract based on matrix shape
    if g.size(-2) > g.size(-1):  # tall matrix
        for a, b, c in coeffs[:ns_steps]:
            A = X.mT @ X
            B = b * A + c * (A @ A)
            X = a * X + X @ B
    else:  # wide matrix
        for a, b, c in coeffs[:ns_steps]:
            A = X @ X.mT
            B = b * A + c * (A @ A)
            X = a * X + B @ X
    g = X

    # 3. NorMuon variance reduction
    v_mean = g.float().square().mean(dim=red_dim, keepdim=True)
    second_momentum_buf.lerp_(v_mean, 1 - beta2)
    step_size = second_momentum_buf.clamp_min(1e-10).rsqrt()
    # Normalize so total norm is preserved
    ...

    # 4. Cautious weight decay + update
    mask = (g * params) >= 0  # only decay where gradient agrees
    params.sub_(lr * g + lr * wd * params * mask)

Muon hyperparameters

Parameter	Typical Value	Notes
lr	0.02-0.04	Scaled by `max(1, rows/cols)^0.5` for non-square matrices
momentum	0.95	Warm up from 0.85 over first 300 steps
ns_steps	5	Number of Newton-Schulz iterations (more = better approx, slower)
beta2	0.95	For second moment tracking in NorMuon
weight_decay	0.1-0.2	Cautious (only where gradient agrees with param)

---

Hybrid MuonAdamW

The key insight: different parameter types benefit from different optimization strategies.

Parameter Type	Optimizer	Why
2D weight matrices (attention, MLP)	Muon	Benefits from orthogonalization
Token embeddings	AdamW	Sparse updates, not a matrix transform
Unembedding (lm_head)	AdamW	Needs lower LR for stability
Per-layer scalars	AdamW	Too small for matrix methods
Value embeddings	AdamW	Same as token embeddings

class MuonAdamW(torch.optim.Optimizer):
    def step(self):
        for group in self.param_groups:
            if group['kind'] == 'adamw':
                self._step_adamw(group)
            elif group['kind'] == 'muon':
                self._step_muon(group)

Grouping Muon parameters

Group Muon parameters by shape for efficient stacked updates:

# Group same-shape params together
for shape in sorted({p.shape for p in matrix_params}):
    group_params = [p for p in matrix_params if p.shape == shape]
    param_groups.append({
        'kind': 'muon',
        'params': group_params,
        'lr': matrix_lr,
        'momentum': 0.95,
        'ns_steps': 5,
    })

This enables torch.stack for vectorized Newton-Schulz across all params of the same shape.

---

Per-Parameter-Group Configuration

A complete optimizer setup for modern LLM training:

def setup_optimizer(model, d_model=768):
    lr_scale = (d_model / 768) ** -0.5

    param_groups = [
        # Unembedding: low LR, no weight decay
        {
            'kind': 'adamw',
            'params': list(model.lm_head.parameters()),
            'lr': 0.004 * lr_scale,
            'betas': (0.8, 0.95),
            'eps': 1e-10,
            'weight_decay': 0.0,
        },
        # Token embeddings: higher LR (sparse updates need bigger steps)
        {
            'kind': 'adamw',
            'params': list(model.wte.parameters()),
            'lr': 0.6 * lr_scale,
            'betas': (0.8, 0.95),
            'eps': 1e-10,
            'weight_decay': 0.0,
        },
        # Transformer matrices: Muon
        {
            'kind': 'muon',
            'params': list(model.transformer.h.parameters()),
            'lr': 0.04,
            'momentum': 0.95,
            'ns_steps': 5,
            'beta2': 0.95,
            'weight_decay': 0.2,
        },
        # Per-layer scalars: separate AdamW
        {
            'kind': 'adamw',
            'params': [model.resid_lambdas],
            'lr': 0.005 * lr_scale,
            'betas': (0.8, 0.95),
            'eps': 1e-10,
            'weight_decay': 0.0,
        },
    ]

    # Store initial LR for scheduling
    optimizer = MuonAdamW(param_groups)
    for group in optimizer.param_groups:
        group["initial_lr"] = group["lr"]
    return optimizer

---

LR Scaling Rules

By model dimension

As models get wider, per-parameter learning rates should decrease:

lr_effective = lr_base * (d_model / d_reference) ^ (-0.5)

This comes from the observation that larger matrices amplify gradient norms. Scaling by 1/√d keeps the effective step size constant across model sizes.

By matrix shape (Muon specific)

Non-square matrices need LR adjustment:

effective_lr = lr * max(1.0, rows / cols) ** 0.5

This compensates for the asymmetry in the orthogonalization process.

---

Weight Decay Strategies

Linear decay to zero

def get_weight_decay(progress):
    return base_wd * (1 - progress)

Rationale: early in training, regularization prevents overfitting to initial data distribution. Late in training, we want the model to fully commit to learned features.

Cautious weight decay (Muon)

Only apply weight decay where the gradient and parameter have the same sign:

mask = (gradient * parameter) >= 0
parameter -= lr * (gradient + wd * parameter * mask)

This prevents weight decay from fighting the gradient — if the gradient says "increase this weight" but weight decay says "decrease it", cautious WD skips the decay for that element.

What to weight-decay

Yes: Transformer weight matrices (attention projections, MLP weights)
No: Embeddings, biases, layer norm parameters, per-layer scalars

---

Momentum Scheduling

Warm up momentum over the first few hundred steps:

def get_muon_momentum(step, warmup_steps=300):
    frac = min(step / warmup_steps, 1.0)
    return 0.85 + frac * (0.95 - 0.85)  # 0.85 → 0.95

Lower momentum early in training allows faster adaptation when the loss landscape is changing rapidly. As training stabilizes, higher momentum smooths the updates.

---

Compiled Optimizer Steps

When using torch.compile, avoid recompilation from changing scalar values by using 0-D tensors:

class CompiledOptimizer:
    def __init__(self):
        # 0-D CPU tensors: changing their values doesn't trigger recompile
        self._lr = torch.tensor(0.0, dtype=torch.float32, device="cpu")
        self._wd = torch.tensor(0.0, dtype=torch.float32, device="cpu")

    def step(self, group):
        self._lr.fill_(group['lr'])        # update value
        self._wd.fill_(group['weight_decay'])
        compiled_step(params, grads, self._lr, self._wd)  # no recompile

This is critical for training loops where LR changes every step — without this pattern, torch.compile would recompile the optimizer step function every time the LR changes, defeating the purpose of compilation.

Scaling Laws & Architecture Selection Reference

Detailed decision frameworks for choosing architectures based on data scale, compute budget, and task type. Referenced from SKILL.md.

1. Scaling Laws 2. Architecture Decision Tree 3. Data Scale Thresholds 4. Compute Budget Planning 5. Optimizer Selection Guide 6. Training Instability at Scale 7. Key References

---

Scaling Laws

Chinchilla (Hoffmann et al., 2022)

The most important scaling law for LLM training:

For compute-optimal training: N (params) and D (tokens) should scale equally with compute. The ratio is approximately 20 tokens per parameter.

FLOPs ≈ 6 × N × D

Where:
  N = number of parameters
  D = number of training tokens
  6 = forward (2) + backward (4) FLOPs per parameter per token

Chinchilla vs Inference-Optimal

Strategy	Tokens/Param	When to use	Example
Chinchilla-optimal	~20x	Research, one-time compute	7B model → 140B tokens
Inference-optimal	100-200x	Production deployment	7B model → 700B-1.4T tokens

The LLaMA philosophy: deploy smaller models trained on more data, because inference is the ongoing cost while training is a one-time cost.

Beyond Chinchilla

Muennighoff et al. (2023): repeating data up to 4 epochs ≈ 85% as effective as unique data.

Beyond 4 epochs, returns diminish sharply. D_effective ≈ D × (1 - e^{-epochs})

Over-training smaller models is now standard practice for production (LLaMA, Mistral, Phi)
Data quality >> data quantity (Llama 3 finding): aggressive dedup + quality filtering > raw scale

---

Architecture Decision Tree

Master flowchart by data type

What is your data type?
│
├─ IMAGES / VIDEO
│   ├─ Data < 10K → Pretrained CNN (ResNet/EfficientNet) + fine-tune head
│   ├─ Data 10K-1M → Pretrained ViT fine-tune OR CNN fine-tune (both viable)
│   ├─ Data > 1M → ViT or hybrid (ConvNeXt, CoAtNet) from scratch
│   └─ Video → Video Swin Transformer or TimeSformer (pretrained)
│
├─ TEXT / NLP
│   ├─ Classification/NER → Fine-tune encoder (BERT/RoBERTa)
│   ├─ Generation → Fine-tune decoder (GPT/LLaMA)
│   ├─ Seq2seq (translation) → Fine-tune T5/BART
│   ├─ Data < 1K examples → Few-shot with large LLM (no training)
│   ├─ Seq length > 8K → Consider Mamba-hybrid or long-context Transformer
│   └─ Tight inference budget → Distilled model, RWKV, or Mamba
│
├─ TABULAR
│   ├─ Rows < 50K → XGBoost / LightGBM (NOT deep learning)
│   ├─ Rows 50K-500K → GBM still strong; try FT-Transformer as comparison
│   └─ Rows > 500K → Neural methods viable; benchmark both
│
├─ TIME SERIES
│   ├─ Univariate, short horizon → ARIMA / Prophet / simple LSTM
│   ├─ Multivariate, medium data → LSTM/GRU or N-BEATS
│   ├─ Long sequences / many series → PatchTST / Informer / Mamba
│   └─ Foundation model exists → TimesFM or Chronos (fine-tune)
│
├─ AUDIO / SPEECH
│   ├─ Speech recognition → Whisper (pretrained) + fine-tune
│   ├─ Audio classification → AST or CNN on spectrograms
│   └─ Long audio → Mamba / SSM variants
│
├─ GRAPH DATA
│   └─ GNN (GCN, GAT, GraphSAGE); Transformer-on-graphs for large graphs
│
└─ MULTIMODAL
    └─ CLIP-style (vision+text), or unified Transformer (Gemini-style)

Compute budget flowchart

How much compute do you have?
│
├─ Single GPU, < 1 day
│   → Models < 500M params
│   → Fine-tune pretrained, don't train from scratch
│   → LoRA/QLoRA for large model fine-tuning
│
├─ Single GPU, 1-7 days
│   → Up to 1B params from scratch
│   → Or fine-tune up to 7B with QLoRA
│
├─ Multi-GPU (4-8), 1-7 days
│   → Up to 3B from scratch
│   → Or fine-tune up to 13B
│   → Use DDP for data parallel
│
├─ Cluster (32+ GPUs), weeks
│   → 7B+ from scratch
│   → Apply Chinchilla scaling: 20 tokens/param minimum
│   → Use FSDP or Pipeline Parallel
│
└─ Massive cluster (100s of GPUs), months
    → 70B+ models
    → Full 5-way parallelism (TP + PP + DP + EP + CP)
    → Chinchilla ratios critical

---

Data Scale Thresholds

Vision: CNN vs ViT crossover points

Dataset Size	Winner	Notes
< 5K images	Pretrained CNN	ViT overfits without pretraining
5K-50K	Fine-tuned ViT ≈ CNN	Both viable, ViT needs pretraining (ImageNet-21k)
50K-500K	ViT with pretraining edges ahead	Hybrid architectures (CoAtNet) excel
> 1M	ViT from scratch viable	ViT-L/H outperform CNNs
> 10M	ViT clearly dominates	Original ViT paper showed this on JFT-300M

Key insight: transfer learning erases the gap. A ViT pretrained on large data and fine-tuned on small data can beat a CNN trained from scratch on that small data.

NLP: model size thresholds

Task Data Size	Approach
< 100 examples	Few-shot prompting (no training)
100-1K	Fine-tune small model (BERT-base) or LoRA on large model
1K-10K	Full fine-tune medium model
10K-100K	Train domain-specific model or continue pretraining
> 100K	Scale up model size with data per Chinchilla

Tabular: the tree boundary

Grinsztajn et al. (2022): "Why do tree-based models still outperform deep learning on typical tabular data?"

Dataset Rows	Recommendation
< 10K	XGBoost/LightGBM (no debate)
10K-50K	Trees almost always win. Neural barely competitive
50K-500K	Neural (FT-Transformer, TabNet) becomes viable
> 500K	Both competitive; neural can win with high-cardinality features

This is one of the most robust findings in ML — neural networks rarely beat gradient boosted trees on typical tabular data under ~50K rows.

Time series thresholds

Data Scale	Architecture
< 1K sequences	Classical (ARIMA, Prophet) or simple LSTM
1K-100K	LSTM/GRU competitive. Transformers become viable
> 100K	Transformer variants or Mamba for long-horizon

---

Compute Budget Planning

FLOPs estimates by model size

Model Size	Tokens (Chinchilla)	Training FLOPs	A100 GPU-hours (est.)
125M	2.5B	1.9e18	~6h
350M	7B	1.5e19	~48h
1B	20B	1.2e20	~385h
7B	140B	5.9e21	~19,000h
13B	260B	2.0e22	~65,000h
70B	1.4T	5.9e23	~1.9M h

Memory estimation

Rule of thumb for model memory (bf16 training):

Total VRAM ≈ 18-20 × N_params (in bytes)

Breakdown:
  Model weights (bf16):     2 × N bytes
  Gradients (bf16):         2 × N bytes
  Optimizer states (Adam):  8 × N bytes (fp32 first+second moments)
  Activations:              varies (~4-8 × N)

Example: 1B params → ~18-20 GB VRAM minimum

Techniques to reduce:

Gradient checkpointing: -50-70% activation memory, +30% compute
8-bit optimizer: -30% optimizer state memory
FSDP: shard across GPUs
QLoRA: 4-bit base + LoRA adapters

---

Optimizer Selection Guide

Optimizer	Best For	Memory	Notes
AdamW	Default for everything	2× params	β1=0.9, β2=0.95 for LLMs
8-bit Adam (bitsandbytes)	Memory-constrained	~1.3× params	Near-identical quality
Adafactor	Very large models	~1× params	Factorizes second moment
SGD+momentum	CNNs on vision	1× params	Needs more LR tuning
Muon	Transformer matrices	~2× params	Orthogonal updates, emerging
LAMB/LARS	Very large batch (>32K)	2× params	Scales LR per-layer for stability
Lion (Google)	Worth trying	1× params	Sign-based, less memory than Adam
Schedule-Free Adam	Simplicity	2× params	No LR schedule needed
SOAP	LLM training	~2× params	Shampoo-like but practical

When to use what

Default: AdamW. Always works, well-understood, vast literature.
Memory pressure: 8-bit Adam or Adafactor.
Very large batches: LAMB/LARS (linear scaling rule breaks down otherwise).
Cutting-edge LLM: Muon for matrix params + AdamW for embeddings (autoresearch pattern).
Simplicity: Schedule-Free Adam — eliminates LR schedule entirely.

---

Training Instability at Scale

Common failure modes observed in large-scale training (OPT-175B, BLOOM, PaLM, Llama):

Failure	Symptom	Fix
Loss spikes	Sudden loss jump, may or may not recover	Reduce LR, skip batch, rollback to earlier checkpoint (PaLM strategy)
Slow divergence	Loss gradually increases	Data quality issue or LR too high
Embedding collapse	All embeddings converge to similar values	Add embedding LayerNorm, reduce embedding LR
Attention entropy collapse	Attention uniform or one-hot	z-loss regularization, QK-norm
NaN in fp16	Training crashes	Switch to bf16, or reorder normalization before matmul

PaLM loss spike strategy

When a loss spike is detected: 1. Roll back to the last checkpoint before the spike 2. Skip the data batch that caused the spike 3. Optionally reduce LR temporarily, then ramp back up 4. Resume training

This is now standard practice at most large-scale training labs.

Stability techniques (now standard)

Pre-norm (normalize before attention/FFN, not after)
QK-norm (normalize Q and K before dot product)
No bias in linear layers (except final output)
Gradient clipping (max_norm=1.0)
Embedding LayerNorm (especially at scale)
bf16 over fp16 (no loss scaling needed)

---

DGX Spark / Bandwidth-Limited GPU Training

GB10 Grace Blackwell specs

Spec	Value	vs H100 SXM
GPU memory	128 GB LPDDR5X (unified CPU+GPU)	80 GB HBM3
Memory bandwidth	~273 GB/s	~3,350 GB/s (12× less)
CPU-GPU interconnect	NVLink C2C (~900 GB/s)	N/A (discrete)
FP4 Tensor Core	Yes (Blackwell)	No
FP8 Tensor Core	Yes	Yes
bf16 peak TFLOPS	~TBD (Blackwell arch)	989.5
Power	~300W total system	700W GPU alone
Form factor	Desktop workstation	Data center

The bandwidth bottleneck

DGX Spark's biggest constraint is memory bandwidth — 12× less than H100. This means:

Compute-bound ops (large matmuls): run fine, similar efficiency per FLOP
Memory-bound ops (element-wise, reductions, attention): severely bottlenecked
Effective MFU will be lower than on HBM GPUs for the same model

Rule of thumb: if your operation has low arithmetic intensity (FLOPs/byte < 50), it will be bandwidth-limited on DGX Spark. Large batch sizes and wide models help increase arithmetic intensity.

Optimization strategies for bandwidth-limited training

1. Maximize compute-to-memory ratio

# Use larger batch sizes to increase arithmetic intensity of matmuls
# Bigger batches → more FLOPs per weight load → better bandwidth utilization

# Use gradient accumulation to simulate large batches without OOM
grad_accum_steps = 16  # effectively 16x batch size

2. Quantized training (FP8 / FP4)

DGX Spark's Blackwell cores natively support FP4 and FP8 — these reduce memory traffic proportionally:

# FP8 training with transformer engine
import transformer_engine.pytorch as te

# Replace nn.Linear with FP8 version
linear = te.Linear(in_features, out_features, bias=False)

# FP8 autocast
with te.fp8_autocast(enabled=True):
    output = model(input)

FP8 cuts memory bandwidth demand by ~2× vs bf16. FP4 (where available) cuts by ~4×. Since bandwidth is the bottleneck, this directly translates to speed.

3. Operator fusion

Fuse element-wise operations to reduce memory round-trips:

# torch.compile is critical on bandwidth-limited hardware
# It fuses element-wise ops (norm, activation, residual add) into single kernels
model = torch.compile(model, dynamic=False, fullgraph=True)

# Manual fusion example: fused RMSNorm + linear
# Instead of: norm(x) → write to memory → linear(normed_x)
# Fused: norm + linear in one kernel, x never written back to memory

4. Gradient checkpointing (actually beneficial here)

On HBM GPUs, gradient checkpointing trades compute for memory. On DGX Spark, it's a different tradeoff — recomputing activations can be faster than loading them from memory:

from torch.utils.checkpoint import checkpoint

class Block(nn.Module):
    def forward(self, x):
        # Recompute attention activations instead of storing them
        x = x + checkpoint(self.attn, x, use_reentrant=False)
        x = x + checkpoint(self.mlp, x, use_reentrant=False)
        return x

5. Unified memory advantage

The NVLink C2C connection (~900 GB/s) between CPU and GPU means:

No explicit CPU↔GPU copies needed — unified address space
Can train models larger than GPU VRAM without offloading overhead
Use torch.cuda.mem_get_info() to check available unified memory
The 128GB pool is shared — monitor total system memory, not just "GPU memory"

6. KV-cache optimization for inference

For LLM inference on DGX Spark, KV-cache is the bandwidth bottleneck:

GQA/MQA: fewer KV heads = smaller cache = less bandwidth
KV-cache quantization: INT8 or FP8 KV cache reduces bandwidth 2-4×
Sliding window attention: bounds cache size regardless of sequence length
PagedAttention (vLLM): efficient memory management for variable-length sequences

7. Model selection for DGX Spark

Model Size	Feasibility	Notes
< 1B	Excellent	Train from scratch, fast iteration
1-7B	Good	Train from scratch; fine-tune comfortably
7-13B	Feasible	Fine-tune with QLoRA; train from scratch slowly
13-30B	Fine-tune only	QLoRA; unified memory helps fit the model
30-70B	Inference only	With quantization (GPTQ/AWQ 4-bit)
> 70B	Not recommended	Even inference may be too slow

DGX Spark checklist

[ ] Enable FP8 training (transformer_engine) — biggest single win
[ ] Use torch.compile with fullgraph=True for operator fusion
[ ] Increase batch size as much as memory allows (improves arithmetic intensity)
[ ] Enable gradient checkpointing (free performance on bandwidth-limited hardware)
[ ] Use GQA/MQA for attention-heavy models
[ ] Monitor torch.cuda.max_memory_allocated() — unified memory means different limits
[ ] Profile with torch.profiler to find bandwidth-bound kernels
[ ] Consider FP4 for inference if Blackwell kernel support is available

---

Key References

Scaling Laws

Kaplan et al. (2020): "Scaling Laws for Neural Language Models" — arxiv:2001.08361
Hoffmann et al. (2022): "Training Compute-Optimal Large Language Models" (Chinchilla) — arxiv:2203.15556
Muennighoff et al. (2023): "Scaling Data-Constrained Language Models" — arxiv:2305.16264

Architecture Selection

Dosovitskiy et al. (2020): "An Image is Worth 16x16 Words" (ViT) — arxiv:2010.11929
Liu et al. (2022): "A ConvNet for the 2020s" (ConvNeXt) — arxiv:2201.03545
Grinsztajn et al. (2022): "Why do tree-based models still outperform deep learning on tabular data?" — arxiv:2207.08815

Alternative Architectures

Gu & Dao (2023): "Mamba: Linear-Time Sequence Modeling" — arxiv:2312.00752
Peng et al. (2023): "RWKV: Reinventing RNNs for the Transformer Era" — arxiv:2305.13048
Sun et al. (2023): "Retentive Network" (RetNet) — arxiv:2307.08621

Training Recipes & Methodology

Karpathy (2019): "A Recipe for Training Neural Networks" (blog post)
Wightman et al. (2021): "ResNet Strikes Back" — arxiv:2110.00476
Yang et al. (2022): "Tensor Programs V" (µP) — arxiv:2203.03466
Google Research: "Deep Learning Tuning Playbook" — github.com/google-research/tuning_playbook
Stas Bekman: "ML Engineering" — github.com/stas00/ml-engineering
Geiping & Goldstein (2022): "Cramming: Training a Language Model on a Single GPU in One Day" — arxiv:2212.14034

Training at Scale

Zhang et al. (2022): "OPT: Open Pre-trained Transformer Language Models" — arxiv:2205.01068
Chowdhery et al. (2022): "PaLM: Scaling Language Modeling with Pathways" — arxiv:2204.02311
Touvron et al. (2023): "LLaMA" — arxiv:2302.13971

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Reference recipe pack for model code—not a hyperparameter tuner or dataset pipeline skill.

FAQ

Who is ml-training-recipes for?

ML engineers and agent developers implementing their own transformer training stack who need battle-tested module patterns.

When should I use ml-training-recipes?

During Build backend work while you code attention blocks, normalization, and model config for a training or fine-tuning job.

Is ml-training-recipes safe to install?

It is documentation and code patterns only; review the Security Audits panel on this page and audit any copied training code before running on your GPUs.

Data Science & MLllmresearch

About

Ml Training Recipes by the numbers

Add your badge

What it does

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

ML Training Recipes

Reference files (read when needed)

Architecture Selection

Scaling Laws

Chinchilla rule (Hoffmann et al., 2022)

Training Loop

Key principles

Optimizer Configuration

Rules of thumb

Learning Rate Scheduling

Time-based (autoresearch style)

Cosine decay

Guidance

Mixed Precision & Compilation

Memory & Performance

Meta device init (large models)

MFU (Model FLOPs Utilization)

OOM solutions (in order)

Hyperparameter Search

Priority order (tune first → last)

The 2025 default recipe

Debugging Checklist

Karpathy's recipe (still canonical)

Loss exploding / NaN

Slow training / Low MFU

Loss plateau / Slow convergence

Silent failures

What to monitor

Experiment Management

Evaluation metrics by domain

Architecture Patterns Reference

Table of Contents

RMSNorm

Rotary Position Embeddings (RoPE)

Precomputation

Application

Tips

Flash Attention with Sliding Window

Sliding Window Pattern

Integration

Grouped Query Attention (GQA)

Value Embedding (ResFormer)

Activation Functions

ReluSquared (recommended for simplicity)

SwiGLU (recommended for quality)

GELU (safe default)

Residual Scaling

Logit Soft Capping

Model Configuration Pattern

FLOPs Estimation

Biomedical & Pharmaceutical ML Reference

Table of Contents

Molecular Property Prediction & Drug Discovery

Graph Neural Networks for molecules

Molecular fingerprints + transformers

Practical setup for molecular GNNs

ADMET prediction

Molecular Generation

String-based (SMILES)

Graph-based

3D structure-aware generation

Retrosynthesis

Protein Structure & Language Models

Structure prediction

Protein language models

Fine-tuning protein LMs

Drug-Target Interaction

Virtual screening pipeline

Medical Imaging

Architectures by task

nnU-Net (self-configuring segmentation)