
Ml Training Recipes
Copy vetted transformer building blocks—RMSNorm, RoPE, GQA, Flash Attention—when you implement or fine-tune an LLM backend as a solo ML engineer.
Overview
ML Training Recipes is an agent skill for the Build phase that supplies copy-ready transformer architecture patterns for training modern LLM backends.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill ml-training-recipesWhat is this skill?
- 10-section architecture patterns reference (RMSNorm through model configuration)
- Pre-norm transformer blocks with RMSNorm and residual scaling patterns
- Modern attention stack: RoPE, sliding-window Flash Attention, and GQA
- ResFormer value embedding, activation, logit soft-capping, and full block assembly
- Python-oriented snippets intended to plug into a training codebase
- 10 architecture pattern sections in the reference table of contents
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are implementing a custom transformer but unsure which norm, attention, and embedding patterns match current best practice.
Who is it for?
Solo ML builders fine-tuning or training small-to-mid LLMs who want vetted architecture snippets instead of piecing papers together ad hoc.
Skip if: Teams that only consume hosted models via API with no custom training code, or beginners without PyTorch familiarity.
When should I use this skill?
Implementing or refactoring transformer training code and you need standard patterns for norms, attention, and blocks.
What do I get? / Deliverables
You assemble blocks from documented recipes (pre-norm RMSNorm, RoPE, GQA, flash attention) consistent with a full model configuration pattern.
- Transformer module implementations aligned to reference patterns
- Model configuration structure matching documented conventions
Recommended Skills
Journey fit
Training-recipe patterns are used while implementing model code during the Build phase, not during distribution or ops unless you are actively changing architecture. Backend subphase holds custom model training stacks, loss blocks, and inference kernels that power agent or API products.
How it compares
Reference recipe pack for model code—not a hyperparameter tuner or dataset pipeline skill.
Common Questions / FAQ
Who is ml-training-recipes for?
Indie ML engineers and agent builders implementing their own transformer training stack who need battle-tested module patterns.
When should I use ml-training-recipes?
During Build backend work while you code attention blocks, normalization, and model config for a training or fine-tuning job.
Is ml-training-recipes safe to install?
It is documentation and code patterns only; review the Security Audits panel on this page and audit any copied training code before running on your GPUs.
SKILL.md
READMESKILL.md - Ml Training Recipes
# Architecture Patterns Reference Detailed code patterns for modern transformer architectures. Referenced from the main SKILL.md. ## Table of Contents 1. [RMSNorm](#rmsnorm) 2. [Rotary Position Embeddings (RoPE)](#rotary-position-embeddings-rope) 3. [Flash Attention with Sliding Window](#flash-attention-with-sliding-window) 4. [Grouped Query Attention (GQA)](#grouped-query-attention-gqa) 5. [Value Embedding (ResFormer)](#value-embedding-resformer) 6. [Activation Functions](#activation-functions) 7. [Residual Scaling](#residual-scaling) 8. [Logit Soft Capping](#logit-soft-capping) 9. [Full Transformer Block](#full-transformer-block) 10. [Model Configuration Pattern](#model-configuration-pattern) --- ## RMSNorm Root Mean Square Layer Normalization — drops the mean-centering of LayerNorm, keeping only the variance normalization. ~15% faster with equivalent quality for transformers. ```python def norm(x): return F.rms_norm(x, (x.size(-1),)) ``` Apply pre-norm (before attention and MLP), not post-norm: ```python class Block(nn.Module): def forward(self, x): x = x + self.attn(norm(x)) # pre-norm x = x + self.mlp(norm(x)) # pre-norm return x ``` Also normalize the final output before the unembedding layer: ```python x = norm(x) logits = self.lm_head(x) ``` --- ## Rotary Position Embeddings (RoPE) RoPE encodes position through rotation of query/key pairs. It's relative (only depends on distance between tokens) and naturally handles varying sequence lengths. ### Precomputation Compute cos/sin tables once at model init, not every forward pass: ```python def precompute_rotary(seq_len, head_dim, base=10000, device=None): """Precompute RoPE cos/sin for positions [0, seq_len).""" channel_range = torch.arange(0, head_dim, 2, dtype=torch.float32, device=device) inv_freq = 1.0 / (base ** (channel_range / head_dim)) t = torch.arange(seq_len, dtype=torch.float32, device=device) freqs = torch.outer(t, inv_freq) cos, sin = freqs.cos().bfloat16(), freqs.sin().bfloat16() # Shape: [1, seq_len, 1, head_dim//2] for broadcasting return cos[None, :, None, :], sin[None, :, None, :] ``` Register as non-persistent buffers (not saved in state_dict, but moved with `.to(device)`): ```python self.register_buffer("cos", cos, persistent=False) self.register_buffer("sin", sin, persistent=False) ``` ### Application ```python def apply_rotary_emb(x, cos, sin): """Apply RoPE to query or key tensor. x shape: [B, T, H, D].""" d = x.shape[3] // 2 x1, x2 = x[..., :d], x[..., d:] y1 = x1 * cos + x2 * sin y2 = x1 * (-sin) + x2 * cos return torch.cat([y1, y2], dim=3) ``` ### Tips - Pre-allocate for `seq_len * 10` (or max expected length) to avoid recomputation - Apply RoPE **after** splitting into heads but **before** attention - Normalize q and k **after** RoPE: `q, k = norm(q), norm(k)` (QK-norm stabilizes training) --- ## Flash Attention with Sliding Window Flash Attention computes exact attention in O(N) memory instead of O(N^2), and is significantly faster due to IO-awareness. ### Sliding Window Pattern Use a repeating pattern like `SSSL` — most layers use short (local) windows, with periodic long (global) windows. The last layer always gets full context. ```python def compute_window_sizes(config): pattern = config.window_pattern.upper() # e.g., "SSSL" long_window = config.sequence_len short_window = long_window // 2 # half context window_sizes = [] for layer_idx in range(config.n_layer): char = pattern[layer_idx % len(pattern)] if char == "L": window_sizes.append((long_window, 0)) else: window_sizes.append((short_window, 0)) # Last layer always gets full context window_sizes[-1] = (long_window, 0) return window_sizes ``` This saves ~25% attention com