
Long Context
Compare YaRN, ALiBi, and position interpolation when choosing or tuning long-context limits for agent and LLM-backed products.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill long-contextWhat is this skill?
- Side-by-side treatment of YaRN, ALiBi, and position interpolation from cited papers
- YaRN NTK-aware interpolation, attention temperature scaling, and NTK-by-parts frequency cutoffs
- Concrete formulas and Python sketches for m-scale and correction dimensions
- Method comparison section for picking an approach under data and length constraints
- Anchored to arXiv 2309.00071 and related long-context literature
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Microsoft Foundrymicrosoft/azure-skills
Azure Aimicrosoft/azure-skills
Azure Hosted Copilot Sdkmicrosoft/azure-skills
Lark Eventlarksuite/cli
Running Claude Code Via Litellm Copilotxixu-me/skills
Setup Matt Pocock Skillsmattpocock/skills
Journey fit
Primary fit
Canonical shelf is idea/research because the skill synthesizes published papers and tradeoffs before you commit to a context-window strategy. It answers how to extend context (RoPE, attention biases, interpolation), not how to ship a specific UI or API integration.
Common Questions / FAQ
Is Long Context safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Long Context
# Context Extension Methods Comprehensive comparison of YaRN, ALiBi, and Position Interpolation based on published research. ## Table of Contents - YaRN (Yet another RoPE extensioN) - ALiBi (Attention with Linear Biases) - Position Interpolation - Method Comparison ## YaRN: Yet another RoPE extensioN **Paper**: arXiv 2309.00071 (2023) **Authors**: Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole ### Overview YaRN extends RoPE-based models to 128k+ context with 10× less training data than previous methods. ### Key Innovations 1. **NTK-aware interpolation**: Scales different frequency components differently 2. **Attention temperature scaling**: Adjusts attention sharpness 3. **NTK-by-parts**: Hybrid interpolation/extrapolation ### Technical Details **Problem**: Naive position interpolation compresses all frequencies uniformly, losing high-frequency information. **Solution**: Different treatment for different frequencies. ```python # Frequency decomposition # Low frequencies (< 1/β_slow): Interpolate (compress) # High frequencies (> 1/β_fast): Extrapolate (extend as-is) # Middle frequencies: Smooth ramp between the two def yarn_get_mscale(scale=1.0): """Attention temperature scaling.""" if scale <= 1: return 1.0 return 0.1 * math.log(scale) + 1.0 def yarn_find_correction_dim(num_rotations, dim, base=10000, max_position_embeddings=2048): """Find dimension cutoffs for NTK-by-parts.""" return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base)) def yarn_find_correction_range(low_rot, high_rot, dim, base=10000, max_position_embeddings=2048): """Find frequency ranges for interpolation.""" low = math.floor(yarn_find_correction_dim(low_rot, dim, base, max_position_embeddings)) high = math.ceil(yarn_find_correction_dim(high_rot, dim, base, max_position_embeddings)) return max(low, 0), min(high, dim - 1) def yarn_linear_ramp_mask(min_val, max_val, dim): """Create smooth ramp between interpolation and extrapolation.""" if min_val == max_val: max_val += 0.001 # Avoid division by zero linear_func = (torch.arange(dim, dtype=torch.float32) - min_val) / (max_val - min_val) ramp_func = torch.clamp(linear_func, 0, 1) return ramp_func ``` ### Complete YaRN Implementation ```python class YaRNScaledRoPE(nn.Module): """Full YaRN implementation.""" def __init__( self, dim, max_position_embeddings=2048, base=10000, scale=1.0, original_max_position_embeddings=2048, extrapolation_factor=1.0, attn_factor=1.0, beta_fast=32, beta_slow=1, device=None ): super().__init__() self.dim = dim self.max_position_embeddings = max_position_embeddings self.base = base self.scale = scale self.original_max_position_embeddings = original_max_position_embeddings self.extrapolation_factor = extrapolation_factor self.attn_factor = attn_factor self.beta_fast = beta_fast self.beta_slow = beta_slow # Compute mscale (attention temperature) self.mscale = float(yarn_get_mscale(self.scale) * self.attn_factor) # Compute frequency bands self.low, self.high = yarn_find_correction_range( self.beta_fast, self.beta_slow, self.dim, self.base, self.original_max_position_embeddings ) # Compute inverse frequencies inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.float32) / self.dim)) # Create ramp mask inv_freq_mask = 1.0 - yarn_linear_ramp_mask(self.low, self.high, self.dim // 2) inv_freq = inv_freq / ((1 - inv_freq_mask) * self.extrapolation_factor + inv_freq_mask) self.register_buffer("inv_freq", inv_freq) def forward(self, seq_len, device): t = torch.arange(seq_len, device=device,