
Speculative Decoding
Understand and apply lookahead (Jacobi) decoding to speed up autoregressive LLM inference without a draft model when you are shipping or tuning agent products.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill speculative-decodingWhat is this skill?
- Explains Jacobi iteration reformulation of sequential token generation for parallel n-gram candidates
- Documents two-branch lookahead architecture with window size W and n-gram size N parameters
- References ICML 2024 paper and LMSYS lookahead decoding with cited 1.5–2.3× speedup range
- Includes Python-oriented lookahead branch sketch for generating candidate n-grams from past tokens
- Contrasts traditional autoregressive steps with parallel Jacobi update equations
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Canonical shelf is Build because the skill teaches inference mechanics you implement or evaluate while building LLM-backed agents and APIs. Agent-tooling is the best fit: decoding strategy directly affects latency and cost of code-assistant and chat agents you deploy.
Common Questions / FAQ
Is Speculative Decoding safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Speculative Decoding
# Lookahead Decoding: Jacobi Iteration Based on ICML 2024 paper and LMSYS blog post ## Overview **Source**: https://lmsys.org/blog/2023-11-21-lookahead-decoding/ **Paper**: ICML 2024 **GitHub**: https://github.com/hao-ai-lab/LookaheadDecoding Lookahead Decoding breaks sequential dependency in autoregressive decoding using Jacobi iteration, achieving 1.5-2.3× speedup without draft models or training. ## Core Concept ### Reformulation as Equation Solving **Traditional autoregressive**: ``` y_t = f(x, y_1, y_2, ..., y_{t-1}) # Sequential ``` **Jacobi iteration**: ``` y_t^{(k+1)} = f(x, y_1^{(k)}, y_2^{(k)}, ..., y_{t-1}^{(k)}) # Parallel ``` **Key insight**: Although exact parallel decoding is impossible, we can generate multiple disjoint n-grams in parallel that may fit into the final sequence. ## Two-Branch Architecture ### Lookahead Branch **Purpose**: Generate potential token sequences (n-grams) in parallel. **Parameters**: - `W` (window size): How many steps ahead to look - `N` (n-gram size): How many past tokens to use for generation ```python # Example: W=5, N=3 # Generate n-grams at positions 1-5 using past 1-3 tokens def lookahead_branch(model, tokens, W=5, N=3): """Generate n-grams using Jacobi iteration.""" candidates = {} for w in range(1, W + 1): # Position offset for n in range(1, N + 1): # N-gram length # Use n past tokens to predict at position w past_tokens = tokens[-n:] future_position = len(tokens) + w # Generate n-gram ngram = model.generate_ngram( context=past_tokens, position=future_position, length=n ) candidates[(w, n)] = ngram return candidates ``` **Output**: Pool of candidate n-grams that might match future sequence. ### Verification Branch **Purpose**: Identify and confirm promising n-grams. ```python def verification_branch(model, tokens, candidates): """Verify which candidates match actual sequence.""" verified = [] for ngram in candidates: # Check if ngram's first token matches last generated token if ngram[0] == tokens[-1]: # Verify full n-gram with model is_valid = model.verify_sequence(tokens + ngram) if is_valid: verified.append(ngram) # Return longest verified n-gram return max(verified, key=len) if verified else None ``` **Acceptance**: N-gram accepted if its first token matches the last input token and model confirms the sequence. ## Algorithm ### Complete Lookahead Decoding ```python class LookaheadDecoding: def __init__(self, model, W=15, N=5, G=5): """ Args: W: Window size (lookahead distance) N: N-gram size (context length) G: Guess size (parallel candidates) """ self.model = model self.W = W self.N = N self.G = G def generate(self, input_ids, max_new_tokens=256): tokens = input_ids.clone() while len(tokens) < max_new_tokens: # 1. Lookahead: Generate candidates candidates = self._lookahead_step(tokens) # 2. Verification: Find matching n-grams accepted_ngram = self._verification_step(tokens, candidates) if accepted_ngram is not None: # Accept multiple tokens tokens = torch.cat([tokens, accepted_ngram]) else: # Fallback: Generate single token next_token = self.model.generate_next(tokens) tokens = torch.cat([tokens, next_token]) return tokens def _lookahead_step(self, tokens): """Generate candidate n-grams in parallel.""" candidates = [] for w in range(1, self.W + 1): for n in range(1, self.N + 1): # Sample n-gram from model ngram = self.model.sample_ngram(