
Serving Llms Vllm
Deploy and tune vLLM inference with PagedAttention, continuous batching, and prefix caching for your own model API.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill serving-llms-vllmWhat is this skill?
- PagedAttention block sizing and gpu-memory-utilization serve flags
- Continuous batching mechanics mixing prefill and decode for higher GPU utilization
- Prefix caching and speculative decoding setup guidance
- Documented throughput contrast: continuous batching example up to 4x vs traditional batching
- KV cache memory example: 70B traditional 160GB vs PagedAttention ~80GB narrative
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 1/3 security scanners passed (skills.sh audits).
Recommended Skills
Microsoft Foundrymicrosoft/azure-skills
Azure Aimicrosoft/azure-skills
Azure Hosted Copilot Sdkmicrosoft/azure-skills
Lark Eventlarksuite/cli
Running Claude Code Via Litellm Copilotxixu-me/skills
Setup Matt Pocock Skillsmattpocock/skills
Journey fit
Primary fit
Production LLM serving is operated infrastructure—GPU memory, throughput, and tuning live after the model is chosen. Infra subphase covers serve commands, GPU utilization flags, and performance mechanics rather than model research alone.
Common Questions / FAQ
Is Serving Llms Vllm safe to install?
skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Serving Llms Vllm
# Performance Optimization ## Contents - PagedAttention explained - Continuous batching mechanics - Prefix caching strategies - Speculative decoding setup - Benchmark results and comparisons - Performance tuning guide ## PagedAttention explained **Traditional attention problem**: - KV cache stored in contiguous memory - Wastes ~50% GPU memory due to fragmentation - Cannot dynamically reallocate for varying sequence lengths **PagedAttention solution**: - Divides KV cache into fixed-size blocks (like OS virtual memory) - Dynamic allocation from free block queue - Shares blocks across sequences (for prefix caching) **Memory savings example**: ``` Traditional: 70B model needs 160GB KV cache → OOM on 8x A100 PagedAttention: 70B model needs 80GB KV cache → Fits on 4x A100 ``` **Configuration**: ```bash # Block size (default: 16 tokens) vllm serve MODEL --block-size 16 # Number of GPU blocks (auto-calculated) # Controlled by --gpu-memory-utilization vllm serve MODEL --gpu-memory-utilization 0.9 ``` ## Continuous batching mechanics **Traditional batching**: - Wait for all sequences in batch to finish - GPU idle while waiting for longest sequence - Low GPU utilization (~40-60%) **Continuous batching**: - Add new requests as slots become available - Mix prefill (new requests) and decode (ongoing) in same batch - High GPU utilization (>90%) **Throughput improvement**: ``` Traditional batching: 50 req/sec @ 50% GPU util Continuous batching: 200 req/sec @ 90% GPU util = 4x throughput improvement ``` **Tuning parameters**: ```bash # Max concurrent sequences (higher = more batching) vllm serve MODEL --max-num-seqs 256 # Prefill/decode schedule (auto-balanced by default) # No manual tuning needed ``` ## Prefix caching strategies Reuse computed KV cache for common prompt prefixes. **Use cases**: - System prompts repeated across requests - Few-shot examples in every prompt - RAG contexts with overlapping chunks **Example savings**: ``` Prompt: [System: 500 tokens] + [User: 100 tokens] Without caching: Compute 600 tokens every request With caching: Compute 500 tokens once, then 100 tokens/request = 83% faster TTFT ``` **Enable prefix caching**: ```bash vllm serve MODEL --enable-prefix-caching ``` **Automatic prefix detection**: - vLLM detects common prefixes automatically - No code changes required - Works with OpenAI-compatible API **Cache hit rate monitoring**: ```bash curl http://localhost:9090/metrics | grep cache_hit # vllm_cache_hit_rate: 0.75 (75% hit rate) ``` ## Speculative decoding setup Use smaller "draft" model to propose tokens, larger model to verify. **Speed improvement**: ``` Standard: Generate 1 token per forward pass Speculative: Generate 3-5 tokens per forward pass = 2-3x faster generation ``` **How it works**: 1. Draft model proposes K tokens (fast) 2. Target model verifies all K tokens in parallel (one pass) 3. Accept verified tokens, restart from first rejection **Setup with separate draft model**: ```bash vllm serve meta-llama/Llama-3-70B-Instruct \ --speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ --num-speculative-tokens 5 ``` **Setup with n-gram draft** (no separate model): ```bash vllm serve MODEL \ --speculative-method ngram \ --num-speculative-tokens 3 ``` **When to use**: - Output length > 100 tokens - Draft model 5-10x smaller than target - Acceptable 2-3% accuracy trade-off ## Benchmark results **vLLM vs HuggingFace Transformers** (Llama 3 8B, A100): ``` Metric | HF Transformers | vLLM | Improvement ------------------------|-----------------|--------|------------ Throughput (req/sec) | 12 | 280 | 23x TTFT (ms) | 850 | 120 | 7x Tokens/sec | 45 | 2,100 | 47x GPU Memory (GB) | 28 | 16 | 1.75x less ``` **vLLM vs TensorRT-LLM** (Llama 2 70B, 4x A100): ``` Metric | TensorRT-LLM | vLLM | Notes ------------------------|--------------|--------|