
Turboquant Pytorch
Implement Google TurboQuant-style KV cache compression in PyTorch to cut LLM inference memory while preserving attention inner products for long-context serving.
Overview
TurboQuant PyTorch is an agent skill for the Build phase that guides PyTorch implementation of TurboQuant KV cache compression using rotation, Lloyd-Max quantization, and QJL residual correction.
Install
npx skills add https://github.com/aradotso/trending-skills --skill turboquant-pytorchWhat is this skill?
- Two-stage pipeline: random orthogonal rotation + Lloyd-Max scalar quantization plus QJL 1-bit residual correction
- Targets 2–4 bits per coordinate with inner-product fidelity, not vector reconstruction fidelity
- Documents sweet-spot 3-bit ~5.0x compression (58 MB vs 289 MB FP16 baseline on cited Qwen2.5-3B 8K example)
- Lists 4-bit 3.8x and 2-bit 7.3x ratios from the skill’s benchmark table
- From-scratch PyTorch oriented to ICLR 2026 TurboQuant paper behavior
- 5x compression at 3-bit with 99.5% attention fidelity (skill claim)
- 289 MB FP16 baseline → 58 MB at 3-bit on cited 8K Qwen2.5-3B example
- 4-bit 76 MB (3.8x) and 2-bit 40 MB (7.3x) in same table
Adoption & trust: 722 installs on skills.sh; 31 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent or chat product runs out of VRAM on long contexts because FP16 KV caches dominate memory.
Who is it for?
Indie ML engineers building custom LLM inference in PyTorch who need aggressive KV cache compression without throwing away attention accuracy.
Skip if: Builders who only consume closed APIs with no access to KV tensors, or teams wanting general INT8 weight quantization without cache-specific TurboQuant stages.
When should I use this skill?
compress KV cache with TurboQuant; quantize key value cache for LLM; 3-bit KV cache compression; QJL residual correction; LLM inference memory optimization with TurboQuant.
What do I get? / Deliverables
You get a concrete TurboQuant implementation path with documented 3-bit ~5x compression tradeoffs and attention-fidelity expectations for integrating into your inference stack.
- TurboQuant stage-1 and stage-2 implementation outline
- Bit-width vs memory compression table aligned to your model context
Recommended Skills
Journey fit
Build is where you implement inference optimizations in code; KV cache compression is a backend/model-serving concern, not a launch or grow marketing task. Backend subphase holds model runtime, memory, and quantization logic that agents invoke during implementation.
How it compares
Implementation skill for KV cache vector quantization, not a generic model-quantization checklist or hosted inference product.
Common Questions / FAQ
Who is turboquant-pytorch for?
Solo developers and small teams implementing LLM inference in PyTorch who need to compress key-value caches for longer context or cheaper GPUs.
When should I use turboquant-pytorch?
Use it during build when integrating 2–4 bit KV compression, when triggers mention TurboQuant, QJL residual correction, or reducing KV cache memory during agent product development.
Is turboquant-pytorch safe to install?
The skill describes clone-and-implement workflows; review the Security Audits panel on this Prism page and audit any referenced repos before running install commands.
SKILL.md
READMESKILL.md - Turboquant Pytorch
# TurboQuant PyTorch > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for compressing LLM KV caches. Achieves 5x compression at 3-bit with 99.5% attention fidelity via two-stage vector quantization. ## What It Does TurboQuant compresses LLM key-value caches to 2–4 bits per coordinate: - **Stage 1**: Random orthogonal rotation + Lloyd-Max scalar quantization (MSE-optimal) - **Stage 2**: QJL residual correction — 1-bit sign projection that makes inner product estimates unbiased Result: attention scores remain accurate even when individual vectors look quite different from originals. The algorithm preserves **inner products**, not vector fidelity. **Compression ratios at 8K context on Qwen2.5-3B (289 MB FP16 baseline):** - 4-bit → 76 MB (3.8x) - 3-bit → 58 MB (5.0x) ← practical sweet spot - 2-bit → 40 MB (7.3x) ## Installation ```bash git clone https://github.com/tonbistudio/turboquant-pytorch cd turboquant-pytorch pip install -r requirements.txt # For CUDA PyTorch: pip install torch --index-url https://download.pytorch.org/whl/cu128 ``` **requirements.txt includes:** - `torch>=2.0` - `scipy` (Lloyd-Max codebook computation) - `transformers`, `accelerate`, `bitsandbytes` (only for real model validation) ## Project Structure ``` turboquant/ __init__.py # Package exports lloyd_max.py # Lloyd-Max optimal scalar quantizer turboquant.py # Core: TurboQuantMSE, TurboQuantProd, TurboQuantKVCache compressors.py # Production compressors for real model tensors test_turboquant.py # Synthetic validation tests validate.py # Real model (Qwen2.5-3B) validation ``` ## Key Commands ```bash # Run synthetic algorithm validation (no GPU required, but GPU enables speed benchmark) python -m turboquant.test_turboquant # Run real model validation on Qwen2.5-3B-Instruct # Requires CUDA GPU with ≥6GB VRAM; downloads ~2GB model on first run python -m turboquant.validate ``` ## Core API ### Lloyd-Max Codebook ```python from turboquant.lloyd_max import build_lloyd_max_codebook # Build optimal scalar quantizer codebook for d-dimensional rotated unit vectors # Returns (boundaries, centroids) for the given bit-width boundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3) ``` ### Stage 1: MSE Quantization (TurboQuantMSE) ```python from turboquant.turboquant import TurboQuantMSE # Initialize for head_dim=128, 3-bit quantization tq_mse = TurboQuantMSE(dim=128, bits=3) # Compress a batch of vectors: shape (batch, dim) keys = torch.randn(512, 128) # 512 key vectors codes = tq_mse.quantize(keys) # integer codes, (512, 128) reconstructed = tq_mse.dequantize(codes) # approximate keys, (512, 128) ``` ### Stage 2: Unbiased Inner Product Estimation (TurboQuantProd) ```python from turboquant.turboquant import TurboQuantProd # Initialize with QJL correction tq_prod = TurboQuantProd(dim=128, bits=3, proj_dim=64) # Compress key vectors (stores codes + QJL residual signs) compressed = tq_prod.compress(keys) # dict with 'codes', 'signs', 'residual_norms' # Estimate inner products <query, key> for all keys — unbiased estimator query = torch.randn(128) scores = tq_prod.inner_product(query, compressed) # shape (512,) ``` ### KV Cache Wrapper (TurboQuantKVCache) ```python from turboquant.turboquant import TurboQuantKVCache # Wrap a KV cache for a single attention head c