Turboquant Pytorch

Name: Turboquant Pytorch
Author: aradotso

aradotso/trending-skills

782 installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

TurboQuant PyTorch is a Claude Code skill that implements Google TurboQuant-style KV cache compression in PyTorch for developers who need lower LLM inference memory on long-context serving without breaking attention inne

About

TurboQuant PyTorch is a Claude Code skill with a from-scratch PyTorch implementation of Google's TurboQuant method from ICLR 2026 for LLM key-value cache compression. It applies two-stage vector quantization—random rotation, Lloyd-Max quantization, and QJL residual correction—to shrink KV cache footprint while preserving attention inner products for long-context inference. Developers reach for TurboQuant PyTorch when optimizing transformer serving memory, experimenting with 3-bit KV cache compression, or reproducing TurboQuant research in production PyTorch stacks. Triggers include compress KV cache, quantize key value cache, reduce KV cache memory, and apply QJL residual correction.

Two-stage pipeline: random orthogonal rotation + Lloyd-Max scalar quantization plus QJL 1-bit residual correction
Targets 2–4 bits per coordinate with inner-product fidelity, not vector reconstruction fidelity
Documents sweet-spot 3-bit ~5.0x compression (58 MB vs 289 MB FP16 baseline on cited Qwen2.5-3B 8K example)
Lists 4-bit 3.8x and 2-bit 7.3x ratios from the skill’s benchmark table
From-scratch PyTorch oriented to ICLR 2026 TurboQuant paper behavior

Turboquant Pytorch by the numbers

782 all-time installs (skills.sh)
+8 installs in the week ending Jul 23, 2026 (Skillselion tracking)
Ranked #1,321 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 23, 2026 (Skillselion catalog sync)

npx skills add https://github.com/aradotso/trending-skills --skill turboquant-pytorch

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/turboquant-pytorch.svg)](https://skillselion.com/skills/aradotso/trending-skills/turboquant-pytorch)

Installs	782
repo stars	★ 66
Security audit	2 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

How do you compress LLM KV cache in PyTorch?

Implement Google TurboQuant-style KV cache compression in PyTorch to cut LLM inference memory while preserving attention inner products for long-context serving.

Who is it for?

ML engineers serving long-context transformers in PyTorch who need research-grade KV cache compression beyond naive quantization.

Skip if: Frontend developers or teams on non-PyTorch runtimes who only need standard INT8 weight quantization without KV cache changes.

When should I use this skill?

The user asks to compress KV cache, quantize key-value cache, reduce LLM inference memory, implement TurboQuant, or apply QJL residual correction in PyTorch.

What you get

PyTorch TurboQuant modules with rotated quantization, Lloyd-Max codebooks, QJL residual correction, and reduced KV cache memory for inference.

turboquant pytorch modules
compressed kv cache integration code

Files

SKILL.mdMarkdownGitHub ↗

TurboQuant PyTorch

Skill by ara.so — Daily 2026 Skills collection.

From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for compressing LLM KV caches. Achieves 5x compression at 3-bit with 99.5% attention fidelity via two-stage vector quantization.

What It Does

TurboQuant compresses LLM key-value caches to 2–4 bits per coordinate:

Stage 1: Random orthogonal rotation + Lloyd-Max scalar quantization (MSE-optimal)
Stage 2: QJL residual correction — 1-bit sign projection that makes inner product estimates unbiased

Result: attention scores remain accurate even when individual vectors look quite different from originals. The algorithm preserves inner products, not vector fidelity.

Compression ratios at 8K context on Qwen2.5-3B (289 MB FP16 baseline):

4-bit → 76 MB (3.8x)
3-bit → 58 MB (5.0x) ← practical sweet spot
2-bit → 40 MB (7.3x)

Installation

git clone https://github.com/tonbistudio/turboquant-pytorch
cd turboquant-pytorch
pip install -r requirements.txt

# For CUDA PyTorch:
pip install torch --index-url https://download.pytorch.org/whl/cu128

requirements.txt includes:

torch>=2.0
scipy (Lloyd-Max codebook computation)
transformers, accelerate, bitsandbytes (only for real model validation)

Project Structure

turboquant/
  __init__.py           # Package exports
  lloyd_max.py          # Lloyd-Max optimal scalar quantizer
  turboquant.py         # Core: TurboQuantMSE, TurboQuantProd, TurboQuantKVCache
  compressors.py        # Production compressors for real model tensors
  test_turboquant.py    # Synthetic validation tests
  validate.py           # Real model (Qwen2.5-3B) validation

Key Commands

# Run synthetic algorithm validation (no GPU required, but GPU enables speed benchmark)
python -m turboquant.test_turboquant

# Run real model validation on Qwen2.5-3B-Instruct
# Requires CUDA GPU with ≥6GB VRAM; downloads ~2GB model on first run
python -m turboquant.validate

Core API

Lloyd-Max Codebook

from turboquant.lloyd_max import build_lloyd_max_codebook

# Build optimal scalar quantizer codebook for d-dimensional rotated unit vectors
# Returns (boundaries, centroids) for the given bit-width
boundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3)

Stage 1: MSE Quantization (TurboQuantMSE)

from turboquant.turboquant import TurboQuantMSE

# Initialize for head_dim=128, 3-bit quantization
tq_mse = TurboQuantMSE(dim=128, bits=3)

# Compress a batch of vectors: shape (batch, dim)
keys = torch.randn(512, 128)  # 512 key vectors
codes = tq_mse.quantize(keys)       # integer codes, (512, 128)
reconstructed = tq_mse.dequantize(codes)  # approximate keys, (512, 128)

Stage 2: Unbiased Inner Product Estimation (TurboQuantProd)

from turboquant.turboquant import TurboQuantProd

# Initialize with QJL correction
tq_prod = TurboQuantProd(dim=128, bits=3, proj_dim=64)

# Compress key vectors (stores codes + QJL residual signs)
compressed = tq_prod.compress(keys)  # dict with 'codes', 'signs', 'residual_norms'

# Estimate inner products <query, key> for all keys — unbiased estimator
query = torch.randn(128)
scores = tq_prod.inner_product(query, compressed)  # shape (512,)

KV Cache Wrapper (TurboQuantKVCache)

from turboquant.turboquant import TurboQuantKVCache

# Wrap a KV cache for a single attention head
cache = TurboQuantKVCache(dim=128, bits=3, proj_dim=64)

# Add key/value vectors as tokens are generated
cache.append_key(new_key)    # shape (dim,)
cache.append_value(new_val)  # shape (dim,)

# Compute attention scores for a query against all cached keys
query = torch.randn(128)
scores = cache.attention_scores(query)  # shape (seq_len,), unbiased

# Get values (MSE-reconstructed, used for weighted sum)
values = cache.get_values()  # shape (seq_len, dim)

Production Compressors (for real model tensors)

from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSE

# Key compressor — supports asymmetric attention score computation
key_compressor = TurboQuantCompressorV2(dim=128, bits=3, proj_dim=64)

# Compress all keys in a layer: shape (num_heads, seq_len, head_dim)
compressed_keys = key_compressor.compress(layer_keys)

# Compute attention scores directly from compressed keys (no decompress needed)
# query shape: (num_heads, head_dim)
scores = key_compressor.asymmetric_attention_scores(query, compressed_keys)
# scores shape: (num_heads, seq_len)

# Value compressor — MSE reconstruction (Stage 1 only, acceptable for values)
val_compressor = TurboQuantCompressorMSE(dim=128, bits=3)
compressed_vals = val_compressor.compress(layer_values)
reconstructed_vals = val_compressor.decompress(compressed_vals)

Common Patterns

Pattern 1: Compress a Full Model's KV Cache

import torch
from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSE

def compress_kv_cache(kv_cache, head_dim=128, bits=3, proj_dim=64):
    """
    kv_cache: list of (keys, values) per layer
              keys/values shape: (num_heads, seq_len, head_dim)
    Returns list of compressed (keys, values) per layer.
    """
    key_comp = TurboQuantCompressorV2(dim=head_dim, bits=bits, proj_dim=proj_dim)
    val_comp = TurboQuantCompressorMSE(dim=head_dim, bits=bits)

    compressed = []
    for layer_keys, layer_vals in kv_cache:
        c_keys = key_comp.compress(layer_keys)
        c_vals = val_comp.compress(layer_vals)
        compressed.append((c_keys, c_vals))

    return compressed, key_comp, val_comp


def run_attention_with_compressed_cache(query, compressed_keys, compressed_vals,
                                        key_comp, val_comp):
    """
    query: (num_heads, head_dim)
    Returns: attention output (num_heads, head_dim)
    """
    # Unbiased attention scores from compressed keys
    scores = key_comp.asymmetric_attention_scores(query, compressed_keys)
    # scores: (num_heads, seq_len)

    attn_weights = torch.softmax(scores, dim=-1)  # (num_heads, seq_len)

    # Decompress values and compute weighted sum
    values = val_comp.decompress(compressed_vals)  # (num_heads, seq_len, head_dim)
    output = torch.einsum('hs,hsd->hd', attn_weights, values)
    return output

Pattern 2: Validate Compression Quality

import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd

def measure_attention_fidelity(keys, queries, bits=3, proj_dim=64):
    """
    Measure how well TurboQuant preserves attention distributions.
    keys:    (seq_len, head_dim)
    queries: (num_queries, head_dim)
    """
    dim = keys.shape[-1]
    tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=proj_dim)

    compressed = tq.compress(keys)

    cosine_sims = []
    top1_matches = []

    for q in queries:
        # True attention scores
        true_scores = (keys @ q)  # (seq_len,)
        true_attn = torch.softmax(true_scores, dim=0)

        # TurboQuant estimated scores
        est_scores = tq.inner_product(q, compressed)  # (seq_len,)
        est_attn = torch.softmax(est_scores, dim=0)

        # Cosine similarity of attention distributions
        cos_sim = F.cosine_similarity(true_attn.unsqueeze(0),
                                       est_attn.unsqueeze(0)).item()
        cosine_sims.append(cos_sim)

        # Top-1 match
        top1_matches.append(true_attn.argmax() == est_attn.argmax())

    return {
        'mean_cosine_sim': sum(cosine_sims) / len(cosine_sims),
        'top1_accuracy': sum(top1_matches) / len(top1_matches),
    }

# Example usage
keys = torch.randn(2048, 128)
keys = F.normalize(keys, dim=-1)
queries = torch.randn(100, 128)
queries = F.normalize(queries, dim=-1)

results = measure_attention_fidelity(keys, queries, bits=3)
print(f"Cosine similarity: {results['mean_cosine_sim']:.4f}")
print(f"Top-1 accuracy:    {results['top1_accuracy']:.2%}")

Pattern 3: Needle-in-Haystack Retrieval Test

import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd

def needle_in_haystack(seq_len=2048, dim=128, bits=3):
    """Test whether TurboQuant preserves nearest-neighbor ordering."""
    tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=64)

    # Build haystack of random unit vectors
    haystack = F.normalize(torch.randn(seq_len, dim), dim=-1)

    # Insert needle at random position
    needle_idx = torch.randint(0, seq_len, (1,)).item()
    query = F.normalize(torch.randn(dim), dim=0)
    needle = query + 0.1 * torch.randn(dim)  # Similar to query
    needle = F.normalize(needle, dim=0)
    haystack[needle_idx] = needle

    # Compress
    compressed = tq.compress(haystack)

    # True nearest neighbor
    true_scores = haystack @ query
    true_best = true_scores.argmax().item()

    # TurboQuant estimated nearest neighbor
    est_scores = tq.inner_product(query, compressed)
    est_best = est_scores.argmax().item()

    return true_best == est_best, true_best, est_best

# Run multiple trials
successes = sum(needle_in_haystack(seq_len=8192)[0] for _ in range(20))
print(f"Retrieval accuracy: {successes}/20")

Pattern 4: Compute Memory Savings

def estimate_memory_savings(num_layers, num_kv_heads, seq_len, head_dim,
                             bits, proj_dim=64):
    """
    Estimate compressed KV cache size vs FP16 baseline.
    """
    # FP16 baseline: 2 bytes per element
    fp16_bytes = num_layers * 2 * num_kv_heads * seq_len * head_dim * 2

    # Stage 1 codes: bits per element, packed into bytes
    codes_bytes = (num_layers * 2 * num_kv_heads * seq_len * head_dim * bits) // 8

    # Stage 2 signs (keys only): 1 bit per proj_dim element
    signs_bytes = (num_layers * num_kv_heads * seq_len * proj_dim) // 8

    # Residual norms: 1 float16 per vector (keys only)
    norms_bytes = num_layers * num_kv_heads * seq_len * 2

    total_compressed = codes_bytes + signs_bytes + norms_bytes
    ratio = fp16_bytes / total_compressed

    print(f"FP16 baseline:      {fp16_bytes / 1e6:.1f} MB")
    print(f"TurboQuant {bits}-bit:  {total_compressed / 1e6:.1f} MB")
    print(f"Compression ratio:  {ratio:.1f}x")
    return ratio

# Qwen2.5-3B: 36 layers, 2 KV heads, head_dim=128
estimate_memory_savings(
    num_layers=36, num_kv_heads=2,
    seq_len=8192, head_dim=128, bits=3
)
# FP16 baseline:      289.4 MB
# TurboQuant 3-bit:   57.9 MB
# Compression ratio:  5.0x

Algorithm Details

Why Random Rotation?

Rotating by a random orthogonal matrix R maps unit vectors to a space where each coordinate follows N(0, 1/d). This makes coordinates nearly independent with known distribution — enabling optimal per-coordinate scalar quantization (Lloyd-Max).

Why QJL for Keys but Not Values?

Keys: Used in dot products with queries. Bias in inner product estimates directly corrupts attention weights. QJL correction is essential.
Values: Used in weighted sums after softmax. Small per-vector MSE errors average out. Stage 1 MSE quantization is sufficient.

Choosing `proj_dim` (QJL projection dimension)

Higher proj_dim → lower variance in inner product estimates, but more memory:

# Rule of thumb: proj_dim = head_dim // 2 is a good default
# head_dim=128 → proj_dim=64
# head_dim=64  → proj_dim=32
# head_dim=256 → proj_dim=128

Bit-width Selection Guide

Bits	Compression	Cosine Sim	Top-1 Match	Use Case
4	3.8x	0.999	87%	Quality-critical tasks
3	5.0x	0.995	82%	Recommended default
2	7.3x	0.988	66%	Extreme memory pressure

Troubleshooting

`scipy` import error when building codebooks:

pip install scipy

CUDA out of memory during `validate.py`:

Requires ≥6GB VRAM for Qwen2.5-3B in 4-bit
Reduce seq_len in the validation script or use a smaller model

Inner product estimates have high variance:

Increase proj_dim (try head_dim instead of head_dim // 2)
Check that input vectors are normalized before compressing

Codebook build is slow on first run:

Lloyd-Max uses numerical integration (scipy) — this is expected
Codebooks are precomputed once per (dim, bits) combination; cache them:

import pickle

# Save codebook
boundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3)
with open('codebook_128_3bit.pkl', 'wb') as f:
    pickle.dump((boundaries, centroids), f)

# Load cached codebook
with open('codebook_128_3bit.pkl', 'rb') as f:
    boundaries, centroids = pickle.load(f)

Attention fidelity lower than expected:

Ensure vectors are L2-normalized before compressing (F.normalize(x, dim=-1))
The compressors in compressors.py handle normalization internally; TurboQuantProd expects unit vectors

References

TurboQuant paper — ICLR 2026
QJL paper — 1-bit residual correction technique
PolarQuant — Related polar coordinate approach

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Pick TurboQuant PyTorch over generic LLM optimization tips when the goal is a reproducible TurboQuant KV cache implementation rather than broad prompt or batching tweaks.

FAQ

What quantization pipeline does TurboQuant PyTorch implement?

TurboQuant PyTorch implements Google's TurboQuant pipeline in PyTorch: random rotation, Lloyd-Max vector quantization, and QJL residual correction to compress KV cache while keeping attention inner products accurate for long-context LLM inference.

When should ML engineers use TurboQuant PyTorch?

ML engineers should use TurboQuant PyTorch when serving long-context transformers in PyTorch and needing lower KV cache memory than full-precision storage, especially when experimenting with 3-bit cache compression based on the ICLR 2026 TurboQuant paper.

Is Turboquant Pytorch safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingllmautomation