
Hqq Quantization
Configure Half-Quadratic Quantization (HQQ) backends and mixed-precision layers so locally run LLMs fit GPU memory without guessing kernel choices.
Overview
hqq-quantization is an agent skill for the Build phase that documents advanced HQQLinear backend selection, per-layer kernels, and TorchAO mixed-precision setup for GPU-efficient LLM inference.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill hqq-quantizationWhat is this skill?
- Hardware-aware backend picker using CUDA compute capability (Ampere+ → marlin, Volta/Turing → aten, else pytorch_compile
- Per-layer backend assignment pattern for attention vs MLP modules
- TorchAO int4 integration with inductor tuning flags documented
- Mixed-precision quantization workflows beyond default HQQ setup (advanced usage guide)
- CUDA compute capability threshold ≥80 routes to marlin backend in the documented selector
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your quantized model runs slowly or OOMs because default HQQ backends do not match your GPU architecture or layer types.
Who is it for?
Indie ML builders optimizing inference on NVIDIA GPUs who already use HQQ and want agent-guided backend tuning in code.
Skip if: Beginners choosing a first model host, or teams that only use hosted APIs with no local PyTorch quant pipeline.
When should I use this skill?
User is configuring HQQ quantization backends, per-layer kernels, or TorchAO integration for a PyTorch LLM.
What do I get? / Deliverables
You apply documented backend and per-layer HQQ configuration so attention and MLP paths use appropriate kernels and TorchAO options where needed.
- Backend configuration snippets for HQQLinear global and per-layer setup
- Documented inductor tuning flags when using TorchAO int4 backend
Recommended Skills
Journey fit
Model compression and inference backends are backend build work done while assembling or optimizing an AI product’s inference path. The guide focuses on HQQLinear backend selection (Marlin, ATen, BitBLAS, TorchAO) and per-layer assignments—implementation detail for model serving stacks.
How it compares
Reference procedural quantization tuning, not a one-click model downloader—pairs with research skills for eval, not replacement for full training stacks.
Common Questions / FAQ
Who is hqq-quantization for?
Solo builders and agent users implementing local LLM inference who need CUDA-aware HQQ backend recipes in PyTorch.
When should I use hqq-quantization?
During build backend work when swapping quant backends after profiling, splitting marlin on attention and bitblas on MLP, or enabling TorchAO int4.
Is hqq-quantization safe to install?
It is documentation-style Python guidance; review the Security Audits panel on this page before running agent-generated GPU code in production.
SKILL.md
READMESKILL.md - Hqq Quantization
# HQQ Advanced Usage Guide ## Custom Backend Configuration ### Backend Selection by Hardware ```python from hqq.core.quantize import HQQLinear import torch def select_optimal_backend(): """Select best backend based on hardware.""" device = torch.cuda.get_device_properties(0) compute_cap = device.major * 10 + device.minor if compute_cap >= 80: # Ampere+ return "marlin" elif compute_cap >= 70: # Volta/Turing return "aten" else: return "pytorch_compile" backend = select_optimal_backend() HQQLinear.set_backend(backend) print(f"Using backend: {backend}") ``` ### Per-Layer Backend Assignment ```python from hqq.core.quantize import HQQLinear def set_layer_backends(model): """Assign optimal backends per layer type.""" for name, module in model.named_modules(): if isinstance(module, HQQLinear): if "attn" in name: module.set_backend("marlin") # Fast for attention elif "mlp" in name: module.set_backend("bitblas") # Flexible for MLP else: module.set_backend("aten") set_layer_backends(model) ``` ### TorchAO Integration ```python from hqq.core.quantize import HQQLinear import torchao # Enable TorchAO int4 backend HQQLinear.set_backend("torchao_int4") # Configure TorchAO options import torch torch._inductor.config.coordinate_descent_tuning = True torch._inductor.config.triton.unique_kernel_names = True ``` ## Mixed Precision Quantization ### Layer-Specific Configuration ```python from hqq.core.quantize import BaseQuantizeConfig from transformers import AutoModelForCausalLM # Define configs per layer pattern quant_configs = { # Embeddings: Keep full precision "embed_tokens": None, "lm_head": None, # Attention: 4-bit with larger groups "self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=128), "self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=128), "self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=128), "self_attn.o_proj": BaseQuantizeConfig(nbits=4, group_size=128), # MLP: More aggressive 2-bit "mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.down_proj": BaseQuantizeConfig(nbits=3, group_size=64), } def quantize_with_mixed_precision(model, configs): """Apply mixed precision quantization.""" from hqq.core.quantize import HQQLinear for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): for pattern, config in configs.items(): if pattern in name: if config is None: continue # Skip quantization parent = get_parent_module(model, name) setattr(parent, name.split(".")[-1], HQQLinear(module, config)) break return model ``` ### Sensitivity-Based Quantization ```python import torch from hqq.core.quantize import BaseQuantizeConfig, HQQLinear def measure_layer_sensitivity(model, calibration_data, layer_name): """Measure quantization sensitivity of a layer.""" original_output = None quantized_output = None # Get original output def hook_original(module, input, output): nonlocal original_output original_output = output.clone() layer = dict(model.named_modules())[layer_name] handle = layer.register_forward_hook(hook_original) with torch.no_grad(): model(calibration_data) handle.remove() # Quantize and measure error for nbits in [4, 3, 2]: config = BaseQuantizeConfig(nbits=nbits, group_size=64) quant_layer = HQQLinear(layer, config) with torch.no_grad(): quantized_output = quant_layer(calibration_data) error = torch.mean((original_output - quantized_output) ** 2).item() print(f"{layer_name} @ {nbits}-bit: MS