
Awq Quantization
Quantize open LLM weights with AWQ (4-bit) and pick GEMM vs GEMV kernels for your inference latency and batch shape.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill awq-quantizationWhat is this skill?
- Explains activation-aware scaling (~1% salient weights) and the core AWQ loss formula versus GPTQ Hessian reconstruction
- Compares AWQ vs GPTQ on calibration size (128–1024 tokens), overfitting risk, and cross-domain generalization
- Documents WQLinear_GEMM for batch throughput and WQLinear_GEMV for batch_size=1 (~20% faster streaming)
- Shows quant_config version switches and practical deployment tradeoffs for chat vs batch inference
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 1/3 security scanners passed (skills.sh audits).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Common Questions / FAQ
Is Awq Quantization safe to install?
skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Awq Quantization
# AWQ Advanced Usage Guide ## Quantization Algorithm Details ### How AWQ Works AWQ (Activation-aware Weight Quantization) is based on the key insight that not all weights in an LLM are equally important. The algorithm: 1. **Identifies salient weights** (~1%) by examining activation distributions 2. **Applies mathematical scaling** to protect critical channels 3. **Quantizes remaining weights** to 4-bit with minimal error **Core formula**: `L(s) = ||Q(W * s)(s^-1 * X) - W * X||` Where: - `Q` is the quantization function - `W` is the weight matrix - `s` is the scaling factor - `X` is the input activation ### Why AWQ Outperforms GPTQ | Aspect | AWQ | GPTQ | |--------|-----|------| | Calibration approach | Activation-aware scaling | Hessian-based reconstruction | | Overfitting risk | Low (no backprop) | Higher (reconstruction-based) | | Calibration data | 128-1024 tokens | Larger datasets needed | | Generalization | Better across domains | Can overfit to calibration | ## WQLinear Kernel Variants AutoAWQ provides multiple kernel implementations for different use cases: ### WQLinear_GEMM - **Use case**: Batch inference, training - **Best for**: Batch sizes > 1, throughput optimization - **Implementation**: General matrix multiplication ```python quant_config = {"version": "GEMM"} ``` ### WQLinear_GEMV - **Use case**: Single-token generation - **Best for**: Streaming, chat applications - **Speedup**: ~20% faster than GEMM for batch_size=1 - **Limitation**: Only works with batch_size=1 ```python quant_config = {"version": "GEMV"} ``` ### WQLinear_GEMVFast - **Use case**: Optimized single-token generation - **Requirements**: awq_v2_ext kernels installed - **Best for**: Maximum single-token speed ```python # Requires autoawq[kernels] installation quant_config = {"version": "gemv_fast"} ``` ### WQLinear_Marlin - **Use case**: High-throughput inference - **Requirements**: Ampere+ GPUs (Compute Capability 8.0+) - **Speedup**: 2x faster on A100/H100 ```python from transformers import AwqConfig config = AwqConfig(bits=4, version="marlin") ``` ### WQLinear_Exllama / ExllamaV2 - **Use case**: AMD GPU compatibility, faster prefill - **Benefits**: Works with ROCm ```python config = AwqConfig(bits=4, version="exllama") ``` ### WQLinear_IPEX - **Use case**: Intel CPU/XPU acceleration - **Requirements**: Intel Extension for PyTorch, torch 2.4+ ```python pip install autoawq[cpu] ``` ## Group Size Configuration Group size determines how weights are grouped for quantization: | Group Size | Model Size | Accuracy | Speed | Use Case | |------------|------------|----------|-------|----------| | 32 | Larger | Best | Slower | Maximum accuracy | | **128** | Medium | Good | Fast | **Recommended default** | | 256 | Smaller | Lower | Faster | Speed-critical | ```python quant_config = { "q_group_size": 128, # Recommended "w_bit": 4, "zero_point": True } ``` ## Zero-Point Quantization Zero-point quantization adds an offset to handle asymmetric weight distributions: ```python # With zero-point (recommended for most models) quant_config = {"zero_point": True, "w_bit": 4, "q_group_size": 128} # Without zero-point (symmetric quantization) quant_config = {"zero_point": False, "w_bit": 4, "q_group_size": 128} ``` **When to disable zero-point**: - Models with symmetric weight distributions - When using specific kernels that don't support it ## Custom Calibration Strategies ### Domain-Specific Calibration For domain-specific models, use relevant calibration data: ```python # Medical domain medical_samples = [ "Patient presents with acute respiratory symptoms...", "Differential diagnosis includes pneumonia, bronchitis...", # More domain-specific examples ] model.quantize( tokenizer, quant_config=quant_config, calib_data=medical_samples, max_calib_samples=256 ) ``` ### Instruction-Tuned Model Calibration For chat/instruction models, include conversational data: ```python chat_samples = [ "