
Gptq
Pick calibration data and run GPTQ quantization so compressed LLMs stay accurate instead of collapsing perplexity.
Overview
gptq is an agent skill for the Build phase that guides GPTQ calibration data selection and quantization tradeoffs to preserve model accuracy after weight compression.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill gptqWhat is this skill?
- Explains why calibration drives Hessian-aware weight importance and post-quant perplexity
- Recommends 128–256 samples × 512 tokens (~65K–131K tokens); <64 underfits, >512 diminishing returns
- Domain recipes: C4 for general LLMs, The Stack for code models, ShareGPT/Alpaca for chat models
- Quality band: good calibration <1.5% perplexity increase vs poor at 5–10% or gibberish without calibration
- 128–256 calibration samples of 512 tokens (65K–131K total tokens) recommended
- Good calibration: under 1.5% perplexity increase; poor: 5–10%
- Under 64 samples risks underfitting calibration
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want a smaller quantized model but don’t know what calibration text to use, and bad data silently wrecks perplexity or outputs.
Who is it for?
Indie ML builders quantizing Llama-, CodeLlama-, or chat-tuned checkpoints for self-hosted inference with tight VRAM budgets.
Skip if: Teams with no local quantization pipeline, pure API-only products, or builders who only need prompting help without model compression.
When should I use this skill?
User is quantizing with GPTQ, choosing calibration datasets, or debugging perplexity blow-ups after quantization.
What do I get? / Deliverables
You leave with a domain-matched 128–256 sample calibration plan and Hugging Face-oriented snippets sized for ~512-token windows before running GPTQ.
- Calibration data sampling plan by model domain
- Executable dataset streaming snippets
- Expected perplexity impact checklist
Recommended Skills
Journey fit
Quantization is an implementation-time ML engineering step when you are preparing models for cheaper inference on your own stack. Backend subphase covers model compression pipelines that feed APIs, agents, or self-hosted inference—not launch marketing or ops dashboards.
How it compares
Calibration methodology for GPTQ—not a general LLM fine-tuning or LoRA training skill.
Common Questions / FAQ
Who is gptq for?
Developers compressing open-weight LLMs who need concrete calibration dataset choices and token budgets, not high-level “quantize your model” slogans.
When should I use gptq?
In build/backend when preparing a checkpoint for GPTQ before deploy; when perplexity spikes after a bad calib run; when switching a code model from general C4 to Stack-derived samples.
Is gptq safe to install?
Check this page’s Security Audits panel; the skill implies downloading public datasets and running local quantization—scope network and disk access accordingly.
SKILL.md
READMESKILL.md - Gptq
# GPTQ Calibration Guide Complete guide to calibration data selection and quantization process. ## Calibration Data Selection ### Why calibration matters Calibration data is used to: 1. **Compute weight importance** (Hessian matrix) 2. **Minimize quantization error** for important weights 3. **Preserve model accuracy** after quantization **Impact**: - Good calibration: <1.5% perplexity increase - Poor calibration: 5-10% perplexity increase - No calibration: Model may output gibberish ### Dataset size **Recommended**: - **128-256 samples** of 512 tokens each - Total: 65K-131K tokens **More is not always better**: - <64 samples: Underfitting (poor quality) - 128-256 samples: Sweet spot - >512 samples: Diminishing returns, slower quantization ### Dataset selection by domain **General purpose models (GPT, Llama)**: ```python from datasets import load_dataset # C4 dataset (recommended for general models) dataset = load_dataset("c4", split="train", streaming=True) calibration_data = [ tokenizer(example["text"])["input_ids"][:512] for example in dataset.take(128) ] ``` **Code models (CodeLlama, StarCoder)**: ```python # The Stack dataset dataset = load_dataset("bigcode/the-stack", split="train", streaming=True) calibration_data = [ tokenizer(example["content"])["input_ids"][:512] for example in dataset.take(128) if example["lang"] == "Python" # Or your target language ] ``` **Chat models**: ```python # ShareGPT or Alpaca format dataset = load_dataset("anon8231489123/ShareGPT_Vicuna_unfiltered", split="train") calibration_data = [] for example in dataset.select(range(128)): # Format as conversation conversation = tokenizer.apply_chat_template( example["conversations"], tokenize=True, max_length=512 ) calibration_data.append(conversation) ``` **Domain-specific (medical, legal)**: ```python # Use domain-specific text dataset = load_dataset("medical_dataset", split="train") calibration_data = [ tokenizer(example["text"])["input_ids"][:512] for example in dataset.take(256) # More samples for niche domains ] ``` ## Quantization Process ### Basic quantization ```python from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig from transformers import AutoTokenizer from datasets import load_dataset # 1. Load model model_name = "meta-llama/Llama-2-7b-hf" model = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=BaseQuantizeConfig( bits=4, group_size=128, desc_act=False ) ) tokenizer = AutoTokenizer.from_pretrained(model_name) # 2. Prepare calibration data dataset = load_dataset("c4", split="train", streaming=True) calibration_data = [ tokenizer(example["text"])["input_ids"][:512] for example in dataset.take(128) ] # 3. Quantize model.quantize(calibration_data) # 4. Save model.save_quantized("llama-2-7b-gptq") ``` **Time**: ~10-30 minutes for 7B model on A100 ### Advanced configuration ```python config = BaseQuantizeConfig( bits=4, # 3, 4, or 8 bits group_size=128, # 32, 64, 128, or -1 (per-column) desc_act=False, # Activation order (True = better accuracy, slower) damp_percent=0.01, # Dampening (0.001-0.1, default 0.01) static_groups=False, # Static quantization sym=True, # Symmetric quantization true_sequential=True, # Sequential quantization (more accurate) model_seqlen=2048 # Model sequence length ) ``` **Parameter tuning**: - `damp_percent`: Lower = more accurate, slower. Try 0.005-0.02. - `desc_act=True`: 0.5-1% better accuracy, 20-30% slower inference - `group_size=32`: Better accuracy, slightly larger model ### Multi-GPU quantization ```python # Quantize on multiple GPUs (faster) model = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=config, device_map="auto", # Distribute across GPUs max_memory={0: "40GB", 1: "40GB"} )