
Gguf Quantization
Run and optimize quantized GGUF models with llama.cpp—speculative decoding, server batching, and custom conversion—for local or self-hosted agent inference.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill gguf-quantizationWhat is this skill?
- Documents speculative decoding with `llama-speculative`, draft models, and `--draft` token verification counts
- Covers `llama-server` with `--parallel` concurrent requests and `--cont-batching` continuous batching
- Shows `llama_cpp` Python setup with `n_gpu_layers`, `n_batch`, and multi-prompt inference loops
- Includes self-speculative / lookup-cache static and dynamic binary patterns for `llama-cli`
- Advanced guide sections extend beyond basic Q4_K_M loading into conversion and server-scale inference
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 1/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Most builders touch this while assembling agent stacks and local model servers in Build; the same patterns matter again under Operate when tuning latency and throughput. Agent-tooling subphase is where llama-server, llama-cli, GPU layer flags, and GGUF quantization choices directly affect what your coding agent can call locally.
Common Questions / FAQ
Is Gguf Quantization safe to install?
skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Gguf Quantization
# GGUF Advanced Usage Guide ## Speculative Decoding ### Draft Model Approach ```bash # Use smaller model as draft for faster generation ./llama-speculative \ -m large-model-q4_k_m.gguf \ -md draft-model-q4_k_m.gguf \ -p "Write a story about AI" \ -n 500 \ --draft 8 # Draft tokens before verification ``` ### Self-Speculative Decoding ```bash # Use same model with different context for speculation ./llama-cli -m model-q4_k_m.gguf \ --lookup-cache-static lookup.bin \ --lookup-cache-dynamic lookup-dynamic.bin \ -p "Hello world" ``` ## Batched Inference ### Process Multiple Prompts ```python from llama_cpp import Llama llm = Llama( model_path="model-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=35, n_batch=512 # Larger batch for parallel processing ) prompts = [ "What is Python?", "Explain machine learning.", "Describe neural networks." ] # Process in batch (each prompt gets separate context) for prompt in prompts: output = llm(prompt, max_tokens=100) print(f"Q: {prompt}") print(f"A: {output['choices'][0]['text']}\n") ``` ### Server Batching ```bash # Start server with batching ./llama-server -m model-q4_k_m.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 35 \ -c 4096 \ --parallel 4 # Concurrent requests --cont-batching # Continuous batching ``` ## Custom Model Conversion ### Convert with Vocabulary Modifications ```python # custom_convert.py import sys sys.path.insert(0, './llama.cpp') from convert_hf_to_gguf import main from gguf import GGUFWriter # Custom conversion with modified vocab def convert_with_custom_vocab(model_path, output_path): # Load and modify tokenizer from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_path) # Add special tokens if needed special_tokens = {"additional_special_tokens": ["<|custom|>"]} tokenizer.add_special_tokens(special_tokens) tokenizer.save_pretrained(model_path) # Then run standard conversion main([model_path, "--outfile", output_path]) ``` ### Convert Specific Architecture ```bash # For Mistral-style models python convert_hf_to_gguf.py ./mistral-model \ --outfile mistral-f16.gguf \ --outtype f16 # For Qwen models python convert_hf_to_gguf.py ./qwen-model \ --outfile qwen-f16.gguf \ --outtype f16 # For Phi models python convert_hf_to_gguf.py ./phi-model \ --outfile phi-f16.gguf \ --outtype f16 ``` ## Advanced Quantization ### Mixed Quantization ```bash # Quantize different layer types differently ./llama-quantize model-f16.gguf model-mixed.gguf Q4_K_M \ --allow-requantize \ --leave-output-tensor ``` ### Quantization with Token Embeddings ```bash # Keep embeddings at higher precision ./llama-quantize model-f16.gguf model-q4.gguf Q4_K_M \ --token-embedding-type f16 ``` ### IQ Quantization (Importance-aware) ```bash # Ultra-low bit quantization with importance ./llama-quantize --imatrix model.imatrix \ model-f16.gguf model-iq2_xxs.gguf IQ2_XXS # Available IQ types: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS ``` ## Memory Optimization ### Memory Mapping ```python from llama_cpp import Llama # Use memory mapping for large models llm = Llama( model_path="model-q4_k_m.gguf", use_mmap=True, # Memory map the model use_mlock=False, # Don't lock in RAM n_gpu_layers=35 ) ``` ### Partial GPU Offload ```python # Calculate layers to offload based on VRAM import subprocess def get_free_vram_gb(): result = subprocess.run( ['nvidia-smi', '--query-gpu=memory.free', '--format=csv,nounits,noheader'], capture_output=True, text=True ) return int(result.stdout.strip()) / 1024 # Estimate layers based on VRAM (rough: 0.5GB per layer for 7B Q4) free_vram = get_free_vram_gb() layers_to_offload = int(free_vram / 0.5) llm = Llama( model_path="model-q4_k_m.gguf", n_gpu_layers=min(layers_to_offload, 35) #