
Llama Cpp
Tune local llama.cpp inference—CPU threads, BLAS, GPU layers, batch, and context—for faster token throughput on indie hardware.
Overview
llama-cpp is an agent skill most often used in Ship (also Build backend, Operate infra) that optimizes llama.cpp inference via CPU, BLAS, GPU offload, and batch settings.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill llama-cppWhat is this skill?
- CPU thread tuning (-t) with guidance to favor physical cores over hyperthreading
- OpenBLAS build (LLAMA_OPENBLAS=1) for roughly 2–3× matrix speedup
- GPU layer offload (-ngl) with OOM backoff workflow and nvidia-smi monitoring
- Batch and ubatch flags for throughput; context length (-c) tradeoffs
- Benchmark tables for CPU (M3 Max, 7950X, i9-13900K) and GPU offload scenarios
- OpenBLAS build documented as 2–3× speedup for matrix ops
- Benchmark table covers Apple M3 Max ~50 tok/s and AMD 7950X ~35 tok/s for Llama 2-7B Q4_K_M
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
Local llama.cpp runs are slow or OOM and you lack a systematic order for threads, BLAS, -ngl, and context limits.
Who is it for?
Indie developers self-hosting GGUF models on laptop or workstation GPUs who need throughput before relying on the setup in production.
Skip if: Teams that only use hosted APIs and never compile or run llama.cpp binaries locally.
When should I use this skill?
When optimizing llama.cpp inference, tuning -t/-ngl/-c/-b flags, fixing OOM on GPU offload, or benchmarking local GGUF performance.
What do I get? / Deliverables
You leave with tuned llama-cli flags, a known-good GPU layer count, and benchmark-aligned expectations for tok/s on your hardware.
- Documented llama-cli command line with tuned performance flags
- Stable GPU layer count and context size that fits available VRAM
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Performance tuning matters most when you are shipping or operating a local model stack, even though flags are applied during initial build setup. perf is canonical because the guide optimizes tok/s, VRAM, and throughput rather than training or prompt design.
Where it fits
Pick -ngl and -c defaults while wiring a local agent API around llama-server.
Benchmark tok/s before release and lock thread and batch flags for your target hardware.
Diagnose post-deploy slowdowns by re-checking BLAS build, GPU layers, and context against nvidia-smi.
How it compares
Runtime inference tuning for llama.cpp binaries, not a model download catalog or fine-tuning pipeline.
Common Questions / FAQ
Who is llama-cpp for?
Solo builders shipping local LLM features via llama.cpp who own CPU/GPU tuning and want repeatable speed wins.
When should I use llama-cpp?
In Ship perf when launching local inference; in Build backend when first compiling llama.cpp; in Operate infra when production tok/s or VRAM regress.
Is llama-cpp safe to install?
It is documentation-style optimization guidance; review the Security Audits panel on this Prism page and validate commands against your own binaries and drivers.
SKILL.md
READMESKILL.md - Llama Cpp
# Performance Optimization Guide Maximize llama.cpp inference speed and efficiency. ## CPU Optimization ### Thread tuning ```bash # Set threads (default: physical cores) ./llama-cli -m model.gguf -t 8 # For AMD Ryzen 9 7950X (16 cores, 32 threads) -t 16 # Best: physical cores # Avoid hyperthreading (slower for matrix ops) ``` ### BLAS acceleration ```bash # OpenBLAS (faster matrix ops) make LLAMA_OPENBLAS=1 # BLAS gives 2-3× speedup ``` ## GPU Offloading ### Layer offloading ```bash # Offload 35 layers to GPU (hybrid mode) ./llama-cli -m model.gguf -ngl 35 # Offload all layers ./llama-cli -m model.gguf -ngl 999 # Find optimal value: # Start with -ngl 999 # If OOM, reduce by 5 until fits ``` ### Memory usage ```bash # Check VRAM usage nvidia-smi dmon # Reduce context if needed ./llama-cli -m model.gguf -c 2048 # 2K context instead of 4K ``` ## Batch Processing ```bash # Increase batch size for throughput ./llama-cli -m model.gguf -b 512 # Default: 512 # Physical batch (GPU) --ubatch 128 # Process 128 tokens at once ``` ## Context Management ```bash # Default context (512 tokens) -c 512 # Longer context (slower, more memory) -c 4096 # Very long context (if model supports) -c 32768 ``` ## Benchmarks ### CPU Performance (Llama 2-7B Q4_K_M) | Setup | Speed | Notes | |-------|-------|-------| | Apple M3 Max | 50 tok/s | Metal acceleration | | AMD 7950X (16c) | 35 tok/s | OpenBLAS | | Intel i9-13900K | 30 tok/s | AVX2 | ### GPU Offloading (RTX 4090) | Layers GPU | Speed | VRAM | |------------|-------|------| | 0 (CPU only) | 30 tok/s | 0 GB | | 20 (hybrid) | 80 tok/s | 8 GB | | 35 (all) | 120 tok/s | 12 GB | # GGUF Quantization Guide Complete guide to GGUF quantization formats and model conversion. ## Quantization Overview **GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models. ### Format Comparison | Format | Perplexity | Size (7B) | Tokens/sec | Notes | |--------|------------|-----------|------------|-------| | FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality | | Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless | | **Q6_K** | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size | | **Q5_K_M** | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced | | **Q4_K_M** | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | **Recommended** | | Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality | | Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only | | Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended | **Recommendation**: Use **Q4_K_M** for best balance of quality and speed. ## Converting Models ### HuggingFace to GGUF ```bash # 1. Download HuggingFace model huggingface-cli download meta-llama/Llama-2-7b-chat-hf \ --local-dir models/llama-2-7b-chat/ # 2. Convert to FP16 GGUF python convert_hf_to_gguf.py \ models/llama-2-7b-chat/ \ --outtype f16 \ --outfile models/llama-2-7b-chat-f16.gguf # 3. Quantize to Q4_K_M ./llama-quantize \ models/llama-2-7b-chat-f16.gguf \ models/llama-2-7b-chat-Q4_K_M.gguf \ Q4_K_M ``` ### Batch quantization ```bash # Quantize to multiple formats for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do ./llama-quantize \ model-f16.gguf \ model-${quant}.gguf \ $quant done ``` ## K-Quantization Methods **K-quants** use mixed precision for better quality: - Attention weights: Higher precision - Feed-forward weights: Lower precision **Variants**: - `_S` (Small): Faster, lower quality - `_M` (Medium): Balanced (recommended) - `_L` (Large): Better quality, larger size **Example**: `Q4_K_M` - `Q4`: 4-bit quantization - `K`: Mixed precision method - `M`: Medium quality ## Quality Testing ```bash # Calculate perplexity (quality metric) ./llama-perplexity \ -m model.gguf \ -f wikitext-2-raw/wiki.test.raw \ -c 512 # Lower perplexity = better quality # Baseline (FP16): ~5.96 # Q4_K_M: ~6.06 (+1.7%) # Q2_K: ~6.87 (+15.3% - too much degradation) ``` ##