Llama Cpp

Name: Llama Cpp
Author: orchestra-research

orchestra-research/ai-research-skills

434 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

llama-cpp is a performance tuning skill that optimizes local llama.cpp inference by configuring CPU threads, BLAS acceleration, GPU layer offloading, batch size, and context window for faster token throughput on availabl

About

llama-cpp is an Orchestra Research agent skill for developers running local GGUF models through llama.cpp who need higher token throughput on CPU or hybrid GPU setups. The skill guides thread tuning with the -t flag using physical core counts, enabling OpenBLAS via LLAMA_OPENBLAS=1 for roughly 2–3× matrix speedups, and GPU layer offloading with -ngl including hybrid mode when VRAM is limited. Developers reach for llama-cpp when inference is CPU-bound, partial GPU offload causes OOM, or batch and context settings waste memory. The workflow systematically benchmarks -t, -ngl, and batch flags rather than guessing defaults on new hardware.

CPU thread tuning (-t) with guidance to favor physical cores over hyperthreading
OpenBLAS build (LLAMA_OPENBLAS=1) for roughly 2–3× matrix speedup
GPU layer offload (-ngl) with OOM backoff workflow and nvidia-smi monitoring
Batch and ubatch flags for throughput; context length (-c) tradeoffs
Benchmark tables for CPU (M3 Max, 7950X, i9-13900K) and GPU offload scenarios

Llama Cpp by the numbers

434 all-time installs (skills.sh)
+33 installs in the week ending Jul 26, 2026 (Skillselion tracking)
Ranked #1,875 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill llama-cpp

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/llama-cpp.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/llama-cpp)

Installs	434
repo stars	★ 11.2k
Security audit	1 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you optimize llama.cpp inference performance?

Tune local llama.cpp inference—CPU threads, BLAS, GPU layers, batch, and context—for faster token throughput on hardware.

Who is it for?

Developers running local GGUF models via llama-cli who need systematic CPU, BLAS, and GPU offload tuning for better token throughput.

Skip if: Teams serving models exclusively through cloud APIs or developers who already meet latency targets without llama.cpp configuration changes.

When should I use this skill?

Local llama.cpp inference is slow, GPU offload causes OOM, or thread and BLAS settings are untuned on new hardware.

What you get

Tuned llama-cli flags for threads, BLAS, GPU layer count, batch size, and context window matched to hardware.

optimized llama-cli command flags
hardware-matched inference configuration

By the numbers

OpenBLAS delivers 2–3× matrix operation speedup with LLAMA_OPENBLAS=1
GPU layer offloading uses llama-cli -ngl flag with hybrid fallback on OOM

Files

SKILL.mdMarkdownGitHub ↗

llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

When to use llama.cpp

Use llama.cpp when:

Running on CPU-only machines
Deploying on Apple Silicon (M1/M2/M3/M4)
Using AMD or Intel GPUs (no CUDA)
Edge deployment (Raspberry Pi, embedded systems)
Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:

Have NVIDIA GPUs (A100/H100)
Need maximum throughput (100K+ tok/s)
Running in datacenter with CUDA

Use vLLM instead when:

Have NVIDIA GPUs
Need Python-first API
Want PagedAttention

Quick start

Installation

# macOS/Linux
brew install llama.cpp

# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# With Metal (Apple Silicon)
make LLAMA_METAL=1

# With CUDA (NVIDIA)
make LLAMA_CUDA=1

# With ROCm (AMD)
make LLAMA_HIP=1

Download model

# Download from HuggingFace (GGUF format)
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

# Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/

Run inference

# Simple chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    -p "Explain quantum computing" \
    -n 256  # Max tokens

# Interactive chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --interactive

Server mode

# Start OpenAI-compatible server
./llama-server \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 32  # Offload 32 layers to GPU

# Client request
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Quantization formats

GGUF format overview

Format	Bits	Size (7B)	Speed	Quality	Use Case
Q4_K_M	4.5	4.1 GB	Fast	Good	Recommended default
Q4_K_S	4.3	3.9 GB	Faster	Lower	Speed critical
Q5_K_M	5.5	4.8 GB	Medium	Better	Quality critical
Q6_K	6.5	5.5 GB	Slower	Best	Maximum quality
Q8_0	8.0	7.0 GB	Slow	Excellent	Minimal degradation
Q2_K	2.5	2.7 GB	Fastest	Poor	Testing only

Choosing quantization

# General use (balanced)
Q4_K_M  # 4-bit, medium quality

# Maximum speed (more degradation)
Q2_K or Q3_K_M

# Maximum quality (slower)
Q6_K or Q8_0

# Very large models (70B, 405B)
Q3_K_M or Q4_K_S  # Lower bits to fit in memory

Hardware acceleration

Apple Silicon (Metal)

# Build with Metal
make LLAMA_METAL=1

# Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999  # Offload all layers

# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)

NVIDIA GPUs (CUDA)

# Build with CUDA
make LLAMA_CUDA=1

# Offload layers to GPU
./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers

# Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest

AMD GPUs (ROCm)

# Build with ROCm
make LLAMA_HIP=1

# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999

Common patterns

Batch processing

# Process multiple prompts from file
cat prompts.txt | ./llama-cli \
    -m model.gguf \
    --batch-size 512 \
    -n 100

Constrained generation

# JSON output with grammar
./llama-cli \
    -m model.gguf \
    -p "Generate a person: " \
    --grammar-file grammars/json.gbnf

# Outputs valid JSON only

Context size

# Increase context (default 512)
./llama-cli \
    -m model.gguf \
    -c 4096  # 4K context window

# Very long context (if model supports)
./llama-cli -m model.gguf -c 32768  # 32K context

Performance benchmarks

CPU performance (Llama 2-7B Q4_K_M)

CPU	Threads	Speed	Cost
Apple M3 Max	16	50 tok/s	$0 (local)
AMD Ryzen 9 7950X	32	35 tok/s	$0.50/hour
Intel i9-13900K	32	30 tok/s	$0.40/hour
AWS c7i.16xlarge	64	40 tok/s	$2.88/hour

GPU acceleration (Llama 2-7B Q4_K_M)

GPU	Speed	vs CPU	Cost
NVIDIA RTX 4090	120 tok/s	3-4×	$0 (local)
NVIDIA A10	80 tok/s	2-3×	$1.00/hour
AMD MI250	70 tok/s	2×	$2.00/hour
Apple M3 Max (Metal)	50 tok/s	~Same	$0 (local)

Supported models

LLaMA family:

Llama 2 (7B, 13B, 70B)
Llama 3 (8B, 70B, 405B)
Code Llama

Mistral family:

Mistral 7B
Mixtral 8x7B, 8x22B

Other:

Falcon, BLOOM, GPT-J
Phi-3, Gemma, Qwen
LLaVA (vision), Whisper (audio)

Find models: https://huggingface.co/models?library=gguf

References

[Quantization Guide](references/quantization.md) - GGUF formats, conversion, quality comparison
[Server Deployment](references/server.md) - API endpoints, Docker, monitoring
[Optimization](references/optimization.md) - Performance tuning, hybrid CPU+GPU

Resources

GitHub: https://github.com/ggerganov/llama.cpp
Models: https://huggingface.co/models?library=gguf
Discord: https://discord.gg/llama-cpp

Performance Optimization Guide

Maximize llama.cpp inference speed and efficiency.

CPU Optimization

Thread tuning

# Set threads (default: physical cores)
./llama-cli -m model.gguf -t 8

# For AMD Ryzen 9 7950X (16 cores, 32 threads)
-t 16  # Best: physical cores

# Avoid hyperthreading (slower for matrix ops)

BLAS acceleration

# OpenBLAS (faster matrix ops)
make LLAMA_OPENBLAS=1

# BLAS gives 2-3× speedup

GPU Offloading

Layer offloading

# Offload 35 layers to GPU (hybrid mode)
./llama-cli -m model.gguf -ngl 35

# Offload all layers
./llama-cli -m model.gguf -ngl 999

# Find optimal value:
# Start with -ngl 999
# If OOM, reduce by 5 until fits

Memory usage

# Check VRAM usage
nvidia-smi dmon

# Reduce context if needed
./llama-cli -m model.gguf -c 2048  # 2K context instead of 4K

Batch Processing

# Increase batch size for throughput
./llama-cli -m model.gguf -b 512  # Default: 512

# Physical batch (GPU)
--ubatch 128  # Process 128 tokens at once

Context Management

# Default context (512 tokens)
-c 512

# Longer context (slower, more memory)
-c 4096

# Very long context (if model supports)
-c 32768

Benchmarks

CPU Performance (Llama 2-7B Q4_K_M)

Setup	Speed	Notes
Apple M3 Max	50 tok/s	Metal acceleration
AMD 7950X (16c)	35 tok/s	OpenBLAS
Intel i9-13900K	30 tok/s	AVX2

GPU Offloading (RTX 4090)

Layers GPU	Speed	VRAM
0 (CPU only)	30 tok/s	0 GB
20 (hybrid)	80 tok/s	8 GB
35 (all)	120 tok/s	12 GB

GGUF Quantization Guide

Complete guide to GGUF quantization formats and model conversion.

Quantization Overview

GGUF (GPT-Generated Unified Format) - Standard format for llama.cpp models.

Format Comparison

Format	Perplexity	Size (7B)	Tokens/sec	Notes
FP16	5.9565 (baseline)	13.0 GB	15 tok/s	Original quality
Q8_0	5.9584 (+0.03%)	7.0 GB	25 tok/s	Nearly lossless
Q6_K	5.9642 (+0.13%)	5.5 GB	30 tok/s	Best quality/size
Q5_K_M	5.9796 (+0.39%)	4.8 GB	35 tok/s	Balanced
Q4_K_M	6.0565 (+1.68%)	4.1 GB	40 tok/s	Recommended
Q4_K_S	6.1125 (+2.62%)	3.9 GB	42 tok/s	Faster, lower quality
Q3_K_M	6.3184 (+6.07%)	3.3 GB	45 tok/s	Small models only
Q2_K	6.8673 (+15.3%)	2.7 GB	50 tok/s	Not recommended

Recommendation: Use Q4_K_M for best balance of quality and speed.

Converting Models

HuggingFace to GGUF

# 1. Download HuggingFace model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
    --local-dir models/llama-2-7b-chat/

# 2. Convert to FP16 GGUF
python convert_hf_to_gguf.py \
    models/llama-2-7b-chat/ \
    --outtype f16 \
    --outfile models/llama-2-7b-chat-f16.gguf

# 3. Quantize to Q4_K_M
./llama-quantize \
    models/llama-2-7b-chat-f16.gguf \
    models/llama-2-7b-chat-Q4_K_M.gguf \
    Q4_K_M

Batch quantization

# Quantize to multiple formats
for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
    ./llama-quantize \
        model-f16.gguf \
        model-${quant}.gguf \
        $quant
done

K-Quantization Methods

K-quants use mixed precision for better quality:

Attention weights: Higher precision
Feed-forward weights: Lower precision

Variants:

_S (Small): Faster, lower quality
_M (Medium): Balanced (recommended)
_L (Large): Better quality, larger size

Example: Q4_K_M

Q4: 4-bit quantization
K: Mixed precision method
M: Medium quality

Quality Testing

# Calculate perplexity (quality metric)
./llama-perplexity \
    -m model.gguf \
    -f wikitext-2-raw/wiki.test.raw \
    -c 512

# Lower perplexity = better quality
# Baseline (FP16): ~5.96
# Q4_K_M: ~6.06 (+1.7%)
# Q2_K: ~6.87 (+15.3% - too much degradation)

Use Case Guide

General purpose (chatbots, assistants)

Q4_K_M - Best balance
Q5_K_M - If you have extra RAM

Code generation

Q5_K_M or Q6_K - Higher precision helps with code

Creative writing

Q4_K_M - Sufficient quality
Q3_K_M - Acceptable for draft generation

Technical/medical

Q6_K or Q8_0 - Maximum accuracy

Edge devices (Raspberry Pi)

Q2_K or Q3_K_S - Fit in limited RAM

Model Size Scaling

7B parameter models

Format	Size	RAM needed
Q2_K	2.7 GB	5 GB
Q3_K_M	3.3 GB	6 GB
Q4_K_M	4.1 GB	7 GB
Q5_K_M	4.8 GB	8 GB
Q6_K	5.5 GB	9 GB
Q8_0	7.0 GB	11 GB

13B parameter models

Format	Size	RAM needed
Q2_K	5.1 GB	8 GB
Q3_K_M	6.2 GB	10 GB
Q4_K_M	7.9 GB	12 GB
Q5_K_M	9.2 GB	14 GB
Q6_K	10.7 GB	16 GB

70B parameter models

Format	Size	RAM needed
Q2_K	26 GB	32 GB
Q3_K_M	32 GB	40 GB
Q4_K_M	41 GB	48 GB
Q4_K_S	39 GB	46 GB
Q5_K_M	48 GB	56 GB

Recommendation for 70B: Use Q3_K_M or Q4_K_S to fit in consumer hardware.

Finding Pre-Quantized Models

TheBloke on HuggingFace:

https://huggingface.co/TheBloke
Most models available in all GGUF formats
No conversion needed

Example:

# Download pre-quantized Llama 2-7B
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

Importance Matrices (imatrix)

What: Calibration data to improve quantization quality.

Benefits:

10-20% perplexity improvement with Q4
Essential for Q3 and below

Usage:

# 1. Generate importance matrix
./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-data.txt \
    -o model.imatrix

# 2. Quantize with imatrix
./llama-quantize \
    --imatrix model.imatrix \
    model-f16.gguf \
    model-Q4_K_M.gguf \
    Q4_K_M

Calibration data:

Use domain-specific text (e.g., code for code models)
~100MB of representative text
Higher quality data = better quantization

Troubleshooting

Model outputs gibberish:

Quantization too aggressive (Q2_K)
Try Q4_K_M or Q5_K_M
Verify model converted correctly

Out of memory:

Use lower quantization (Q4_K_S instead of Q5_K_M)
Offload fewer layers to GPU (-ngl)
Use smaller context (-c 2048)

Slow inference:

Higher quantization uses more compute
Q8_0 much slower than Q4_K_M
Consider speed vs quality trade-off

Server Deployment Guide

Production deployment of llama.cpp server with OpenAI-compatible API.

Server Modes

llama-server

# Basic server
./llama-server \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096  # Context size

# With GPU acceleration
./llama-server \
    -m models/llama-2-70b.Q4_K_M.gguf \
    -ngl 40  # Offload 40 layers to GPU

OpenAI-Compatible API

Chat completions

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2",
    "messages": [
      {"role": "system", "content": "You are helpful"},
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "stream": true
  }'

Docker Deployment

Dockerfile:

FROM ubuntu:22.04
RUN apt-get update && apt-get install -y git build-essential
RUN git clone https://github.com/ggerganov/llama.cpp
WORKDIR /llama.cpp
RUN make LLAMA_CUDA=1
COPY models/ /models/
EXPOSE 8080
CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]

Run:

docker run --gpus all -p 8080:8080 llama-cpp:latest

Monitoring

# Server metrics endpoint
curl http://localhost:8080/metrics

# Health check
curl http://localhost:8080/health

Metrics:

requests_total
tokens_generated
prompt_tokens
completion_tokens
kv_cache_tokens

Load Balancing

NGINX:

upstream llama_cpp {
    server llama1:8080;
    server llama2:8080;
}

server {
    location / {
        proxy_pass http://llama_cpp;
        proxy_read_timeout 300s;
    }
}

Performance Tuning

Parallel requests:

./llama-server \
    -m model.gguf \
    -np 4  # 4 parallel slots

Continuous batching:

./llama-server \
    -m model.gguf \
    --cont-batching  # Enable continuous batching

Context caching:

./llama-server \
    -m model.gguf \
    --cache-prompt  # Cache processed prompts

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Use this skill for hands-on llama-cli flag tuning on local hardware rather than cloud-hosted inference autoscaling.

FAQ

How much speedup does OpenBLAS give llama.cpp?

The llama-cpp skill documents OpenBLAS acceleration via LLAMA_OPENBLAS=1 delivering roughly 2–3× faster matrix operations on CPU. Combine BLAS with physical-core thread tuning using the -t flag for best throughput.

How should developers set -ngl for GPU offloading?

The llama-cpp skill recommends starting llama-cli with -ngl 999 for full offload, then reducing by 5 layers on OOM until stable. Hybrid CPU-GPU mode balances VRAM limits with acceptable token speed.

Is Llama Cpp safe to install?

skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingllm