
Outlines
Configure Outlines backends—Transformers, vLLM, llama.cpp, or OpenAI—for reliable JSON and schema-bound LLM outputs in agent pipelines.
Overview
Outlines is an agent skill most often used in Build (also Ship perf, Operate infra) that documents how to configure Outlines model backends for constrained LLM generation.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill outlinesWhat is this skill?
- Backend guide covers Transformers, llama.cpp, vLLM, and OpenAI API models
- GPU, CPU, CUDA device index, and Apple MPS configuration patterns
- Advanced model_kwargs for float16, 8-bit load, and device_map
- outlines.generate.json workflow with typed Pydantic-style model targets
- Performance comparison and production deployment section framing
- Backend guide sections: local models, API models, performance comparison, configuration examples, production deployment
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent needs predictable JSON or schema-shaped model output but you are unsure which Outlines backend and device settings fit local GPU, API, or production constraints.
Who is it for?
Builders shipping agents or LLM microservices who want one configuration playbook across Transformers, vLLM, llama.cpp, and OpenAI.
Skip if: Pure prompt-only chat with no structured output requirements, or teams not using the Outlines library at all.
When should I use this skill?
User configures Outlines backends, structured JSON generation, or production inference for Transformers, vLLM, llama.cpp, or OpenAI.
What do I get? / Deliverables
You select a supported Outlines backend, apply device and quantization settings, and wire generate.json-style flows so agents consume validated structured results.
- Backend-specific Outlines model initialization config
- Working generate.json or equivalent structured generator setup
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Outlines backend setup is first needed when you wire structured generation into agents and APIs during product build. Agent-tooling is the canonical shelf because Outlines governs how models emit constrained text and JSON for downstream agent steps.
Where it fits
Pick Transformers on CUDA with float16 before wiring outlines.generate.json into your agent tool loop.
Swap a local llama.cpp backend for OpenAI when prototyping then staging API-based structured extraction.
Apply 8-bit loading and device_map settings to cut VRAM before load testing structured generation endpoints.
Document production deployment choices across vLLM and API models for on-call inference changes.
How it compares
Skill-backed backend matrix for Outlines—not a replacement for raw prompt engineering or unrelated JSON repair post-processing.
Common Questions / FAQ
Who is outlines for?
Solo and indie developers building agents, research tools, or APIs that depend on Outlines for schema-safe generation.
When should I use outlines?
In Build agent-tooling while choosing inference stacks; during Ship perf when tuning GPU dtype and quantization; during Operate infra when stabilizing production model endpoints.
Is outlines safe to install?
Review the Security Audits panel on this Prism page; configuration guides may suggest API keys and GPU runtimes you should scope in your own environment policies.
SKILL.md
READMESKILL.md - Outlines
# Backend Configuration Guide Complete guide to configuring Outlines with different model backends. ## Table of Contents - Local Models (Transformers, llama.cpp, vLLM) - API Models (OpenAI) - Performance Comparison - Configuration Examples - Production Deployment ## Transformers (Hugging Face) ### Basic Setup ```python import outlines # Load model from Hugging Face model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct") # Use with generator generator = outlines.generate.json(model, YourModel) result = generator("Your prompt") ``` ### GPU Configuration ```python # Use CUDA GPU model = outlines.models.transformers( "microsoft/Phi-3-mini-4k-instruct", device="cuda" ) # Use specific GPU model = outlines.models.transformers( "microsoft/Phi-3-mini-4k-instruct", device="cuda:0" # GPU 0 ) # Use CPU model = outlines.models.transformers( "microsoft/Phi-3-mini-4k-instruct", device="cpu" ) # Use Apple Silicon MPS model = outlines.models.transformers( "microsoft/Phi-3-mini-4k-instruct", device="mps" ) ``` ### Advanced Configuration ```python # FP16 for faster inference model = outlines.models.transformers( "microsoft/Phi-3-mini-4k-instruct", device="cuda", model_kwargs={ "torch_dtype": "float16" } ) # 8-bit quantization (less memory) model = outlines.models.transformers( "microsoft/Phi-3-mini-4k-instruct", device="cuda", model_kwargs={ "load_in_8bit": True, "device_map": "auto" } ) # 4-bit quantization (even less memory) model = outlines.models.transformers( "meta-llama/Llama-3.1-70B-Instruct", device="cuda", model_kwargs={ "load_in_4bit": True, "device_map": "auto", "bnb_4bit_compute_dtype": "float16" } ) # Multi-GPU model = outlines.models.transformers( "meta-llama/Llama-3.1-70B-Instruct", device="cuda", model_kwargs={ "device_map": "auto", # Automatic GPU distribution "max_memory": {0: "40GB", 1: "40GB"} # Per-GPU limits } ) ``` ### Popular Models ```python # Phi-4 (Microsoft) model = outlines.models.transformers("microsoft/Phi-4-mini-instruct") model = outlines.models.transformers("microsoft/Phi-3-medium-4k-instruct") # Llama 3.1 (Meta) model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct") model = outlines.models.transformers("meta-llama/Llama-3.1-70B-Instruct") model = outlines.models.transformers("meta-llama/Llama-3.1-405B-Instruct") # Mistral (Mistral AI) model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3") model = outlines.models.transformers("mistralai/Mixtral-8x7B-Instruct-v0.1") model = outlines.models.transformers("mistralai/Mixtral-8x22B-Instruct-v0.1") # Qwen (Alibaba) model = outlines.models.transformers("Qwen/Qwen2.5-7B-Instruct") model = outlines.models.transformers("Qwen/Qwen2.5-14B-Instruct") model = outlines.models.transformers("Qwen/Qwen2.5-72B-Instruct") # Gemma (Google) model = outlines.models.transformers("google/gemma-2-9b-it") model = outlines.models.transformers("google/gemma-2-27b-it") # Llava (Vision) model = outlines.models.transformers("llava-hf/llava-v1.6-mistral-7b-hf") ``` ### Custom Model Loading ```python from transformers import AutoTokenizer, AutoModelForCausalLM import outlines # Load model manually tokenizer = AutoTokenizer.from_pretrained("your-model") model_hf = AutoModelForCausalLM.from_pretrained( "your-model", device_map="auto", torch_dtype="float16" ) # Use with Outlines model = outlines.models.transformers( model=model_hf, tokenizer=tokenizer ) ``` ## llama.cpp ### Basic Setup ```python import outlines # Load GGUF model model = outlines.models.llamacpp( "./models/llama-3.1-8b-instruct.Q4_K_M.gguf", n_ctx=4096 # Context window ) # Use with generator generator = outlines.generate.json(model, YourModel) ``` ### GPU Configuration ```python # CPU only model = outlines.models.llamacpp( "./models/model.gguf", n_c