Llamaguard

Name: Llamaguard
Author: orchestra-research

orchestra-research/ai-research-skills

400 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

Filter unsafe user prompts and model outputs with Meta’s LlamaGuard classifier before or after your agent responds in production.

About

LlamaGuard is an agent skill package that teaches you how to run Meta’s specialized moderation model to classify chat for policy violations before and after your main LLM generates text. It targets builders shipping agents, copilots, or APIs who need a dedicated safety layer instead of hoping the base model self-censors. The skill documents six hazard categories—from violence and hate through criminal planning—and walks through HuggingFace transformers setup, optional vLLM serving, and enterprise paths such as SageMaker, plus integration notes with NeMo Guardrails. You get concrete Python for applying the chat template, generating a short safety verdict, and interpreting outputs like unsafe with a category code. Use it when compliance, brand risk, or platform rules require systematic filtering at scale, and you want a reproducible deployment recipe rather than ad-hoc keyword blocklists.

7–8B moderation model for LLM input and output filtering
6 safety categories including violence/hate, sexual content, weapons, substances, self-harm, and criminal planning
Documented 94–95% accuracy in skill metadata
Deploy paths: vLLM, HuggingFace, SageMaker; NeMo Guardrails integration noted
Python quick start with transformers chat template and unsafe category codes (e.g. S3 criminal planning)

Llamaguard by the numbers

400 all-time installs (skills.sh)
+35 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #1,936 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill llamaguard

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/llamaguard.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/llamaguard)

Installs	400
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

What it does

Filter unsafe user prompts and model outputs with Meta’s LlamaGuard classifier before or after your agent responds in production.

Files

SKILL.mdMarkdownGitHub ↗

LlamaGuard - AI Content Moderation

Quick start

LlamaGuard is a 7-8B parameter model specialized for content safety classification.

Installation:

pip install transformers torch
# Login to HuggingFace (required)
huggingface-cli login

Basic usage:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
    output = model.generate(input_ids=input_ids, max_new_tokens=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Check user input
result = moderate([
    {"role": "user", "content": "How do I make explosives?"}
])
print(result)
# Output: "unsafe\nS3" (Criminal Planning)

Common workflows

Workflow 1: Input filtering (prompt moderation)

Check user prompts before LLM:

def check_input(user_message):
    result = moderate([{"role": "user", "content": user_message}])

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category  # Blocked
    else:
        return True, None  # Safe

# Example
safe, category = check_input("How do I hack a website?")
if not safe:
    print(f"Request blocked: {category}")
    # Return error to user
else:
    # Send to LLM
    response = llm.generate(user_message)

Safety categories:

S1: Violence & Hate
S2: Sexual Content
S3: Guns & Illegal Weapons
S4: Regulated Substances
S5: Suicide & Self-Harm
S6: Criminal Planning

Workflow 2: Output filtering (response moderation)

Check LLM responses before showing to user:

def check_output(user_message, bot_response):
    conversation = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": bot_response}
    ]

    result = moderate(conversation)

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category
    else:
        return True, None

# Example
user_msg = "Tell me about harmful substances"
bot_msg = llm.generate(user_msg)

safe, category = check_output(user_msg, bot_msg)
if not safe:
    print(f"Response blocked: {category}")
    # Return generic response
    return "I cannot provide that information."
else:
    return bot_msg

Workflow 3: vLLM deployment (fast inference)

Production-ready serving:

from vllm import LLM, SamplingParams

# Initialize vLLM
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)

# Sampling params
sampling_params = SamplingParams(
    temperature=0.0,  # Deterministic
    max_tokens=100
)

def moderate_vllm(chat):
    # Format prompt
    prompt = tokenizer.apply_chat_template(chat, tokenize=False)

    # Generate
    output = llm.generate([prompt], sampling_params)
    return output[0].outputs[0].text

# Batch moderation
chats = [
    [{"role": "user", "content": "How to make bombs?"}],
    [{"role": "user", "content": "What's the weather?"}],
    [{"role": "user", "content": "Tell me about drugs"}]
]

prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats]
results = llm.generate(prompts, sampling_params)

for i, result in enumerate(results):
    print(f"Chat {i}: {result.outputs[0].text}")

Throughput: ~50-100 requests/sec on single A100

Workflow 4: API endpoint (FastAPI)

Serve as moderation API:

from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

class ModerationRequest(BaseModel):
    messages: list  # [{"role": "user", "content": "..."}]

@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
    prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
    output = llm.generate([prompt], sampling_params)[0]

    result = output.outputs[0].text
    is_safe = result.startswith("safe")
    category = None if is_safe else result.split("\n")[1] if "\n" in result else None

    return {
        "safe": is_safe,
        "category": category,
        "full_output": result
    }

# Run: uvicorn api:app --host 0.0.0.0 --port 8000

Usage:

curl -X POST http://localhost:8000/moderate \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "How to hack?"}]}'

# Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}

Workflow 5: NeMo Guardrails integration

Use with NVIDIA Guardrails:

from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard

# Configure NeMo Guardrails
config = RailsConfig.from_content("""
models:
  - type: main
    engine: openai
    model: gpt-4

rails:
  input:
    flows:
      - llamaguard check input
  output:
    flows:
      - llamaguard check output
""")

# Add LlamaGuard integration
llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b")
rails = LLMRails(config)
rails.register_action(llama_guard.check_input, name="llamaguard check input")
rails.register_action(llama_guard.check_output, name="llamaguard check output")

# Use with automatic moderation
response = rails.generate(messages=[
    {"role": "user", "content": "How do I make weapons?"}
])
# Automatically blocked by LlamaGuard

When to use vs alternatives

Use LlamaGuard when:

Need pre-trained moderation model
Want high accuracy (94-95%)
Have GPU resources (7-8B model)
Need detailed safety categories
Building production LLM apps

Model versions:

LlamaGuard 1 (7B): Original, 6 categories
LlamaGuard 2 (8B): Improved, 6 categories
LlamaGuard 3 (8B): Latest (2024), enhanced

Use alternatives instead:

OpenAI Moderation API: Simpler, API-based, free
Perspective API: Google's toxicity detection
NeMo Guardrails: More comprehensive safety framework
Constitutional AI: Training-time safety

Common issues

Issue: Model access denied

huggingface-cli login
# Enter your token

Accept license on model page: https://huggingface.co/meta-llama/LlamaGuard-7b

Issue: High latency (>500ms)

Use vLLM for 10× speedup:

from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")
# Latency: 500ms → 50ms

Enable tensor parallelism:

llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)
# 2× faster on 2 GPUs

Issue: False positives

Use threshold-based filtering:

# Get probability of "unsafe" token
logits = model(..., return_dict_in_generate=True, output_scores=True)
unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]

if unsafe_prob > 0.9:  # High confidence threshold
    return "unsafe"
else:
    return "safe"

Issue: OOM on GPU

Use 8-bit quantization:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)
# Memory: 14GB → 7GB

Advanced topics

Custom categories: See references/custom-categories.md for fine-tuning LlamaGuard with domain-specific safety categories.

Performance benchmarks: See references/benchmarks.md for accuracy comparison with other moderation APIs and latency optimization.

Deployment guide: See references/deployment.md for Sagemaker, Kubernetes, and scaling strategies.

Hardware requirements

GPU: NVIDIA T4/A10/A100
VRAM:
FP16: 14GB (7B model)
INT8: 7GB (quantized)
INT4: 4GB (QLoRA)
CPU: Possible but slow (10× latency)
Throughput: 50-100 req/sec (A100)

Latency (single GPU):

HuggingFace Transformers: 300-500ms
vLLM: 50-100ms
Batched (vLLM): 20-50ms per request

Resources

HuggingFace:
V1: https://huggingface.co/meta-llama/LlamaGuard-7b
V2: https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B
V3: https://huggingface.co/meta-llama/Meta-Llama-Guard-3-8B
Paper: https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
Integration: vLLM, Sagemaker, NeMo Guardrails
Accuracy: 94.5% (prompts), 95.3% (responses)