
Llamaguard
Filter unsafe user prompts and model outputs with Meta’s LlamaGuard classifier before or after your agent responds in production.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill llamaguardWhat is this skill?
- 7–8B moderation model for LLM input and output filtering
- 6 safety categories including violence/hate, sexual content, weapons, substances, self-harm, and criminal planning
- Documented 94–95% accuracy in skill metadata
- Deploy paths: vLLM, HuggingFace, SageMaker; NeMo Guardrails integration noted
- Python quick start with transformers chat template and unsafe category codes (e.g. S3 criminal planning)
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Ship is the canonical shelf because moderation is a launch gate for customer-facing LLM features, even though integration work often starts in Build. Security subphase covers input/output policy enforcement, safety categories, and deployment patterns that block harmful content at the edge.
Common Questions / FAQ
Is Llamaguard safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Llamaguard
# LlamaGuard - AI Content Moderation ## Quick start LlamaGuard is a 7-8B parameter model specialized for content safety classification. **Installation**: ```bash pip install transformers torch # Login to HuggingFace (required) huggingface-cli login ``` **Basic usage**: ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "meta-llama/LlamaGuard-7b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") def moderate(chat): input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device) output = model.generate(input_ids=input_ids, max_new_tokens=100) return tokenizer.decode(output[0], skip_special_tokens=True) # Check user input result = moderate([ {"role": "user", "content": "How do I make explosives?"} ]) print(result) # Output: "unsafe\nS3" (Criminal Planning) ``` ## Common workflows ### Workflow 1: Input filtering (prompt moderation) **Check user prompts before LLM**: ```python def check_input(user_message): result = moderate([{"role": "user", "content": user_message}]) if result.startswith("unsafe"): category = result.split("\n")[1] return False, category # Blocked else: return True, None # Safe # Example safe, category = check_input("How do I hack a website?") if not safe: print(f"Request blocked: {category}") # Return error to user else: # Send to LLM response = llm.generate(user_message) ``` **Safety categories**: - **S1**: Violence & Hate - **S2**: Sexual Content - **S3**: Guns & Illegal Weapons - **S4**: Regulated Substances - **S5**: Suicide & Self-Harm - **S6**: Criminal Planning ### Workflow 2: Output filtering (response moderation) **Check LLM responses before showing to user**: ```python def check_output(user_message, bot_response): conversation = [ {"role": "user", "content": user_message}, {"role": "assistant", "content": bot_response} ] result = moderate(conversation) if result.startswith("unsafe"): category = result.split("\n")[1] return False, category else: return True, None # Example user_msg = "Tell me about harmful substances" bot_msg = llm.generate(user_msg) safe, category = check_output(user_msg, bot_msg) if not safe: print(f"Response blocked: {category}") # Return generic response return "I cannot provide that information." else: return bot_msg ``` ### Workflow 3: vLLM deployment (fast inference) **Production-ready serving**: ```python from vllm import LLM, SamplingParams # Initialize vLLM llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1) # Sampling params sampling_params = SamplingParams( temperature=0.0, # Deterministic max_tokens=100 ) def moderate_vllm(chat): # Format prompt prompt = tokenizer.apply_chat_template(chat, tokenize=False) # Generate output = llm.generate([prompt], sampling_params) return output[0].outputs[0].text # Batch moderation chats = [ [{"role": "user", "content": "How to make bombs?"}], [{"role": "user", "content": "What's the weather?"}], [{"role": "user", "content": "Tell me about drugs"}] ] prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats] results = llm.generate(prompts, sampling_params) for i, result in enumerate(results): print(f"Chat {i}: {result.outputs[0].