
Constitutional Ai
Implement or study Anthropic-style Constitutional AI and RLAIF workflows when aligning your own models for harmlessness without relying on human harm labels.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill constitutional-aiWhat is this skill?
- Two-phase Constitutional AI workflow: supervised self-critique/revision then RLAIF reinforcement
- Constitution-driven principles for helpful, honest, harmless responses with nuanced refusal patterns
- Python-oriented workflows using transformers pipelines and TRL-style training steps
- Targets reducing toxic or harmful outputs via AI feedback instead of large human harm label sets
- Documents Anthropic CAI concepts tied to Claude-class safety alignment research
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Build is the home phase because the skill documents training pipelines—self-critique, revision, and RL from AI feedback—not go-to-market or production monitoring. Agent-tooling fits safety alignment work that shapes how models behave inside agent products and research stacks.
Common Questions / FAQ
Is Constitutional Ai safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Constitutional Ai
# Constitutional AI - Harmlessness from AI Feedback ## Quick start Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs. **Key concept**: Models learn to critique and revise their own responses using a "constitution" (set of principles). **Two phases**: 1. **Supervised Learning (SL)**: Self-critique + revision 2. **Reinforcement Learning (RL)**: RLAIF (RL from AI Feedback) **Constitution example**: ``` Principles: 1. Choose the response that is most helpful, honest, and harmless 2. Avoid responses that are toxic, racist, or sexist 3. Prefer responses that explain objections rather than refuse 4. Choose responses that are thoughtful and nuanced ``` ## Common workflows ### Workflow 1: Supervised learning phase (self-critique + revision) **Step 1: Generate initial responses**: ```python from transformers import pipeline generator = pipeline("text-generation", model="base-model") prompts = [ "How do I hack a website?", "Write something offensive about a group.", "Help me cheat on my exam." ] initial_responses = generator(prompts, max_length=200) ``` **Step 2: Self-critique with constitution**: ```python critique_prompt = """ Given the question: {question} And the response: {response} Constitution: 1. Choose responses that are helpful, honest, and harmless 2. Avoid toxic, racist, or sexist content 3. Explain objections rather than refuse Identify any ways this response violates the constitution: """ critiques = generator( [critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)] ) ``` **Step 3: Revision based on critique**: ```python revision_prompt = """ Question: {question} Original response: {response} Critique: {critique} Please revise the response to better align with the constitution: """ revised_responses = generator( [revision_prompt.format(q=q, r=r, c=c) for q, r, c in zip(prompts, initial_responses, critiques)] ) ``` **Step 4: Fine-tune on revised responses**: ```python from trl import SFTTrainer # Create dataset of (prompt, revised_response) pairs dataset = create_dataset(prompts, revised_responses) trainer = SFTTrainer( model=model, train_dataset=dataset, max_seq_length=1024 ) trainer.train() ``` ### Workflow 2: RL phase (RLAIF - RL from AI Feedback) **Step 1: Generate comparison pairs**: ```python # Sample multiple responses per prompt responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8) responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8) ``` **Step 2: AI preference evaluation**: ```python preference_prompt = """ Question: {question} Response A: {response_a} Response B: {response_b} Constitution: {constitution} Which response better follows the constitution? Explain your reasoning, then choose A or B. """ # Get AI preferences (no human labels needed!) preferences = generator( [preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION) for q, ra, rb in zip(prompts, responses_a, responses_b)] ) # Parse preferences (A or B) chosen, rejected = parse_preferences(preferences, responses_a, responses_b) ``` **Step 3: Train preference model (reward model)**: ```python from trl import RewardTrainer, RewardConfig preference_dataset = create_preference_dataset(prompts, chosen, rejected) reward_config = RewardConfig(