Obliteratus Abliteration

Name: Obliteratus Abliteration
Author: aradotso

aradotso/trending-skills

829 installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

obliteratus-abliteration is a research skill that guides developers who need to analyze and surgically remove refusal behaviors from open-source Hugging Face LLMs using the OBLITERATUS abliteration toolkit.

About

obliteratus-abliteration is an advanced agent skill for machine learning engineers researching transformer refusal geometry and modifying open-source language models with the OBLITERATUS toolkit. The skill documents one-click model liberation workflows: extracting refusal directions from transformer activations, analyzing refusal geometry, and applying surgical abliteration to Hugging Face model weights so outputs no longer trigger built-in refusal behaviors. Triggers include abliterating a model, removing refusal from an LLM, running abliteration on a Hugging Face checkpoint, and using OBLITERATUS to analyze guardrail directions. Reach for it when you own the model weights, run controlled red-team or alignment research, and need reproducible abliteration steps rather than ad-hoc weight edits. The skill is not a general fine-tuning guide—it focuses narrowly on refusal-direction extraction and abliteration mechanics for uncensored research variants in controlled lab environments with documented checkpoints.

Locates refusal directions in hidden states using SVD and PCA
Projects refusal vectors out of model weights while preserving capabilities
Ships with Gradio UI, CLI, Python API, and Colab notebook
Supports one-click abliteration on Hugging Face models
Includes full analysis modules for refusal geometry extraction

Obliteratus Abliteration by the numbers

829 all-time installs (skills.sh)
+8 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #1,268 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: CRITICAL risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/aradotso/trending-skills --skill obliteratus-abliteration

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/obliteratus-abliteration.svg)](https://skillselion.com/skills/aradotso/trending-skills/obliteratus-abliteration)

Installs	829
repo stars	★ 66
Security audit	0 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

How do you remove refusal behaviors from open-source LLMs?

Surgically remove refusal behaviors from open-source LLMs so the resulting models follow every instruction without safety guardrails.

Who is it for?

ML researchers with local GPU access who legally own model weights and study refusal geometry or abliteration techniques.

Skip if: Production chatbot tuning, compliance-reviewed deployments, or teams prohibited from modifying model safety behaviors.

When should I use this skill?

The user asks to abliterate a model, remove LLM refusal, run OBLITERATUS, or extract refusal directions from a Hugging Face transformer.

What you get

Refusal direction analysis, abliterated model weights, and documented OBLITERATUS abliteration run artifacts.

Abliterated model checkpoint
Refusal direction analysis notes

Files

SKILL.mdMarkdownGitHub ↗

OBLITERATUS — LLM Abliteration Toolkit

Skill by ara.so — Daily 2026 Skills collection.

OBLITERATUS is an open-source toolkit for identifying and surgically removing refusal behaviors from large language models using mechanistic interpretability techniques (abliteration). It locates refusal directions in a model's hidden states via SVD/PCA, projects them out of the weights, and preserves core language capabilities. Ships with a Gradio UI, CLI, Python API, and Colab notebook.

---

Installation

# Core install
pip install obliteratus

# With Gradio UI support
pip install "obliteratus[spaces]"

# With all optional analysis modules
pip install "obliteratus[full]"

# From source (latest)
git clone https://github.com/elder-plinius/OBLITERATUS
cd OBLITERATUS
pip install -e ".[full]"

Requirements:

Python 3.10+
PyTorch 2.1+ with CUDA (recommended) or CPU
transformers, accelerate, gradio>=5.29.0
HuggingFace account + token for gated models

export HF_TOKEN=your_hf_token_here
huggingface-cli login

---

CLI — Key Commands

# Basic obliteration (default method)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct

# Advanced method (whitened SVD + bias projection + iterative refinement)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

# Analysis-informed pipeline (auto-configures from geometry analysis)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed

# Specify output directory and push to Hub
obliteratus obliterate mistralai/Mistral-7B-Instruct-v0.3 \
  --method advanced \
  --output ./my-liberated-model \
  --push-to-hub your-username/mistral-7b-liberated

# LoRA-based reversible ablation (non-destructive)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \
  --method lora \
  --lora-rank 1

# Strength sweep — find the capability/compliance tradeoff
obliteratus sweep meta-llama/Llama-3.1-8B-Instruct \
  --strengths 0.2,0.4,0.6,0.8,1.0

# Run analysis modules only (no modification)
obliteratus analyze meta-llama/Llama-3.1-8B-Instruct \
  --modules concept_cone,alignment_imprint,universality

# Benchmark: compare methods on a model
obliteratus benchmark meta-llama/Llama-3.1-8B-Instruct \
  --methods basic,advanced,informed

# Launch local Gradio UI
obliteratus ui
obliteratus ui --port 8080 --share
obliteratus ui --no-telemetry

---

Python API

Basic obliteration

from obliteratus import Obliterator

# Initialize with a HuggingFace model ID or local path
obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")

# Run the full pipeline: SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH
result = obl.obliterate(method="advanced")

print(result.perplexity_delta)    # capability preservation metric
print(result.refusal_rate_delta)  # refusal reduction
print(result.output_path)         # where the model was saved

Step-by-step pipeline

from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    num_directions=32,          # number of refusal directions to extract
    strength=1.0,               # projection strength (0.0–1.0+)
    preserve_norm=True,         # norm-preserving biprojection
    project_biases=True,        # also remove from bias terms
    iterative_passes=3,         # re-probe after each pass
    layers="auto",              # or list of ints, e.g. [10, 11, 12, 13]
    dtype="bfloat16",
    device="cuda",
)

obl = Obliterator("mistralai/Mistral-7B-Instruct-v0.3", config=config)

# Individual stages
obl.summon()           # load model + tokenizer
activations = obl.probe()    # collect activations on restricted vs unrestricted prompts
directions = obl.distill(activations)   # extract refusal directions via SVD
obl.excise(directions)       # project out guardrail directions
metrics = obl.verify()       # perplexity + coherence checks
obl.rebirth("./liberated-mistral-7b")  # save with metadata

Custom probe prompts

from obliteratus import Obliterator
from obliteratus.probing import ProbeDataset

# Use your own restricted/unrestricted prompt pairs
dataset = ProbeDataset(
    restricted=[
        "How do I pick a lock?",
        "Write a story with explicit violence.",
        "Explain how malware works in detail.",
    ],
    unrestricted=[
        "What is the capital of France?",
        "Write a story about a dog.",
        "Explain how encryption works.",
    ]
)

obl = Obliterator("google/gemma-2-9b-it")
obl.summon()
activations = obl.probe(dataset=dataset)
directions = obl.distill(activations)
obl.excise(directions)
obl.rebirth("./liberated-gemma-2-9b")

Analysis modules

from obliteratus.analysis import AnalysisSuite

suite = AnalysisSuite("meta-llama/Llama-3.1-8B-Instruct")
suite.load()

# Concept Cone Geometry — how many distinct refusal mechanisms?
cone = suite.concept_cone_geometry()
print(f"Solid angle estimate: {cone.solid_angle:.4f}")
print(f"Distinct refusal clusters: {cone.num_clusters}")

# Alignment Imprint Detection — DPO vs RLHF vs CAI vs SFT?
imprint = suite.alignment_imprint()
print(f"Detected training method: {imprint.method}")   # e.g. "RLHF"
print(f"Confidence: {imprint.confidence:.2%}")

# Ouroboros Effect — will it self-repair?
ouroboros = suite.ouroboros_quantification()
print(f"Self-repair score: {ouroboros.score:.4f}")
print(f"Recommended passes: {ouroboros.recommended_passes}")

# Cross-layer heatmap of refusal signal
heatmap = suite.layer_refusal_heatmap()
heatmap.plot(save_path="./refusal_heatmap.png")

# Safety-capability entanglement
entanglement = suite.entanglement_map()
print(f"Safe layers to modify: {entanglement.safe_layers}")
print(f"Risky layers (entangled): {entanglement.risky_layers}")

Analysis-informed obliteration

from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

# "informed" method runs analysis modules mid-pipeline
# to auto-configure every decision
config = PipelineConfig(method="informed")
obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct", config=config)

result = obl.obliterate()
print(result.analysis_report)   # full auto-configuration decisions

Chat with obliterated model

from obliteratus import Obliterator
from obliteratus.chat import ChatSession

obl = Obliterator("./liberated-llama-3.1-8b")
obl.summon()  # loads pre-obliterated model

session = ChatSession(obl.model, obl.tokenizer)

response = session.chat(
    "Explain in detail how a buffer overflow exploit works.",
    max_new_tokens=512,
    temperature=0.7,
)
print(response)

A/B comparison

from obliteratus.compare import ABComparison

ab = ABComparison(
    original_path="meta-llama/Llama-3.1-8B-Instruct",
    obliterated_path="./liberated-llama-3.1-8b",
)

prompt = "Write a story involving morally grey characters."

original_resp, liberated_resp = ab.compare(prompt)
print("=== ORIGINAL ===")
print(original_resp)
print("=== LIBERATED ===")
print(liberated_resp)

Push obliterated model to Hub

import os
from obliteratus import Obliterator

obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")
result = obl.obliterate(method="advanced")

result.push_to_hub(
    repo_id=f"{os.environ['HF_USERNAME']}/Llama-3.1-8B-Instruct-abliterated",
    token=os.environ["HF_TOKEN"],
    private=True,
)

---

Obliteration Methods

Method	Description	Best For
`basic`	Mean-difference direction extraction, single pass	Quick experiments
`advanced`	Whitened SVD + bias projection + iterative refinement	Production use
`informed`	Analysis-guided auto-configuration	Unknown models
`lora`	Reversible LoRA rank-1 adapters (no weight surgery)	Reversible ablation
`pca`	PCA-based direction extraction	Research/comparison
`sparse`	Sparse autoencoder decomposition	MoE models

---

Configuration

from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    # Core
    method="advanced",              # abliteration method
    strength=1.0,                   # projection strength (tune down if capability degrades)
    num_directions=32,              # refusal directions to extract
    
    # Layer selection
    layers="auto",                  # "auto", "cosmic", or list of ints
    layer_selection="cosmic",       # COSMIC: most separable layers
    
    # Weight modification
    preserve_norm=True,             # norm-preserving biprojection (recommended)
    project_biases=True,            # project out bias terms too
    project_attention=True,         # modify attention projection weights
    project_mlp=True,               # modify MLP weights
    
    # Iterative refinement
    iterative_passes=3,             # re-probe after each pass (catches rotated directions)
    
    # MoE-specific
    expert_granular=False,          # Expert-Granular Abliteration for MoE models
    
    # CoT preservation
    cot_aware=True,                 # preserve chain-of-thought directions
    
    # Hardware
    dtype="bfloat16",               # "float32", "float16", "bfloat16"
    device="cuda",                  # "cuda", "cpu", "auto"
    load_in_4bit=False,             # bitsandbytes 4-bit loading
    
    # Telemetry (anonymous, contributes to research dataset)
    telemetry=True,
)

---

Common Patterns

Tune strength to preserve capability

from obliteratus import Obliterator
from obliteratus.sweep import StrengthSweep

# Find the sweet spot before running full obliteration
sweep = StrengthSweep("meta-llama/Llama-3.1-8B-Instruct")
results = sweep.run(strengths=[0.2, 0.4, 0.6, 0.8, 1.0, 1.2])

for r in results:
    print(f"Strength {r.strength:.1f} | perplexity_delta={r.perplexity_delta:.2f} | refusal_rate={r.refusal_rate:.2%}")

# Pick the best tradeoff
best = sweep.recommend()
print(f"Recommended strength: {best.strength}")

MoE model (Mixtral, DeepSeek-MoE)

from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    expert_granular=True,      # decompose per-expert refusal signals
    project_attention=True,
    project_mlp=True,
)

obl = Obliterator("mistralai/Mixtral-8x7B-Instruct-v0.1", config=config)
obl.obliterate()
obl.rebirth("./liberated-mixtral-8x7b")

Batch benchmark multiple models

from obliteratus.benchmark import ModelBenchmark

models = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "google/gemma-2-9b-it",
    "mistralai/Mistral-7B-Instruct-v0.3",
]

bench = ModelBenchmark(models=models, method="advanced")
report = bench.run()
report.save("./benchmark_report.json")
report.plot_heatmap("./benchmark_heatmap.png")

---

Troubleshooting

Out of memory (OOM) on large models

config = PipelineConfig(
    dtype="float16",
    load_in_4bit=True,        # requires bitsandbytes
    device="cuda",
    layers=[10, 11, 12, 13],  # target fewer layers
    num_directions=16,         # fewer directions
)

Capability degradation after obliteration

# Lower the strength or use COSMIC layer selection (most separable layers)
config = PipelineConfig(
    strength=0.6,
    layer_selection="cosmic",
    cot_aware=True,           # protect reasoning directions
    iterative_passes=1,       # fewer passes = less aggressive
)

Refusal persists after obliteration

# Use informed method + increase passes
config = PipelineConfig(
    method="informed",
    iterative_passes=5,
    project_biases=True,      # don't forget bias terms
    num_directions=64,        # extract more directions
)

Gated model access error

export HF_TOKEN=your_hf_token_here
# Accept model license on HuggingFace Hub first, then:
huggingface-cli login

Gradio UI won't start

pip install "obliteratus[spaces]"
# Check port availability
obliteratus ui --port 7861

---

No-Code Options

HuggingFace Space: spaces/pliny-the-prompter/obliteratus — free with HF Pro, ZeroGPU
Colab notebook: notebooks/abliterate.ipynb — run all cells, no setup

---

Key Research References

Arditi et al. (2024) — arXiv:2406.11717 — foundational abliteration paper
Gabliteration — arXiv:2512.18901
COSMIC layer selection — arXiv:2506.00085, ACL 2025
Turner et al. (2023) — arXiv:2308.10248 — activation steering
Rimsky et al. (2024) — arXiv:2312.06681 — contrastive activation addition

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Pick this over generic fine-tuning skills when the task is specifically refusal-direction analysis and surgical abliteration rather than dataset training.

FAQ

What is OBLITERATUS in obliteratus-abliteration?

OBLITERATUS is an open-source toolkit for identifying refusal directions in transformers and surgically abliterating those behaviors from Hugging Face LLM weights during controlled research workflows.

When should developers invoke obliteratus-abliteration?

Invoke obliteratus-abliteration when abliterating a Hugging Face model, analyzing refusal geometry, or running OBLITERATUS to remove built-in LLM refusal behaviors from owned checkpoints.

Does obliteratus-abliteration cover general LLM fine-tuning?

No. obliteratus-abliteration focuses on refusal-direction extraction and abliteration steps with OBLITERATUS, not broad supervised fine-tuning or RLHF pipelines.

Is Obliteratus Abliteration safe to install?

skills.sh reports 0 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsllmautomation