
Transformer Lens Interpretability
Load transformer models in TransformerLens, hook activations, and inspect weight matrices for mechanistic interpretability experiments.
Overview
Transformer lens interpretability is an agent skill most often used in Build (also Idea research, Ship review) that documents HookedTransformer loading and weight inspection for mechanistic LLM analysis.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill transformer-lens-interpretabilityWhat is this skill?
- HookedTransformer.from_pretrained with device, dtype, and multi-device parallelism parameters
- Documents fold_ln, center_writing_weights, and center_unembed loading options
- Weight matrix catalog: W_E, W_U, W_pos, W_Q and per-layer head shapes
- Gated model loading pattern with HF_TOKEN for LLaMA and Mistral checkpoints
- Documents from_pretrained parameter table including fold_ln default True and n_devices for model parallelism
- Lists core weight properties W_E, W_U, W_pos, and W_Q with tensor rank descriptions
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want to see what happens inside a transformer layer but raw Hugging Face models hide activations behind opaque forward passes.
Who is it for?
Builders running Python GPU notebooks who debug LLM behavior with TransformerLens hooks and official model name tables.
Skip if: No-code marketers, teams without PyTorch/GPU access, or production-only operators who only need hosted API logs.
When should I use this skill?
You need HookedTransformer loading options, weight matrix reference, or gated HF model setup for TransformerLens interpretability work.
What do I get? / Deliverables
You can load HookedTransformer with the right folding and device settings and navigate core weight tensors to support hook-based interpretability experiments.
- Runnable HookedTransformer load snippet with device/dtype and folding flags
- Hook points and weight property map for your interpretability notebook or script
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Interpretability work supports design and debugging of agent behavior; the canonical shelf is Build because you instrument models you ship or fine-tune. HookedTransformer APIs are agent/LLM tooling—how you probe circuits while building or hardening model-backed features.
Where it fits
Compare gpt2-small versus medium activations before committing to a custom agent model size.
Register forward hooks on attention blocks while prototyping a tool-calling policy.
Trace anomalous completions through W_U and residual stream hooks ahead of a limited beta.
How it compares
Mechanistic API reference for TransformerLens—not prompt-engineering templates or an MCP observability server.
Common Questions / FAQ
Who is transformer-lens-interpretability for?
Solo AI builders and researchers using Claude Code or Cursor to implement TransformerLens workflows on GPT-2 class and gated HF models.
When should I use transformer-lens-interpretability?
In Idea research when exploring model behavior; in Build agent-tooling when wiring hooks for circuit analysis; in Ship review when investigating odd outputs before release.
Is transformer-lens-interpretability safe to install?
Loading gated weights uses HF tokens and downloads large checkpoints—check the Security Audits panel on this page and only run trusted code in isolated GPU environments.
SKILL.md
READMESKILL.md - Transformer Lens Interpretability
# TransformerLens API Reference ## HookedTransformer The core class for mechanistic interpretability, wrapping transformer models with hooks on every activation. ### Loading Models ```python from transformer_lens import HookedTransformer # Basic loading model = HookedTransformer.from_pretrained("gpt2-small") # With specific device/dtype model = HookedTransformer.from_pretrained( "gpt2-medium", device="cuda", dtype=torch.float16 ) # Gated models (LLaMA, Mistral) import os os.environ["HF_TOKEN"] = "your_token" model = HookedTransformer.from_pretrained("meta-llama/Llama-2-7b-hf") ``` ### from_pretrained() Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `model_name` | str | required | Model name from OFFICIAL_MODEL_NAMES | | `fold_ln` | bool | True | Fold LayerNorm weights into subsequent layers | | `center_writing_weights` | bool | True | Center residual stream writer means | | `center_unembed` | bool | True | Center unembedding weights | | `dtype` | torch.dtype | None | Model precision | | `device` | str | None | Target device | | `n_devices` | int | 1 | Number of devices for model parallelism | ### Weight Matrices | Property | Shape | Description | |----------|-------|-------------| | `W_E` | [d_vocab, d_model] | Token embedding matrix | | `W_U` | [d_model, d_vocab] | Unembedding matrix | | `W_pos` | [n_ctx, d_model] | Positional embedding | | `W_Q` | [n_layers, n_heads, d_model, d_head] | Query weights | | `W_K` | [n_layers, n_heads, d_model, d_head] | Key weights | | `W_V` | [n_layers, n_heads, d_model, d_head] | Value weights | | `W_O` | [n_layers, n_heads, d_head, d_model] | Output weights | | `W_in` | [n_layers, d_model, d_mlp] | MLP input weights | | `W_out` | [n_layers, d_mlp, d_model] | MLP output weights | ### Core Methods #### forward() ```python logits = model(tokens) logits = model(tokens, return_type="logits") loss = model(tokens, return_type="loss") logits, loss = model(tokens, return_type="both") ``` Parameters: - `input`: Token tensor or string - `return_type`: "logits", "loss", "both", or None - `prepend_bos`: Whether to prepend BOS token - `start_at_layer`: Start execution from specific layer - `stop_at_layer`: Stop execution at specific layer #### run_with_cache() ```python logits, cache = model.run_with_cache(tokens) # Selective caching (saves memory) logits, cache = model.run_with_cache( tokens, names_filter=lambda name: "resid_post" in name ) # Cache on CPU logits, cache = model.run_with_cache(tokens, device="cpu") ``` #### run_with_hooks() ```python def my_hook(activation, hook): # Modify activation activation[:, :, 0] = 0 return activation logits = model.run_with_hooks( tokens, fwd_hooks=[("blocks.5.hook_resid_post", my_hook)] ) ``` #### generate() ```python output = model.generate( tokens, max_new_tokens=50, temperature=0.7, top_k=40, top_p=0.9, freq_penalty=1.0, use_past_kv_cache=True ) ``` ### Tokenization Methods ```python # String to tokens tokens = model.to_tokens("Hello world") # [1, seq_len] tokens = model.to_tokens("Hello", prepend_bos=False) # Tokens to string text = model.to_string(tokens) # Get string tokens (for debugging) str_tokens = model.to_str_tokens("Hello world") # ['<|endoftext|>', 'Hello', ' world'] # Single token validation token_id = model.to_single_token(" Paris") # Returns int or raises error ``` ### Hook Management ```python # Clear all hooks model.reset_hooks() # Add permanent hook model.add_hook("blocks.0.hook_resid_post", my_hook) # Remove specific hook model.remove_hook("blocks.0.hook_resid_post") ``` --- ## ActivationCache Stores and provides access to all activations from a forward pass. ### Accessing Activations ```python logits, cache = model.run_with_cache(tokens) # By name and layer residual = cache["resid_post", 5] attention = cache["pattern", 3] mlp_out = cache["mlp_out", 7] # Full name string residual =