
Sparse Autoencoder Training
Load, train, and inspect sparse autoencoders with SAELens for interpretability and feature analysis on language-model activations.
Overview
Sparse Autoencoder Training is an agent skill for the Build phase that documents SAELens APIs to load, configure, encode, decode, and persist sparse autoencoders on model activations.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill sparse-autoencoder-trainingWhat is this skill?
- SAE.from_pretrained for official releases, HuggingFace repos, and local disk
- Core encode, decode, and forward paths with documented tensor shapes
- SAEConfig parameters for architecture and training context
- save_model and load_from_disk for reproducible artifact handoff
- CUDA-oriented loading patterns in reference snippets
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are wiring interpretability code but need accurate SAELens class methods, tensor shapes, and checkpoint loading paths without spelunking scattered papers and repos.
Who is it for?
Indie ML researchers and agent builders adding SAE-based feature extraction to LLM tooling or reproducible training notebooks.
Skip if: Builders who only need generic LLM chat integration with no activation-level interpretability or GPU training setup.
When should I use this skill?
User implements sparse autoencoder training, SAELens loading, or activation feature extraction for LLM interpretability.
What do I get? / Deliverables
Your agent implements correct SAE load/train/inference code with consistent config objects and saved artifacts ready for feature analysis or downstream evals.
- SAE training or inference script scaffolding
- Saved model directory via save_model
- Documented encode/decode pipelines with shape contracts
Recommended Skills
Journey fit
SAE training and encode/decode workflows are deep ML implementation work that sits on Build when you are extending agent or model research tooling. The skill documents SAELens APIs for hooks and features—agent-tooling for researchers shipping interpretability pipelines—not generic app frontend or DevOps.
How it compares
Reference skill for SAELens training APIs—not a hosted feature dashboard or a general fine-tuning cookbook.
Common Questions / FAQ
Who is sparse-autoencoder-training for?
Developers and researchers using SAELens to train or apply sparse autoencoders on transformer activations inside agent-assisted coding workflows.
When should I use sparse-autoencoder-training?
Use it in Build while implementing SAE training scripts, loading gpt2-small-res-jb or HuggingFace checkpoints, or debugging encode/decode tensor shapes before running experiments.
Is sparse-autoencoder-training safe to install?
Check the Security Audits panel on this Prism page; training skills may pull weights from external repos—verify licenses, disk paths, and GPU code before running.
SKILL.md
READMESKILL.md - Sparse Autoencoder Training
# SAELens API Reference ## SAE Class The core class representing a Sparse Autoencoder. ### Loading Pre-trained SAEs ```python from sae_lens import SAE # From official releases sae, cfg_dict, sparsity = SAE.from_pretrained( release="gpt2-small-res-jb", sae_id="blocks.8.hook_resid_pre", device="cuda" ) # From HuggingFace sae, cfg_dict, sparsity = SAE.from_pretrained( release="username/repo-name", sae_id="path/to/sae", device="cuda" ) # From local disk sae = SAE.load_from_disk("/path/to/sae", device="cuda") ``` ### SAE Attributes | Attribute | Shape | Description | |-----------|-------|-------------| | `W_enc` | [d_in, d_sae] | Encoder weights | | `W_dec` | [d_sae, d_in] | Decoder weights | | `b_enc` | [d_sae] | Encoder bias | | `b_dec` | [d_in] | Decoder bias | | `cfg` | SAEConfig | Configuration object | ### Core Methods #### encode() ```python # Encode activations to sparse features features = sae.encode(activations) # Input: [batch, pos, d_in] # Output: [batch, pos, d_sae] ``` #### decode() ```python # Reconstruct activations from features reconstructed = sae.decode(features) # Input: [batch, pos, d_sae] # Output: [batch, pos, d_in] ``` #### forward() ```python # Full forward pass (encode + decode) reconstructed = sae(activations) # Returns reconstructed activations ``` #### save_model() ```python sae.save_model("/path/to/save") ``` --- ## SAEConfig Configuration class for SAE architecture and training context. ### Key Parameters | Parameter | Type | Description | |-----------|------|-------------| | `d_in` | int | Input dimension (model's d_model) | | `d_sae` | int | SAE hidden dimension | | `architecture` | str | "standard", "gated", "jumprelu", "topk" | | `activation_fn_str` | str | Activation function name | | `model_name` | str | Source model name | | `hook_name` | str | Hook point in model | | `normalize_activations` | str | Normalization method | | `dtype` | str | Data type | | `device` | str | Device | ### Accessing Config ```python print(sae.cfg.d_in) # 768 for GPT-2 small print(sae.cfg.d_sae) # e.g., 24576 (32x expansion) print(sae.cfg.hook_name) # e.g., "blocks.8.hook_resid_pre" ``` --- ## LanguageModelSAERunnerConfig Comprehensive configuration for training SAEs. ### Example Configuration ```python from sae_lens import LanguageModelSAERunnerConfig cfg = LanguageModelSAERunnerConfig( # Model and hook model_name="gpt2-small", hook_name="blocks.8.hook_resid_pre", hook_layer=8, d_in=768, # SAE architecture architecture="standard", # "standard", "gated", "jumprelu", "topk" d_sae=768 * 8, # Expansion factor activation_fn="relu", # Training hyperparameters lr=4e-4, l1_coefficient=8e-5, lp_norm=1.0, lr_scheduler_name="constant", lr_warm_up_steps=500, # Sparsity control l1_warm_up_steps=1000, use_ghost_grads=True, feature_sampling_window=1000, dead_feature_window=5000, dead_feature_threshold=1e-8, # Data dataset_path="monology/pile-uncopyrighted", streaming=True, context_size=128, # Batch sizes train_batch_size_tokens=4096, store_batch_size_prompts=16, n_batches_in_buffer=64, # Training duration training_tokens=100_000_000, # Logging log_to_wandb=True, wandb_project="sae-training", wandb_log_frequency=100, # Checkpointing checkpoint_path="checkpoints", n_checkpoints=5, # Hardware device="cuda", dtype="float32", ) ``` ### Key Parameters Explained #### Architecture Parameters | Parameter | Description | |-----------|-------------| | `architecture` | SAE type: "standard", "gated", "jumprelu", "topk" | | `d_sae` | Hidden dimension (or use `expansion_factor`) | | `expansion_factor` | Alternative to d_sae: d_sae = d_in × expansion_factor | | `activation_fn` | "relu", "topk", etc. | | `activation_fn_kwargs` | Dict for activation params (e.g., {"k": 50} for topk) | #### Sparsity Para