
Mamba Architecture
Understand Mamba selective SSM (S6) mechanics and configs before choosing or implementing linear-time sequence models.
Overview
Mamba Architecture is an agent skill most often used in Idea (also Build agent-tooling) that explains selective SSM (S6) design, behavior, and configuration for linear-time sequence models.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill mamba-architectureWhat is this skill?
- Contrasts fixed-matrix SSMs vs Mamba input-dependent B, C, and Δ (S6) selection
- Explains O(n) recurrent updates vs O(n²) attention with constant state dimension
- Documents typical d_model, d_state=16, and d_conv configuration knobs
- Covers content-based remember/forget behavior for long sequences
- Includes Python-oriented Mamba API setup references from mamba_ssm
- Typical SSM state dimension noted as 16 in configuration examples
- Contrasts O(n²) attention with O(n) selective state updates
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need long-context or efficient sequence modeling but do not understand how Mamba’s selective state updates differ from attention or static SSMs.
Who is it for?
Builders researching alternative LLM backbones, studying mamba_ssm, or scoping agent features that need long sequences with bounded memory.
Skip if: Teams only shipping CRUD SaaS with off-the-shelf GPT APIs and no custom model work, or beginners seeking step-by-step training ops without math background.
When should I use this skill?
Learning or implementing Mamba/S6 architecture, comparing sequence model families, or setting mamba_ssm hyperparameters.
What do I get? / Deliverables
You can reason about S6 selection, complexity, and core hyperparameters before implementing or swapping Mamba blocks in a research or agent stack.
- Accurate mental model of selective SSM updates and selection behavior
- Documented Mamba configuration choices (d_model, d_state, d_conv) for a planned implementation
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Architecture literacy belongs on the research shelf where builders evaluate model families before committing implementation effort. The skill explains selective state space theory, discretization, and parameter tradeoffs—not day-to-day app UI or deploy runbooks.
Where it fits
Compare Mamba versus transformer cost for a planned on-device agent before picking a backbone.
Configure d_model and d_state in a prototype mamba_ssm block while documenting why selection gates matter for your task.
How it compares
Architecture explainer for Mamba/SSM—not a Hugging Face fine-tuning checklist or generic prompt-engineering skill.
Common Questions / FAQ
Who is mamba-architecture for?
Indie ML builders, agent authors, and researchers evaluating or implementing Mamba-based models who need accurate selective SSM concepts in agent context.
When should I use mamba-architecture?
Use it in Idea/research when comparing attention vs SSM options, and in Build/agent-tooling when configuring mamba_ssm layers or explaining model choices to collaborators.
Is mamba-architecture safe to install?
It is informational research content; treat any bundled code snippets as references you audit yourself, and review the Security Audits panel on this Prism page for the skill package source.
SKILL.md
READMESKILL.md - Mamba Architecture
# Mamba Architecture Details ## Selective State Space Mechanism Mamba's core innovation is the **Selective SSM (S6)** layer that makes state space model parameters input-dependent. ### How S6 Works **Traditional SSMs** (non-selective): ```python # Fixed A, B, C matrices for all inputs h(t) = A * h(t-1) + B * x(t) # State update y(t) = C * h(t) # Output ``` **Mamba's Selective SSM**: ```python # Input-dependent parameters B(t) = Linear_B(x(t)) # Selection mechanism C(t) = Linear_C(x(t)) # Output projection Δ(t) = Linear_Δ(x(t)) # Discretization step # Selective state update h(t) = discretize(A, Δ(t)) * h(t-1) + Δ(t) * B(t) * x(t) y(t) = C(t) * h(t) ``` ### Key Advantages **1. Content-based reasoning**: - Can selectively remember or forget based on input - Addresses discrete modality weakness of traditional SSMs - Example: Remembers important tokens, forgets padding **2. Input-dependent selection**: ```python # Mamba decides per token what to remember if is_important(x(t)): Δ(t) = large_value # Keep in state else: Δ(t) = small_value # Forget quickly ``` **3. No attention required**: - Replaces O(n²) attention with O(n) state updates - State dimension is constant (typically 16) ## Model Configuration ### Core Parameters ```python from mamba_ssm import Mamba model = Mamba( d_model=256, # Hidden dimension (256, 512, 768, 1024, 2048) d_state=16, # SSM state dimension (fixed at 16 is optimal) d_conv=4, # Local convolution width (4 is standard) expand=2, # Expansion factor (1.5-2.0) dt_rank="auto", # Rank of Δ projection (auto = d_model / 16) dt_min=0.001, # Min Δ init (controls forgetting rate) dt_max=0.1, # Max Δ init dt_init="random", # Δ initialization (random, constant) dt_scale=1.0, # Δ scaling factor conv_bias=True, # Use bias in convolution bias=False # Use bias in linear projections ) ``` ### Parameter Impact **d_state** (SSM state dimension): - Standard: 16 (optimal from ablations) - Smaller (8): Faster but less capacity - Larger (32, 64): Minimal improvement, 2× slower **expand** (block expansion): - Standard: 2.0 - Range: 1.5-2.0 - Controls inner dimension = expand * d_model **d_conv** (convolution width): - Standard: 4 - Local context window before SSM - Helps with positional information **dt_rank** (Δ projection rank): - Auto: d_model / 16 (recommended) - Controls Δ parameter efficiency - Lower rank = more efficient but less expressive ## Mamba Block Structure ```python # Mamba block (replaces Transformer block) class MambaBlock(nn.Module): def __init__(self, d_model): self.norm = RMSNorm(d_model) self.mamba = Mamba(d_model, d_state=16, d_conv=4, expand=2) def forward(self, x): return x + self.mamba(self.norm(x)) # Residual # Full model (stack of Mamba blocks) model = nn.Sequential( Embedding(...), *[MambaBlock(d_model) for _ in range(n_layers)], RMSNorm(d_model), LMHead(...) ) ``` **Key differences from Transformers**: - No multi-head attention (MHA) - No feedforward network (FFN) - Single Mamba layer per block - 2× more layers than equivalent Transformer ## Hardware-Aware Implementation ### Parallel Algorithm Mamba uses a **scan-based parallel algorithm** for training: ```python # Parallel mode (training) # GPU kernel fuses operations y = parallel_scan(A, B, C, x) # O(n log n) parallel # Sequential mode (inference) # Constant memory RNN-style h = 0 for x_t in sequence: h = A*h + B*x_t y_t = C*h ``` ### Memory Efficiency **Training**: - Recomputes activations in backward pass - Similar to FlashAttention strategy - Memory: O(batch_size * seq_len * d_model) **Inference**: - RNN-style sequential processing - State size: O(d_model * d_state) = constant - No KV cache needed (huge advantage!) ### CUDA Kernel Optimizations ```python # Fused kernel operations - Discretization (continuous → discrete A, B) - SSM re