
Moe Training
Compare Mixture-of-Experts architectures (Mixtral, DeepSeek-V3, Switch, GLaM) when choosing or explaining sparse LLM designs.
Overview
MoE Training is an agent skill most often used in Idea (also Build) that explains Mixture-of-Experts model architectures and routing patterns across major open and research models.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill moe-trainingWhat is this skill?
- Covers Mixtral 8x7B SMoE with top-2 routing and GQA details
- Summarizes DeepSeek-V3, Switch Transformers, and GLaM design patterns
- Includes layer/block structure notes and comparison-oriented framing
- Table-of-contents style guide for multiple vendor architectures
- Useful for explaining active vs total parameters to stakeholders
- Mixtral 8x7B: 47B total parameters with ~13B active per token (2 of 8 experts)
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need to pick or defend an MoE-based model approach but lack a structured comparison of routing, experts, and activation costs.
Who is it for?
Indie ML-curious founders and agent builders researching which sparse LLM designs match budget, latency, and quality goals.
Skip if: Builders who only need API integration with a fixed hosted model and no architecture decisions.
When should I use this skill?
You are researching or explaining Mixture of Experts LLM architectures and routing designs.
What do I get? / Deliverables
You gain architecture-level clarity across named MoE families so you can shortlist models, estimate active-parameter budgets, and plan deeper training or serving work.
- Architecture comparison mental model
- Notes on routing and active-parameter behavior
- Pointers to named MoE families for deeper study
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Idea research because the SKILL is a reference survey you consult before committing to an MoE-based stack or training approach. Research subphase fits architecture literacy, routing patterns, and parameter/active-token tradeoffs—not day-one frontend work.
Where it fits
Compare MoE families before choosing a base model for a coding agent.
Estimate active-parameter costs to decide if a sparse model fits your prototype budget.
Explain expert routing tradeoffs in an internal architecture doc for collaborators.
How it compares
Use as an architecture research brief—not as a step-by-step fine-tuning or GPU cluster runbook.
Common Questions / FAQ
Who is moe-training for?
Builders and researchers who need MoE architecture context before training, fine-tuning, or selecting inference providers.
When should I use moe-training?
In Idea research when comparing LLM families; in Validate when scoping prototype quality vs cost; in Build agent-tooling when designing routers or explaining expert sparsity to collaborators.
Is moe-training safe to install?
It is reference documentation—review the Security Audits panel on this page; it does not execute training jobs by itself.
SKILL.md
READMESKILL.md - Moe Training
# MoE Model Architectures Comprehensive guide to different Mixture of Experts architectures and their design patterns. ## Table of Contents - Mixtral 8x7B (Mistral AI) - DeepSeek-V3 (DeepSeek AI) - Switch Transformers (Google) - GLaM (Google) - Comparison Table ## Mixtral 8x7B (Mistral AI - 2024) ### Architecture Overview **Parameters:** - Total: 47B parameters - Active per token: 13B (2 experts out of 8) - Each expert: ~7B parameters **Key Features:** - **Top-2 routing**: Each token routed to 2 experts - **8 experts per layer**: Sparse activation - **SMoE architecture**: Sparse Mixture of Experts - **Grouped-Query Attention (GQA)**: Efficient attention mechanism ### Layer Structure ```python # Mixtral Transformer Block class MixtralDecoderLayer(nn.Module): def __init__(self, config): super().__init__() self.hidden_size = config.hidden_size # Self-attention self.self_attn = MixtralAttention(config) # MoE Feed-Forward self.block_sparse_moe = MixtralSparseMoeBlock(config) # Layer norms self.input_layernorm = MixtralRMSNorm(config.hidden_size) self.post_attention_layernorm = MixtralRMSNorm(config.hidden_size) def forward(self, hidden_states, attention_mask=None): residual = hidden_states # Self-attention hidden_states = self.input_layernorm(hidden_states) hidden_states = self.self_attn(hidden_states, attention_mask) hidden_states = residual + hidden_states # MoE FFN residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) hidden_states = self.block_sparse_moe(hidden_states) hidden_states = residual + hidden_states return hidden_states ``` ### Sparse MoE Block ```python class MixtralSparseMoeBlock(nn.Module): def __init__(self, config): super().__init__() self.hidden_dim = config.hidden_size self.ffn_dim = config.intermediate_size self.num_experts = config.num_local_experts # 8 self.top_k = config.num_experts_per_tok # 2 # Router (gating network) self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False) # 8 expert FFNs self.experts = nn.ModuleList([ MixtralBlockSparseTop2MLP(config) for _ in range(self.num_experts) ]) def forward(self, hidden_states): batch_size, sequence_length, hidden_dim = hidden_states.shape hidden_states = hidden_states.view(-1, hidden_dim) # Router logits (batch * seq_len, num_experts) router_logits = self.gate(hidden_states) # Top-2 routing routing_weights = F.softmax(router_logits, dim=1) routing_weights, selected_experts = torch.topk( routing_weights, self.top_k, dim=-1 ) # Normalize top-2 weights to sum to 1 routing_weights /= routing_weights.sum(dim=-1, keepdim=True) # Route to experts final_hidden_states = torch.zeros( (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device ) # Process each expert for expert_idx in range(self.num_experts): expert_layer = self.experts[expert_idx] idx, top_x = torch.where(selected_experts == expert_idx) if idx.shape[0] == 0: continue # Tokens routed to this expert top_x_list = top_x.tolist() idx_list = idx.tolist() # Current expert input current_state = hidden_states[None, idx_list].reshape(-1, hidden_dim) current_hidden_states = expert_layer(current_state) # Weight by routing scores current_hidden_states *= routing_weights[idx_list, top_x_list, None] # Accumulate final_hidden_states.index_add_(0, idx, current_hidden_states.to(hidden_states.dtyp