
Training Llms Megatron
Pick Megatron-Core parallelism settings and interpret MFU/throughput benchmarks when training large LLMs on H100 clusters.
Overview
Training LLMs Megatron is an agent skill for the Build phase that summarizes Megatron-Core MFU and throughput benchmarks across GPT, LLaMA, Mixtral, and large-scale training configurations.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill training-llms-megatronWhat is this skill?
- Reports up to 47% Model FLOP Utilization (MFU) on H100 clusters with size-dependent arithmetic intensity
- GPT-3 175B reference: TP=4, PP=8, 128–512 GPUs, ~390 TFlops/GPU on H100
- LLaMA-3 sizing table from 8B through 405B with TP, PP, CP, and sequence-length notes
- Mixtral MoE rows with expert parallelism (EP) and active vs total parameter counts
- DeepSeek-V3 scale reference: 37B active / 671B total on 1024 GPUs (per guide excerpt)
- Up to 47% MFU on H100 clusters (Megatron-Core benchmarks)
- GPT-3 175B: 390 TFlops/GPU on H100 with TP=4, PP=8
- LLaMA-3.1 405B example: 1024 GPUs, TP=8, PP=8, CP=2, ~400 TFlops/GPU average
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You must choose tensor, pipeline, and context parallel settings for a large training run but only have vague “use more GPUs” advice.
Who is it for?
Technical solo builders or tiny teams sizing H100 Megatron training for 8B–405B-class models.
Skip if: Beginners training 7B models on one consumer GPU or teams not using Megatron-Core at all.
When should I use this skill?
You are planning or tuning Megatron-Core large language model training and need parallelism and throughput reference numbers.
What do I get? / Deliverables
You can align your planned GPU count and parallelism tuple with published MFU and TFlops/GPU benchmarks before locking a Megatron job spec.
- Parallelism and GPU-count recommendations grounded in benchmark tables
- MFU and TFlops/GPU expectations for proposal validation
Recommended Skills
Journey fit
Large-model training configuration belongs on the Build shelf because you choose TP/PP/CP and GPU counts while designing the training stack, not while doing growth analytics. Backend captures Megatron parallelism tables, FLOP utilization targets, and hardware scaling notes that inform training job design.
How it compares
Benchmark and configuration compendium—not a substitute for Lambda Labs torchrun setup or DSPy application-layer orchestration.
Common Questions / FAQ
Who is training-llms-megatron for?
Builders and ML leads who need quick Megatron parallelism and throughput reference tables when planning large LM training on H100 clusters.
When should I use training-llms-megatron?
During Build backend planning when selecting TP/PP/CP for LLaMA or MoE targets, and during Operate infra reviews when comparing observed MFU against the documented ~47% H100 ceiling.
Is training-llms-megatron safe to install?
Use the Security Audits panel on this Prism page; the skill is documentation-only but any follow-on cluster scripts should be reviewed before production spend.
SKILL.md
READMESKILL.md - Training Llms Megatron
# Performance Benchmarks Performance metrics and benchmarks for Megatron-Core across different model sizes and hardware configurations. ## Model FLOP Utilization (MFU) **H100 Clusters**: Up to 47% MFU achieved MFU increases with larger model sizes due to higher arithmetic intensity in larger matrix multiplications (GEMMs). ## Throughput Metrics by Model Size ### GPT-3 175B - **Hardware**: H100 - **Configuration**: TP=4, PP=8 - **GPUs**: 128-512 - **MFU**: 47% on H100 - **Throughput**: 390 TFlops/GPU on H100 ### LLaMA Configurations | Model | Size | GPUs | TP | PP | CP | Seq Length | Hardware | Notes | |-------|------|------|----|----|----| -----------|----------|-------| | LLaMA-3 | 8B | 8 | 1 | 1 | 2 | 8K | H100 | CP for long sequences | | LLaMA-3 | 70B | 64 | 4 | 4 | 2 | 4K | H100 | TP+PP parallelism | | LLaMA-3.1 | 405B | 1024 | 8 | 8 | 2 | 4K | H100 | 3D parallelism | **LLaMA-3 405B Details**: - 16K H100 GPUs (two 24K GPU clusters) - TP=8, PP=8, CP=2 - 400 TFlops/GPU average - 95%+ uptime - 3× efficiency improvement vs LLaMA 2 ### Mixtral (Mixture of Experts) | Model | Active Params | Total Params | GPUs | TP | PP | EP | Experts | Hardware | |-------|---------------|--------------|------|----|----|----|---------| ---------| | Mixtral | 7B (active) | 8×7B (56B) | 64 | 1 | 4 | 8 | 8 | H100 | | Mixtral | 22B (active) | 8×22B (176B) | 256 | 4 | 4 | 8 | 8 | H100 | ### DeepSeek-V3 - **Active Parameters**: 37B per token - **Total Parameters**: 671B - **GPUs**: 1024 H100 - **Configuration**: TP=2, PP=16, EP=64 - **Parallelism**: 4D with Expert Parallel ### GPT-462B (Largest Benchmark) - **Parameters**: 462B - **GPUs**: 6144 H100 - **MFU**: 47-48% - **Throughput**: ~390 TFlops/GPU ## Hardware Performance Characteristics ### NVIDIA H100 (Hopper) - **Peak Performance**: - FP16: 1979 TFlops - BF16: 1979 TFlops - FP8: 3958 TFlops - **Memory**: 80GB HBM3 - **Memory Bandwidth**: 3.35 TB/s - **NVLink**: 900 GB/s per GPU **Achieved MFU**: 40-47% (typical range) ### NVIDIA A100 (Ampere) - **Peak Performance**: - FP16: 312 TFlops (with sparsity) - BF16: 312 TFlops - **Memory**: 40GB or 80GB HBM2e - **Memory Bandwidth**: 2 TB/s - **NVLink**: 600 GB/s per GPU **Typical MFU**: 35-42% ## Weak Scaling (Fixed Per-GPU Workload) As you add more GPUs while keeping per-GPU workload constant: | GPUs | Model Size | MFU | Efficiency | |------|------------|-----|------------| | 8 | 7B | 42% | 100% (baseline) | | 64 | 70B | 44% | 95% | | 512 | 175B | 45% | 93% | | 1024 | 405B | 46% | 90% | | 6144 | 462B | 47% | 88% | ## Strong Scaling (Fixed Total Workload) Distributing a fixed model across more GPUs: | Model | GPUs | Time per Iteration | Speedup | Efficiency | |-------|------|-------------------|---------|------------| | 70B | 64 | 1.0× (baseline) | 1.0× | 100% | | 70B | 128 | 0.52× | 1.92× | 96% | | 70B | 256 | 0.27× | 3.70× | 93% | ## Throughput Calculations **Formula**: ``` Throughput (TFlops/GPU) = Total FLOPs / (Time × Number of GPUs × 10^12) ``` **Example (GPT-3 175B)**: - Forward + Backward pass: 3 × (model FLOPs) - Per-token FLOPs: ~350 billion for 175B model - Batch size: 1536 (global) - Sequence length: 2048 - Time per iteration: ~5 seconds on 512 H100s - Throughput: ~390 TFlops/GPU ## Memory Usage vs Model Size | Model Size | Parameters | Memory (FP16) | Memory (BF16) | Memory (FP8) | |------------|------------|---------------|---------------|--------------| | 7B | 7 billion | 14 GB | 14 GB | 7 GB | | 13B | 13 billion | 26 GB | 26 GB | 13 GB | | 70B | 70 billion | 140 GB | 140 GB | 70 GB | | 175B | 175 billion | 350 GB | 350 GB | 175 GB | | 405B | 405 billion | 810 GB | 810 GB | 405 GB | **Note**: These are model weights only. Add ~2× for gradients and optimizer states during training. ## Communication Overhead ### Tensor Parallelism (TP) - **Bandwidth Required**: ~20 GB/GPU for LLaMA 70B with TP=4 - **Frequency**: Every layer (80+ layers) - **Best Practice**: Use NVLink, keep TP ≤8 within single node