
Deepspeed
Look up DeepSpeed MoE and large-model training notes when sizing distributed training stacks for NLG or mixture-of-experts work.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill deepspeedWhat is this skill?
- Curated excerpts from DeepSpeed blog posts on MoE training at larger scale
- References NLG quality vs model size and Megatron-Turing–class training timelines
- Discusses MoE inference and claimed training-cost reductions for language models
- Research-pack format (numbered pages) rather than a step-by-step agent ritual
- Useful as context when evaluating Deepspeed vs hand-rolled distributed training
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
DeepSpeed configuration belongs to the build phase when you implement training and inference infrastructure. MoE parallelism, ZeRO-style scaling, and training cost claims are backend ML systems concerns, not launch or growth tasks.
Common Questions / FAQ
Is Deepspeed safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Deepspeed
# Deepspeed - 08 **Pages:** 1 --- ## DeepSpeed powers 8x larger MoE model training with high performance **URL:** https://www.deepspeed.ai/2021/08/17/deepspeed-moe.html **Contents:** - DeepSpeed powers 8x larger MoE model training with high performance - Contents Updated: August 17, 2021 --- # Deepspeed - 09 **Pages:** 2 --- ## DeepSpeed-MoE for NLG: Reducing the training cost of language models by 5 times **URL:** https://www.deepspeed.ai/2021/12/09/deepspeed-moe-nlg.html **Contents:** - DeepSpeed-MoE for NLG: Reducing the training cost of language models by 5 times - Contents - MoE based NLG model architecture - MoE training infrastructure and dataset - MoE leads to better quality for NLG models - Same quality with 5x less training cost - MoE for Inference - Conclusion and Release - Acknowledgement Autoregressive transformer-based natural language generation (referred to as NLG in the rest of the blog) models can offer convincing solutions to a broad range of language tasks from document summarization, headline generation, question and answering to even generating code in a wide variety of programming languages. Due to the general applicability of these models, improving their quality has been of great interest for both academia and industry alike. The quality of NLG improves with the increase in model size. However, today we are getting close to the limit of what the current generation of hardware can do. The Megatron-Turing NLG 530B model took 3 months to train on over 2K A100 GPUs on the NVIDIA Selene Supercomputer, consuming over 3 million GPU hours. Another 3 to 5 times of increase in model size would be infeasible within a reasonable timeframe. Given the exorbitant compute resources required to train the state-of-art NLG models, a natural question to ask is: “Is it possible to make non-trivial improvement to model quality without increasing the compute cost?” Or equivalently, “Is it possible to produce model with similar quality using 3 to 5 times less resources?” Recent works like GShard and Switch Transformers have shown that Mixture of Experts (MoE) model structure reduces large model training cost significantly for transformer-based encoder-decoder models. An MoE model contains a set of sparsely gated experts. During training and inference, only a subset of these experts is activated for each input token. Therefore, the model could scale to billions of parameters without a proportional increase in the computation. Despite showing promising results, the effectiveness of MoE for the much more computation intensive NLG family models remains mostly unknown. Given the tremendous compute and energy requirements for training NLG family of models, we explore the opportunities that MoE presents to reduce their training cost. We show that MoE can be applied to NLG family of models to significantly improve their model quality with the same training cost. Alternatively, it can achieve 5x reduction in training cost to achieve the same model quality of a dense NLG model. For example, by applying MoE we achieved the model quality of a 6.7B parameter dense NLG model at the cost of training a 1.3B parameter dense model, thanks to the sparse structure of MoE. Assuming the scaling holds, the results have the potential to completely transform the large model training landscape in terms of cost. For example, a trillion-parameter dense model can be potentially trained at the cost of a 200B parameter (like GPT-3) sized dense model, translating to millions of dollars in training cost reduction and energy savings (Brown et al., 2020, Language models are few-shot learners). To create an MoE based NLG model we studied the GPT like transformer-based NLG model. To complete training in a reasonable timeframe, the following models are selected: 350M (24 layers, 1024 hidden size, 16 attention heads), 1.3B (24 layers, 2048 hidden size, 16 attention heads), and 6.7B (32 layers, 4096 hidden size, 32 attention heads). We u