Deepspeed

Name: Deepspeed
Author: orchestra-research

orchestra-research/ai-research-skills

435 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

deepspeed is a research-reference skill that surfaces DeepSpeed MoE and large-model distributed training notes so developers who size NLG or mixture-of-experts stacks can compare infrastructure options with cited benchma

About

deepspeed is an agent skill from orchestra-research/ai-research-skills that indexes Microsoft DeepSpeed MoE documentation and blog posts for large language model training. The bundled notes cover MoE NLG architecture, training infrastructure, dataset choices, and published results such as 8x larger MoE model training and roughly 5x lower training cost claims from DeepSpeed-MoE releases. Developers reach for deepspeed when planning multi-GPU or multi-node fine-tuning and pretraining runs where expert parallelism, memory optimization, and cost tradeoffs must be justified before picking frameworks. The skill acts as a curated lookup layer over deepspeed.ai sources rather than a runtime installer.

Curated excerpts from DeepSpeed blog posts on MoE training at larger scale
References NLG quality vs model size and Megatron-Turing–class training timelines
Discusses MoE inference and claimed training-cost reductions for language models
Research-pack format (numbered pages) rather than a step-by-step agent ritual
Useful as context when evaluating Deepspeed vs hand-rolled distributed training

Deepspeed by the numbers

435 all-time installs (skills.sh)
+32 installs in the week ending Jul 26, 2026 (Skillselion tracking)
Ranked #467 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill deepspeed

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/deepspeed.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/deepspeed)

Installs	435
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you size DeepSpeed MoE training for large NLG models?

Look up DeepSpeed MoE and large-model training notes when sizing distributed training stacks for NLG or mixture-of-experts work.

Who is it for?

ML engineers and backend developers evaluating DeepSpeed MoE against other distributed training options for billion-parameter NLG workloads.

Skip if: Developers who only need local single-GPU inference deployment without any distributed pretraining or fine-tuning planning.

When should I use this skill?

A query mentions DeepSpeed, MoE, mixture-of-experts, distributed NLG training cost, or GPU cluster sizing for large language models.

What you get

Curated DeepSpeed-MoE architecture notes, training infrastructure checklists, and cited benchmark references for distributed NLG planning.

MoE architecture notes
training infrastructure references
benchmark citation list

By the numbers

References DeepSpeed claims of 8x larger MoE model training
References DeepSpeed-MoE NLG training cost reduction of 5 times

Files

SKILL.mdMarkdownGitHub ↗

Deepspeed Skill

Comprehensive assistance with deepspeed development, generated from official documentation.

When to Use This Skill

This skill should be triggered when:

Working with deepspeed
Asking about deepspeed features or APIs
Implementing deepspeed solutions
Debugging deepspeed code
Learning deepspeed best practices

Quick Reference

Common Patterns

Pattern 1: DeepNVMe Contents Requirements Creating DeepNVMe Handles Using DeepNVMe Handles Blocking File Write Non-Blocking File Write Parallel File Write Pinned Tensors Putting it together Acknowledgements Appendix Advanced Handle Creation Performance Tuning DeepNVMe APIs General I/O APIs GDS-specific APIs Handle Settings APIs This tutorial will show how to use DeepNVMe for data transfers between persistent storage and tensors residing in host or device memory. DeepNVMe improves the performance and efficiency of I/O operations in Deep Learning applications through powerful optimizations built on Non-Volatile Memory Express (NVMe) Solid State Drives (SSDs), Linux Asynchronous I/O (libaio), and NVIDIA Magnum IOTM GPUDirect® Storage (GDS). Requirements Ensure your environment is properly configured to use DeepNVMe. First, you need to install DeepSpeed version >= 0.15.0. Next, ensure that the DeepNVMe operators are available in the DeepSpeed installation. The async_io operator is required for any DeepNVMe functionality, while the gds operator is required only for GDS functionality. You can confirm availability of each operator by inspecting the output of ds_report to check that compatible status is [OKAY]. Below is a snippet of ds_report output confirming the availability of both async_io and gds operators. If async_io operator is unavailable, you will need to install the appropriate libaio library binaries for your Linux flavor. For example, Ubuntu users will need to run apt install libaio-dev. In general, you should carefully inspect ds_report output for helpful tips such as the following: [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. To enable gds operator, you will need to install NVIDIA GDS by consulting the appropriate guide for bare-metal systems or Azure VMs (coming soon). Creating DeepNVMe Handles DeepNVMe functionality can be accessed through two abstractions: aio_handle and gds_handle. The aio_handle is usable on both host and device tensors. while gds_handle works only on CUDA tensors, but is more efficient. The first step to use DeepNVMe is to create a desired handle. aio_handle requires async_io operator, while gds_handle requires both async_io and gds operators. The following snippets illustrate aio_handle and gds_handle creation respectively. ### Create aio_handle from deepspeed.ops.op_builder import AsyncIOBuilder aio_handle = AsyncIOBuilder().load().aio_handle() ### Create gds_handle from deepspeed.ops.op_builder import GDSBuilder gds_handle = GDSBuilder().load().gds_handle() For simplicity, the above examples illustrate handle creation using default parameters. We expect that handles created with default parameters to provide good performance in most environments. However, you can see below for advanced handle creation. Using DeepNVMe Handles aio_handle and gds_handle provide identical APIs for storing tensors to files or loading tensors from files. A common feature of these APIs is that they take a tensor and a file path as arguments for the desired I/O operation. For best performance, pinned device or host tensors should be used for I/O operations (see here for details). For brevity, this tutorial will use aio_handle for illustration, but keep in mind that gds_handle works similarly. You can see the available APIs in a Python shell via tab completion on an aio_handle object . This is illustrated using tab completion of h.. >python Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h. h.async_pread( h.free_cpu_locked_tensor( h.get_overlap_events( h.get_single_submit( h.new_cpu_locked_tensor( h.pwrite( h.sync_pread( h.wait( h.async_pwrite( h.get_block_size( h.get_queue_depth( h.get_intra_op_parallelism( h.pread( h.read( h.sync_pwrite( h.write( The APIs of interest for performing I/O operations are those named with pread and pwrite substrings. For brevity, we will focus on the file write APIs, namely sync_pwrite, async_pwrite, and pwrite. We will discuss only sync_pwrite and async_pwrite below because they are specializations of pwrite. Blocking File Write sync_pwrite provides the standard blocking semantics of Python file write. The example below illustrates using sync_pwrite to store a 1GB CUDA tensor to a local NVMe file. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(10243, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h.sync_pwrite(t,'/local_nvme/test_1GB.pt') >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 Non-Blocking File Write An important DeepNVMe optimization is the non-blocking I/O semantics which enables Python threads to overlap computations with I/O operations. async_pwrite provides the non-blocking semantics for file writes. The Python thread can later use wait() to synchronize with the I/O operation. async_write can also be used to submit multiple back-to-back non-blocking I/O operations, of which can then be later blocked on using a single wait(). The example below illustrates using async_pwrite to store a 1GB CUDA tensor to a local NVMe file. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(10243, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> h.wait() 1 >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 Warning for non-blocking I/O operations: To avoid data races and corruptions, .wait() must be carefully used to serialize the writing of source tensors, and the reading of destination tensors. For example, the following update of t during a non-blocking file write is unsafe and could corrupt /local_nvme/test_1GB.pt. >>> t=torch.empty(10243, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> t += 1 # <--- Data race; avoid by preceding with `h.wait()` Similar safety problems apply to reading the destination tensor of a non-blocking file read without .wait() synchronization. Parallel File Write An important DeepNVMe optimization is the ability to parallelize individual I/O operations. This optimization is enabled by specifying the desired parallelism degree when constructing a DeepNVMe handle. Subsequent I/O operations with that handle are automatically parallelized over the requested number of host or device threads, as appropriate. I/O parallelism is composable with either the blocking or non-blocking I/O APIs. The example below illustrates 4-way parallelism of a file write using async_pwrite. Note the use of intra_op_parallelism argument to specify the desired parallelism degree in handle creation. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(10243, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle(intra_op_parallelism=4) >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> h.wait() 1 >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 Pinned Tensors A key part of DeepNVMe optimizations is using direct memory access (DMA) for I/O operations, which requires that the host or device tensor be pinned. To pin host tensors, you can use mechanisms provided by Pytorch or DeepSpeed Accelerators. The following example illustrates writing a pinned CPU tensor to a local NVMe file. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(10243, dtype=torch.uint8).pin_memory() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> h.wait() 1 >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 On the other hand,gds_handle provides new_pinned_device_tensor() and pin_device_tensor() functions for pinning CUDA tensors. The following example illustrates writing a pinned CUDA tensor to a local NVMe file. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(10243, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import GDSBuilder >>> h = GDSBuilder().load().gds_handle() >>> h.pin_device_tensor(t) >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> h.wait() 1 >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 >>> h.unpin_device_tensor(t) Putting it together We hope that the above material helps you to get started with DeepNVMe. You can also use the following links to see DeepNVMe usage in real-world Deep Learning applications. Parameter swapper in ZeRO-Inference and ZeRO-Infinity. Optimizer swapper in ZeRO-Infinity. Gradient swapper in ZeRO-Infinity. Simple file read and write operations. Acknowledgements This tutorial has been significantly improved by feedback from Guanhua Wang, Masahiro Tanaka, and Stas Bekman. Appendix Advanced Handle Creation Achieving peak I/O performance with DeepNVMe requires careful configuration of handle creation. In particular, the parameters of aio_handle and gds_handle constructors are performance-critical because they determine how efficiently DeepNVMe interacts with the underlying storage subsystem (i.e., libaio, GDS, PCIe, and SSD). For convenience we make it possible to create handles using default parameter values which will provide decent performance in most scenarios. However, squeezing out every available performance in your environment will likely require tuning the constructor parameters, namely block_size, queue_depth, single_submit, overlap_events, and intra_op_parallelism. The aio_handle constructor parameters and default values are illustrated below: >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> help(AsyncIOBuilder().load().aio_handle()) Help on aio_handle in module async_io object: class aio_handle(pybind11_builtins.pybind11_object) | Method resolution order: | aio_handle | pybind11_builtins.pybind11_object | builtins.object | | Methods defined here: | | __init__(...) | __init__(self: async_io.aio_handle, block_size: int = 1048576, queue_depth: int = 128, single_submit: bool = False, overlap_events: bool = False, intra_op_parallelism: int = 1) -> None | | AIO handle constructor Performance Tuning As discussed earlier, achieving peak DeepNVMe performance for a target workload or environment requires using optimally configured aio_handle or gds_handle handles. For configuration convenience, we provide a utility called ds_nvme_tune to automate the discovery of optimal DeepNVMe configurations. ds_nvme_tune automatically explores a user-specified or default configuration space and recommends the option that provides the best read and write performance. Below is an example usage of ds_nvme_tune to tune aio_handle data transfers between GPU memory and a local NVVMe SSD mounted on /local_nvme. This example used the default configuration space of ds_nvme_tune for tuning. $ ds_nvme_tune --nvme_dir /local_nvme --gpu Running DeepNVMe performance tuning on ['/local_nvme/'] Best performance (GB/sec): read = 3.69, write = 3.18 { "aio": { "single_submit": "false", "overlap_events": "true", "intra_op_parallelism": 8, "queue_depth": 32, "block_size": 1048576 } } The above tuning was executed on a Lambda workstation equipped with two NVIDIA A6000-48GB GPUs, 252GB of DRAM, and a CS3040 NVMe 2TB SDD with peak read and write speeds of 5.6 GB/s and 4.3 GB/s respectively. The tuning required about four and half minutes. Based on the results, one can expect to achieve read and write transfer speeds of 3.69 GB/sec and 3.18 GB/sec respectively by using an aio_handle configured as below. >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle(block_size=1048576, queue_depth=32, single_submit=False, overlap_events=True, intra_op_parallelism=8) The full command line options of ds_nvme_tune can be obtained via the normal -h or --help. usage: ds_nvme_tune [-h] --nvme_dir NVME_DIR [NVME_DIR ...] [--sweep_config SWEEP_CONFIG] [--no_read] [--no_write] [--io_size IO_SIZE] [--gpu] [--gds] [--flush_page_cache] [--log_dir LOG_DIR] [--loops LOOPS] [--verbose] options: -h, --help show this help message and exit --nvme_dir NVME_DIR [NVME_DIR ...] Directory in which to perform I/O tests. A writeable directory on a NVMe device. --sweep_config SWEEP_CONFIG Performance sweep configuration json file. --no_read Disable read performance measurements. --no_write Disable write performance measurements. --io_size IO_SIZE Number of I/O bytes to read/write for performance measurements. --gpu Test tensor transfers between GPU device and NVME device. --gds Run the sweep over NVIDIA GPUDirectStorage operator --flush_page_cache Page cache will not be flushed and reported read speeds may be higher than actual *Requires sudo access*. --log_dir LOG_DIR Output directory for performance log files. Default is ./_aio_bench_logs --loops LOOPS Count of operation repetitions --verbose Print debugging information. DeepNVMe APIs For convenience, we provide listing and brief descriptions of the DeepNVMe APIs. General I/O APIs The following functions are used for I/O operations with both aio_handle and gds_handle. Function Description async_pread Non-blocking file read into tensor sync_pread Blocking file read into tensor pread File read with blocking and non-blocking options async_pwrite Non-blocking file write from tensor sync_pwrite Blocking file write from tensor pwrite File write with blocking and non-blocking options wait Wait for non-blocking I/O operations to complete GDS-specific APIs The following functions are available only for gds_handle Function Description new_pinned_device_tensor Allocate and pin a device tensor free_pinned_device_tensor Unpin and free a device tensor pin_device_tensor Pin a device tensor unpin_device_tensor unpin a device tensor Handle Settings APIs The following APIs can be used to probe handle configuration. Function Description get_queue_depth Return queue depth setting get_single_submit Return whether single_submit is enabled get_intra_op_parallelism Return I/O parallelism degree get_block_size Return I/O block size setting get_overlap_events Return whether overlap_event is enabled Updated: November 5, 2025 Previous Next

libaio

Pattern 2: Mixture of Experts for NLG models Contents 1. Installation 2. Training NLG+MoE models 2.1. Changes to the model 2.2. Pre-training the Standard MoE model 2.3. Pre-training the PR-MoE model 2.4. Training MoS with reduced model size In this tutorial, we introduce how to apply DeepSpeed Mixture of Experts (MoE) to NLG models, which reduces the training cost by 5 times and reduce the MoE model size by 3 times (details in our Blog). We use the GPT-3 like models in Megatron-LM framework as the example. Before reading this tutorial, we recommend to first read the tutorials about Mixture of Experts and Megatron-LM GPT pre-training. 1. Installation You would need to install DeepSpeed v0.6.0 or higher to use the MoE feature. The MoE for NLG model examples are in the Megatron-DeepSpeed repo under the MoE folder. 2. Training NLG+MoE models 2.1. Changes to the model To apply MoE to the GPT-style model, we made several changes in Megatron framework, mostly in megatron/model/ where we add the MoE layers into the model. 2.2. Pre-training the Standard MoE model We provide example training scripts under examples_deepspeed/MoE which we used to perform the experiments in our Blog. There are a few new hyperparameters for standard MoE model: --num-experts: the number of experts per MoE layer. In our experiments we set it to 128. Larger number of experts tend to provide better convergence, but it’s a diminishing return. --moe-expert-parallel-size: degree of the MoE expert parallelism. In other words, there will be num-experts/moe-expert-parallel-size experts on each GPU. Thus --moe-expert-parallel-size should be no more than both number of GPUs, and --num-experts. --moe-loss-coeff: scaling coefficient for adding MoE loss to model loss. In our experiments we find that 0.01 is a good setting. --moe-train-capacity-factor, --moe-eval-capacity-factor, --moe-min-capacity: these configs determine how many tokens can a single expert handle. Larger numbers could lead to better convergence, but would also lead to slower training since the load would be more unbalanced on different experts. --disable-moe-token-dropping: this will completely remove the limitation of how many tokens can a single expert handle. For the same reason as above, we only recommend using this during inference/eval. 2.3. Pre-training the PR-MoE model PR-MoE is a new designed MoE models, standing for Pyramid-Residual-MoE, which improves the parameter efficiency up to 3x as compared to standard MoE. Please see our Blog for more details. We provide example training scripts under examples_deepspeed/MoE. There are a few different hyperparameters for PR-MoE model compared to standard MoE: --num-experts: Instead of providing a single number, to enable Pyramid-MoE, you need to provide a list, whose length is the same as the number of MoE layers. We suggest to use more experts in the latter stage (close to output) of the model. --mlp-type: chosen from [standard, residual]. When it is residual, Residual-MoE is enabled. In addition to the new hyperparameters above for standard MoE and PR-MoE, for NLG+MoE models we found that it’s helpful to lower the learning rate and increase the learning rate decay duration compared to the base dense model. Details of our tuning can be found in the example training scripts. Regarding training data, we are not able to release our internal data but any public data for Megatron-LM pre-training can be directly used to train MoE models (with the caveat that it might not provide the exact same model quality as in our experiments). For example, we evaluated The Pile dataset (pile.eleuther.ai, github.com/EleutherAI/the-pile) for both dense and MoE models. Table 1 below shows that this public data provides similar evaluation results as our internal data. Model size LAMBADA: completion prediction PIQA: commonsense reasoning BoolQ: reading comprehension RACE-h: reading comprehension TriviaQA: question answering WebQs: question answering Dense NLG: 350M, internal data 0.5203 0.6931 0.5364 0.3177 0.0321 0.0157 350M, public Pile 0.5106 0.6589 0.5933 0.3196 0.0257 0.0064 Standard MoE NLG: 350M+MoE-128, internal data 0.6270 0.7459 0.6046 0.3560 0.1658 0.0517 350M+MoE-128, public Pile 0.6128 0.7323 0.6040 0.3349 0.1111 0.0335 PR-MoE NLG: 350M+MoE-128, internal data 0.6365 0.7399 0.5988 0.3569 0.1630 0.0473 PR-MoE + MoS NLG: 350M+MoE-128, internal data 0.6346 0.7334 0.5807 0.3483 0.1369 0.0522 Table 1: Zero-shot evaluation results (last six columns) for different dense and MoE NLG models. All zero-shot evaluation results use the accuracy metric. 2.4. Training MoS with reduced model size MoS, standing for Mixture-of-Students, is a staged distillation-based technique for compressing large MoE models. MoS further reduces the model size by 12.5%, leading to up 3.7x model size reduction when combined with PR-MoE over the standard MoE. The reduced model size helps reduce the latency and cost during inference. To train an MoS model, one needs to specify a few additional parameters. We will use PR-MoE as an example: --mos: This would enable Mixture-of-Students via knowledge distillation. --load-teacher: This specifies the path to the teacher model checkpoint. This is a mandatory argument for using MoS and the teacher model checkpoint can be obtained by either training a standard MoE or the PR-MoE. num-layers-teacher, --hidden-size-teacher, --hidden-size-teacher, --num-experts-teacher: In addition to the teacher model checkpoint path, we also need to specify the model architecture of the teacher model such as its number of layers, hidden dimension size, and the number of experts per MoE layer. In the case of PR-MoE, we need to also provide a list of experts for the teacher model, where we remove a few expert layers from the teacher model. In addition to the new parameters above, we observe that using the teacher PR-MoE during the entire training process may adversely impact the final student model accuracy. In our experiments, we use a staged distillation method by stopping distillation early in the training process (e.g., after 400K steps) and perform optimization only against the standard language modeling loss for the rest of the training. We provide example training scripts under examples_deepspeed/MoE. Details of our parameter settings can be found in the example training scripts. The performance results of MoS can be seen from our blog post and our paper. Updated: November 5, 2025 Previous Next

megatron/model/

Pattern 3: MoS, standing for Mixture-of-Students, is a staged distillation-based technique for compressing large MoE models. MoS further reduces the model size by 12.5%, leading to up 3.7x model size reduction when combined with PR-MoE over the standard MoE. The reduced model size helps reduce the latency and cost during inference. To train an MoS model, one needs to specify a few additional parameters. We will use PR-MoE as an example:

--mos

Pattern 4: Learning Rate Range Test Contents Learning Rate Range Test (LRRT) Prerequisites LRRT Parameters Required Model Configuration Changes PyTorch Example: Tuning for Large Batch Sizes This tutorial shows how to use to perform Learning Rate range tests in PyTorch. Learning Rate Range Test (LRRT) Learning rate range test ( LRRT ) is a method for discovering the largest learning rate values that can be used to train a model without divergence. Data scientists are often interested in this information because large learning rates lead to faster model convergence than a small learning rates. Moreover, large learning rates are crucial in learning rate schedules such as CLR and 1Cycle, which are used to train effectively with large batch sizes. DeepSpeed provides LRRT for model training in PyTorch frameworks. Prerequisites To use DeepSpeed’s LRRT, you must satisfy the following two conditions: Integrate DeepSpeed into your training script using the Getting Started guide. Add the parameters to configure LRRT to the parameters of your model. The LRRT parameters are defined below. LRRT Parameters LRRT works by linearly increasing the learning rate by a predefined amount, at predefined intervals. Thus, LRRT is a form of learning rate schedule because it defines how and when the learning rate should change during model training. To configure LRRT, you will need to set these parameters: lr_range_test_min_lr : The initial learning rate for training (float) lr_range_test_step_size: The interval for scaling up learning rate, defined in training steps (integer) lr_range_test_step_rate: The scaling factor for increasing learning rate (float) lr_range_test_staircase: If true, learning rate is changed every lr_range_test_step_size training steps, otherwise learning rate is changed at every training step (boolean) Required Model Configuration Changes We will illustrate the required model configuration changes an example LRRT schedule that: Starts training with an initial learning rate of 0.0001 Uses a scaling rate of 5 Uses a scaling interval of 200 training steps Scales learning rate at every training step, i.e., does not use staircase PyTorch For PyTorch models, LRRT is implemented as a learning rate scheduler, a feature that is available in PyTorch versions 1.0.1 and newer. Thus, you can add a "scheduler" entry of type "LRRangeTest" into your model configuration as illustrated below: "scheduler": { "type": "LRRangeTest", "params": { "lr_range_test_min_lr": 0.0001, "lr_range_test_step_size": 200, "lr_range_test_step_rate": 5, "lr_range_test_staircase": false } } Example: Tuning for Large Batch Sizes We illustrate how LRRT can benefit data scientists with a snippet of our experience of tuning an internal production model to converge efficiently on larger batch sizes, as we scaled from one GPU (batch size 512) to four GPUs (batch size 2048). Our goal was to train the model with the larger batch size to match the performance of the smaller batch size using the same amount of data samples. The challenge here is the well known problem of slow convergence of large batch size training. Our approach was to use a 1Cycle schedule in DeepSpeed to tackle this problem, and we used LRRT to configure the schedule. In the plots below, we illustrate using LRRT to discover the maximum learning rates for effective training with batch size 2048. The plot on the left shows the impact of large learning rates on validation loss over the first 9000 batches of training. The plot on the right shows the learning rate values during the same period of training. Using grid search we discover that the best fixed learning rate for the batch size 2048 is 0.0002. The blue line (lr=0.0002) represents training with this fixed learning rate. We compare the two LRRT schedules with this fixed learning rate. The orange (lr_range_test_step_rate=5) and gray (lr_range_test_step_rate=50) lines represent training with similar LRRT schedules that differ only in lr_range_test_step_rate values. Although the LRRT schedules start from the same base learning rate, the gray line’s learning rate grows about 10 times faster than the orange line. Also, the learning rates of the LRRT schedules had grown larger than that of the blue line in the presented data points. We subsequently refer to the gray line as “fast growing”, and the orange line as “slow growing” LRRT schedules respectively. We make the following observations from this small example. Larger learning rates clearly benefit model performance, up to some point. The fast growing LRRT schedule achieves validation loss of 0.46 after 3000 batches, which the fixed learning rate does not achieve with 9000 batches. The slow growing LRRT does not match that score until after 6000 batches, however it maintains an increasing performance advantage over the fixed learning rate. There is an upper bound on learning rate values that are useful for training the model. The fast growing LRRT schedule hits this boundary quickly and diverges, while the slow growing LRRT will later diverge for the same reason. LRRT helped us discover these boundaries quickly, using less than 2% of the training data. These boundaries are useful information for constructing learning rate schedules. These observations from LRRT helped us to configure the learning rate boundaries and the cycle span for a 1Cycle schedule that solves the problem, as shown below. "OneCycle": { "cycle_min_lr": 0.002, "cycle_max_lr": 0.005, "cycle_first_step_size": 2000, "cycle_second_step_size": 2000, ... } In our experience these are four most critical parameters of 1Cycle schedules. We chose to use the slower LRRT schedule (lr_range_test_step_rate=5) to set cycle_min_lr because it achieves the best loss and the faster schedule diverges fairly quickly. We set cycle_max_lr to 0.005 even though the plot shows that performance was still improving at slightly higher learning rate. This is because we observed that if we wait till the maximum learning rate, the model could be at the point of divergence and impossible to recover. Since it takes 8000 batches for the learning rate to become 0.005, we set cycle_first_step_size and (cycle_second_step_size) to 2000 which is the number of steps that it takes for four GPUs to process 8000 batches. We hope this brief example sparks your imagination on using LRRT for your own unique tuning challenges. Updated: November 5, 2025 Previous Next

lr_range_test_min_lr

Pattern 5: Training Overview and Features Contents Overview Distributed, Effective, and Efficient Training with Ease Speed Memory efficiency Scalability Communication efficiency Data efficiency Supporting long sequence length Fast convergence for effectiveness Good Usability Features Distributed Training with Mixed Precision Mixed Precision Training Single-GPU, Multi-GPU, and Multi-Node Training Pipeline Parallelism Model Parallelism Support for Custom Model Parallelism Integration with Megatron-LM The Zero Redundancy Optimizer Optimizer State and Gradient Partitioning Activation Partitioning Constant Buffer Optimization (CBO) Contiguous Memory Optimization (CMO) ZeRO-Offload Additional Memory and Bandwidth Optimizations Smart Gradient Accumulation Communication Overlapping Training Features Simplified training API Activation Checkpointing API Gradient Clipping Automatic loss scaling with mixed precision Training Optimizers 1-bit Adam, 0/1 Adam and 1-bit LAMB optimizers with up to 26x less communication Fused Adam optimizer and arbitrary torch.optim.Optimizer CPU-Adam: High-Performance vectorized implementation of Adam Memory bandwidth optimized FP16 Optimizer Large Batch Training with LAMB Optimizer Memory-Efficient Training with ZeRO Optimizer Training Agnostic Checkpointing Advanced parameter search Learning Rate Range Test 1Cycle Learning Rate Schedule Simplified Data Loader Data Efficiency Curriculum Learning Performance Analysis and Debugging Wall Clock Breakdown Timing Activation Checkpoint Functions Flops Profiler Autotuning Monitor Communication Logging Sparse Attention Mixture of Experts (MoE) Overview Training advanced deep learning models is challenging. Beyond model design, model scientists also need to set up the state-of-the-art training techniques such as distributed training, mixed precision, gradient accumulation, and checkpointing. Yet still, scientists may not achieve the desired system performance and convergence rate. Large model sizes are even more challenging: a large model easily runs out of memory with pure data parallelism and it is difficult to use model parallelism. DeepSpeed addresses these challenges to accelerate model development and training. Distributed, Effective, and Efficient Training with Ease The DeepSpeed API is a lightweight wrapper on PyTorch. This means that you can use everything you love in PyTorch and without learning a new platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art training techniques, such as distributed training, mixed precision, gradient accumulation, and checkpoints so that you can focus on your model development. Most importantly, you can leverage the distinctive efficiency and effectiveness benefit of DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch models. Speed DeepSpeed achieves high performance and fast convergence through a combination of efficiency optimizations on compute/communication/memory/IO and effectiveness optimizations on advanced hyperparameter tuning and optimizers. For example: DeepSpeed trains BERT-large to parity in 44 mins using 1024 V100 GPUs (64 DGX-2 boxes) and in 2.4 hours using 256 GPUs (16 DGX-2 boxes). BERT-large Training Times Devices Source Training Time 1024 V100 GPUs DeepSpeed 44 min 256 V100 GPUs DeepSpeed 2.4 hr 64 V100 GPUs DeepSpeed 8.68 hr 16 V100 GPUs DeepSpeed 33.22 hr BERT code and tutorials will be available soon. DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA Megatron on Azure GPUs. Read more: GPT tutorial Memory efficiency DeepSpeed provides memory-efficient data parallelism and enables training models without model parallelism. For example, DeepSpeed can train models with up to 13 billion parameters on a single GPU. In comparison, existing frameworks (e.g., PyTorch’s Distributed Data Parallel) run out of memory with 1.4 billion parameter models. DeepSpeed reduces the training memory footprint through a novel solution called Zero Redundancy Optimizer (ZeRO). Unlike basic data parallelism where memory states are replicated across data-parallel processes, ZeRO partitions model states and gradients to save significant memory. Furthermore, it also reduces activation memory and fragmented memory. The current implementation (ZeRO-2) reduces memory by up to 8x relative to the state-of-art. You can read more about ZeRO in our paper, and in our blog posts related to ZeRO-1 and ZeRO-2. With this impressive memory reduction, early adopters of DeepSpeed have already produced a language model (LM) with over 17B parameters called Turing-NLG, establishing a new SOTA in the LM category. For model scientists with limited GPU resources, ZeRO-Offload leverages both CPU and GPU memory for training large models. Using a machine with a single GPU, our users can run models of up to 13 billion parameters without running out of memory, 10x bigger than the existing approaches, while obtaining competitive throughput. This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models. Scalability DeepSpeed supports efficient data parallelism, model parallelism, pipeline parallelism and their combinations, which we call 3D parallelism. 3D parallelism of DeepSpeed provides system support to run models with trillions of parameters, read more in our press-release and tutorial. DeepSpeed can run large models more efficiently, up to 10x faster for models with various sizes spanning 1.5B to hundred billion. More specifically, the data parallelism powered by ZeRO is complementary and can be combined with different types of model parallelism. It allows DeepSpeed to fit models using lower degree of model parallelism and higher batch size, offering significant performance gains compared to using model parallelism alone. Read more: ZeRO paper, and GPT tutorial. The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone. Communication efficiency Pipeline parallelism of DeepSpeed reduce communication volume during distributed training, which allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks. 1-bit Adam blog post, 1-bit Adam tutorial, 0/1 Adam tutorial, 1-bit LAMB tutorial. Data efficiency DeepSpeed Data Efficiency Library provides efficient data sampling via curriculum learning and efficient data routing via random layerwise token dropping. The composed solution enables up to 2x data and 2x time saving during GPT-3/BERT pretraining and GPT/ViT finetuning, or further improve model quality under the same data/time. See more in the tutorial. Supporting long sequence length DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5–3x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures. Read more here. Fast convergence for effectiveness DeepSpeed supports advanced hyperparameter tuning and large batch size optimizers such as LAMB. These improve the effectiveness of model training and reduce the number of samples required to convergence to desired accuracy. Read more: Tuning tutorial. Good Usability Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to 13 billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.4 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custom model parallelisms, such as tensor slicing of NVIDIA’s Megatron-LM. Features Below we provide a brief feature list, see our detailed feature overview for descriptions and usage. Distributed Training with Mixed Precision 16-bit mixed precision Single-GPU/Multi-GPU/Multi-Node Model Parallelism Support for Custom Model Parallelism Integration with Megatron-LM Pipeline Parallelism 3D Parallelism The Zero Redundancy Optimizer Optimizer State and Gradient Partitioning Activation Partitioning Constant Buffer Optimization Contiguous Memory Optimization ZeRO-Offload Leverage both CPU/GPU memory for model training Support 10B model training on a single GPU Ultra-fast dense transformer kernels Sparse attention Memory- and compute-efficient sparse kernels Support 10x long sequences than dense Flexible support to different sparse structures 1-bit Adam, 0/1 Adam and 1-bit LAMB Custom communication collective Up to 26x communication volume saving Additional Memory and Bandwidth Optimizations Smart Gradient Accumulation Communication/Computation Overlap Training Features Simplified training API Gradient Clipping Automatic loss scaling with mixed precision Training Optimizers Fused Adam optimizer and arbitrary torch.optim.Optimizer Memory bandwidth optimized FP16 Optimizer Large Batch Training with LAMB Optimizer Memory efficient Training with ZeRO Optimizer CPU-Adam Training Agnostic Checkpointing Advanced Parameter Search Learning Rate Range Test 1Cycle Learning Rate Schedule Simplified Data Loader Data Efficiency Efficient data sampling via curriculum learning and efficient data routing via random layerwise token dropping Up to 2x data and 2x time saving during GPT-3/BERT pretraining and GPT/ViT finetuning Or further improve model quality under the same data/time Curriculum Learning A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed Complementary to many other DeepSpeed features Note that the Data Efficiency Library above provides more general curriculum learning support. This legacy curriculum learning feature is still supported but we recommend to use the Data Efficiency Library. Progressive Layer Dropping Efficient and robust compressed training Up to 2.5x convergence speedup for pre-training Performance Analysis and Debugging Mixture of Experts (MoE) title: “Feature Overview” layout: single permalink: /features/ toc: true toc_label: “Contents” — Distributed Training with Mixed Precision Mixed Precision Training Enable 16-bit (FP16) training by in the deepspeed_config JSON. "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "consecutive_hysteresis": false, "min_loss_scale": 1 } Single-GPU, Multi-GPU, and Multi-Node Training Easily switch between single-GPU, single-node multi-GPU, or multi-node multi-GPU execution by specifying resources with a hostfile. deepspeed --hostfile=<hostfile> \ <client_entry.py> <client args> \ --deepspeed --deepspeed_config ds_config.json The script <client_entry.py> will execute on the resources specified in <hostfile>. Pipeline Parallelism DeepSpeed provides pipeline parallelism for memory- and communication- efficient training. DeepSpeed supports a hybrid combination of data, model, and pipeline parallelism and has scaled to over one trillion parameters using 3D parallelism. Pipeline parallelism can also improve communication efficiency and has accelerated training by up to 7x on low-bandwidth clusters. Model Parallelism Support for Custom Model Parallelism DeepSpeed supports all forms of model parallelism including tensor slicing based approaches such as the Megatron-LM. It does so by only requiring the model parallelism framework to provide a model parallelism unit (mpu) that implements a few bookkeeping functionalities: mpu.get_model_parallel_rank() mpu.get_model_parallel_group() mpu.get_model_parallel_world_size() mpu.get_data_parallel_rank() mpu.get_data_parallel_group() mpu.get_data_parallel_world_size() Integration with Megatron-LM DeepSpeed is fully compatible with Megatron. Please see the Megatron-LM tutorial for details. The Zero Redundancy Optimizer The Zero Redundancy Optimizer (ZeRO) is at the heart of DeepSpeed and enables large model training at a scale that is simply not possible with model parallelism alone. When enabled, ZeRO allows training models with over 13 billion parameters without any model parallelism, and up to 200 billion parameter models with model parallelism on current generation hardware. For more details see the ZeRO paper, GPT tutorial on integration with DeepSpeed. Optimizer State and Gradient Partitioning Optimizer State and Gradient Partitioning in ZeRO reduces the memory consumption of the model states (optimizer states, gradients and parameters) by 8x compared to standard data parallelism by partitioning these states across data parallel process instead of replicating them. Activation Partitioning Activation Partitioning is a memory optimization in ZeRO that can reduce the memory consumed by activations during model parallel training (MP). In MP certain activations maybe required by all MP processes, resulting in a replication of activations across MP GPUs. Activation Partitioning stores these activations in a partitioned state once they are used for computation in the forward propagation. These activations are allgathered right before they are needed again during the backward propagation. By storing activations in a partitioned state, ZeRO in DeepSpeed can reduce the activation memory footprint proportional to the MP degree. Constant Buffer Optimization (CBO) CBO enables high network and memory throughput while restricting memory usage to a constant size. For memory- and network-bound operations such as normalization or allreduce collectives, the performance depends on the size of the operand. Simply fusing all operands into a single large operand can enable great throughput at the expense of unnecessary memory overhead. CBO in DeepSpeed fuses smaller operands into approximately a pre-defined sized buffer large enough to achieve great performance without the unnecessary memory overhead. Contiguous Memory Optimization (CMO) CMO reduces memory fragmentation during training, preventing out of memory errors due to lack of contiguous memory. Memory fragmentation is a result of interleaving between short lived and long lived memory objects. During the forward propagation activation checkpoints are long lived but the activations that recomputed are short lived. Similarly, during the backward computation, the activation gradients are short lived while the parameter gradients are long lived. CMO transfers activation checkpoints and parameter gradients to contiguous buffers preventing memory fragmentation. ZeRO-Offload ZeRO-Offload pushes the boundary of the maximum model size that can be trained efficiently using minimal GPU resources, by exploiting computational and memory resources on both GPUs and their host CPUs. It allows training up to 13-billion-parameter models on a single NVIDIA V100 GPU, 10x larger than the state-of-the-art, while retaining high training throughput of over 30 teraflops per GPU. For more details see the ZeRO-Offload release blog, and tutorial on integration with DeepSpeed. Additional Memory and Bandwidth Optimizations Smart Gradient Accumulation Gradient accumulation allows running larger batch size with limited memory by breaking an effective batch into several sequential micro-batches, and averaging the parameter gradients across these micro-batches. Furthermore, instead of averaging the gradients of each micro-batch across all GPUs, the gradients are averaged locally during each step of the sequence, and a single allreduce is done at the end of the sequence to produce the averaged gradients for the effective batch across all GPUs. This strategy significantly reduces the communication involved over the approach of averaging globally for each micro-batch, specially when the number of micro-batches per effective batch is large. Communication Overlapping During back propagation, DeepSpeed can overlap the communication required for averaging parameter gradients that have already been computed with the ongoing gradient computation. This computation-communication overlap allows DeepSpeed to achieve higher throughput even at modest batch sizes. Training Features Simplified training API The DeepSpeed core API consists of just a handful of methods: initialization: initialize training: backward and step argument parsing: add_config_arguments checkpointing : load_checkpoint and store_checkpoint DeepSpeed supports most of the features described in this document, via the use of these API, along with a deepspeed_config JSON file for enabling and disabling the features. Please see the core API doc for more details. Activation Checkpointing API DeepSpeed’s Activation Checkpointing API supports activation checkpoint partitioning, cpu checkpointing, and contiguous memory optimizations, while also allowing layerwise profiling. Please see the core API doc for more details. Gradient Clipping { "gradient_clipping": 1.0 } DeepSpeed handles gradient clipping under the hood based on the max gradient norm specified by the user. Please see the core API doc for more details. Automatic loss scaling with mixed precision DeepSpeed internally handles loss scaling for mixed precision training. The parameters for loss scaling can be specified in the deepspeed_config JSON file. Please see the core API doc for more details. Training Optimizers 1-bit Adam, 0/1 Adam and 1-bit LAMB optimizers with up to 26x less communication DeepSpeed has three communication-efficient optimizers called 1-bit Adam, 0/1 Adam and 1-bit LAMB. They offer the same convergence as Adam/LAMB, incur up to 26x less communication that enables up to 6.6x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput for SQuAD fine-tuning on bandwidth-limited clusters. For more details on usage and performance, please refer to the 1-bit Adam tutorial, 1-bit Adam blog post, 0/1 Adam tutorial and 1-bit LAMB tutorial. For technical details, please refer to the 1-bit Adam paper, 0/1 Adam paper and 1-bit LAMB paper. Fused Adam optimizer and arbitrary torch.optim.Optimizer With DeepSpeed, the user can choose to use a high performance implementation of ADAM from NVIDIA, or any training optimizer that extends torch’s torch.optim.Optimizer class. CPU-Adam: High-Performance vectorized implementation of Adam We introduce an efficient implementation of Adam optimizer on CPU that improves the parameter-update performance by nearly an order of magnitude. We use the AVX SIMD instructions on Intel-x86 architecture for the CPU-Adam implementation. We support both AVX-512 and AVX-2 instruction sets. DeepSpeed uses AVX-2 by default which can be switched to AVX-512 by setting the build flag, DS_BUILD_AVX512 to 1 when installing DeepSpeed. Using AVX-512, we observe 5.1x to 6.5x speedups considering the model-size between 1 to 10 billion parameters with respect to torch-adam. Memory bandwidth optimized FP16 Optimizer Mixed precision training is handled by the DeepSpeed FP16 Optimizer. This optimizer not only handles FP16 training but is also highly efficient. The performance of weight update is primarily dominated by the memory bandwidth, and the achieved memory bandwidth is dependent on the size of the input operands. The FP16 Optimizer is designed to maximize the achievable memory bandwidth by merging all the parameters of the model into a single large buffer, and applying the weight updates in a single kernel, allowing it to achieve high memory bandwidth. Large Batch Training with LAMB Optimizer DeepSpeed makes it easy to train with large batch sizes by enabling the LAMB Optimizer. For more details on LAMB, see the LAMB paper. Memory-Efficient Training with ZeRO Optimizer DeepSpeed can train models with up to 13 billion parameters without model parallelism, and models with up to 200 billion parameters with 16-way model parallelism. This leap in model size is possible through the memory efficiency achieved via the ZeRO Optimizer. For more details see ZeRO paper . Training Agnostic Checkpointing DeepSpeed can simplify checkpointing for you regardless of whether you are using data parallel training, model parallel training, mixed-precision training, a mix of these three, or using the zero optimizer to enable larger model sizes. Please see the Getting Started guide and the core API doc for more details. Advanced parameter search DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence for large batch scaling. Learning Rate Range Test Please refer to the Learning Rate Range Test tutorial. 1Cycle Learning Rate Schedule Please refer to the 1Cycle Learning Rate Schedule tutorial. Simplified Data Loader DeepSpeed abstracts away data parallelism and model parallelism from the user when it comes to data loading. Users simply provide a PyTorch dataset, and DeepSpeed data loader can automatically handle batch creation appropriately. Data Efficiency Please refer to the Data Efficiency tutorial. Curriculum Learning Please refer to the Curriculum Learning tutorial. Note that the Data Efficiency Library above provides more general curriculum learning support. This legacy curriculum learning feature is still supported but we recommend to use the Data Efficiency Library. Performance Analysis and Debugging DeepSpeed provides a set of tools for performance analysis and debugging. Wall Clock Breakdown DeepSpeed provides a detailed breakdown of the time spent in different parts of the training. This can be enabled by setting the following in the deepspeed_config file. { "wall_clock_breakdown": true, } Timing Activation Checkpoint Functions When activation checkpointing is enabled, profiling the forward and backward time of each checkpoint function can be enabled in the deepspeed_config file. { "activation_checkpointing": { "profile": true } } Flops Profiler The DeepSpeed flops profiler measures the time, flops and parameters of a PyTorch model and shows which modules or layers are the bottleneck. When used with the DeepSpeed runtime, the flops profiler can be configured in the deepspeed_config file as follows: { "flops_profiler": { "enabled": true, "profile_step": 1, "module_depth": -1, "top_modules": 3, "detailed": true, } } The flops profiler can also be used as a standalone package. Please refer to the Flops Profiler tutorial for more details. Autotuning The DeepSpeed Autotuner uses model information, system information, and heuristics to efficiently tune Zero stage, micro batch size, and other Zero configurations. Using the autotuning feature requires no code change from DeepSpeed users. While "autotuning": {"enabled": true} is the minimal required to enable autotuning, there are other parameters users can define to configure the autotuning process. Below shows major parameters and their default values in the autotuning configuration. Please refer to the Autotuning tutorial for more details. { "autotuning": { "enabled": true, "results_dir": null, "exps_dir": null, "overwrite": false, "metric": "throughput", "num_nodes": null, "num_gpus": null, "start_profile_step": 3, "end_profile_step": 5, "fast": true, "num_tuning_micro_batch_sizes": 3, "tuner_type": "model_based", "tuner_early_stopping": 5, "tuner_num_trials": 50, "arg_mappings": null } } The flops profiler can also be used as a standalone package. Please refer to the Flops Profiler tutorial for more details. Monitor The DeepSpeed Monitor logs live training metrics to one or more monitoring backends, including PyTorch’s TensorBoard, WandB, or simply to CSV files. The Monitor can be configured with one or more backends in the deepspeed_config file as follows: { "tensorboard": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } "wandb": { "enabled": true, "team": "my_team", "group": "my_group", "project": "my_project" } "csv_monitor": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } } The Monitor can also be added to log custom metrics and client codes. Please refer to the Monitor tutorial for more details. Communication Logging DeepSpeed provides logging of all communication operations launched within deepspeed.comm. The communication logger can be configured in the deepspeed_config file as follows: { "comms_logger": { "enabled": true, "verbose": false, "prof_all": true, "debug": false } } Client codes can then print a summary with a call to deepspeed.comm.log_summary(). For more details and example usage, see the Communication Logging tutorial. Sparse Attention DeepSpeed offers sparse attention to support long sequences. Please refer to the Sparse Attention tutorial. --deepspeed_sparse_attention "sparse_attention": { "mode": "fixed", "block": 16, "different_layout_per_head": true, "num_local_blocks": 4, "num_global_blocks": 1, "attention": "bidirectional", "horizontal_global_attention": false, "num_different_global_patterns": 4 } Mixture of Experts (MoE) To learn more about training Mixture of Experts (MoE) models with DeepSpeed, see our tutorial for more details.

torch.optim.Optimizer

Pattern 6: Flops Profiler Contents Overview Flops Measurement Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism Usage Usage With the DeepSpeed Runtime Example: Megatron-LM Usage Outside the DeepSpeed Runtime In Model Inference Example: AlexNet Example: Bert In Model Training Workflow Example Training Workflow In this tutorial, we introduce the DeepSpeed Flops Profiler and provide examples of its usage. Overview Flops Measurement Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism Usage Overview Effective use of hardware resources is critical to good performance, but performance inefficiency in existing implementations for large-scale model training and inference are often hard to spot and attribute to specific module components. DeepSpeed Flops Profiler helps users easily measure both the model training/inference speed (latency, throughput) and efficiency (floating-point operations per second, i.e., FLOPS) of a model and its submodules, with an eye towards eliminating inefficiencies in existing implementations. Below is an example output for BERT-Large(NVIDIA) on an A100 GPU with batch size 80: -------------------------- DeepSpeed Flops Profiler -------------------------- Profile Summary at step 10: Notations: data parallel size (dp_size), model parallel size(mp_size), number of parameters (params), number of multiply-accumulate operations(MACs), number of floating-point operations (flops), floating-point operations per second (FLOPS), fwd latency (forward propagation latency), bwd latency (backward propagation latency), step (weights update latency), iter latency (sum of fwd, bwd and step latency) world size: 1 data parallel size: 1 model parallel size: 1 batch size per GPU: 80 params per gpu: 336.23 M params of model = params per GPU mp_size: 336.23 M fwd MACs per GPU: 3139.93 G fwd flops per GPU: 6279.86 G fwd flops of model = fwd flops per GPU mp_size: 6279.86 G fwd latency: 76.67 ms bwd latency: 108.02 ms fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 81.9 TFLOPS bwd FLOPS per GPU = 2 fwd flops per GPU / bwd latency: 116.27 TFLOPS fwd+bwd FLOPS per GPU = 3 fwd flops per GPU / (fwd+bwd latency): 102.0 TFLOPS step latency: 34.09 us iter latency: 184.73 ms samples/second: 433.07 ----------------------------- Aggregated Profile per GPU ----------------------------- Top modules in terms of params, MACs or fwd latency at different model depths: depth 0: params - {'BertForPreTrainingPreLN': '336.23 M'} MACs - {'BertForPreTrainingPreLN': '3139.93 GMACs'} fwd latency - {'BertForPreTrainingPreLN': '76.39 ms'} depth 1: params - {'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'} MACs - {'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'} fwd latency - {'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'} depth 2: params - {'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'} MACs - {'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'} fwd latency - {'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'} depth 3: params - {'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'} MACs - {'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'} fwd latency - {'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''} depth 4: params - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''} MACs - {'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'} fwd latency - {'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'} depth 5: params - {'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'} MACs - {'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'} fwd latency - {'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'} depth 6: params - {'LinearActivation': '100.76 M', 'Linear': '100.69 M'} MACs - {'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'} fwd latency - {'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'} ------------------------------ Detailed Profile per GPU ------------------------------ Each module profile is listed after its name in the following order: params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS BertForPreTrainingPreLN( 336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS, (bert): BertModel( 335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS, (embeddings): BertEmbeddings(...) (encoder): BertEncoder( 302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS, (FinalLayerNorm): FusedLayerNorm(...) (layer): ModuleList( 302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS, (0): BertLayer( 12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS, (attention): BertAttention( 4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS, (self): BertSelfAttention( 3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS, (query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...) (key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...) (value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...) (dropout): Dropout(...) (softmax): Softmax(...) ) (output): BertSelfOutput( 1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS, (dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...) (dropout): Dropout(...) ) ) (PreAttentionLayerNorm): FusedLayerNorm(...) (PostAttentionLayerNorm): FusedLayerNorm(...) (intermediate): BertIntermediate( 4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS, (dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...) ) (output): BertOutput( 4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS, (dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...) (dropout): Dropout(...) ) ) ... (23): BertLayer(...) ) ) (pooler): BertPooler(...) ) (cls): BertPreTrainingHeads(...) ) ------------------------------------------------------------------------------ In the summary profile, the DeepSpeed Flops Profiler outputs the number of parameters, floating-point operations (flops), FLOPS, latency, and throughput in samples/second of the model. This profile shows how much performance gap (compared to the peak hardware performance) the current model execution has and helps users tune the training or inference setup (e.g., hyperparameters, data parallelism, model parallelism, system configurations, etc.) for better performance. The DeepSpeed Flops Profiler also measures significant modules at different model depths (aggregated profile) and module-specific profile in the model architecture (detailed profile). Using these profiles, DeepSpeed users can understand how each layer or submodule contributes to the overall model complexity/performance. Then users can adjust or refactor the model design to improve performance. For example, using the profiler, DeepSpeed users can quantitatively tell if stacking smaller layers is lighter or more performant than having bigger ones. The aggregated and detailed profiles also allow users to quickly identify bottleneck modules. In the BERT-Large example above, using the DeepSpeed Flops Profiler, we find that BertLayer is the most significant layer and contains quite a few dropout, softmax, and layer norm along with linear modules. These modules are not heavy in flops and would trigger many GPU kernel invocations and create excessive read/write requests to memory. The pattern shown in the detailed profile suggests this is a perfect match for kernel fusion, and we developed fused transformer-kernels to reduce data movement (see DeepSpeedBert). After applying our optimizations, we see a 25% improvement in FLOPS per GPU and overall training samples/second in the DeepSpeed Flops Profiler output. The DeepSpeed Flops Profiler can be used with the DeepSpeed runtime without any user code change or be used independently from DeepSpeed as a standalone package. When using DeepSpeed for model training, the profiler can be enabled in the DeepSpeed configuration file. As a standalone package, the profiler API can be used in both training and inference code. The DeepSpeed profiler is still under active development and includes just initial features. Stay connected for more exciting features to be added soon. Flops Measurement Similar to existing flops calculation tools or methods, the DeepSpeed Flops Profiler measures the flops of the forward pass of a module and the flops of the backward pass is estimated as 2 times of that of the forward pass. Different from the PyTorch profiler which calculates the flops of PyTorch operators, the DeepSpeed Flops Profiler measures the flops within modules in a model and provides more insights to the users about the model execution. The flops estimation is partly inspired by ptflops with the major difference being that the DeepSpeed Flops Profiler not only supports flops computation directly at module level, but can also capture torch.nn.functional invoked in a module to estimate the flops. Thus the DeepSpeed Flops Profiler allows for customized modules in the model, e.g., ParallelTransformerLayerworks, ParallelSelfAttention, RowParallelLinear, etc. in Megatron-LM. This is in contrast to ptflops which requires users to write customized flops calculation functions for each customized module. Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism The DeepSpeed Flops Profiler outputs the per GPU profile as well as the world size, data parallel size, and model parallel size. For models running on multi-GPU or multi-node, only change of the model parallelism (e.g., --model-parallel-size in Megatron-LM) affects the number of flops and parameters profiled, i.e., model_parallel_size flops = total_flops and model_parallel_size parameters = total_parameters. The data parallel size or world size (related to the number of GPUs or nodes) does not affect the per GPU profile. Usage The DeepSpeed Flops Profiler can be used with the DeepSpeed runtime or as a standalone package. When using DeepSpeed for model training, the profiler can be configured in the deepspeed configuration file without user code changes. To use the flops profiler outside the DeepSpeed runtime, install DeepSpeed and import the flops_profiler package to use the APIs directly. Examples of each usage are given below. Usage With the DeepSpeed Runtime Example: Megatron-LM Usage Outside the DeepSpeed Runtime In Model Inference Example: AlexNet Example: Bert In Model Training Workflow Example Training Workflow Usage With the DeepSpeed Runtime When using DeepSpeed for model training, the profiler can be configured in the deepspeed configuration file. No explicit API calls are needed to use the profiler. The profiler can be enabled by adding the following field to deepspeed’s configuration json file. Refer to flops profiler for details. { "flops_profiler": { "enabled": true, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } } Example: Megatron-LM For information on running Megatron-LM with DeepSpeed, please refer to our tutorial Megatron-LM. An example output of 12-layer Megatron-LM model (hidden_size = 8192, num_attention_heads = 32, batch_size = 1024, seq_length = 1024) is shown below. -------------------------- DeepSpeed Flops Profiler -------------------------- Profile Summary at step 10: Notations: data parallel size (dp_size), model parallel size(mp_size), number of parameters (params), number of multiply-accumulate operations(MACs), number of floating-point operations (flops), floating-point operations per second (FLOPS), fwd latency (forward propagation latency), bwd latency (backward propagation latency), step (weights update latency), iter latency (sum of fwd, bwd and step latency) world size: 1 data parallel size: 1 model parallel size: 1 batch size per GPU: 1024 params per gpu: 1.29 M params of model = params per GPU mp_size: 1.29 M fwd MACs per GPU: 41271.95 G fwd flops per GPU: 82543.9 G fwd flops of model = fwd flops per GPU mp_size: 82543.9 G fwd latency: 1.89 s bwd latency: 5.38 s fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 43.68 TFLOPS bwd FLOPS per GPU = 2 fwd flops per GPU / bwd latency: 30.7 TFLOPS fwd+bwd FLOPS per GPU = 3 fwd flops per GPU / (fwd+bwd latency): 34.07 TFLOPS step latency: 34.12 s iter latency: 41.39 s samples/second: 24.74 ----------------------------- Aggregated Profile per GPU ----------------------------- Top 1 modules in terms of params, MACs or fwd latency at different model depths: depth 0: params - {'GPT2Model': '1.29 M'} MACs - {'GPT2Model': '41271.95 GMACs'} fwd latency - {'GPT2Model': '1.84 s'} depth 1: params - {'TransformerLanguageModel': '1.29 M'} MACs - {'TransformerLanguageModel': '39584.03 GMACs'} fwd latency - {'TransformerLanguageModel': '1.83 s'} depth 2: params - {'ParallelTransformer': '1.29 M'} MACs - {'ParallelTransformer': '39584.03 GMACs'} fwd latency - {'ParallelTransformer': '1.81 s'} depth 3: params - {'ModuleList': '1.28 M'} MACs - {'ModuleList': '39584.03 GMACs'} fwd latency - {'ModuleList': '1.3 s'} depth 4: params - {'ParallelTransformerLayerPart2': '688.15 k'} MACs - {'ParallelTransformerLayerPart2': '26388.28 GMACs'} fwd latency - {'ParallelTransformerLayerPart2': '865.73 ms'} depth 5: params - {'ParallelMLP': '491.54 k'} MACs - {'ParallelMLP': '26388.28 GMACs'} fwd latency - {'ParallelMLP': '849.4 ms'} ------------------------------ Detailed Profile per GPU ------------------------------ Each module profile is listed after its name in the following order: params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs(or latency) and the sum of its submodules'. 1. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput. 2. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed. GPT2Model( 1.29 M, 100.00% Params, 41271.95 GMACs, 100.00% MACs, 1.84 s, 100.00% latency, 44.78 TFLOPS, (language_model): TransformerLanguageModel( 1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.83 s, 99.11% latency, 43.34 TFLOPS, (embedding): Embedding( 2, 0.00% Params, 0 MACs, 0.00% MACs, 18.1 ms, 0.98% latency, 0.0 FLOPS, (word_embeddings): VocabParallelEmbedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 164.75 us, 0.01% latency, 0.0 FLOPS, ) (position_embeddings): Embedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 489.23 us, 0.03% latency, 0.0 FLOPS, 1024, 8192) (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 93.94 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False) ) (transformer): ParallelTransformer( 1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.81 s, 98.11% latency, 43.78 TFLOPS, (layers): ModuleList( 1.28 M, 98.73% Params, 39584.03 GMACs, 95.91% MACs, 1.3 s, 70.66% latency, 60.79 TFLOPS, (0): ParallelTransformerLayerPart1( 49.15 k, 3.80% Params, 1099.65 GMACs, 2.66% MACs, 23.5 ms, 1.27% latency, 93.6 TFLOPS, (input_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 128.75 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True) (attention): ParallelSelfAttention( 32.77 k, 2.53% Params, 1099.65 GMACs, 2.66% MACs, 22.8 ms, 1.24% latency, 96.46 TFLOPS, (query_key_value): ColumnParallelLinear(24.58 k, 1.90% Params, 824.63 GMACs, 2.00% MACs, 8.93 ms, 0.48% latency, 184.7 TFLOPS, ) (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.00% MACs, 151.16 us, 0.01% latency, 1.78 TFLOPS, ) (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.63 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False) (dense): RowParallelLinear(8.19 k, 0.63% Params, 274.88 GMACs, 0.67% MACs, 2.67 ms, 0.14% latency, 205.81 TFLOPS, ) ) ) (1): ParallelTransformerLayerPart2( 57.35 k, 4.43% Params, 2199.02 GMACs, 5.33% MACs, 77.53 ms, 4.21% latency, 56.73 TFLOPS, (post_attention_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 116.11 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True) (mlp): ParallelMLP( 40.96 k, 3.16% Params, 2199.02 GMACs, 5.33% MACs, 76.19 ms, 4.13% latency, 57.72 TFLOPS, (dense_h_to_4h): ColumnParallelLinear(32.77 k, 2.53% Params, 1099.51 GMACs, 2.66% MACs, 10.79 ms, 0.59% latency, 203.81 TFLOPS, ) (dense_4h_to_h): RowParallelLinear(8.19 k, 0.63% Params, 1099.51 GMACs, 2.66% MACs, 14.38 ms, 0.78% latency, 152.95 TFLOPS, ) ) ) ... (23): ParallelTransformerLayerPart2(...) ) (final_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 110.86 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True) ) ) ) ------------------------------------------------------------------------------ Usage Outside the DeepSpeed Runtime The profiler can be used as a standalone package outside of the DeepSpeed runtime. One can simply install DeepSpeed and import the flops_profiler package to use the APIs directly. Refer to installation of DeepSpeed for installing DeepSpeed. In Model Inference To profile a trained model in inference, use the get_model_profile function. Examples are given below. Example: AlexNet The following example shows how to profile AlexNet using the DeepSpeed flops profiler. import torchvision.models as models import torch from deepspeed.profiling.flops_profiler import get_model_profile from deepspeed.accelerator import get_accelerator with get_accelerator().device(0): model = models.alexnet() batch_size = 256 flops, macs, params = get_model_profile(model=model, # model input_shape=(batch_size, 3, 224, 224), # input shape to the model. If specified, the model takes a tensor with this shape as the only positional argument. args=None, # list of positional arguments to the model. kwargs=None, # dictionary of keyword arguments to the model. print_profile=True, # prints the model graph with the measured profile attached to each module detailed=True, # print the detailed profile module_depth=-1, # depth into the nested modules, with -1 being the inner most modules top_modules=1, # the number of top modules to print aggregated profile warm_up=10, # the number of warm-ups before measuring the time of each module as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k) output_file=None, # path to the output file. If None, the profiler prints to stdout. ignore_modules=None) # the list of modules to ignore in the profiling Example: Bert from functools import partial import torch from transformers import BertForSequenceClassification, BertTokenizer from deepspeed.profiling.flops_profiler import get_model_profile from deepspeed.accelerator import get_accelerator def bert_input_constructor(batch_size, seq_len, tokenizer): fake_seq = "" for _ in range(seq_len - 2): # ignore the two special tokens [CLS] and [SEP] fake_seq += tokenizer.pad_token inputs = tokenizer([fake_seq] batch_size, padding=True, truncation=True, return_tensors="pt") labels = torch.tensor([1] batch_size) inputs = dict(inputs) inputs.update({"labels": labels}) return inputs with get_accelerator().device(0): tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased') batch_size = 4 seq_len = 128 enable_profile = True if enable_profile: flops, macs, params = get_model_profile( model, kwargs=bert_input_constructor(batch_size, seq_len, tokenizer), print_profile=True, detailed=True, ) else: inputs = bert_input_constructor((batch_size, seq_len), tokenizer) outputs = model(inputs) In Model Training Workflow To profile model forward in a training workflow, use the FlopsProfilerclass. The FlopsProfilerclass provides the following methods: start_profile() - starts profiling get_total_flops(as_string=False) - returns the total number of floating-point operations in the model get_total_macs(as_string=False) - returns the total number of MACs in the model get_total_params(as_string=False) - returns the total number of parameters in the model print_model_profile(profile_step=1, module_depth=-1, top_modules=3, detailed=True, output_file=None) - prints the model profile stop_profile() - stops profiling. This stops the flops counting in the model. end_profile() - cleans up. This cleans up the profile attributes added to the model during the profiling. This should be invoked at the end of the profiling and AFTER get_total_flops, get_total_params or print_model_profile. Example Training Workflow Below is an example of this usage in a typical training workflow. from deepspeed.profiling.flops_profiler import FlopsProfiler model = Model() prof = FlopsProfiler(model) profile_step = 5 print_profile= True for step, batch in enumerate(data_loader): # start profiling at training step "profile_step" if step == profile_step: prof.start_profile() # forward() method loss = model(batch) # end profiling and print output if step == profile_step: # if using multi nodes, check global_rank == 0 as well prof.stop_profile() flops = prof.get_total_flops() macs = prof.get_total_macs() params = prof.get_total_params() if print_profile: prof.print_model_profile(profile_step=profile_step) prof.end_profile() # runs backpropagation loss.backward() # weight update optimizer.step() Updated: November 5, 2025 Previous Next

Pattern 7: DeepSpeed Configuration JSON Contents Batch Size Related Parameters Optimizer Parameters Scheduler Parameters Communication options FP16 training options BFLOAT16 training options Automatic mixed precision (AMP) training options Gradient Clipping ZeRO Optimizations for FP16 Training Parameter offloading Optimizer offloading Asynchronous I/O Logging Autotuning Flops Profiler Activation Checkpointing Sparse Attention Data Efficiency Curriculum Learning Monitoring Module Elastic Training Config (V0.1 and V0.2) Communication Logging Compression Layer Reduction Weight Quantization Activation Quantization Sparse Pruning Row Pruning Head Pruning Channel Pruning Checkpoint options Data Type options Batch Size Related Parameters Note: train_batch_size must be equal to train_micro_batch_size_per_gpu gradient_accumulation_steps number of GPUs. For simplicity, you can choose to only specify two of the three parameters, the last one will be inferred automatically by DeepSpeed. train_batch_size: [integer] Value Example The effective training batch size. This is the amount of data samples that leads to one step of model update. train_batch_size is aggregated by the batch size that a single GPU processes in one forward/backward pass (a.k.a., train_micro_batch_size_per_gpu), the gradient accumulation steps (a.k.a., gradient_accumulation_steps), and the number of GPUs. Can be omitted if both train_micro_batch_size_per_gpu and gradient_accumulation_steps are provided. 32 train_micro_batch_size_per_gpu: [integer] Description Default Batch size to be processed by one GPU in one step (without gradient accumulation). Can be omitted if both train_batch_size and gradient_accumulation_steps are provided. train_batch_size value gradient_accumulation_steps: [integer] Description Default Number of training steps to accumulate gradients before averaging and applying them. This feature is sometimes useful to improve scalability since it results in less frequent communication of gradients between steps. Another impact of this feature is the ability to train with larger batch sizes per GPU. Can be omitted if both train_batch_size and train_micro_batch_size_per_gpu are provided. 1 Optimizer Parameters optimizer: [dictionary] Fields Value Example type The optimizer name. DeepSpeed natively supports Adam, AdamW, OneBitAdam, Lamb, and OneBitLamb optimizers (See here for details) and will import other optimizers from torch. "Adam" params Dictionary of parameters to instantiate optimizer. The parameter names must match the optimizer constructor signature (e.g., for Adam). {"lr": 0.001, "eps": 1e-8} Example of optimizer with Adam "optimizer": { "type": "Adam", "params": { "lr": 0.001, "betas": [ 0.8, 0.999 ], "eps": 1e-8, "weight_decay": 3e-7 } } The Adam optimizer also supports the following two params keys/values in addition to the standard parameters from torch.optim.Adam: “params” key Description Default torch_adam Use torch’s implementation of adam instead of our fused adam implementation false adam_w_mode Apply L2 regularization (also known as AdamW) true Another example of optimizer with 1-bit Adam specific parameters is as follows. "optimizer": { "type": "OneBitAdam", "params": { "lr": 0.001, "betas": [ 0.8, 0.999 ], "eps": 1e-8, "weight_decay": 3e-7, "freeze_step": 400, "cuda_aware": false, "comm_backend_name": "nccl" } } The 1-bit Adam optimizer supports the following three params keys/values in addition to the standard Adam (learn more in our tutorial): “params” key Description Default freeze_step Number of warm up steps before 1-bit compression gets applied to the communication 100000 cuda_aware To indicate that the underlying MPI library supports CUDA-Aware communication false comm_backend_name To indicate which backend implementation to use “nccl” A variant optimizer for 1-bit Adam is 0/1 Adam, which further optimizes 1-bit Adam via adaptive variance freezing and 1-bit synchronization over optimizer states. "optimizer": { "type": "ZeroOneAdam", "params": { "lr": 1e-3, "weight_decay": 0.01, "bias_correction": false, "var_freeze_step": 1000, "var_update_scaler": 16, "local_step_scaler": 1000, "local_step_clipper": 16, "cuda_aware": false, "comm_backend_name": "nccl" } } 0/1 Adam supports the following params key/values in addition to standard Adam (learn more in our tutorial.) “params” key Description Default var_freeze_step The latest step to update the variance 100000 var_update_scaler The interval to update the variance 16 local_step_scaler The interval to scale the local steps interval according to the learning rate policy 32678 local_step_clipper The largest interval for local steps with learning rate policy 16 cuda_aware To indicate that the underlying MPI library supports CUDA-Aware communication false comm_backend_name To indicate which backend implementation to use “nccl” Another example of optimizer with 1-bit LAMB "optimizer": { "type": "OneBitLamb", "params": { "lr": 11e-3, "weight_decay": 0.01, "bias_correction": false, "max_coeff": 0.3, "min_coeff": 0.01, "freeze_step": 1000, "cuda_aware": false, "comm_backend_name": "nccl", "coeff_beta": 0.9, "factor_max": 4.0, "factor_min": 0.5, "factor_threshold": 0.1 } } The 1-bit LAMB optimizer supports the following params keys/values in addition to the standard LAMB (learn more in our tutorial): “params” key Description Default max_coeff Scaling coefficient upper bound for original LAMB algorithm and 1-bit LAMB’s warmup stage 10.0 min_coeff Scaling coefficient lower bound for original LAMB algorithm and 1-bit LAMB’s warmup stage 0.01 freeze_step Number of warm up steps before 1-bit compression gets applied to the communication 100000 cuda_aware To indicate that the underlying MPI library supports CUDA-Aware communication false comm_backend_name To indicate which backend implementation to use “nccl” coeff_beta Coefficient used for computing running averages of lamb coefficient 0.9 factor_max Maximum value of scaling factor to the frozen lamb coefficient during compression stage 4.0 factor_min Minimum value of scaling factor to the frozen lamb coefficient during compression stage 0.5 factor_threshold Threshold of how much the scaling factor can fluctuate between steps 0.1 Scheduler Parameters DeepSpeed calls the step() method of the scheduler at every training step when model_engine.step() is executed. scheduler: [dictionary] Fields Value Example type The scheduler name. See here for list of support schedulers. "WarmupLR" params Dictionary of parameters to instantiate scheduler. The parameter names should match scheduler constructor signature. {"warmup_min_lr": 0, "warmup_max_lr": 0.001} Example of scheduler "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 0.001, "warmup_num_steps": 1000 } } Communication options communication_data_type: [string] Description Default During gradient averaging perform communication with selected data type. By default it will be determined by selected regime None prescale_gradients: [boolean] Description Default Scale gradients before doing allreduce false gradient_predivide_factor: [float] Description Default Before gradient averaging predivide gradients by a specified factor, can sometimes help with fp16 stability when scaling to large numbers of GPUs 1.0 sparse_gradients: [boolean] Description Default Enable sparse compression of torch.nn.Embedding gradients. This feature is essentially deprecated as we don’t see use cases for it as much anymore. It should be noted that this feature is not compatible with torch.sparse related features. false FP16 training options Note: this mode cannot be combined with the amp mode described below. fp16: [dictionary] Description Default Configuration for using mixed precision/FP16 training that leverages NVIDIA’s Apex package. An example, including the available dictionary keys is illustrated below. NOTE: this does not use Apex’s AMP mode that allows for more flexibility in mixed precision training modes, this mode is similar to AMP’s O2 mode. Please see AMP support below if you want to use more complex mixed precision modes. If you want to use ZeRO (currently) you must use this mode. None "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "consecutive_hysteresis": false, "min_loss_scale": 1 } fp16:enabled: [boolean] Description Default enabled is a fp16 parameter indicating whether or not FP16 training enabled. false fp16:auto_cast: [boolean] Description Default auto_cast automatically casts inputs to fp16 false fp16:loss_scale: [float] Description Default loss_scale is a fp16 parameter representing the loss scaling value for FP16 training. The default value of 0.0 results in dynamic loss scaling, otherwise the value will be used for static fixed loss scaling. 0.0 fp16:initial_scale_power: [integer] Description Default initial_scale_power is a fp16 parameter representing the power of the initial dynamic loss scale value. The actual loss scale is computed as 2initial_scale_power. 16 fp16:loss_scale_window: [integer] Description Default loss_scale_window is a fp16 parameter representing the window over which to raise/lower the dynamic loss scale value. 1000 fp16:hysteresis: [integer] Description Default hysteresis is a fp16 parameter representing the delay shift in dynamic loss scaling. 2 fp16:consecutive_hysteresis: [boolean] Description Default consecutive_hysteresis is a fp16 parameter representing whether to refill the hysteresis if we reach an iteration that doesn’t overflow false fp16:min_loss_scale: [integer] Description Default min_loss_scale is a fp16 parameter representing the minimum dynamic loss scale value. 1 BFLOAT16 training options Note: this mode cannot be combined with the amp mode described below. Note: this mode cannot be combined with the fp16 mode described above. bf16: [dictionary] Description Default Configuration for using bfloat16 floating-point format as an alternative to FP16. BFLOAT16 requires hardware support (e.g., NVIDIA A100). An example, including the available dictionary keys is illustrated below. Training with bfloat16 does not require loss scaling. None "bf16": { "enabled": true } bf16:enabled: [boolean] Description Default enabled indicates whether BFLOAT16 training is enabled. false Automatic mixed precision (AMP) training options Note: this mode cannot be combined with the fp16 mode described above. In addition this mode is not currently compatible with ZeRO. amp: [dictionary] Description Default Configuration for using automatic mixed precision (AMP) training that leverages NVIDIA’s Apex AMP package. An example, including the available dictionary keys is illustrated below. Is not compatible with fp16 mode above or ZeRO. Any parameters outside of “enabled” will be passed to AMP’s initialize call, see the API and descriptions here at the apex.amp.initialize documentation. None "amp": { "enabled": true, ... "opt_level": "O1", ... } amp:enabled: [boolean] Description Default enabled is an amp parameter indicating whether or not AMP training is enabled. false amp params: [various] Description Default Any parameters outside of “enabled” will be passed to AMP’s initialize call, see the API and descriptions here at the apex.amp.initialize documentation. None Gradient Clipping gradient_clipping: [float] Description Default Enable gradient clipping with value 1.0 ZeRO Optimizations for FP16 Training Enabling and configuring ZeRO memory optimizations "zero_optimization": { "stage": [0|1|2|3], "allgather_partitions": [true|false], "allgather_bucket_size": 5e8, "overlap_comm": false, "reduce_scatter": [true|false], "reduce_bucket_size": 5e8, "contiguous_gradients" : [true|false], "offload_param": { ... }, "offload_optimizer": { ... }, "stage3_max_live_parameters" : 1e9, "stage3_max_reuse_distance" : 1e9, "stage3_prefetch_bucket_size" : 5e8, "stage3_param_persistence_threshold" : 1e6, "sub_group_size" : 1e12, "elastic_checkpoint" : [true|false], "stage3_gather_16bit_weights_on_model_save": [true|false], "ignore_unused_parameters": [true|false], "round_robin_gradients": [true|false], "zero_hpz_partition_size": 1, "zero_quantized_weights": [true|false], "zero_quantized_gradients": [true|false], "log_trace_cache_warnings": [true|false], } zero_optimization: [dictionary] Description Default Enable ZeRO memory optimizations, compatible with FP16/BF16/FP32 and the Adam optimizer. false stage: [integer] Description Default Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively. 0 allgather_partitions: [boolean] Description Default Chooses between allgather collective or a series of broadcast collectives to gather updated parameters from all the GPUs at the end of each step true allgather_bucket_size: [integer] Description Default Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes 5e8 overlap_comm: [boolean] Description Default Attempts to overlap the reduction of the gradients with backward computation false reduce_scatter: [boolean] Description Default Uses reduce or reduce scatter instead of allreduce to average gradients true reduce_bucket_size: [integer] Description Default Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes 5e8 contiguous_gradients: [boolean] Description Default Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass. True load_from_fp32_weights: [boolean] Description Default Initialize fp32 master weights from fp32 copies in checkpoint (no precision loss) or from model’s fp16 copies (with precision loss). This can be used to initialize optimizer state even when checkpoint is missing optimizer state. True grad_hooks: [boolean] Description Default For use with ZeRO stage 1, enable backward hooks to reduce gradients during the backward pass or wait until the end of the backward pass. True round_robin_gradients: [boolean] Description Default Stage 1 and 2 optimization for CPU offloading that parallelizes gradient copying to CPU memory among ranks by fine-grained gradient partitioning. Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism). False offload_param: [dictionary] Description Default Enable offloading of model parameters to CPU or NVMe. This frees up GPU memory for larger models or batch sizes. Valid only with stage 3. See here for more details. False offload_optimizer: [dictionary] Description Default Enable offloading of optimizer state to CPU or NVMe, and optimizer computation to CPU. This frees up GPU memory for larger models or batch sizes. Valid for ZeRO stage 1, 2, 3. See here for more details. False stage3_max_live_parameters: [integer] Description Default The maximum number of parameters resident per GPU before releasing. Smaller values use less memory, but perform more communication. 1e9 stage3_max_reuse_distance: [integer] Description Default Do not release a parameter if it will be reused within this threshold of parameters. Smaller values use less memory, but perform more communication. 1e9 stage3_prefetch_bucket_size: [integer] Description Default The size of the fixed buffer for prefetching parameters. Smaller values use less memory, but can increase stalls due to communication. 5e8 stage3_param_persistence_threshold: [integer] Description Default Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages). 1e5 stage3_gather_16bit_weights_on_model_save: [boolean] Description Default Consolidate the weights before saving the model by save_16bit_model(). Since the weights are partitioned across GPUs, they aren’t part of state_dict, so this function automatically gathers the weights when this option is enabled and then saves the fp16 model weights. False stage3_module_granularity_threshold: [integer] | Description | Default | |——————————————————————————————————————————————————————————————————————————————————————————–| ——- | | The granularity of a module is determined by the ratio of parameter_count / (1 + descendant_count). ZeRO3 classifies modules with a granularity below the threshold as fine-grained, treating them as integral units during parameter fetching. This reduces host and communication overhead from separate hooks. | 0 | zero_hpz_partition_size: [integer] Description Default Number of ranks in hiearchical partitioning ZeRO (hpZ) secondary tensor group of ZeRO++, default is 1 meaning no hpZ, ideal is number of ranks (gpus) per node. 1 zero_quantized_weights: [boolean] Description Default Boolean indicating whether to enable communication efficient quantized weights of ZeRO++. False zero_quantized_gradients: [boolean] Description Default Boolean indicating whether to enable communication efficient quantized gradients of ZeRO++. False log_trace_cache_warnings: [boolean] Description Default Log warnings from trace cache optimization of parameter sharding, such as cache invalidation events. False cpu_offload: [boolean] Deprecated: cpu_offload is deprecated and will be removed in future, please use offload_optimizer instead. Description Default Enable offloading of optimizer memory and computation to CPU. This frees up GPU memory for larger models or batch sizes. Valid with stage 1 and 2. False Parameter offloading Enabling and configuring ZeRO optimization of parameter offloading to CPU/NVMe. Available only with ZeRO stage 3. Note that if the value of “device” is not specified or not supported, an assertion will be triggered. "offload_param": { "device": "[cpu|nvme]", "nvme_path": "/local_nvme", "pin_memory": [true|false], "buffer_count": 5, "buffer_size": 1e8, "max_in_cpu": 1e9 } device: [string] Description Default Device memory to offload model parameters. Supported options are cpu and nvme. cpu nvme_path: [string] Description Default Filesystem path for NVMe device for parameter offloading. /local_nvme pin_memory: [boolean] Description Default Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. false buffer_count: [integer] Description Default Number of buffers in buffer pool for parameter offloading to NVMe. 5 buffer_size: [integer] Description Default Size of buffers in buffer pool for parameter offloading to NVMe. 1e8 max_in_cpu: [integer] Description Default Number of parameter elements to maintain in CPU memory when offloading to NVMe is enabled. 1e9 Optimizer offloading Enabling and configuring ZeRO optimization of offloading optimizer computation to CPU and state to CPU/NVMe. CPU offloading is available with ZeRO stage 1, 2, 3. NVMe offloading is available only with ZeRO stage 3. Note that if the value of “device” is not specified or not supported, an assertion will be triggered. "offload_optimizer": { "device": "[cpu|nvme]", "nvme_path": "/local_nvme", "pin_memory": [true|false], "ratio": 0.3, "buffer_count": 4, "fast_init": false } device: [string] Description Default Device memory to offload optimizer state. Supported options are cpu and nvme. Optimizer computation is offload to CPU regardless of device option. cpu nvme_path: [string] Description Default Filesystem path for NVMe device for optimizer state offloading. /local_nvme pin_memory: [boolean] Description Default Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. false ratio: [float] Description Default the ratio of parameters updating (i.e. optimizer step) on CPU side. 1 buffer_count: [integer] Description Default Number of buffers in buffer pool for optimizer state offloading to NVMe. This should be at least the number of states maintained per parameter by the optimizer. For example, Adam optimizer has 4 states (parameter, gradient, momentum, and variance). 4 fast_init: [boolean] Description Default Enable fast optimizer initialization when offloading to NVMe. false Asynchronous I/O Configuring the asynchronous I/O module for offloading parameter and optimizer states to persistent (NVMe) storage. This module uses Linux native asynchronous I/O (libaio). "aio": { "block_size": 1048576, "queue_depth": 8, "thread_count": 1, "single_submit": false, "overlap_events": true } block_size: [integer] Description Default I/O block size in bytes. 1048576 queue_depth: [integer] Description Default I/O queue depth. 8 thread_count: [integer] Description Default Intra-request parallelism for each read/write submitted by a user thread. 1 single_submit: [boolean] Description Default Submit requests to storage device as multiple individual requests as opposed to one block of requests. false overlap_events: [boolean] Description Default Submit requests to storage device in an overlapped fashion without waiting for completion of earlier requests. true ignore_unused_parameters: [boolean] Description Default Unused parameters in modules may be unexpected in static networks, but could be normal in dynamic networks. This controls whether or not training should terminate with an error message when unused parameters are detected. This is set to True by default, which means unused parameters are ignored and training continues. Now is just used in stage 2. True Logging steps_per_print: [integer] Description Default Print progress report every N training steps. The report includes the number of training steps, number of skipped optimizer updates (likely due to overflows in mixed-precision training), current learning rate, and current momentum. 10 wall_clock_breakdown: [boolean] Description Default Enable timing of the latency of forward/backward/update training phases false dump_state: [boolean] Description Default Print out state information of DeepSpeed object after initialization false Autotuning { "autotuning": { "enabled": false, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": false, "metric": "throughput", "start_profile_step": 3, "end_profile_step": 5, "fast": true, "max_train_batch_size": null, "mp_size": 1, "num_tuning_micro_batch_sizes": 3, "tuner_type": "model_based", "tuner_early_stopping": 5, "tuner_num_trials": 50, "arg_mappings": null } } enabled: [boolean] Description Default Enables the autotuner. false results_dir: [string] Description Default Path to the autotuning experiment results directory. The default appears in the working directory from which Deepspeed was launched. “autotuning_results” exps_dir: [string] Description Default Path to the auotuning experiment descriptions directory. The default appears in the working directory from which Deepspeed was launched. “autotuning_exps” overwrite: [boolean] Description Default Whether to run autotuning experiments whose results already exist. Setting it to true would overwrite the existing result. false metric: [string] Description Default The performance metric to use for ranking autotuning experiments. latency, throughput, and FLOPS are currently supported, referring to training step latency, training samples per second, and floating-point operations per second achieved per GPU respectively. throughput start_profile_step: [integer] Description Default The global training step at which to start profiling in an autotuning experiment. Note that warm-up is needed for accurate performance measurement. 3 end_profile_step: [integer] Description Default The global training step at which to end profiling in an autotuning experiment. Must not be less than start_profile_step. 5 fast: [boolean] Description Default Enables fast-model autotuning where only Zero stages and micro-batch sizes per GPU are tuned. true max_train_batch_size: [int] Description Default The maximum train batch size (global effective batch size) for the model training. null mp_size: [int] Description Default Model parallelism degree. 1 num_tuning_micro_batch_sizes: [integer] Description Default The number of micro-batch sizes to explore. 3 tuner_type: [string] Description Default The algorithm defines the order of autotuning space exploration within a ZeRO stage. model_based tuner_early_stopping: [integer] Description Default The number of experiments to run beyond the current best experiment. If no better experiment is found within that number, the Autotuner stops the exploration. 5 tuner_num_trials: [integer] Description Default The maximum number of experiments to explore in the tuning space within a ZeRO stage. 50 Flops Profiler { "flops_profiler": { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null, } } enabled: [boolean] Description Default Enables the flops profiler. This would also enables wall_clock_breakdown false profile_step: [integer] Description Default The global training step at which to profile. Note that warm up steps are needed for accurate time measurement. 1 module_depth: [integer] Description Default The depth of the model at which to print the aggregated module information. When set to -1, it prints information from the top module to the innermost modules (the maximum depth). -1 top_modules: [integer] Description Default Limits the aggregated profile output to the number of top modules specified. 1 detailed: [boolean] Description Default Whether to print the detailed model profile. true output_file: [string] Description Default Path to the output file. If None, the profiler prints to stdout.. null Activation Checkpointing "activation_checkpointing": { "partition_activations": false, "cpu_checkpointing": false, "contiguous_memory_optimization": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } partition_activations: [boolean] Description Default Enables partition activation when used with model parallelism false cpu_checkpointing: [boolean] Description Default Offloads partitioned activations to CPU if partition_activations is enabled false contiguous_memory_optimization: [boolean] Description Default Copies partitioned activations so that they are contiguous in memory false number_checkpoints: [integer] Description Default Total number of activation checkpoints used to allocate memory buffer for contiguous_memory_optimization None synchronize_checkpoint_boundary: [boolean] Description Default Inserts get_accelerator().synchronize() at each checkpoint boundary. false profile: [boolean] Description Default Logs the forward and backward time for each checkpoint function false Sparse Attention sparse_attention: [dictionary] Fields Value Example mode A string determining sparsity structure type. Deepspeed currently supports "dense", "fixed", "bigbird", "bslongformer", and "variable". "fixed" block An integer determining the block size. Current implementation of sparse self-attention is based on blocked sparse matrices. In which this parameter defines size of such blocks, Block X Block. 16 different_layout_per_head A boolean determining if each head should be assigned a different sparsity layout; this will be satisfied based on availability. false num_local_blocks An integer determining the number of random blocks in each block row; only used in "fixed" mode. 4 num_global_blocks An integer determining how many consecutive blocks in a local window is used as the representative of the window for global attention; used in "fixed" and "bigbird" modes. 1 attention A string determining attention type. Attention can be "unidirectional", such as autoregressive models, in which tokens attend only to tokens appear before them in the context. Considering that, the upper triangular of attention matrix is empty. Or it can be "bidirectional", such as BERT, in which tokens can attend to any other tokens before or after them. Then, the upper triangular part of the attention matrix is mirror of the lower triangular; used in "fixed" and "variable" modes. "bidirectional" horizontal_global_attention A boolean determining if blocks that are global representative of a local window, also attend to all other blocks. This is valid only if attention type is "bidirectional". Looking at the attention matrix, that means global attention not only includes the vertical blocks, but also horizontal blocks; used in "fixed" and "variable" modes. false num_different_global_patterns An integer determining number of different global attentions layouts. While global attention can be fixed by which block/s are representative of any local window, since there are multi-heads, each head can use a different global representative; used only in "fixed" mode. 4 num_random_blocks An integer determining the number of random blocks in each block row; used in "variable" and "bigbird" modes. 0 local_window_blocks A list of integers determining the number of blocks in each local attention window. It assumes first number determines # of blocks in the first local window, second the second window, …, and the last number determines the number of blocks in the remaining local windows; only used in "variable" mode. [4] global_block_indices A list of integers determining which blocks are considered as global attention. Given indices, determine the blocks that all other token blocks attend to and they attend to all other token blocks. Notice that if global_block_end_indices parameter is set, this parameter is used as starting index of each global window; used in "variable" and "bslongformer" modes. [0] global_block_end_indices A list of integers determining end indices of global window blocks. By default this is not used. But if it is set, it must have the same size of global_block_indices parameter, and combining this two parameters, for each index i, blocks from global_block_indices[i] to global_block_end_indices[i], exclusive, are considered as global attention; used in "variable" and "bslongformer" modes. None num_sliding_window_blocks An integer determining the number of blocks in sliding local attention window; used in "bigbird" and "bslongformer" modes. 3 Example of sparse_attention "sparse_attention": { "mode": "fixed", "block": 16, "different_layout_per_head": true, "num_local_blocks": 4, "num_global_blocks": 1, "attention": "bidirectional", "horizontal_global_attention": false, "num_different_global_patterns": 4, "num_random_blocks": 0, "local_window_blocks": [4], "global_block_indices": [0], "global_block_end_indices": None, "num_sliding_window_blocks": 3 } Data Efficiency DeepSpeed Data Efficiency Library includes two techniques: curriculum learning and random layerwise token dropping (random-LTD). Read more about how to use the DeepSpeed Data Efficiency Library in our tutorial. "data_efficiency": { "enabled": true, "seed": 1234, "data_routing": { "enabled": true, "random_ltd":{ "enabled": true, "total_layer_num": 24, "random_ltd_layer_num": 22, "random_ltd_layer_id": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], "model_mask_name": "attention_mask", "model_type": "decoder", "hidden_state_order": "seq_batch_dim", "random_ltd_schedule": { "min_value": 128, "max_value": 2048, "schedule_type":"fixed_linear", "schedule_config": { "require_steps": 200000, "seq_per_step": 16 } } } }, "data_sampling": { "enabled": true, "num_epochs": 1, "num_workers": 0, "curriculum_learning": { "enabled": true, "data_cluster_path": "/path/to/data_clusters", "curriculum_metrics": { "vocabularyrarity": { "index_to_sample_path": "/path/to/index_to_sample", "index_to_metric_path": "/path/to/index_to_metric", "difficulty_type": "percentile", "clustering_type": "schedule_based", "min_difficulty": 1, "max_difficulty": 100, "schedule_type": "fixed_root", "schedule_config": { "total_curriculum_step": 110000, "difficulty_step": 1, "root_degree": 2 } } } } } } data_efficiency: [dictionary] Fields Value Default enabled: [boolean] Enable data efficiency or not. false seed: [integer] Random seed for data sampling. 1234 data_routing: [dictionary] Configs for data routing techniques. N/A data_sampling: [dictionary] Configs for data sampling techniques. N/A data_routing: [dictionary] Fields Value Default enabled: [boolean] Enable data routing techniques or not. false random_ltd: [dictionary] Configs for random-LTD technique. N/A data_sampling: [dictionary] Fields Value Default enabled: [boolean] Enable data sampling techniques or not. false num_epochs: [integer] At most how many epoches of the original dataset will be iterated. 1000 num_workers: [integer] Data loader number of workers. 0 curriculum_learning: [dictionary] Configs for curriculum learing technique. N/A random_ltd: [dictionary] Fields Value Default enabled: [boolean] Enable random-LTD technique or not. false total_layer_num: [integer] The number of layer (or the depth) for the pretraining/fine-tuning model. N/A random_ltd_layer_num: [integer] The number of layers that will be applied with random-LTD. N/A random_ltd_layer_id: [list] The exact layer_id that will be applied with random-LTD. The length of this list must be the same as random_ltd_layer_num. N/A model_mask_name: [str] The variable name of the attention_mask. Different libraries have different names, such as att_mask. For huggingface model, it’s named “attention_mask”. Users need to check the forward function in the original model files. If the attention mask input in the original model’s forward function is not a keyword/named argument (e.g., attention_mask=None), user would need to change it to a keyword/named argument and provide that keyword as model_mask_name. N/A model_type: [str] Users need to identify whether the model is decoder or encoder. Currently we only support these two. N/A hidden_state_order: [str] Users need to know the input order of the hidden state tensor. Normally, it’s batch, sequence and then the hidden dimension, which is batch_seq_dim. Somethings, the order between batch and sequence will be switch like seq_batch_dim. Currently, we support these two. N/A random_ltd_schedule: [dictionary] The schedule of the effective sequence length after token dropping. It’s a linear function where random-LTD gradually drops less tokens and increases effective sequence length. N/A min_value: [integer] The initial effective sequence length (after token dropping) at step/iteration 0. N/A max_value: [integer] The max effective sequence length (usually the case without any token dropping). Usually this is set as baseline’s seqlen. N/A schedule_type: [str] The sequence length follows a linear increasing function starting from min_value and reaching max_value. We currently only support this type. N/A schedule_config: [dictionary] Configs for the linear increasing function. N/A require_steps: [integer] How many iterations will be needed to reach max_value from min_value. N/A seq_per_step: [integer] At any time, the effective sequence length be multiple of this seq_per_step. Set this to multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. N/A curriculum_learning: [dictionary] Fields Value Default enabled: [boolean] Enable curriculum learing technique or not. false data_cluster_path: [str] Path to directory where curriculum learning will store the indexes of data samples within the same difficulty ranges. N/A curriculum_metrics: [dictionary] This dictionary includes all desired curriculum metrics and their configs. Each metric will be a separate sub-dictionary, where the key is the metric name and the values are configs below. N/A index_to_sample_path: [str] Path to the index_to_sample file generated during offline data analysis. Note that data analysis will generate two kinds of index_to_sample files: The metric_name_index_to_sample_percentile_merged file is a concatenated index for perf improvement, but it only works when you set difficulty_type=percentile. If you use difficulty_type=value, you need to change this to use the metric_name_index_to_sample file. N/A index_to_metric_path: [str] Path to the index_to_metric_path file generated during offline data analysis. N/A difficulty_type: [str] During training, how to increase the max accepted difficulty. Currently support value (increase by absolute value) and percentile (increase by difficulty percentile). N/A clustering_type: [str] Currently support schedule_based (cluster data based on the difficulty schedule (pacing function) below) and single_cluster (no clustering required and probably CL is achieved by data postprocessing, such as sequence length truncation). N/A min_difficulty: [integer] Starting difficulty at first step. When difficulty_type=value the min_difficulty is an absolute difficulty value. When difficulty_type=percentile the min_difficulty is a difficulty percentile value. N/A max_difficulty: [integer] Final max difficulty. When difficulty_type=value the max_difficulty is an absolute difficulty value. When difficulty_type=percentile the max_difficulty is a difficulty percentile value. N/A schedule_type: [str] The difficulty schedule (pacing function) that defines how the max accepted difficulty increases from min_difficulty to max_difficulty during training. Currently support fixed_linear, fixed_root, fixed_discrete, and custom. N/A schedule_config: [dictionary] Configs for the pacing function. When schedule_type=custom this dictionary is not necessary. Instead user needs to provide a callback function (via the set_custom_curriculum_learning_schedule API in deepspeed/runtime/engine.py) which will update the max accepted difficulty during training. Configs below are all belongs to schedule_config. N/A total_curriculum_step: [integer] How many steps the curriculum learning takes to go from min difficulty to max difficulty. Used by fixed_linear and fixed_root schedule. N/A difficulty_step: [integer] The max accepted difficulty level determined every step must be a multiple of this difficulty_step. This is used to ensure the use of NVIDIA Tensor Core acceleration (requires multiple of 8 (FP16) or 16 (INT8)). Used by fixed_linear and fixed_root schedule. N/A root_degree: [integer] The degree of the root function. Degree of 2 means square root and degree of 3 means cube root. Degree of 1 is equivalent to linear. Used by fixed_root schedule. N/A difficulty: [list] List of max accepted difficulty levels to be used during schedule. Used by fixed_discrete schedule. N/A max_step: [list] List of which step to change max accepted difficulty level. Used by fixed_discrete schedule. N/A Curriculum Learning Note: On 12/12/2022, we released DeepSpeed Data Efficiency Library which provides a more general curriculum learning support. This legacy curriculum learning feature below is still supported but we recommend to use the Data Efficiency Library. "curriculum_learning": { "enabled": true, "curriculum_type": "seqlen", "min_difficulty": 8, "max_difficulty": 1024, "schedule_type": "fixed_linear", "schedule_config": { "total_curriculum_step": 40000, "difficulty_step": 8 } } enabled: [boolean] Description Default Set to true to enable curriculum learning false curriculum_type: [string] Description Default Type of curriculum difficulty metric. Currently support seqlen. N/A min_difficulty: [integer] Description Default The starting difficulty level N/A max_difficulty: [integer] Description Default The ending difficulty level N/A schedule_type: [string] Description Default Type of curriculum schedule. Currently support fixed_linear, fixed_root, and fixed_discrete. N/A total_curriculum_step: [integer] Description Default Total number of steps for the curriculum learning. One of the schedule_config when the fixed_linear and fixed_root schedule_type are used. N/A difficulty_step: [integer] Description Default At any time, the curriculum learning difficulty must be multiple of this difficulty_step. Set this to multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. One of the schedule_config when the fixed_linear and fixed_root schedule_type are used. N/A root_degree: [integer] Description Default Root degree of the curriculum schedule function. One of the schedule_config when the fixed_root schedule_type is used. N/A difficulty: [list of integer] Description Default List of difficulty levels to be used during schedule. One of the schedule_config when the fixed_discrete schedule_type is used. N/A max_step: [list of integer] Description Default List of which step to change difficulty level. One of the schedule_config when the fixed_discrete schedule_type is used. N/A Monitoring Module Note: Deepspeed logs to TensorBoard through PyTorch. Logging to TensorBoard requires that the tensorboard package is installed (read more in the PyTorch documentation). Note: Logging to WandB requires that the wandb package is installed (read more in the WandB documentation). Note: Logging to Comet requires that the comet_ml package is installed (read more in the Comet documentation). Deepspeed’s Monitor module can log training details into a Tensorboard-compatible file, to WandB, to Comet or to simple CSV files. Below is an overview of what DeepSpeed will log automatically. Field Description Conditions Train/Samples/train_loss The training loss. None Train/Samples/lr The learning rate during training. None Train/Samples/loss_scale The loss scale when training using fp16. fp16 must be enabled. Train/Eigenvalues/ModelBlockParam_{i} Eigen values per param block. eigenvalue must be enabled. Train/Samples/elapsed_time_ms_forward The global duration of the forward pass. flops_profiler.enabled or wall_clock_breakdown. Train/Samples/elapsed_time_ms_backward The global duration of the forward pass. flops_profiler.enabled or wall_clock_breakdown. Train/Samples/elapsed_time_ms_backward_inner The backward time that does not include the gradient reduction time. Only in cases where the gradient reduction is not overlapped, if it is overlapped then the inner time should be about the same as the entire backward time. flops_profiler.enabled or wall_clock_breakdown. Train/Samples/elapsed_time_ms_backward_allreduce The global duration of the allreduce operation. flops_profiler.enabled or wall_clock_breakdown. Train/Samples/elapsed_time_ms_step The optimizer step time flops_profiler.enabled or wall_clock_breakdown. tensorboard: [dictionary] Fields Value Default enabled Whether logging to Tensorboard is enabled. false output_path Path to where the Tensorboard logs will be written. If None, the output path is set under the training script’s launching path. null job_name Name for the current job. This will become a new directory inside output_path. "DeepSpeedJobName" Example of tensorboard configuration: "tensorboard": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } wandb: [dictionary] Fields Value Default enabled Whether logging to WandB is enabled. false group Name for the WandB group. This can be used to group together runs. None team Name for the WandB team. None project Name for the WandB project. deepspeed Example of wandb configuration: "wandb": { "enabled": true, "group": "my_group", "team": "my_team", "project": "my_project" } comet: [dictionary] Fields Value Default enabled Whether logging to Comet is enabled. false workspace Comet workspace name. None project Comet project name. None samples_log_interval Metrics will be submitted to Comet after processing every samples_log_intervas samples. 100 experiment_name The name for comet experiment to be used for logging. None api_key Comet API key. It’s not recommended to save the Comet API Key in code. None experiment_key The key for comet experiment to be used for logging. Must be an alphanumeric string whose length is between 32 and 50 characters. None online If True, the data will be logged to Comet server, otherwise it will be stored locally in offline experiment. Default is True. None mode Control how the Comet experiment is started. “get”: Continue logging to an existing experiment identified by the experiment_key value. “create”: Always creates of a new experiment, useful for HPO sweeps. “get_or_create” (default): Starts a fresh experiment if required, or persists logging to an existing one. None Example of comet configuration: "comet": { "enabled": true, "workspace": "my_workspace", "project": "my_project", "samples_log_interval": 50, "experiment_name": "llama-fine-tuning", "experiment_key": "0c4a1c4a90664f2a8084e600b19a9d7", "online": false, "mode": "get", } csv_monitor: [dictionary] Fields Value Default enabled Whether logging to local CSV files is enabled. false output_path Path to where the csv files will be written. If None, the output path is set under the training script’s launching path. null job_name Name for the current job. This will become a new directory inside output_path "DeepSpeedJobName" Example of csv_monitor configuration: "csv_monitor": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } Elastic Training Config (V0.1 and V0.2) "elasticity": { "enabled": true, "max_train_batch_size": "seqlen", "micro_batch_sizes": 8, "min_gpus": 1024, "max_gpus": "fixed_linear", "min_time": "seqlen", "version": 8, "ignore_non_elastic_batch_info": 1024, "num_gpus_per_node": "fixed_linear", "model_parallel_size": MODEL_PARALLEL_SIZE } Field Description Default enabled Enables computation of global batch size in elastic training. false max_train_batch_size Max acceptable batch size can be used in training. 2000 micro_batch_sizes Acceptable micro batch sizes, same as train_micro_batch_size_per_gpu [2,4,6] min_gpus Min number of GPUs to search over when computing highly composite batch size in v0.1 and v0.2. 1 max_gpus Max number of GPUs to search over when computing highly composite batch size in v0.1 and v0.2. 10000 min_time Minimum running time (minutes) before the scheduler will scale again (only used in v0.1). 0 implies it’s unknown 0 prefer_large_batch When finding a suitable batch size, attempt to find one that is closest to the max train batch size given. true version Version of elastic logic to use. 0.2 ignore_non_elastic_batch_info Ignore all batch info provided outside the elastic config. To reduce confusion, we require all batch related info to be given in elastic config only. false num_gpus_per_node Number of GPUs per node. This information is used by v0.2 to support model-parallel training (only used by v0.2) 1 model_parallel_size Tensor or model parallel size (only used by v0.2) 1 Communication Logging DeepSpeed provides a flexible communication logging tool which can automatically detect and record communication operations launched via deepspeed.comm. NOTE: All logging communication calls are synchronized in order to provide accurate timing information. This may hamper performance if your model heavily uses asynchronous communication operations. Once the logs are populated, they can be summarized with deepspeed.comm.log_summary(). For more detail and example usage, see the tutorial comms_logger: [dictionary] Fields Value Default enabled Whether communication logging is enabled. false verbose Whether to immediately print every communication operation false prof_all Whether to profile all operations. true debug Appends the caller function to each communication operation’s log_name. false prof_ops A list of communication operations to log (only the specified ops will be profiled). [] Example of recommended comms_logger configuration: "comms_logger": { "enabled": true, "verbose": false, "prof_all": true, "debug": false } Example of comms_logger configuration for logging specific operations only: "comms_logger": { "enabled": true, "verbose": false, "prof_all": false, "debug": false, "prof_ops": ["all_reduce", "all_gather"] } Compression Note: Compression has seven different components, including layer reduction, weight quantization, activation quantization, sparse pruning, row pruning, head pruning, and channel pruning. We explain them one by one with simple json examples. Read more about how to use the DeepSpeed Compression library in our tutorial. Layer Reduction Note: Layer reduction works much better when using knowledage distillation (learn more in our tutorial): "compression_training": { "layer_reduction": { "enabled": true, "keep_number_layer": 5, "module_name_prefix": "bert.encoder.layer", "teacher_layer": [ 2, 4, 6, 8, 10 ], "other_module_name": [ "bert.pooler", "bert.embeddings", "classifier" ] } } layer_reduction: [dictionary] Fields Value Default enabled: [boolean] Enable layer reduction or not. false keep_number_layer: [list] The number of layer in the model to be kept. N/A module_name_prefix: [str] The (uniform) name prefix of the model’s modules of which the associated weight parameters are to be reinitialized. N/A teacher_layer: [list] The layer of the weight parameters are to be reinitialized. The length of the list equals to ‘keep_number_layer’. N/A other_module_name: [list] The name of modules of which the associated weight parameters are to be reinitialized. It is an complemenatory or alternative of module_name_prefix. For instance, “other_module_name”: [“bert.encoder.layer.2”,”bert.encoder.layer.4”] equals to “module_name_prefix”:”bert.encoder.layer” and “teacher_layer”: [2,4]. N/A Weight Quantization "compression_training": { "weight_quantization": { "shared_parameters":{ "enabled": true, "quantizer_kernel": false, "schedule_offset": 0, "quantize_groups": 1, "quantize_verbose": false, "quantization_type": "symmetric", "rounding": "nearest", "quantize_weight_in_forward": false, "fp16_mixed_quantize":{ "enabled": false, "quantize_change_ratio": 0.001 } }, "different_groups":{ "wq1": { "params": { "start_bits": 8, "target_bits": 8, "quantization_period": 50 }, "modules": [ "attention.self", "intermediate" ] }, "wq2": { "params": { "start_bits": 4, "target_bits": 4, "quantization_period": 50 }, "modules": [ "attention.output" ] } } } } shared_parameters: [dictionary] Shared parameters for all weight quantization groups. Fields Value Default enabled: [boolean] Enable weight quantization or not. false quantizer_kernel: [boolean] Use DeepSpeed quantization kernel for >=4 bit quantization. This can only be enabled when using DeepSpeed FP16 optimizer. false schedule_offset: [integer] Enable weight quantization after scheduled steps (can be treated as warmup steps). 0 quantize_groups: [integer] Split the weight matrix into different number of groups, and each of them has its own scaling factor. 1 quantize_verbose: [boolean] Print the quantization related logs. false quantization_type: [string] Choose the quantization algorithm, symmetric or asymmetric. "symmetric" rounding: [string] Rounding algorithm associated with quantization, nearest or stochastic. "nearest" quantize_weight_in_forward: [boolean] Quantize weight in optimizer or forward step, must set to be true for FP32 optimizer training. false fp16_mixed_quantize: [dictionary] Using the value mixed by FP16 value and the quantized value. N/A enabled: [boolean] Whether fp16 mixed quantization is enabled. false quantize_change_ratio: [float] Initial quantize value ratio, will gradually increase to 1. 0.001 different_groups: [dictionary] Different quantization sets, this is used for different quantization parameters. In this example, we give two different sets. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] start_bits: [integer] Quantization starting bits, will gradaully reduce to target bits. 8 target_bits: [integer] Quantization target bits, need to be <= start_bits. 8 quantization_period: [integer] For every n steps, the quantization bits will be reduce by 1. 1 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" Activation Quantization "compression_training": { "activation_quantization": { "shared_parameters":{ "enabled": true, "quantization_type": "asymmetric", "range_calibration": "dynamic", "schedule_offset": 50 }, "different_groups":{ "aq1": { "params": { "bits": 8 }, "modules": [ "attention.output" ] } } } shared_parameters: [dictionary] Shared parameters for all activation quantization groups. Fields Value Default enabled: [boolean] Enable activation quantization or not. false quantization_type: [string] Choose the quantization algorithm, symmetric or asymmetric. "symmetric" range_calibration: [string] Using dynamic (per token or per image) or static (fixed min/max using momentum) for inference. "static" schedule_offset: [integer] Enable activation quantization after scheduled steps (can be treated as warmup steps). 0 different_groups: [dictionary] Different quantization sets, this is used for different quantization parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] bits: [integer] Number of bits used for activation target bits, need to be >= 4. 8 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" Sparse Pruning "compression_training": { "sparse_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 30, "method": "l1" }, "different_groups":{ "sp1": { "params": { "dense_ratio": 0.5 }, "modules": [ "attention.self" ] } } } } "compression_training": { "sparse_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 30, "schedule_offset_end": 90, "schedule_offset_stride": 15, "method": "snip_momentum", "block_pattern": "4x1", "dense_ratio": 0.4, "excluded_modules": ['classifier', 'pooler'] }, "different_groups":{ } } } shared_parameters: [dictionary] Shared parameters for all sparse pruning groups. Fields Value Default enabled: [boolean] Enable sparse pruning or not. false schedule_offset: [integer] Enable sparse pruning after scheduled steps (can be treated as warmup steps). 0 schedule_offset_end: [integer] Disable sparse pruning after scheduled steps, mandotory for snip_momentum. 0 schedule_offset_stride: [integer] The stride of pruning on training steps, mandotory for snip_momentum. "1" method: [string] Choose different pruning methods, l1 (static, magnitude based), topk (dynamic, learnable) or snip_momentum (structured pruning). "l1" block_pattern: [string] Choose different structured pruning block patterns, NxM or N:M (N and M are integers). For instance, “4x1” or “2:4” are common block patterns, mandotory for snip_momentum. "4x1" dense_ratio: [float] Used to get the targeted global sparsity ratio, mandotory for snip_momentum. "0.1" excluded_modules: [list] Excluded pruning scope on some special modules like output layer. [] different_groups: [dictionary] Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Note for snip_momentum method, you can leave it as empty. Fields Value Default params: [dictionary] dense_ratio: [float] The percentage of weights to keep after pruning. 0.5 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" Row Pruning Note: Row Pruning is a feature designed for two back-to-back linear layers (e.g., Feed Forward Network in Transformers). As such, we suggested use row pruning for the first linear layer (i.e., the intermediate.dense layer for BERT). Reducing the row dimension of this matrix can help reducing the column of the follow-up matrix (i.e., layer.\\w+.output.dense layer for BERT). It should also work for other linear layers as well. "compression_training": { "row_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 20, "method": "topk" }, "different_groups":{ "rp1": { "params": { "dense_ratio": 0.5 }, "modules": [ "intermediate.dense" ], "related_modules":[ ["layer.\\w+.output.dense"] ] } } } } shared_parameters: [dictionary] Shared parameters for all row pruning groups. Fields Value Default enabled: [boolean] Enable row pruning or not. false schedule_offset: [integer] Enable row pruning after scheduled steps (can be treated as warmup steps). 0 method: [string] Choose different pruning methods, l1 (static, magnitude based) or topk (dynamic, learnable). "l1" different_groups: [dictionary] Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] dense_ratio: [float] The percentage of weights to keep after pruning. 0.5 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" related_modules: [list[list]] Related module to the row pruned module, which can be performed column pruning. None Head Pruning Note: Head Pruning is a feature designed for two attention layers (e.g., Multi Head Attention in Transformers). For now, it can only be applied to output matrix of the Transformer (i.e., attention.output.dense in BERT). Pruning the output matrix can lead to the pruning of Query/Key/Value matrix as well. "compression_training": { "head_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 10, "method": "topk", "num_heads": 12 }, "different_groups":{ "rp1": { "params": { "dense_ratio": 0.5 }, "modules": [ "attention.output.dense" ], "related_modules":[ ["self.query", "self.key", "self.value"] ] } } } } shared_parameters: [dictionary] Shared parameters for all head pruning groups. Fields Value Default enabled: [boolean] Enable head pruning or not. false schedule_offset: [integer] Enable head pruning after scheduled steps (can be treated as warmup steps). 0 method: [string] Choose different pruning methods. For now, we only support topk (dynamic, learnable). "topk" num_heads: [int] Number of heads (must be provided by user). N/A different_groups: [dictionary] Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] dense_ratio: [float] The percentage of weights to keep after pruning. 0.5 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" related_modules: [list[list]] Related module (Usually Q/K/V) to the head pruned module (i.e., the output matrix). For now, this feature only works for BERT. None Channel Pruning Note: Channel Pruning is a feature designed for two back-to-back CONV2d layers (e.g., residual connection in ResNet). As such, we suggested use channel pruning for the first CONV2d layer. Reducing the number of output channels of this layer can help reducing the number of input channels the follow-up layer. It should also work for other CONV2d layers as well. "compression_training": { "channel_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 0, "method": "topk" }, "different_groups":{ "cp1": { "params": { "dense_ratio": 0.5 }, "modules": [ "layer....conv1" ], "related_modules": [ ["layer....conv2", "layer....bn1"] ] } } } } shared_parameters: [dictionary] Shared parameters for all channel pruning groups. Fields Value Default enabled: [boolean] Enable channel pruning or not. false schedule_offset: [integer] Enable channel pruning after scheduled steps (can be treated as warmup steps). 0 method: [string] Choose different pruning methods, l1 (static, magnitude based) or topk (dynamic, learnable). "l1" different_groups: [dictionary] Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] dense_ratio: [float] The percentage of weights to keep after pruning. 0.5 modules: [list] Scope of weight parameters associated to the params setting. "All CONV2D layers" related_modules: [list[list]] Related module to the channel pruned module. None Checkpoint options "checkpoint": { "tag_validation"="Warn", "load_universal"=false, "use_node_local_storage"=false, "parallel_write":{ "pipeline_stage": false } } tag_validation: [“Ignore” “Warn” “Fail”] Description Default Enables level of checking to ensure checkpoint tags are consistent across all ranks. Useful when restoring with different world sizes. “Warn” load_universal: [boolean] Description Default Load the latest checkpoint for all. false use_node_local_storage: [boolean] Description Default If true DeepSpeed will store model parameter states and checkpoint states based on local rank allowing checkpoints to be loaded without access to a shared filesystem. false pipeline_stage: [boolean] Description Default Use pipeline stages to parallelize the writing of checkpoints. false Data Type options "data_types": { "grad_accum_dtype"=["fp32"|"fp16"|"bf16"] } } grad_accum_dtype: [“fp32” “fp16” “bf16”] Description Default Specifies the data type in which to do gradient accumulation. If None the default is to match the model type. None

Pattern 8: Monitor Contents Overview Usage Automatic Monitoring Custom Monitoring In this tutorial, we introduce the DeepSpeed Monitor and provide examples of its usage. Overview Usage Overview Monitoring model and system metrics during training is vital to ensure hardware resources are fully utilized. The DeepSpeed Monitor enables live logging of metrics through one or more monitoring backends such as PyTorch’s TensorBoard, WandB, Comet and simple CSV files. Below is a live monitoring view for TensorBoard: Below is a live monitoring view for WandB: Below is a live monitoring view for Comet: Usage The DeepSpeed Monitor is configured within the deepspeed configuration file. DeepSpeed will automatically monitor key training metrics, including those tracked with the wall_clock_breakdown configuration option. In addition, users can log their own custom events and metrics. Automatic Monitoring Custom Monitoring Automatic Monitoring When using DeepSpeed for model training, the Monitor can be configured in the DeepSpeed configuration file. No explicit API calls are needed to use the Monitor. The Monitor can be enabled by adding the following field to DeepSpeed’s configuration json file. Refer to Monitoring for details. { "tensorboard": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } "wandb": { "enabled": true, "team": "my_team", "group": "my_group", "project": "my_project" } "comet": { "enabled": true, "project": "my_project", "experiment_name": "my_experiment" } "csv_monitor": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } } DeepSpeed will automatically log to all available and enabled monitoring backends listed in the config, and will generate live monitoring views such as those listed above. Custom Monitoring In addition to automatic monitoring, users can log their own custom metrics in client scripts. Currently, there are two ways to initialize Monitor objects: (Recommended) - Create a MonitorMaster(ds_config.monitor_config) object, which automatically initializes all monitor backends present in the DeepSpeed configuration Create a specific TensorBoardMonitor(ds_config.monitor_config), WandbMonitor(ds_config.monitor_config), csvMonitor(ds_config.monitor_config) object which will only initialize a specific monitor backend present in the DeepSpeed configuration The steps to create a custom monitor are as follows: Add import to your desired Monitor Initialize monitor with DeepSpeed config’s monitor_config Create a list of one or more 3-tuples in the format [("label", value, ds_engine.global_samples), ...] Call monitor.write_events on the list from step 3 Note - Some Monitor backends don’t support mixed sample values. Be sure to use your DeepSpeed engine object’s global_samples attribute in each 3-tuple For example usage, see the following modified DeepSpeedExamples/cifar example: # Step 1: Import monitor (and DeepSpeed config, if needed) from deepspeed.monitor.monitor import MonitorMaster from deepspeed.runtime.config import DeepSpeedConfig # Step 2: Initialized monitor with DeepSpeed config (get DeepSpeed config object, if needed) ds_config = DeepSpeedConfig("ds_config.json") monitor = MonitorMaster(ds_config.monitor_config) for epoch in range(2): running_loss = 0.0 for i, data in enumerate(trainloader): pre = time.time() inputs, labels = data[0].to(model_engine.local_rank), data[1].to( model_engine.local_rank) if fp16: inputs = inputs.half() outputs = model_engine(inputs) loss = criterion(outputs, labels) model_engine.backward(loss) model_engine.step() post = time.time() # Step 3: Create list of 3-tuple records (single entry in this case) events = [("Time per step", post-pre, model_engine.global_samples)] # Step 4: Call monitor.write_events on the list from step 3 monitor.write_events(events) Updated: November 5, 2025 Previous Next

wall_clock_breakdown

Example Code Patterns

Example 1 (python):

### Create aio_handle
from deepspeed.ops.op_builder import AsyncIOBuilder
aio_handle = AsyncIOBuilder().load().aio_handle()

Reference Files

This skill includes comprehensive documentation in references/:

08.md - 08 documentation
09.md - 09 documentation
2020.md - 2020 documentation
2023.md - 2023 documentation
assets.md - Assets documentation
mii.md - Mii documentation
other.md - Other documentation
tutorials.md - Tutorials documentation

Use view to read specific reference files when detailed information is needed.

Working with This Skill

For Beginners

Start with the getting_started or tutorials reference files for foundational concepts.

For Specific Features

Use the appropriate category reference file (api, guides, etc.) for detailed information.

For Code Examples

The quick reference section above contains common patterns extracted from the official docs.

Resources

references/

Organized documentation extracted from official sources. These files contain:

Detailed explanations
Code examples with language annotations
Links to original documentation
Table of contents for quick navigation

scripts/

Add helper scripts here for common automation tasks.

assets/

Add templates, boilerplate, or example projects here.

Notes

This skill was automatically generated from official documentation
Reference files preserve the structure and examples from source docs
Code examples include language detection for better syntax highlighting
Quick reference patterns are extracted from common usage examples in the docs

Updating

To refresh this skill with updated documentation: 1. Re-run the scraper with the same configuration 2. The skill will be rebuilt with the latest information

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Pick deepspeed when the decision is MoE-specific distributed training architecture and published DeepSpeed benchmarks matter more than generic PyTorch DDP setup steps.

FAQ

What does the deepspeed skill cover?

The deepspeed skill curates DeepSpeed MoE and large-model training notes from orchestra-research/ai-research-skills, including MoE NLG architecture, training infrastructure, datasets, and cited DeepSpeed blog benchmarks for distributed stack planning.

Does deepspeed install or run DeepSpeed?

deepspeed does not install DeepSpeed or launch training jobs. The skill is a research lookup that points developers to DeepSpeed-MoE documentation and published results such as 8x larger MoE training and 5x cost reduction claims.

Is Deepspeed safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmautomation