
Huggingface Accelerate
Configure Hugging Face Accelerate plugins and kwargs handlers so multi-GPU or mixed-precision training runs with predictable distributed behavior.
Overview
Hugging Face Accelerate is an agent skill most often used in Build (also Operate iterate) that configures Accelerate plugins and kwargs handlers for distributed and mixed-precision training.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill huggingface-accelerateWhat is this skill?
- Documents custom Accelerate plugin structure and validation hooks
- Covers built-in kwargs patterns such as GradScalerKwargs for FP16 mixed precision
- Explains DDP, FSDP, and DeepSpeed as context for extending beyond defaults
- Shows Accelerator wiring with kwargs_handlers for production training scripts
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your training script works on one GPU but breaks or wastes memory when you turn on mixed precision or distributed modes without structured Accelerate configuration.
Who is it for?
Indie ML builders fine-tuning open models who want Accelerate-native configuration instead of raw DistributedDataParallel snippets.
Skip if: Pure inference-only deployments with no training loop, or teams standardized entirely on a separate framework with no Accelerate.
When should I use this skill?
You are implementing or debugging Hugging Face Accelerate training with custom plugins, kwargs handlers, or mixed-precision settings.
What do I get? / Deliverables
You implement validated plugin and kwargs_handler patterns so Accelerator runs FP16 or multi-GPU strategies with explicit, maintainable configuration.
- Accelerate plugin or kwargs_handler configuration
- Documented training launch parameters
- Validated mixed-precision or distributed setup
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Training stack setup is core Build backend work for ML products; the same patterns reappear when you iterate models in Operate. Backend covers training orchestration, precision, and distributed strategies—not notebook-only exploration.
Where it fits
Wire GradScalerKwargs before a first multi-GPU fine-tune of an open LLM for your API product.
Tune FP16 scaler growth_interval to cut VRAM spikes before load-testing inference adapters trained with Accelerate.
Adjust custom plugin validation when adding a new cluster node shape without rewriting the whole training entrypoint.
How it compares
Skill-backed training configuration guidance—not a hosted training SaaS or a weights download integration.
Common Questions / FAQ
Who is huggingface-accelerate for?
Solo developers and small teams writing PyTorch/Hugging Face training code who need distributed and mixed-precision setup clarity.
When should I use huggingface-accelerate?
While building model training backends, before scaling fine-tunes to multiple GPUs, or when iterating training configs during Operate.
Is huggingface-accelerate safe to install?
The skill is documentation-oriented; review Prism Security Audits and any bundled scripts in the upstream repo before running training on secrets-bearing data.
SKILL.md
READMESKILL.md - Huggingface Accelerate
# Custom Plugins for Accelerate ## Overview Accelerate allows creating **custom plugins** to extend distributed training strategies beyond built-in options (DDP, FSDP, DeepSpeed). ## Plugin Architecture ### Base Plugin Structure ```python from accelerate.utils import DistributedDataParallelKwargs from dataclasses import dataclass @dataclass class CustomPlugin: """Custom training plugin.""" # Plugin configuration param1: int = 1 param2: str = "default" def __post_init__(self): # Validation logic if self.param1 < 1: raise ValueError("param1 must be >= 1") ``` ### Using Custom Plugin ```python from accelerate import Accelerator # Create plugin custom_plugin = CustomPlugin(param1=4, param2="value") # Pass to Accelerator accelerator = Accelerator( custom_plugin=custom_plugin # Not a real parameter, example only ) ``` ## Built-In Plugin Examples ### 1. GradScalerKwargs (FP16 Configuration) ```python from accelerate.utils import GradScalerKwargs # Configure gradient scaler for FP16 scaler_kwargs = GradScalerKwargs( init_scale=2.**16, # Initial loss scale growth_factor=2.0, # Scale growth rate backoff_factor=0.5, # Scale backoff rate growth_interval=2000, # Steps between scale increases enabled=True # Enable scaler ) accelerator = Accelerator( mixed_precision='fp16', kwargs_handlers=[scaler_kwargs] # Pass as kwargs handler ) ``` **Use case**: Fine-tune FP16 gradient scaling behavior ### 2. DistributedDataParallelKwargs ```python from accelerate.utils import DistributedDataParallelKwargs # Configure DDP behavior ddp_kwargs = DistributedDataParallelKwargs( bucket_cap_mb=25, # Gradient bucketing size find_unused_parameters=False, # Find unused params (slower) check_reduction=False, # Check gradient reduction gradient_as_bucket_view=True, # Memory optimization static_graph=False # Static computation graph ) accelerator = Accelerator( kwargs_handlers=[ddp_kwargs] ) ``` **Use case**: Optimize DDP performance for specific models ### 3. FP8RecipeKwargs (H100 FP8) ```python from accelerate.utils import FP8RecipeKwargs # Configure FP8 training (H100) fp8_recipe = FP8RecipeKwargs( backend="te", # TransformerEngine backend margin=0, # Scaling margin interval=1, # Scaling interval fp8_format="HYBRID", # E4M3 + E5M2 hybrid amax_history_len=1024, # AMAX history length amax_compute_algo="max" # AMAX computation algorithm ) accelerator = Accelerator( mixed_precision='fp8', kwargs_handlers=[fp8_recipe] ) ``` **Use case**: Ultra-fast training on H100 GPUs ## Custom DeepSpeed Configuration ### ZeRO-3 with CPU Offload ```python from accelerate import Accelerator from accelerate.utils import DeepSpeedPlugin # Custom DeepSpeed config ds_plugin = DeepSpeedPlugin( zero_stage=3, # ZeRO-3 offload_optimizer_device="cpu", # CPU offload optimizer offload_param_device="cpu", # CPU offload parameters zero3_init_flag=True, # ZeRO-3 initialization zero3_save_16bit_model=True, # Save FP16 weights ) accelerator = Accelerator( deepspeed_plugin=ds_plugin, mixed_precision='bf16' ) ``` ### ZeRO-2 with NVMe Offload ```python ds_plugin = DeepSpeedPlugin( zero_stage=2, offload_optimizer_device="nvme", # NVMe offload offload_param_device="nvme", nvme_path="/local_nvme", # NVMe mount path ) ``` ### Custom JSON Config ```python import json # Load custom DeepSpeed config with open('deepspeed_config.json', 'r') as f: ds_config = json.load(f) ds_plugin = DeepSpeedPlugin(hf_ds_config=ds_config) accelerator = Accelerator(deepspeed_plugin=ds_plugin) ``` **Example config** (`deepspeed_config.json`): ```json { "train_batch_size": "auto", "train_micro_batch_size_per