
Pytorch Fsdp2
Apply PyTorch FSDP2 patterns and Distributed Checkpoint async save so multi-GPU training does not stall on synchronous checkpoint I/O.
Overview
pytorch-fsdp2 is an agent skill most often used in Build (also Operate infra) that explains FSDP2 training with Distributed Checkpoint and asynchronous save recipes from official PyTorch docs.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill pytorch-fsdp2What is this skill?
- Summarizes official async DCP recipe: torch.distributed.checkpoint.async_save keeps saves off the critical path
- Notes CPU memory overhead from copying model state into internal buffers before async flush
- Documents DCP multi-rank parallel save/load and load-time resharding across cluster topologies
- Warns that DCP has no backwards-compatibility guarantees—version pins matter
- Practical rule: use async save when checkpoint stalls hurt throughput and CPU headroom exists
- References official PyTorch async DCP recipe (Jul 2024 create, Sep 2025 doc update cited in SKILL)
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your multi-GPU training run spends minutes frozen on each checkpoint, or you cannot resume a sharded job after changing GPU count without manual state surgery.
Who is it for?
Solo builders scaling PyTorch training past one GPU who own the training script and checkpoint policy.
Skip if: Single-GPU fine-tuning with occasional torch.save pickles—full DCP/async setup is unnecessary overhead.
When should I use this skill?
You implement or tune FSDP2 training, Distributed Checkpoint save/load, or async checkpointing to reduce training stalls.
What do I get? / Deliverables
You adopt DCP with optional async_save, understand CPU memory tradeoffs, and checkpoint in a resharding-friendly way aligned with current PyTorch recipes.
- Async or sync DCP checkpoint integration in training code
- Documented memory/stall tradeoffs for your save policy
- Reshard-friendly checkpoint layout for future cluster changes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Large-model training code lives in Build; this skill is shelved there because FSDP2 and DCP are implemented in the training loop before you operate clusters day to day. Backend captures distributed training logic, state dict sharding, and checkpoint orchestration—not frontend or go-to-market work.
Where it fits
Wire async_save into your FSDP2 loop after profiling synchronous checkpoint gaps.
Resume a sharded checkpoint on a different GPU count using DCP load-time resharding assumptions.
Validate that checkpoint policy does not dominate step time before calling a training run production-ready.
How it compares
Training-scale checkpoint reference—not a generic CI/CD deploy skill or managed cloud trainer UI.
Common Questions / FAQ
Who is pytorch-fsdp2 for?
Developers writing distributed PyTorch training loops who need FSDP2-compatible checkpointing rather than monolithic .pt files.
When should I use pytorch-fsdp2?
Use it in Build while implementing training and checkpoint code; revisit in Operate when tuning cluster resume, topology changes, or stall times on long jobs.
Is pytorch-fsdp2 safe to install?
Skill content references official docs; check this page’s Security Audits and avoid granting broad cluster credentials to agents that only need doc-guided code changes.
SKILL.md
READMESKILL.md - Pytorch Fsdp2
# Reference: Asynchronous Saving with Distributed Checkpoint (DCP) recipe **Source (official):** PyTorch Tutorials recipe — “Asynchronous Saving with Distributed Checkpoint (DCP)” https://docs.pytorch.org/tutorials/recipes/distributed_async_checkpoint_recipe.html Created: Jul 22, 2024 • Last updated: Sep 29, 2025 • Last verified: Nov 05, 2024 ## What async checkpointing changes - Moves checkpointing off the critical training path via `torch.distributed.checkpoint.async_save` - Introduces extra memory overhead because async save first copies model state into internal CPU buffers ## Practical agent guidance - Use async save when checkpoint stalls are significant and you have headroom for CPU memory. - Consider pinned memory strategies described in the recipe if performance matters. # Reference: Distributed Checkpoint (DCP) overview (torch.distributed.checkpoint) **Source (official):** PyTorch docs — `torch.distributed.checkpoint` https://docs.pytorch.org/docs/stable/distributed.checkpoint.html Created: Nov 16, 2022 • Last updated: Oct 08, 2025 ## What DCP does - Supports saving/loading from **multiple ranks in parallel** - Handles **load-time resharding**, enabling saving with one cluster topology and loading into another - Produces **multiple files per checkpoint** (often at least one per rank) - Operates “in place”: the model allocates storage first; DCP loads into that storage ## Important caveats - The docs warn: **no guarantees of backwards compatibility** across PyTorch versions for saved `state_dict`s. - Process-group usage: if you pass a process group, only those ranks should call save/load, and all tensors must belong to that group. ## Where to learn usage The doc links to official “Getting Started with DCP” and “Asynchronous Saving with DCP” recipes. # Reference: Getting Started with Distributed Checkpoint (DCP) recipe **Source (official):** PyTorch Tutorials recipe — “Getting Started with Distributed Checkpoint (DCP)” https://docs.pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html Created: Oct 02, 2023 • Last updated: Jul 10, 2025 • Last verified: Nov 05, 2024 ## Key ideas shown in the recipe - DCP saves/loads in parallel, and supports resharding across topologies at load time. - It provides helpers under `torch.distributed.checkpoint.state_dict` to manage distributed `state_dict` generation/loading. ## Example structure (high level) - Wrap application state in a `Stateful` object, so DCP automatically calls `state_dict()` / `load_state_dict()` - Use `dcp.save(...)` / `dcp.load(...)` - Use `get_state_dict` / `set_state_dict` helpers to correctly obtain and apply model/optimizer state dicts in distributed settings ## Practical agent guidance If adding checkpointing to an FSDP2 training script, this recipe’s patterns are the safest default. # Reference: Distributed Data Parallel (DDP) notes **Source (official):** PyTorch docs — “Distributed Data Parallel” https://docs.pytorch.org/docs/stable/notes/ddp.html Last accessed: Jan 30, 2026 ## Key points (paraphrased from the notes) - DDP is the standard PyTorch wrapper for distributed data parallel training. - Typical usage includes initializing the process group, wrapping the model with `DistributedDataParallel`, and training normally. # Reference: Getting Started with DeviceMesh (PyTorch tutorial) **Source (official):** PyTorch Recipes — “Getting Started with DeviceMesh” https://docs.pytorch.org/tutorials/recipes/distributed_device_mesh.html Created: Jan 24, 2024 • Last updated: Jul 18, 2025 • Last verified: Nov 05, 2024 ## What DeviceMesh is (as defined by the tutorial) DeviceMesh is a higher-level abstraction that **manages ProcessGroups**, making it easier to set up the right communication groups for multi-dimensional parallelism. The tutorial motivation: - Without DeviceMesh, users must manually compute rank groupings (replicate/shard groups) and create multiple process groups. - With DeviceMesh, you describe top