Pytorch Fsdp2

Name: Pytorch Fsdp2
Author: orchestra-research

orchestra-research/ai-research-skills

396 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

pytorch-fsdp2 is a Claude Code skill that helps developers apply PyTorch FSDP2 patterns and Distributed Checkpoint async save so multi-GPU training does not stall on synchronous checkpoint I/O.

About

pytorch-fsdp2 is an AI research skill based on the official PyTorch tutorial for asynchronous saving with Distributed Checkpoint (DCP), last verified November 2024 and updated September 2025. The guide uses torch.distributed.checkpoint.async_save to move checkpoint writes off the critical training path, noting async save copies model state into internal CPU buffers which adds memory overhead. It pairs FSDP2 sharding patterns with DCP recipes so large LLM training jobs avoid pipeline stalls from synchronous torch.save calls. Developers reach for pytorch-fsdp2 when multi-GPU FSDP training pauses noticeably at checkpoint intervals and they need the official async DCP pattern with realistic memory tradeoffs documented.

Summarizes official async DCP recipe: torch.distributed.checkpoint.async_save keeps saves off the critical path
Notes CPU memory overhead from copying model state into internal buffers before async flush
Documents DCP multi-rank parallel save/load and load-time resharding across cluster topologies
Warns that DCP has no backwards-compatibility guarantees—version pins matter
Practical rule: use async save when checkpoint stalls hurt throughput and CPU headroom exists

Pytorch Fsdp2 by the numbers

396 all-time installs (skills.sh)
+36 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #507 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill pytorch-fsdp2

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/pytorch-fsdp2.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/pytorch-fsdp2)

Installs	396
repo stars	★ 11.2k
Security audit	3 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you async save FSDP2 checkpoints in PyTorch?

Apply PyTorch FSDP2 patterns and Distributed Checkpoint async save so multi-GPU training does not stall on synchronous checkpoint I/O.

Who is it for?

Distributed training engineers using PyTorch FSDP2 who hit checkpoint I/O stalls and need official async DCP save patterns.

Skip if: Single-GPU fine-tuning or teams using TorchTitan-only checkpoint TOML without custom FSDP2 training loops.

When should I use this skill?

User asks about FSDP2 async checkpointing, torch.distributed.checkpoint.async_save, or DCP memory overhead during training.

What you get

FSDP2 training loop with async DCP save integration and documented CPU buffer memory tradeoffs.

Async checkpoint save integration
FSDP2 training loop patch
DCP checkpoint directory

By the numbers

Based on official PyTorch async DCP recipe last updated Sep 29, 2025

Files

SKILL.mdMarkdownGitHub ↗

Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script

This skill teaches a coding agent how to add PyTorch FSDP2 to a training loop with correct initialization, sharding, mixed precision/offload configuration, and checkpointing.

FSDP2 in PyTorch is exposed primarily via torch.distributed.fsdp.fully_shard and the FSDPModule methods it adds in-place to modules. See: references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md.

---

When to use this skill

Use FSDP2 when:

Your model doesn’t fit on one GPU (parameters + gradients + optimizer state).
You want an eager-mode sharding approach that is DTensor-based per-parameter sharding (more inspectable, simpler sharded state dicts) than FSDP1.
You may later compose DP with Tensor Parallel using DeviceMesh.

Avoid (or be careful) if:

You need strict backwards-compatible checkpoints across PyTorch versions (DCP warns against this).
You’re forced onto older PyTorch versions without the FSDP2 stack.

Alternatives (when FSDP2 is not the best fit)

DistributedDataParallel (DDP): Use the standard data-parallel wrapper when you want classic distributed data parallel training.
FullyShardedDataParallel (FSDP1): Use the original FSDP wrapper for parameter sharding across data-parallel workers.

Reference: references/pytorch_ddp_notes.md, references/pytorch_fsdp1_api.md.

---

Contract the agent must follow

1. Launch with `torchrun` and set the CUDA device per process (usually via LOCAL_RANK). 2. Apply `fully_shard()` bottom-up, i.e., shard submodules (e.g., Transformer blocks) before the root module. 3. Call `model(input)`, not model.forward(input), so the FSDP2 hooks run (unless you explicitly unshard() or register the forward method). 4. Create the optimizer after sharding and make sure it is built on the DTensor parameters (post-fully_shard). 5. Checkpoint using Distributed Checkpoint (DCP) or the distributed-state-dict helpers, not naïve torch.save(model.state_dict()) unless you deliberately gather to full tensors.

(Each of these rules is directly described in the official API docs/tutorial; see references.)

---

Step-by-step procedure

0) Version & environment sanity

Prefer a recent stable PyTorch where the docs show FSDP2 and DCP updated recently.
Use torchrun --nproc_per_node <gpus_per_node> ... and ensure RANK, WORLD_SIZE, LOCAL_RANK are visible.

Reference: references/pytorch_fsdp2_tutorial.md (launch commands and setup), references/pytorch_fully_shard_api.md (user contract).

---

1) Initialize distributed and set device

Minimal, correct pattern:

dist.init_process_group(backend="nccl")
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
Optionally create a DeviceMesh to describe the data-parallel group(s)

Reference: references/pytorch_device_mesh_tutorial.md (why DeviceMesh exists & how it manages process groups).

---

2) Build model on meta device (recommended for very large models)

For big models, initialize on meta, apply sharding, then materialize weights on GPU:

with torch.device("meta"): model = ...
apply fully_shard(...) on submodules, then fully_shard(model)
model.to_empty(device="cuda")
model.reset_parameters() (or your init routine)

Reference: references/pytorch_fsdp2_tutorial.md (migration guide shows this flow explicitly).

---

3) Apply `fully_shard()` bottom-up (wrapping policy = “apply where needed”)

Do not only call fully_shard on the topmost module.

Recommended sharding pattern for transformer-like models:

iterate modules, if isinstance(m, TransformerBlock): fully_shard(m, ...)
then fully_shard(model, ...)

Why:

fully_shard forms “parameter groups” for collective efficiency and excludes params already grouped by earlier calls. Bottom-up gives better overlap and lower peak memory.

Reference: references/pytorch_fully_shard_api.md (bottom-up requirement and why).

---

4) Configure `reshard_after_forward` for memory/perf trade-offs

Default behavior:

None means True for non-root modules and False for root modules (good default).

Heuristics:

If you’re memory-bound: keep defaults or force True on many blocks.
If you’re throughput-bound and can afford memory: consider keeping unsharded params longer (root often False).
Advanced: use an int to reshard to a smaller mesh after forward (e.g., intra-node) if it’s a meaningful divisor.

Reference: references/pytorch_fully_shard_api.md (full semantics).

---

5) Mixed precision & offload (optional but common)

FSDP2 uses:

mp_policy=MixedPrecisionPolicy(param_dtype=..., reduce_dtype=..., output_dtype=..., cast_forward_inputs=...)
offload_policy=CPUOffloadPolicy() if you want CPU offload

Rules of thumb:

Start with BF16 parameters/reductions on H100/A100-class GPUs (if numerically stable for your model).
Keep reduce_dtype aligned with your gradient reduction expectations.
If you use CPU offload, budget for PCIe/NVLink traffic and runtime overhead.

Reference: references/pytorch_fully_shard_api.md (MixedPrecisionPolicy / OffloadPolicy classes).

---

6) Optimizer, gradient clipping, accumulation

Create the optimizer after sharding so it holds DTensor params.
If you need gradient accumulation / no_sync:
use the FSDP2 mechanism (set_requires_gradient_sync) instead of FSDP1’s no_sync().

Gradient clipping:

Use the approach shown in the FSDP2 tutorial (“Gradient Clipping and Optimizer with DTensor”), because parameters/gradients are DTensors.

Reference: references/pytorch_fsdp2_tutorial.md.

---

7) Checkpointing: prefer DCP or distributed state dict helpers

Two recommended approaches:

A) Distributed Checkpoint (DCP) — best default

DCP saves/loads from multiple ranks in parallel and supports load-time resharding.
DCP produces multiple files (often at least one per rank) and operates “in place”.

B) Distributed state dict helpers

get_model_state_dict / set_model_state_dict with StateDictOptions(full_state_dict=True, cpu_offload=True, broadcast_from_rank0=True, ...)
For optimizer: get_optimizer_state_dict / set_optimizer_state_dict

Avoid:

Saving DTensor state dicts with plain torch.save unless you intentionally convert with DTensor.full_tensor() and manage memory carefully.

References:

references/pytorch_dcp_overview.md (DCP behavior and caveats)
references/pytorch_dcp_recipe.md and references/pytorch_dcp_async_recipe.md (end-to-end usage)
references/pytorch_fsdp2_tutorial.md (DTensor vs DCP state-dict flows)
references/pytorch_examples_fsdp2.md (working checkpoint scripts)

---

Workflow checklists (copy-paste friendly)

Workflow A: Retrofit FSDP2 into an existing training script

[ ] Launch with torchrun and initialize the process group.
[ ] Set the CUDA device from LOCAL_RANK; create a DeviceMesh if you need multi-dim parallelism.
[ ] Build the model (use meta if needed), apply fully_shard bottom-up, then fully_shard(model).
[ ] Create the optimizer after sharding so it captures DTensor parameters.
[ ] Use model(inputs) so hooks run; use set_requires_gradient_sync for accumulation.
[ ] Add DCP save/load via torch.distributed.checkpoint helpers.

Reference: references/pytorch_fsdp2_tutorial.md, references/pytorch_fully_shard_api.md, references/pytorch_device_mesh_tutorial.md, references/pytorch_dcp_recipe.md.

Workflow B: Add DCP save/load (minimal pattern)

[ ] Wrap state in Stateful or assemble state via get_state_dict.
[ ] Call dcp.save(...) from all ranks to a shared path.
[ ] Call dcp.load(...) and restore with set_state_dict.
[ ] Validate any resharding assumptions when loading into a different mesh.

Reference: references/pytorch_dcp_recipe.md.

Debug checklist (what the agent should check first)

1. All ranks on distinct GPUs? If not, verify torch.cuda.set_device(LOCAL_RANK) and your torchrun flags. 2. Did you accidentally call `forward()` directly? Use model(input) or explicitly unshard() / register forward. 3. Is `fully_shard()` applied bottom-up? If only root is sharded, expect worse memory/perf and possible confusion. 4. Optimizer created at the right time? Must be built on DTensor parameters after sharding. 5. Checkpointing path consistent?

If using DCP, don’t mix with ad-hoc torch.save unless you understand conversions.
Be mindful of PyTorch-version compatibility warnings for DCP.

---

Common issues and fixes

Forward hooks not running → Call model(inputs) (or unshard() explicitly) instead of model.forward(...).
Optimizer sees non-DTensor params → Create optimizer after all fully_shard calls.
Only root module sharded → Apply fully_shard bottom-up on submodules before the root.
Memory spikes after forward → Set reshard_after_forward=True for more modules.
Gradient accumulation desync → Use set_requires_gradient_sync instead of FSDP1’s no_sync().

Reference: references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md.

---

Minimal reference implementation outline (agent-friendly)

The coding agent should implement a script with these labeled blocks:

init_distributed(): init process group, set device
build_model_meta(): model on meta, apply fully_shard, materialize weights
build_optimizer(): optimizer created after sharding
train_step(): forward/backward/step with model(inputs) and DTensor-aware patterns
checkpoint_save/load(): DCP or distributed state dict helpers

Concrete examples live in references/pytorch_examples_fsdp2.md and the official tutorial reference.

---

References

references/pytorch_fsdp2_tutorial.md
references/pytorch_fully_shard_api.md
references/pytorch_ddp_notes.md
references/pytorch_fsdp1_api.md
references/pytorch_device_mesh_tutorial.md
references/pytorch_tp_tutorial.md
references/pytorch_dcp_overview.md
references/pytorch_dcp_recipe.md
references/pytorch_dcp_async_recipe.md
references/pytorch_examples_fsdp2.md
references/torchtitan_fsdp_notes.md (optional, production notes)
references/ray_train_fsdp2_example.md (optional, integration example)

Reference: Getting Started with Fully Sharded Data Parallel (FSDP2) tutorial

Source (official): PyTorch Tutorials — “Getting Started with Fully Sharded Data Parallel (FSDP2)” https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html Created: Mar 17, 2022 • Last updated: Sep 02, 2025 • Last verified: Nov 05, 2024

What the tutorial emphasizes

How FSDP2 differs from DDP and FSDP1

FSDP shards parameters, gradients, and optimizer state; parameters are all-gathered for compute and reduce-scattered for grads.
Compared to FSDP1, FSDP2:
uses DTensor per-parameter sharding (more direct manipulation; sharded state dicts)
improves memory management for more deterministic memory behavior
supports extensibility points for custom all-gather (e.g., float8/NF4 use cases)

Model initialization flow (meta-device pattern)

The tutorial’s migration section shows a typical pattern:

initialize model on meta
apply fully_shard to the intended layers (policy expressed by explicit calls)
apply fully_shard to the root module
materialize weights via to_empty(device="cuda"), then run reset_parameters()

State dict workflows

The tutorial describes two main ways:

A) DTensor APIs (manual)

Loading: use distribute_tensor(full_tensor, meta_param.device_mesh, meta_param.placements) then model.load_state_dict(..., assign=True)
Saving: call DTensor.full_tensor() to all-gather; optionally CPU-offload on rank0 to avoid peak GPU memory

B) DCP distributed state-dict helpers (recommended when no custom handling needed)

Loading: set_model_state_dict(..., StateDictOptions(full_state_dict=True, broadcast_from_rank0=True))
Saving: get_model_state_dict(..., StateDictOptions(full_state_dict=True, cpu_offload=True))
Points to pytorch/examples for optimizer state dict save/load with set_optimizer_state_dict / get_optimizer_state_dict

Migration guide mapping

The tutorial explicitly maps FSDP1 concepts to FSDP2:

sharding_strategy ↔ reshard_after_forward (+ 2D mesh for HYBRID)
cpu_offload ↔ offload_policy (CPUOffloadPolicy)
no_sync() ↔ set_requires_gradient_sync
sync_module_states moves to DCP broadcast-from-rank0 flows

Practical takeaways for agents

Express wrapping policy by explicitly applying `fully_shard` to chosen submodules.
Use DCP APIs for flexible checkpointing and resharding unless you must interop with third-party formats.

Reference: `torch.distributed.fsdp.fully_shard` API (FSDP2)

Source (official): PyTorch docs — torch.distributed.fsdp.fully_shard https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html Created: Dec 04, 2024 • Last updated: Oct 13, 2025

Key facts (paraphrased from the API docs)

User contract highlights

fully_shard(model) converts model.parameters() to DTensor at init, then hooks all-gather before forward/backward and free/reshard after.
The optimizer must be initialized with DTensor parameters and step must happen on DTensors.
Call model(input) (not model.forward(input)) so hooks run; otherwise explicitly unshard() or register the forward method for hooking.
Apply fully_shard bottom-up: shard submodules first, then the root module, to form efficient communication groups and enable overlap.
fully_shard “unions” the module type in-place with FSDPModule, enabling methods like unshard() / reshard().

Short excerpt (<= 25 words): “Users generally should not call fully_shard() only on the topmost root module.”

Signature & core args

fully_shard(module, *, mesh=None, reshard_after_forward=None, shard_placement_fn=None, mp_policy=MixedPrecisionPolicy(...), offload_policy=OffloadPolicy(), ignored_params=None)

mesh (DeviceMesh):
1D mesh ⇒ “classic” FSDP sharding, placement (Shard(0),)
2D mesh ⇒ Hybrid sharding (HSDP): sharded across one dim, replicated across the other, placement (Replicate(), Shard(0))
reshard_after_forward:
True: free unsharded params after forward (re-all-gather during backward)
False: keep unsharded params after forward (avoid backward all-gather)
None: defaults to True for non-root, False for root
int: reshard to a smaller world-size after forward (must divide shard-dim size)
shard_placement_fn: override per-parameter sharding dim (requires even sharding if not dim-0)
ignored_params: parameters not sharded / not moved / not reduced

Mixed precision & offload policy classes (same doc page)

`MixedPrecisionPolicy`

Controls:

param_dtype: dtype used for unsharded parameters during forward/backward
reduce_dtype: dtype used for gradient reduction
output_dtype: dtype used for forward output
cast_forward_inputs: whether to cast forward inputs to param_dtype

`OffloadPolicy` and `CPUOffloadPolicy`

OffloadPolicy controls:

param_device / reduce_device / output_device (and for CPU offload policy, also optimizer_state_device)

Practical implications for agents

Bottom-up sharding is not optional: it affects grouping and memory/perf.
Don’t bypass hooks: using model.forward directly breaks all-gather scheduling.
Optimizer construction order matters: construct optimizer after fully_shard.

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use pytorch-fsdp2 for FSDP2 async DCP in custom training loops; use distributed-llm-pretraining-torchtitan when checkpointing inside Meta TorchTitan TOML configs.

FAQ

What does PyTorch async DCP save change during training?

pytorch-fsdp2 uses torch.distributed.checkpoint.async_save to move checkpoint writes off the critical training path, though async save first copies model state into internal CPU buffers which increases memory overhead.

When should developers enable async checkpoint saves?

pytorch-fsdp2 recommends async DCP saves when checkpoint stalls significantly impact multi-GPU FSDP2 training throughput and the cluster has enough CPU memory headroom for buffer copies.

Is Pytorch Fsdp2 safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmautomation

About

Pytorch Fsdp2 by the numbers

Add your badge

How do you async save FSDP2 checkpoints in PyTorch?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Skill: Use PyTorch FSDP2 (fully_shard) correctly in a training script

When to use this skill

Alternatives (when FSDP2 is not the best fit)

Contract the agent must follow

Step-by-step procedure

0) Version & environment sanity

1) Initialize distributed and set device

2) Build model on meta device (recommended for very large models)

3) Apply fully_shard() bottom-up (wrapping policy = “apply where needed”)

4) Configure reshard_after_forward for memory/perf trade-offs

5) Mixed precision & offload (optional but common)

6) Optimizer, gradient clipping, accumulation

7) Checkpointing: prefer DCP or distributed state dict helpers

Workflow checklists (copy-paste friendly)

Workflow A: Retrofit FSDP2 into an existing training script

Workflow B: Add DCP save/load (minimal pattern)

Debug checklist (what the agent should check first)

Common issues and fixes

Minimal reference implementation outline (agent-friendly)

References

Reference: Asynchronous Saving with Distributed Checkpoint (DCP) recipe

What async checkpointing changes

Practical agent guidance

Reference: Distributed Checkpoint (DCP) overview (torch.distributed.checkpoint)

What DCP does

Important caveats

Where to learn usage

Reference: Getting Started with Distributed Checkpoint (DCP) recipe

Key ideas shown in the recipe

Example structure (high level)

Practical agent guidance

Reference: Distributed Data Parallel (DDP) notes

Key points (paraphrased from the notes)

Reference: Getting Started with DeviceMesh (PyTorch tutorial)

What DeviceMesh is (as defined by the tutorial)

Why this matters for FSDP2

Reference: Official pytorch/examples FSDP2 scripts

Why this matters

How agents should use this

Reference: Fully Sharded Data Parallel (FSDP1) API

Key points (paraphrased from the API docs)

Reference: Getting Started with Fully Sharded Data Parallel (FSDP2) tutorial

What the tutorial emphasizes

How FSDP2 differs from DDP and FSDP1

Model initialization flow (meta-device pattern)

State dict workflows

Migration guide mapping

Practical takeaways for agents

Reference: torch.distributed.fsdp.fully_shard API (FSDP2)

Key facts (paraphrased from the API docs)

User contract highlights

Signature & core args

Mixed precision & offload policy classes (same doc page)

MixedPrecisionPolicy

OffloadPolicy and CPUOffloadPolicy

Practical implications for agents

Reference: Tensor Parallel (TP) tutorial (and how it composes with FSDP)

Key composition pattern: TP intra-host + FSDP inter-host

Practical agent guidance

Reference: Ray Train FSDP2 integration guide (third-party, useful patterns)

Why include this

Agent guidance

Reference: TorchTitan notes on FSDP/FSDP2 (production-oriented)

Why include this

Agent guidance

Related skills

How it compares

FAQ

What does PyTorch async DCP save change during training?

When should developers enable async checkpoint saves?

Is Pytorch Fsdp2 safe to install?

Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script

3) Apply `fully_shard()` bottom-up (wrapping policy = “apply where needed”)

4) Configure `reshard_after_forward` for memory/perf trade-offs

Reference: Official `pytorch/examples` FSDP2 scripts

Reference: `torch.distributed.fsdp.fully_shard` API (FSDP2)

`MixedPrecisionPolicy`

`OffloadPolicy` and `CPUOffloadPolicy`