
Pytorch Lightning
Add Lightning Trainer callbacks for checkpoints, early stopping, and logging without bloating your LightningModule.
Overview
PyTorch Lightning is an agent skill for the Build phase that documents Lightning Trainer callbacks, especially ModelCheckpoint, for training and resuming models.
Install
npx skills add https://github.com/davila7/claude-code-templates --skill pytorch-lightningWhat is this skill?
- ModelCheckpoint patterns: monitor val_loss/val_acc, save_top_k, save_last, and custom filename templates
- Trainer integration via L.Trainer(callbacks=[...]) and fit on train/val loaders
- Checkpoint resume and load_from_checkpoint for best_model_path
- Explains callbacks as non-essential logic separated from LightningModule core
- ModelCheckpoint section with save_top_k, save_last, and filename pattern options documented in the skill
Adoption & trust: 578 installs on skills.sh; 27.8k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your training script mixes checkpointing and metric tracking into the model code, making experiments hard to resume and compare.
Who is it for?
Solo builders training PyTorch Lightning models who need standardized checkpoint and resume behavior during feature development.
Skip if: Pure scikit-learn tabular workflows, inference-only deployment with no training loop, or teams not using Lightning at all.
When should I use this skill?
You are implementing or debugging Lightning Trainer callbacks, especially checkpointing and resume, during model training.
What do I get? / Deliverables
You get callback-based Trainer setups with monitored checkpoints, top-K retention, and load_from_checkpoint flows ready to drop into training scripts.
- ModelCheckpoint configuration snippets
- Trainer.fit callback wiring
- checkpoint load and resume examples
Recommended Skills
Journey fit
Model training orchestration is build-time product work for ML features, not launch or ops monitoring of a live SaaS alone. Backend subphase fits training loops, checkpoint I/O, and Trainer configuration that ships as part of the product’s ML stack.
How it compares
Skill reference for Lightning callbacks—not a full MLOps platform skill for Kubeflow, SageMaker, or experiment tracking SaaS.
Common Questions / FAQ
Who is pytorch-lightning for?
PyTorch Lightning is for developers shipping ML features with Lightning who want agent help configuring callbacks, checkpoints, and Trainer.fit without rewriting modules.
When should I use pytorch-lightning?
Use it in Build when you configure L.Trainer, add ModelCheckpoint or related callbacks, tune save_top_k and monitor metrics, or resume training from best_model_path.
Is pytorch-lightning safe to install?
Check the Security Audits panel on this Prism page and inspect the skill files; training skills may suggest code that reads local datasets and writes checkpoint directories.
SKILL.md
READMESKILL.md - Pytorch Lightning
# PyTorch Lightning Callbacks ## Overview Callbacks add functionality to training without modifying the LightningModule. They capture **non-essential logic** like checkpointing, early stopping, and logging. ## Built-In Callbacks ### 1. ModelCheckpoint **Saves best models during training**: ```python from lightning.pytorch.callbacks import ModelCheckpoint # Save top 3 models based on validation loss checkpoint = ModelCheckpoint( dirpath='checkpoints/', filename='model-{epoch:02d}-{val_loss:.2f}', monitor='val_loss', mode='min', save_top_k=3, save_last=True, # Also save last epoch verbose=True ) trainer = L.Trainer(callbacks=[checkpoint]) trainer.fit(model, train_loader, val_loader) ``` **Configuration options**: ```python checkpoint = ModelCheckpoint( monitor='val_acc', # Metric to monitor mode='max', # 'max' for accuracy, 'min' for loss save_top_k=5, # Keep best 5 models save_last=True, # Save last epoch separately every_n_epochs=1, # Save every N epochs save_on_train_epoch_end=False, # Save on validation end instead filename='best-{epoch}-{val_acc:.3f}', # Naming pattern auto_insert_metric_name=False # Don't auto-add metric to filename ) ``` **Load checkpoint**: ```python # Load best model best_model_path = checkpoint.best_model_path model = LitModel.load_from_checkpoint(best_model_path) # Resume training trainer = L.Trainer(callbacks=[checkpoint]) trainer.fit(model, train_loader, val_loader, ckpt_path='checkpoints/last.ckpt') ``` ### 2. EarlyStopping **Stops training when metric stops improving**: ```python from lightning.pytorch.callbacks import EarlyStopping early_stop = EarlyStopping( monitor='val_loss', patience=5, # Wait 5 epochs mode='min', min_delta=0.001, # Minimum change to qualify as improvement verbose=True, strict=True, # Crash if monitored metric not found check_on_train_epoch_end=False # Check on validation end ) trainer = L.Trainer(callbacks=[early_stop]) trainer.fit(model, train_loader, val_loader) # Stops automatically if no improvement for 5 epochs ``` **Advanced usage**: ```python early_stop = EarlyStopping( monitor='val_loss', patience=10, min_delta=0.0, verbose=True, mode='min', stopping_threshold=0.1, # Stop if val_loss < 0.1 divergence_threshold=5.0, # Stop if val_loss > 5.0 check_finite=True # Stop on NaN/Inf ) ``` ### 3. LearningRateMonitor **Logs learning rate**: ```python from lightning.pytorch.callbacks import LearningRateMonitor lr_monitor = LearningRateMonitor( logging_interval='epoch', # Or 'step' log_momentum=True # Also log momentum ) trainer = L.Trainer(callbacks=[lr_monitor]) # Learning rate automatically logged to TensorBoard/WandB ``` ### 4. TQDMProgressBar **Customizes progress bar**: ```python from lightning.pytorch.callbacks import TQDMProgressBar progress_bar = TQDMProgressBar( refresh_rate=10, # Update every 10 batches process_position=0 ) trainer = L.Trainer(callbacks=[progress_bar]) ``` ### 5. GradientAccumulationScheduler **Dynamic gradient accumulation**: ```python from lightning.pytorch.callbacks import GradientAccumulationScheduler # Accumulate more gradients as training progresses accumulator = GradientAccumulationScheduler( scheduling={ 0: 8, # Epochs 0-4: accumulate 8 batches 5: 4, # Epochs 5-9: accumulate 4 batches 10: 2 # Epochs 10+: accumulate 2 batches } ) trainer = L.Trainer(callbacks=[accumulator]) ``` ### 6. StochasticWeightAveraging (SWA) **Averages weights for better generalization**: ```python from lightning.pytorch.callbacks import StochasticWeightAveraging swa = StochasticWeightAveraging( swa_lrs=1e-2, # SWA learning rate swa_epoch_start=0.8, # Start at 80% of training annealing_epochs=10, # Annealing period annealing_strategy=