
Safe Debug
Diagnose deep-learning training or research stack traces conservatively—likely cause, smallest safe fix, and savepoint guidance—without patching code until you explicitly approve.
Overview
safe-debug is an agent skill most often used in Operate (also Ship, Build) that conservatively diagnoses deep-learning errors and blocks code patches until you explicitly approve them.
Install
npx skills add https://github.com/lllllllama/rigorpilot-skills --skill safe-debugWhat is this skill?
- Four-step default protocol: read error, diagnose without edits, state cause and smallest fix, require approval before pa
- Required outputs: diagnosis summary, cause category, conservative fix suggestions, savepoint when change scope is medium
- Category heuristics for CUDA OOM, checkpoint mismatch, distributed/NCCL, device mismatch, and shape errors
- Forbidden: editing before approval, broad refactors, silent routing into exploration
- Companion script supports conservative research debugging without automatic patching
- Default 4-step debug protocol before any repository edit
- Built-in category rules for CUDA OOM, checkpoint mismatch, distributed, device, and shape failures
Adoption & trust: 32.3k installs on skills.sh; 412 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your training job failed with a CUDA, checkpoint, or shape error and you need a grounded diagnosis—not an agent that immediately edits half the repo.
Who is it for?
Solo researchers hitting reproducible ML failures who want audit-friendly debug steps before changing training code.
Skip if: Greenfield feature implementation, hyperparameter sweeps, or situations where you already approved a large refactor without a savepoint.
When should I use this skill?
Diagnose this deep learning research error conservatively: analyze traceback or symptom first, explain likely cause, suggest smallest safe fix, do not patch unless explicitly authorized.
What do I get? / Deliverables
You get a diagnosis summary, categorized likely cause, smallest safe fix options, and optional savepoint guidance; patching only follows your explicit approval.
- Diagnosis summary with likely cause category
- Conservative fix suggestions and savepoint recommendation when scope is medium or high
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Failure diagnosis during live experiments maps to Operate errors as the canonical shelf, though the same protocol applies during Ship testing runs. Errors subphase captures traceback-driven recovery before infra changes or broad refactors.
Where it fits
Interpret a shape mismatch during first integration of a new loss head before touching model code.
Analyze a failed CI or local verification run and document cause category before merge.
Triage NCCL timeout on a multi-GPU job with savepoint recommendation before retrying.
Classify recurring OOM and suggest batch-size or gradient-checkpointing fixes without broad refactors.
How it compares
Use instead of default agent behavior that patches on the first traceback without categorizing OOM versus checkpoint versus distributed failures.
Common Questions / FAQ
Who is safe-debug for?
Indie builders and researchers running PyTorch-style training who want RigorPilot-style conservative failure analysis compatible with Claude Code, Cursor, or Codex.
When should I use safe-debug?
Use it in Operate when jobs fail in production or iteration loops; during Ship testing when verification runs error; in Build when integration tests surface device or shape bugs—always before unauthorized patches.
Is safe-debug safe to install?
Check the Security Audits panel on this page; the skill prioritizes read-only diagnosis though follow-on approved patches may use shell and filesystem access.
Workflow Chain
Then invoke: run train
SKILL.md
READMESKILL.md - Safe Debug
display_name: Rigor Debug / Rigor Audit short_description: Rigor Debug / Rigor Audit mode for conservative failure diagnosis before patching. default_prompt: Diagnose this deep learning research error conservatively. Analyze the traceback or symptom first, explain the likely cause, suggest the smallest safe fix, and do not patch code unless explicitly authorized. # Debug Policy ## Default protocol 1. read the error or symptom carefully 2. diagnose without editing repository code 3. state the likely cause, evidence, and smallest safe fix 4. require explicit approval before patching ## Required outputs - diagnosis summary - likely cause category - conservative fix suggestions - savepoint recommendation when change scope is medium or high ## Forbidden behavior - editing code before approval - drifting into broad refactor work - silently routing into exploration #!/usr/bin/env python3 """Conservative research debugging without automatic patching.""" from __future__ import annotations import argparse import json from pathlib import Path from typing import Dict, List CATEGORY_RULES = [ ("cuda_oom", ["cuda out of memory", "outofmemoryerror", "oom"]), ("checkpoint_mismatch", ["size mismatch", "missing key", "unexpected key", "checkpoint"]), ("distributed_issue", ["nccl", "distributed", "ddp", "rank"]), ("device_mismatch", ["expected all tensors to be on the same device", "same device"]), ("shape_mismatch", ["shape", "dimension", "size mismatch"]), ("loss_nan", ["loss is nan", "nan", "not converging"]), ("file_missing", ["filenotfounderror", "no such file", "cannot find path"]), ] def classify_error(text: str) -> str: lower = text.lower() for category, signals in CATEGORY_RULES: if any(signal in lower for signal in signals): return category if "traceback" in lower or "runtimeerror" in lower or "valueerror" in lower: return "runtime_failure" return "unknown" def suggested_actions(category: str) -> List[str]: mapping = { "cuda_oom": [ "Check effective batch size, input resolution, and mixed-precision settings before patching model code.", "Prefer a configuration-only reduction before touching architecture.", ], "checkpoint_mismatch": [ "Verify checkpoint source, model variant, and load strictness assumptions.", "Confirm whether the mismatch is expected before introducing compatibility code.", ], "distributed_issue": [ "Inspect launch command, world size, and environment variables before patching training logic.", "Reproduce with a single process when possible to narrow the issue safely.", ], "device_mismatch": [ "Trace where tensors and modules move across CPU and GPU boundaries.", "Prefer a minimal device-placement fix over a broad refactor.", ], "shape_mismatch": [ "Log tensor shapes at the failing boundary without changing unrelated code paths.", "Check config, dataset, and head dimensions before editing model internals.", ], "loss_nan": [ "Inspect data ranges, loss inputs, mixed precision, and learning rate before changing architecture.", "Use a shorter controlled run to confirm whether NaNs appear at startup or later.", ], "file_missing": [ "Validate dataset, checkpoint, and config paths before editing code.", "Prefer a path fix or documented setup correction over logic changes.", ], "runtime_failure": [ "Trace the failing file and symbol before proposing any patch.", "Confirm whether the failure is environment-related, config-related, or code-related.", ], "unknown": [ "Collect the full command, stack trace, and recent code change before patching anything.", "Narrow the failure surface with the