
Aoti Debug
Diagnose AOTInductor compile/load crashes, device mismatches, and runtime failures when shipping PyTorch models through the inductor export path.
Overview
AOTI Debug is an agent skill for the Operate phase that diagnoses AOTInductor segfaults, device mismatches, and load/runtime errors from PyTorch aot_compile and package loaders.
Install
npx skills add https://github.com/pytorch/pytorch --skill aoti-debugWhat is this skill?
- Routes Triton index-out-of-bounds assertions to a dedicated sub-guide
- Mandates compile-device vs load-device and input shape checks before deeper debugging
- Covers aot_compile, aot_load, aoti_compile_and_package, and aoti_load_package failure modes
- Pattern-first workflow: read the error string, then follow the matching section
- Targets PyTorch _inductor export and AOTInductor packaging workflows
Adoption & trust: 1.3k installs on skills.sh; 101k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your AOT-compiled PyTorch model segfaults or throws device/shape errors at load time and you do not know whether the bug is inductor, Triton, or a compile-vs-runtime mismatch.
Who is it for?
Indie ML engineers exporting PyTorch models with AOTInductor who see crashes or exceptions during aot_load or packaged inference.
Skip if: Greenfield model training, generic Python debugging unrelated to inductor export, or teams not using torch._inductor AOT paths.
When should I use this skill?
Encountering AOTI segfaults, device mismatch errors, constant loading failures, or runtime errors from aot_compile, aot_load, aoti_compile_and_package, or aoti_load_package.
What do I get? / Deliverables
You follow a ordered triage path—device and shape parity first, then pattern-specific fixes—so the compiled artifact runs on the intended device with matching inputs.
- Root-cause hypothesis tied to device/shape parity or routed sub-guide
- Concrete fix steps for the matched AOTI error pattern
Recommended Skills
Journey fit
AOTI failures surface at inference and packaging time—after build—when compiled artifacts misbehave in production or staging runtimes. Segfaults, shape/device errors, and constant-loading failures are classic production error triage, routed through pattern-specific sub-guides.
How it compares
Use instead of unstructured log reading when failures mention AOTI, aot_compile, or inductor packaging—this is procedural triage, not a generic debugger skill.
Common Questions / FAQ
Who is aoti-debug for?
Solo builders and small teams shipping PyTorch models through AOTInductor who need a structured guide when compile/load or packaged inference fails.
When should I use aoti-debug?
Use it in Operate when production or staging inference crashes; in Ship when validating compiled artifacts before release; whenever errors mention aot_compile, aoti_load_package, or Triton index assertions.
Is aoti-debug safe to install?
Review the Security Audits panel on this Prism page before installing; the skill is documentation-driven and does not prescribe secret exfiltration, but verify the repo and SKILL.md in your environment.
SKILL.md
READMESKILL.md - Aoti Debug
# AOTI Debugging Guide This skill helps diagnose and fix common AOTInductor issues. ## Error Pattern Routing **Check the error message and route to the appropriate sub-guide:** ### Triton Index Out of Bounds If the error matches this pattern: ``` Assertion `index out of bounds: 0 <= tmpN < ksM` failed ``` **→ Follow the guide in `triton-index-out-of-bounds.md`** ### All Other Errors Continue with the sections below. --- ## First Step: Always Check Device and Shape Matching **For ANY AOTI error (segfault, exception, crash, wrong output), ALWAYS check these first:** 1. **Compile device == Load device**: The model must be loaded on the same device type it was compiled on 2. **Input devices match**: Runtime inputs must be on the same device as the compiled model 3. **Input shapes match**: Runtime input shapes must match the shapes used during compilation (or satisfy dynamic shape constraints) ```python # During compilation - note the device and shapes model = MyModel().eval() # What device? CPU or .cuda()? inp = torch.randn(2, 10) # What device? What shape? compiled_so = torch._inductor.aot_compile(model, (inp,)) # During loading - device type MUST match compilation loaded = torch._export.aot_load(compiled_so, "???") # Must match model/input device above # During inference - device and shapes MUST match out = loaded(inp.to("???")) # Must match compile device, shape must match ``` **If any of these don't match, you will get errors ranging from segfaults to exceptions to wrong outputs.** ## Key Constraint: Device Type Matching **AOTI requires compile and load to use the same device type.** - If you compile on CUDA, you must load on CUDA (device index can differ) - If you compile on CPU, you must load on CPU - Cross-device loading (e.g., compile on GPU, load on CPU) is NOT supported ## Common Error Patterns ### 1. Device Mismatch Segfault **Symptom**: Segfault, exception, or crash during `aot_load()` or model execution. **Example error messages**: - `The specified pointer resides on host memory and is not registered with any CUDA device` - Crash during constant loading in AOTInductorModelBase - `Expected out tensor to have device cuda:0, but got cpu instead` **Cause**: Compile and load device types don't match (see "First Step" above). **Solution**: Ensure compile and load use the same device type. If compiled on CPU, load on CPU. If compiled on CUDA, load on CUDA. ### 2. Input Device Mismatch at Runtime **Symptom**: RuntimeError during model execution. **Cause**: Input device doesn't match compile device (see "First Step" above). **Better Debugging**: Run with `AOTI_RUNTIME_CHECK_INPUTS=1` for clearer errors. This flag validates all input properties including device type, dtype, sizes, and strides: ```bash AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py ``` This produces actionable error messages like: ``` Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda) ``` ## Debugging CUDA Illegal Memory Access (IMA) Errors If you encounter CUDA illegal memory access errors, follow this systematic approach: ### Step 1: Sanity Checks Before diving deep, try these debugging flags: ```bash AOTI_RUNTIME_CHECK_INPUTS=1 TORCHINDUCTOR_NAN_ASSERTS=1 ``` These flags take effect at compilation time (at codegen time): - `AOTI_RUNTIME_CHECK_INPUTS=1` checks if inputs satisfy the same guards used during compilation - `TORCHINDUCTOR_NAN_ASSERTS=1` adds codegen before and after each kernel to check for NaN ### Step 2: Pinpoint the CUDA IMA CUDA IMA errors can be non-deterministic. Use these flags to trigger the error deterministically: ```bash PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 ``` T