Aoti Debug

Name: Aoti Debug
Author: pytorch

pytorch/pytorch

Diagnose AOTInductor compile/load crashes, device mismatches, and runtime failures when shipping PyTorch models through the inductor export path.

Overview

AOTI Debug is an agent skill for the Operate phase that diagnoses AOTInductor segfaults, device mismatches, and load/runtime errors from PyTorch aot_compile and package loaders.

Install

npx skills add https://github.com/pytorch/pytorch --skill aoti-debug

What is this skill?

Routes Triton index-out-of-bounds assertions to a dedicated sub-guide
Mandates compile-device vs load-device and input shape checks before deeper debugging
Covers aot_compile, aot_load, aoti_compile_and_package, and aoti_load_package failure modes
Pattern-first workflow: read the error string, then follow the matching section
Targets PyTorch _inductor export and AOTInductor packaging workflows

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1.3k installs on skills.sh; 101k GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

Your AOT-compiled PyTorch model segfaults or throws device/shape errors at load time and you do not know whether the bug is inductor, Triton, or a compile-vs-runtime mismatch.

Who is it for?

Indie ML engineers exporting PyTorch models with AOTInductor who see crashes or exceptions during aot_load or packaged inference.

Skip if: Greenfield model training, generic Python debugging unrelated to inductor export, or teams not using torch._inductor AOT paths.

When should I use this skill?

Encountering AOTI segfaults, device mismatch errors, constant loading failures, or runtime errors from aot_compile, aot_load, aoti_compile_and_package, or aoti_load_package.

What do I get? / Deliverables

You follow a ordered triage path—device and shape parity first, then pattern-specific fixes—so the compiled artifact runs on the intended device with matching inputs.

Root-cause hypothesis tied to device/shape parity or routed sub-guide
Concrete fix steps for the matched AOTI error pattern

Recommended Skills

Azure Diagnosticsmicrosoft/azure-skills

Azure Diagnostics walks agents through systematic production troubleshooting on Azure—checking resource health, AppLens …374k installs·1.2k stars

Diagnosemattpocock/skills

Matt Pocock-style diagnose skill that prioritizes deterministic pass/fail signals then walks through structured debuggin…187k installs·121k stars

Systematic Debuggingobra/superpowers

Systematic Debugging is an agent skill that forces a root-cause-first workflow before any proposed fix for bugs, test fa…134k installs·221k stars

Safe Debuglllllllama/rigorpilot-skills

safe-debug implements Rigor Debug / Rigor Audit mode for deep-learning research repos: your agent reads the traceback or…32.3k installs·412 stars

Mastramastra-ai/skills

The mastra skill is a structured troubleshooting companion for solo builders shipping TypeScript agents on the Mastra fr…18.5k installs·57 stars

Insforge Debuginsforge/agent-skills

InsForge Debug guides solo builders through structured diagnosis on InsForge-backed projects when something breaks in pr…9.2k installs·27 stars

Journey fit

Primary fit

OperateError tracking

AOTI failures surface at inference and packaging time—after build—when compiled artifacts misbehave in production or staging runtimes. Segfaults, shape/device errors, and constant-loading failures are classic production error triage, routed through pattern-specific sub-guides.

Also useful

ShipTesting & QA

Also useful

BuildBackend, data & payments

How it compares

Use instead of unstructured log reading when failures mention AOTI, aot_compile, or inductor packaging—this is procedural triage, not a generic debugger skill.

Common Questions / FAQ

Who is aoti-debug for?

Solo builders and small teams shipping PyTorch models through AOTInductor who need a structured guide when compile/load or packaged inference fails.

When should I use aoti-debug?

Use it in Operate when production or staging inference crashes; in Ship when validating compiled artifacts before release; whenever errors mention aot_compile, aoti_load_package, or Triton index assertions.

Is aoti-debug safe to install?

Review the Security Audits panel on this Prism page before installing; the skill is documentation-driven and does not prescribe secret exfiltration, but verify the repo and SKILL.md in your environment.

SKILL.md

READMESKILL.md - Aoti Debug

# AOTI Debugging Guide

This skill helps diagnose and fix common AOTInductor issues.

## Error Pattern Routing

**Check the error message and route to the appropriate sub-guide:**

### Triton Index Out of Bounds
If the error matches this pattern:
```
Assertion `index out of bounds: 0 <= tmpN < ksM` failed
```
**→ Follow the guide in `triton-index-out-of-bounds.md`**

### All Other Errors
Continue with the sections below.

---

## First Step: Always Check Device and Shape Matching

**For ANY AOTI error (segfault, exception, crash, wrong output), ALWAYS check these first:**

1. **Compile device == Load device**: The model must be loaded on the same device type it was compiled on
2. **Input devices match**: Runtime inputs must be on the same device as the compiled model
3. **Input shapes match**: Runtime input shapes must match the shapes used during compilation (or satisfy dynamic shape constraints)

```python
# During compilation - note the device and shapes
model = MyModel().eval()           # What device? CPU or .cuda()?
inp = torch.randn(2, 10)           # What device? What shape?
compiled_so = torch._inductor.aot_compile(model, (inp,))

# During loading - device type MUST match compilation
loaded = torch._export.aot_load(compiled_so, "???")  # Must match model/input device above

# During inference - device and shapes MUST match
out = loaded(inp.to("???"))  # Must match compile device, shape must match
```

**If any of these don't match, you will get errors ranging from segfaults to exceptions to wrong outputs.**

## Key Constraint: Device Type Matching

**AOTI requires compile and load to use the same device type.**

- If you compile on CUDA, you must load on CUDA (device index can differ)
- If you compile on CPU, you must load on CPU
- Cross-device loading (e.g., compile on GPU, load on CPU) is NOT supported

## Common Error Patterns

### 1. Device Mismatch Segfault

**Symptom**: Segfault, exception, or crash during `aot_load()` or model execution.

**Example error messages**:
- `The specified pointer resides on host memory and is not registered with any CUDA device`
- Crash during constant loading in AOTInductorModelBase
- `Expected out tensor to have device cuda:0, but got cpu instead`

**Cause**: Compile and load device types don't match (see "First Step" above).

**Solution**: Ensure compile and load use the same device type. If compiled on CPU, load on CPU. If compiled on CUDA, load on CUDA.

### 2. Input Device Mismatch at Runtime

**Symptom**: RuntimeError during model execution.

**Cause**: Input device doesn't match compile device (see "First Step" above).

**Better Debugging**: Run with `AOTI_RUNTIME_CHECK_INPUTS=1` for clearer errors. This flag validates all input properties including device type, dtype, sizes, and strides:
```bash
AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py
```

This produces actionable error messages like:
```
Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)
```


## Debugging CUDA Illegal Memory Access (IMA) Errors

If you encounter CUDA illegal memory access errors, follow this systematic approach:

### Step 1: Sanity Checks

Before diving deep, try these debugging flags:

```bash
AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1
```

These flags take effect at compilation time (at codegen time):

- `AOTI_RUNTIME_CHECK_INPUTS=1` checks if inputs satisfy the same guards used during compilation
- `TORCHINDUCTOR_NAN_ASSERTS=1` adds codegen before and after each kernel to check for NaN

### Step 2: Pinpoint the CUDA IMA

CUDA IMA errors can be non-deterministic. Use these flags to trigger the error deterministically:

```bash
PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1
```

T

What is this skill?

Routes Triton index-out-of-bounds assertions to a dedicated sub-guide

Mandates compile-device vs load-device and input shape checks before deeper debugging

Covers aot_compile, aot_load, aoti_compile_and_package, and aoti_load_package failure modes

Pattern-first workflow: read the error string, then follow the matching section

Targets PyTorch _inductor export and AOTInductor packaging workflows

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1.3k installs on skills.sh; 101k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Primary fit

OperateError tracking

Also useful

ShipTesting & QA

Also useful

BuildBackend, data & payments

SKILL.md

READMESKILL.md - Aoti Debug

# AOTI Debugging Guide

This skill helps diagnose and fix common AOTInductor issues.

## Error Pattern Routing

**Check the error message and route to the appropriate sub-guide:**

### Triton Index Out of Bounds
If the error matches this pattern:
```
Assertion `index out of bounds: 0 <= tmpN < ksM` failed
```
**→ Follow the guide in `triton-index-out-of-bounds.md`**

### All Other Errors
Continue with the sections below.

---

## First Step: Always Check Device and Shape Matching

**For ANY AOTI error (segfault, exception, crash, wrong output), ALWAYS check these first:**

1. **Compile device == Load device**: The model must be loaded on the same device type it was compiled on
2. **Input devices match**: Runtime inputs must be on the same device as the compiled model
3. **Input shapes match**: Runtime input shapes must match the shapes used during compilation (or satisfy dynamic shape constraints)

```python
# During compilation - note the device and shapes
model = MyModel().eval()           # What device? CPU or .cuda()?
inp = torch.randn(2, 10)           # What device? What shape?
compiled_so = torch._inductor.aot_compile(model, (inp,))

# During loading - device type MUST match compilation
loaded = torch._export.aot_load(compiled_so, "???")  # Must match model/input device above

# During inference - device and shapes MUST match
out = loaded(inp.to("???"))  # Must match compile device, shape must match
```

**If any of these don't match, you will get errors ranging from segfaults to exceptions to wrong outputs.**

## Key Constraint: Device Type Matching

**AOTI requires compile and load to use the same device type.**

- If you compile on CUDA, you must load on CUDA (device index can differ)
- If you compile on CPU, you must load on CPU
- Cross-device loading (e.g., compile on GPU, load on CPU) is NOT supported

## Common Error Patterns

### 1. Device Mismatch Segfault

**Symptom**: Segfault, exception, or crash during `aot_load()` or model execution.

**Example error messages**:
- `The specified pointer resides on host memory and is not registered with any CUDA device`
- Crash during constant loading in AOTInductorModelBase
- `Expected out tensor to have device cuda:0, but got cpu instead`

**Cause**: Compile and load device types don't match (see "First Step" above).

**Solution**: Ensure compile and load use the same device type. If compiled on CPU, load on CPU. If compiled on CUDA, load on CUDA.

### 2. Input Device Mismatch at Runtime

**Symptom**: RuntimeError during model execution.

**Cause**: Input device doesn't match compile device (see "First Step" above).

**Better Debugging**: Run with `AOTI_RUNTIME_CHECK_INPUTS=1` for clearer errors. This flag validates all input properties including device type, dtype, sizes, and strides:
```bash
AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py
```

This produces actionable error messages like:
```
Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)
```


## Debugging CUDA Illegal Memory Access (IMA) Errors

If you encounter CUDA illegal memory access errors, follow this systematic approach:

### Step 1: Sanity Checks

Before diving deep, try these debugging flags:

```bash
AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1
```

These flags take effect at compilation time (at codegen time):

- `AOTI_RUNTIME_CHECK_INPUTS=1` checks if inputs satisfy the same guards used during compilation
- `TORCHINDUCTOR_NAN_ASSERTS=1` adds codegen before and after each kernel to check for NaN

### Step 2: Pinpoint the CUDA IMA

CUDA IMA errors can be non-deterministic. Use these flags to trigger the error deterministically:

```bash
PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1
```

T

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is aoti-debug for?

When should I use aoti-debug?

Is aoti-debug safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is aoti-debug for?

When should I use aoti-debug?

Is aoti-debug safe to install?

SKILL.md