
Agent Architecture Audit
Run a structured 12-layer audit before you ship an agent or any LLM feature that uses tools, memory, or multi-step loops.
Overview
Agent Architecture Audit is an agent skill most often used in Ship (also Operate) that audits the 12-layer agent stack for wrapper regression, memory pollution, tool failures, and hidden repair loops with severity-ranked
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill agent-architecture-auditWhat is this skill?
- 12-layer agent stack diagnostic with severity-ranked findings
- Detects wrapper regression, stale memory, and hidden repair/retry loops
- Code-first fix recommendations for tool calling and multi-step workflows
- Mandatory gate before releasing agent or LLM-powered apps to production
- Contrasts with general debugging via agent-introspection-debugging when root cause is architectural
- 12-layer agent stack audit
Adoption & trust: 1.2k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent works in a demo but degrades in production, and you cannot tell whether wrappers, memory, tools, or silent retries are corrupting behavior.
Who is it for?
Indie builders releasing tool-using or memory-backed agents who need a pre-production architectural pass instead of endless prompt tweaking.
Skip if: General application debugging without an agent stack focus (use agent-introspection-debugging) or routine static code review without LLM workflow concerns.
When should I use this skill?
Releasing agent or LLM apps to production; shipping tool calling, memory, or multi-step workflows; agent degrades after wrapper layers; debugging agent behavior exceeds ~15 minutes without root cause.
What do I get? / Deliverables
You get a prioritized architectural findings report with concrete code fixes so you can ship or iterate without masking failures behind wrapper layers.
- Severity-ranked architectural findings
- Code-first remediation guidance
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Production release of agent apps is a Ship concern; this skill is shelved under security because it targets wrapper regression, hidden retries, and failure modes that become production incidents. Security subphase fits pre-release hardening and architectural failure modes (tool discipline, memory pollution) rather than generic unit testing.
Where it fits
Run the 12-layer audit the week before turning on a paid tier for your coding agent.
Compare findings across two agent variants that share tools but use different memory schemas.
Investigate user reports that tools are flaky after a deploy that only changed prompt wrappers.
Validate a new tool-definition layer before merging the branch that adds five new MCP tools.
How it compares
Use instead of ad-hoc prompt tweaks when symptoms point to architecture (memory, tools, wrappers), not a single bad function.
Common Questions / FAQ
Who is agent-architecture-audit for?
Solo and indie developers building agent applications, autonomous loops, or LLM features with tools, memory, or multi-step workflows who are close to production.
When should I use agent-architecture-audit?
Before releasing an agent app, when adding wrapper or memory layers, when playground behavior diverges from your wrapper, or when debugging agent issues stalls without a root cause—in Ship for release gates and in Operate when production behavior regresses.
Is agent-architecture-audit safe to install?
Review the Security Audits panel on this skill’s Prism page and treat Read, Write, Edit, Bash, Grep, and Glob as full-repo access during the audit run.
SKILL.md
READMESKILL.md - Agent Architecture Audit
# Agent Architecture Audit A diagnostic workflow for agent systems that hide failures behind wrapper layers, stale memory, retry loops, or transport/rendering mutations. ## When to Activate **MANDATORY for:** - Releasing any agent or LLM-powered application to production - Shipping features with tool calling, memory, or multi-step workflows - Agent behavior degrades after adding wrapper layers - User reports "the agent is getting worse" or "tools are flaky" - Same model works in playground but breaks inside your wrapper - Debugging agent behavior for more than 15 minutes without finding root cause **Especially critical when:** - You've added new prompt layers, tool definitions, or memory systems - Different agents in your system behave inconsistently - The model was fine yesterday but is hallucinating today - You suspect hidden repair/retry loops silently mutating responses **Do not use for:** - General code debugging — use `agent-introspection-debugging` - Code review — use language-specific reviewer agents - Security scanning — use `security-review` or `security-review/scan` - Agent performance benchmarking — use `agent-eval` - Writing new features — use the appropriate workflow skill ## The 12-Layer Stack Every agent system has these layers. Any of them can corrupt the answer: | # | Layer | What Goes Wrong | |---|-------|----------------| | 1 | System prompt | Conflicting instructions, instruction bloat | | 2 | Session history | Stale context injection from previous turns | | 3 | Long-term memory | Pollution across sessions, old topics in new conversations | | 4 | Distillation | Compressed artifacts re-entering as pseudo-facts | | 5 | Active recall | Redundant re-summary layers wasting context | | 6 | Tool selection | Wrong tool routing, model skips required tools | | 7 | Tool execution | Hallucinated execution — claims to call but doesn't | | 8 | Tool interpretation | Misread or ignored tool output | | 9 | Answer shaping | Format corruption in final response | | 10 | Platform rendering | Transport-layer mutation (UI, API, CLI mutates valid answers) | | 11 | Hidden repair loops | Silent fallback/retry agents running second LLM pass | | 12 | Persistence | Expired state or cached artifacts reused as live evidence | ## Common Failure Patterns ### 1. Wrapper Regression The base model produces correct answers, but the wrapper layers make it worse. **Symptoms:** - Model works fine in playground or direct API call, breaks in your agent - Added a new prompt layer, existing behavior degraded - Agent sounds confident but is confidently wrong - "It was working before the last update" ### 2. Memory Contamination Old topics leak into new conversations through history, memory retrieval, or distillation. **Symptoms:** - Agent brings up unrelated past topics - User corrections don't stick (old memory overwrites new) - Same-session artifacts re-enter as pseudo-facts - Memory grows without bound, degrading response quality over time ### 3. Tool Discipline Failure Tools are declared in the prompt but not enforced in code. The model skips them or hallucinates execution. **Symptoms:** - "Must use tool X" in prompt, but model answers without calling it - Tool results look correct but were never actually executed - Different tools fight over the same responsibility - Model uses tool when it shouldn't, or skips it when it must ### 4. Rendering/Transport Corruption The agent's internal answer is correct, but the platform layer mutates it during delivery. **