
Adversarial Resilience
Harden coding agents that read untrusted data—ads copy, docs, API responses—against prompt injection and capability escalation before production.
Overview
adversarial-resilience is an agent skill most often used in Ship (also Build agent-tooling, Operate monitoring) that hardens AI agents against prompt injection, exfiltration, and capability escalation.
Install
npx skills add https://github.com/itallstartedwithaidea/agent-skills --skill adversarial-resilienceWhat is this skill?
- Layered defense model: input sanitization, instruction boundaries, and capability limits
- Covers prompt injection, data exfiltration, sandbox escape, and unauthorized escalation
- Patterns from production agent work with untrusted user-provided business data
- Applies to any agent processing external documents, APIs, or user content
- Frames every agent as handling untrusted input in practice
Adoption & trust: 1 installs on skills.sh; 18 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
Your agent can call tools and touch sensitive systems, but untrusted input in ads data, files, or APIs could redirect or abuse those powers.
Who is it for?
Builders shipping tool-using agents that ingest user or third-party content and need a structured adversarial threat model.
Skip if: Static sites with no agents, or teams that only need conventional OWASP web-app hardening without LLM-specific threats.
When should I use this skill?
Building or shipping an agent that processes untrusted user, document, or API content and exposes tools, filesystem, or execution.
What do I get? / Deliverables
You apply layered defensive patterns so agent instructions and capabilities resist deliberate manipulation before and after you ship.
- Layered defense checklist applied to the agent
- Input-handling and capability-boundary recommendations
- Threat-model notes for injection and exfiltration paths
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Security hardening is canonically shelved under ship when you are validating defenses before release, even though patterns also apply while building agents. The skill centers on adversarial threats and layered defenses (injection, exfiltration, sandbox escape), matching ship → security rather than generic integrations.
Where it fits
Design tool permissions and input boundaries before wiring Google Ads or CRM data into the agent.
Run an adversarial review pass before production launch when the agent can write files or execute shell commands.
Revisit defenses when new untrusted document types or API partners are added to a live agent.
Decide which agent capabilities are in scope for an MVP so attack surface stays bounded.
How it compares
Use alongside generic secure-code review; this skill targets LLM instruction manipulation and agent capability abuse, not routine XSS-only checklists.
Common Questions / FAQ
Who is adversarial-resilience for?
Solo builders and small teams creating AI agents with MCP tools, APIs, or code execution who must treat all external text as potentially hostile.
When should I use adversarial-resilience?
In ship when pre-launch threat modeling; in build when adding new agent tools or data sources; in operate when investigating suspicious agent behavior or expanding production data feeds.
Is adversarial-resilience safe to install?
The skill is defensive guidance; it does not grant permissions by itself—review the Security Audits panel on this page and your agent’s actual tool policies before deployment.
SKILL.md
READMESKILL.md - Adversarial Resilience
# Adversarial Resilience Part of [Agent Skills™](https://github.com/itallstartedwithaidea/agent-skills) by [googleadsagent.ai™](https://googleadsagent.ai) ## Description Adversarial Resilience is the practice of hardening AI agents against deliberate attacks — prompt injection, data exfiltration, sandbox escape, and unauthorized capability escalation. As agents gain more access to tools, APIs, file systems, and code execution environments, the attack surface grows proportionally. An agent that can write files and execute shell commands is a powerful ally but also a potent attack vector if its instructions can be manipulated by adversarial input. This skill addresses the security challenges encountered in building the Buddy™ agent at [googleadsagent.ai™](https://googleadsagent.ai), where user-provided Google Ads data could theoretically contain injection payloads embedded in campaign names, ad copy, or keyword lists. The defensive patterns here apply to any agent that processes untrusted input — which, in practice, is every agent. Even "internal" tools are vulnerable to indirect injection through documents, code comments, and API responses that contain adversarial content. The defense model operates in layers: input sanitization strips known attack patterns before they reach the model, instruction anchoring makes the system prompt resistant to override, output filtering prevents sensitive data from leaking through responses, permission boundaries restrict what the agent can do regardless of what it is told to do, and audit logging creates a forensic trail of every action for post-incident analysis. ## Use When - Agents process any form of user-provided or external input - The agent has access to sensitive tools (file write, shell execute, API calls) - Deployed agents are accessible to users outside your trusted organization - Compliance requirements mandate security controls for AI systems - You need to protect against both deliberate attacks and accidental injection - The agent operates in a multi-tenant environment with data isolation requirements ## How It Works ```mermaid graph TD A[Untrusted Input] --> B[Layer 1: Input Sanitizer] B --> C[Layer 2: Instruction Anchor] C --> D[Agent Core Processing] D --> E[Layer 3: Output Filter] E --> F[Layer 4: Permission Boundary] F --> G{Action Allowed?} G -->|Yes| H[Execute Action] G -->|No| I[Block + Alert] H --> J[Layer 5: Audit Logger] I --> J J --> K[Response to User] L[Secret Scanner] --> B L --> E M[Pattern Database] --> B ``` The five-layer defense model ensures that adversarial content is intercepted at the earliest possible point. Input sanitization detects and neutralizes known injection patterns before they enter the model's context. Instruction anchoring uses structural techniques (XML delimiters, priority declarations, identity reinforcement) to make the system prompt resistant to override. Output filtering scans generated responses for sensitive data leakage. Permission boundaries enforce hard limits on actions regardless of model intent. Audit logging provides complete forensic visibility into every input, decision, and action. ## Implementation **Input Sanitization Engine:** ```python import re class InputSanitizer: INJECTION_PATTERNS = [ r"ignore\s+(all\s+)?(previous|prior|above)\s+(instructions|prompts|rules)", r"you\s+are\s+now\s+(?:a|an)\s+", r"system\s*:\s*", r"<\s*/?system\s*>", r"BEGIN\s+SYSTEM\s+PROMPT", r"(?:ADMIN|ROOT|SUDO)\s*(?:MODE|ACCESS|OVERRIDE)", r"\[INST\]", r"<\|(?:im_start|im_end|endoftext)\|>", ] def __init__(self): self.compiled = [re.compile(p, re.IGNORECAS