
Enterprise Agent Ops
Run always-on or cloud-hosted coding agents with lifecycle controls, observability, and safe rollouts instead of one-off CLI sessions.
Overview
Enterprise Agent Ops is an agent skill most often used in Operate (also Ship launch) that governs long-lived agent workloads with lifecycle management, observability, safety boundaries, and audited change control.
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill enterprise-agent-opsWhat is this skill?
- 4 operational domains: runtime lifecycle, observability, safety controls, and change management
- 5 baseline controls including least-privilege credentials, secret injection, timeout/retry budgets, and audit logging fo
- 5 metrics to track: success rate, mean retries per task, time to recovery, cost per successful task, and failure class d
- 6-step incident pattern when failure spikes: freeze rollout through gradual resume after regression and security checks
- Pairs with PM2, systemd, container orchestrators, and CI/CD gates
Adoption & trust: 4.2k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You shipped an agent that works in CLI demos but lacks timeouts, observability, rollbacks, or permission boundaries when it runs continuously in production.
Who is it for?
Indie builders running PM2, systemd, or containerized agent workers who need kill switches, audit logs, and gradual rollouts after Ship.
Skip if: One-shot local Claude Code tasks with no hosted runtime, or teams that only need unit tests without operational SLOs for agents.
When should I use this skill?
Operate long-lived agent workloads with observability, security boundaries, and lifecycle management.
What do I get? / Deliverables
You get a repeatable ops model—metrics, incident steps, and deployment integrations—so agent tasks recover safely and rollouts do not amplify failures.
- Operational runbook covering lifecycle, observability, safety, and change management
- Metric dashboard or checklist for agent task health and cost
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Operate because the skill defines how production agent workloads are started, observed, bounded, and recovered—not how you initially ship features. Infra is the best fit for PM2, systemd, containers, immutable artifacts, and rollout/rollback patterns called out as deployment integrations.
Where it fits
Gate agent image promotion in CI/CD with immutable artifacts before pointing traffic at a new agent version.
Define PM2 or systemd restart policies, hard timeouts, and retry budgets for a worker that runs agent tasks overnight.
Track success rate, mean retries, and failure class distribution to spot regressions after a model or prompt change.
Execute the six-step incident pattern: freeze rollout, capture traces, isolate the failing route, patch minimally, rerun checks, resume gradually.
How it compares
Use for production agent SRE patterns—not as a substitute for Ship-phase test suites or single-session coding skills.
Common Questions / FAQ
Who is enterprise-agent-ops for?
Solo and indie builders who operate cloud-hosted or always-on coding agents and want lifecycle, observability, and safety controls beyond a single CLI session.
When should I use enterprise-agent-ops?
During Ship when wiring CI/CD and deployment artifacts for agents, and during Operate when defining metrics, incidents, rollouts, PM2/systemd/container runbooks, and permission scopes.
Is enterprise-agent-ops safe to install?
Treat it as operational guidance that assumes shell, network, and secrets access in real environments; review the Security Audits panel on this Prism page before enabling automated deploy or kill-switch automation.
SKILL.md
READMESKILL.md - Enterprise Agent Ops
# Enterprise Agent Ops Use this skill for cloud-hosted or continuously running agent systems that need operational controls beyond single CLI sessions. ## Operational Domains 1. runtime lifecycle (start, pause, stop, restart) 2. observability (logs, metrics, traces) 3. safety controls (scopes, permissions, kill switches) 4. change management (rollout, rollback, audit) ## Baseline Controls - immutable deployment artifacts - least-privilege credentials - environment-level secret injection - hard timeout and retry budgets - audit log for high-risk actions ## Metrics to Track - success rate - mean retries per task - time to recovery - cost per successful task - failure class distribution ## Incident Pattern When failure spikes: 1. freeze new rollout 2. capture representative traces 3. isolate failing route 4. patch with smallest safe change 5. run regression + security checks 6. resume gradually ## Deployment Integrations This skill pairs with: - PM2 workflows - systemd services - container orchestrators - CI/CD gates