Enterprise Agent Ops

Canonical shelf is Operate because the skill defines how production agent workloads are started, observed, bounded, and recovered—not how you initially ship features. Infra is the best fit for PM2, systemd, containers, immutable artifacts, and rollout/rollback patterns called out as deployment integrations.

Also useful

Also useful

Where it fits

Example use

Gate agent image promotion in CI/CD with immutable artifacts before pointing traffic at a new agent version.

Example use

Define PM2 or systemd restart policies, hard timeouts, and retry budgets for a worker that runs agent tasks overnight.

Example use

Track success rate, mean retries, and failure class distribution to spot regressions after a model or prompt change.

Example use

OperateError tracking

Execute the six-step incident pattern: freeze rollout, capture traces, isolate the failing route, patch minimally, rerun checks, resume gradually.

How it compares

Use for production agent SRE patterns—not as a substitute for Ship-phase test suites or single-session coding skills.

Common Questions / FAQ

Who is enterprise-agent-ops for?

Solo and indie builders who operate cloud-hosted or always-on coding agents and want lifecycle, observability, and safety controls beyond a single CLI session.

When should I use enterprise-agent-ops?

During Ship when wiring CI/CD and deployment artifacts for agents, and during Operate when defining metrics, incidents, rollouts, PM2/systemd/container runbooks, and permission scopes.

Is enterprise-agent-ops safe to install?

Treat it as operational guidance that assumes shell, network, and secrets access in real environments; review the Security Audits panel on this Prism page before enabling automated deploy or kill-switch automation.

SKILL.md

READMESKILL.md - Enterprise Agent Ops

# Enterprise Agent Ops

Use this skill for cloud-hosted or continuously running agent systems that need operational controls beyond single CLI sessions.

## Operational Domains

1. runtime lifecycle (start, pause, stop, restart)
2. observability (logs, metrics, traces)
3. safety controls (scopes, permissions, kill switches)
4. change management (rollout, rollback, audit)

## Baseline Controls

- immutable deployment artifacts
- least-privilege credentials
- environment-level secret injection
- hard timeout and retry budgets
- audit log for high-risk actions

## Metrics to Track

- success rate
- mean retries per task
- time to recovery
- cost per successful task
- failure class distribution

## Incident Pattern

When failure spikes:
1. freeze new rollout
2. capture representative traces
3. isolate failing route
4. patch with smallest safe change
5. run regression + security checks
6. resume gradually

## Deployment Integrations

This skill pairs with:
- PM2 workflows
- systemd services
- container orchestrators
- CI/CD gates

What is this skill?

4 operational domains: runtime lifecycle, observability, safety controls, and change management

5 baseline controls including least-privilege credentials, secret injection, timeout/retry budgets, and audit logging fo

5 metrics to track: success rate, mean retries per task, time to recovery, cost per successful task, and failure class d

6-step incident pattern when failure spikes: freeze rollout through gradual resume after regression and security checks

Pairs with PM2, systemd, container orchestrators, and CI/CD gates

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 4.2k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Gate agent image promotion in CI/CD with immutable artifacts before pointing traffic at a new agent version.

Example use

Define PM2 or systemd restart policies, hard timeouts, and retry budgets for a worker that runs agent tasks overnight.

Example use