
Observability Designer
Design Prometheus-style alerts, severities, annotations, and runbook links so production signals are actionable.
Overview
Observability Designer is an agent skill most often used in Operate (also Ship security-adjacent launch prep) that drafts structured Prometheus-style alerts with runbooks and severity routing.
Install
npx skills add https://github.com/alirezarezvani/claude-skills --skill observability-designerWhat is this skill?
- Example alert definitions with PromQL expressions and for durations
- Severity labels, team routing, and runbook_url annotations
- Historical alert metadata fields (fires per day, false positive rate)
- Patterns for high latency, service down, and error-rate alerts
- Structured JSON-oriented alert catalog suitable for review in agent sessions
Adoption & trust: 540 installs on skills.sh; 17.5k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You deployed a service but only have ad-hoc logs and no alert rules with owners, thresholds, or runbook links.
Who is it for?
Indie SaaS or API operators adopting Prometheus-compatible monitoring who need alert catalogs designed in collaboration with an agent.
Skip if: Teams wanting fully managed APM with zero PromQL, or greenfield apps with no metrics instrumentation yet.
When should I use this skill?
You need structured alert definitions with expressions, severities, and runbook links for production services.
What do I get? / Deliverables
You leave the session with reviewable alert definitions—expressions, labels, annotations, and tuning metadata—for your metrics stack.
- JSON-oriented alert rule set with labels and annotations
- Runbook-linked summaries for critical failure modes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Observability design is canonically shelved under Operate because it governs how you detect and respond after launch. Alert rules, SLO-ish latency thresholds, and service-up checks are core monitoring artifacts, not distribution or feature coding.
Where it fits
Draft latency and 5xx rate alerts before flipping traffic to a new API deployment.
Refine PromQL rules and annotation runbook URLs after the first week of on-call noise.
Align service labels and team routing when splitting a monolith into named services.
How it compares
Alert-design procedural skill—not an MCP metrics server or a hosted observability vendor integration.
Common Questions / FAQ
Who is observability-designer for?
Solo builders and small teams running APIs or microservices who need Prometheus-style alerts, severities, and runbook links without hiring an SRE first.
When should I use observability-designer?
In Operate monitoring when defining alerts before or after launch, and in Ship launch prep when you want error-rate and latency guardrails documented.
Is observability-designer safe to install?
Check the Security Audits panel on this Prism page; alert JSON is documentation-oriented but still review expressions before applying to production.
SKILL.md
READMESKILL.md - Observability Designer
{ "alerts": [ { "alert": "HighLatency", "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) > 0.5", "for": "5m", "labels": { "severity": "warning", "service": "payment-service", "team": "payments" }, "annotations": { "summary": "High request latency detected", "description": "95th percentile latency is {{ $value }}s for payment-service", "runbook_url": "https://runbooks.company.com/high-latency" }, "historical_data": { "fires_per_day": 2.5, "false_positive_rate": 0.15, "average_duration_minutes": 12 } }, { "alert": "ServiceDown", "expr": "up{service=\"payment-service\"} == 0", "labels": { "severity": "critical", "service": "payment-service", "team": "payments" }, "annotations": { "summary": "Payment service is down", "description": "Payment service has been down for more than 1 minute", "runbook_url": "https://runbooks.company.com/service-down" }, "historical_data": { "fires_per_day": 0.1, "false_positive_rate": 0.05, "average_duration_minutes": 3 } }, { "alert": "HighErrorRate", "expr": "sum(rate(http_requests_total{service=\"payment-service\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"payment-service\"}[5m])) > 0.01", "for": "2m", "labels": { "severity": "warning", "service": "payment-service", "team": "payments" }, "annotations": { "summary": "High error rate detected", "description": "Error rate is {{ $value | humanizePercentage }} for payment-service", "runbook_url": "https://runbooks.company.com/high-error-rate" }, "historical_data": { "fires_per_day": 1.8, "false_positive_rate": 0.25, "average_duration_minutes": 8 } }, { "alert": "HighCPUUsage", "expr": "rate(process_cpu_seconds_total{service=\"payment-service\"}[5m]) * 100 > 80", "labels": { "severity": "warning", "service": "payment-service", "team": "payments" }, "annotations": { "summary": "High CPU usage", "description": "CPU usage is {{ $value }}% for payment-service" }, "historical_data": { "fires_per_day": 15.2, "false_positive_rate": 0.8, "average_duration_minutes": 45 } }, { "alert": "HighMemoryUsage", "expr": "process_resident_memory_bytes{service=\"payment-service\"} / process_virtual_memory_max_bytes{service=\"payment-service\"} * 100 > 85", "labels": { "severity": "info", "service": "payment-service", "team": "payments" }, "annotations": { "summary": "High memory usage", "description": "Memory usage is {{ $value }}% for payment-service" }, "historical_data": { "fires_per_day": 8.5, "false_positive_rate": 0.6, "average_duration_minutes": 30 } }, { "alert": "DatabaseConnectionPoolExhaustion", "expr": "db_connections_active{service=\"payment-service\"} / db_connections_max{service=\"payment-service\"} > 0.9", "for": "1m", "labels": { "severity": "critical", "service": "payment-service", "team": "payments" }, "annotations": { "summary": "Database connection pool near exhaustion", "description": "Connection pool utilization is {{ $value | humanizePercentage }}", "runbook_url": "https://runbooks.company.com/db-connections" }, "historical_data": { "fires_per_day": 0.3, "false_positive_rate": 0.1, "average_duration_minutes": 5 } }, { "alert": "LowTraffic", "expr": "sum(rate(http_requests_total{service=\"payment-s