Observability Designer

Name: Observability Designer
Author: alirezarezvani

alirezarezvani/claude-skills

608 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

observability-designer is a Claude Code DevOps skill that designs Prometheus alert rules with severities, annotations, and runbook links for developers who need production signals that trigger actionable incident respons

About

observability-designer is a skill from alirezarezvani/claude-skills for authoring Prometheus-compatible alert definitions with expressions, hold durations, labels, and annotations. Example output includes latency alerts using histogram_quantile on http_request_duration_seconds_bucket, severity labels, team routing, summary and description templates, and runbook_url fields pointing to operational docs. The skill can incorporate historical firing rates and false-positive context to tune thresholds. Developers reach for observability-designer when standing up or refining service monitoring where alerts must be severity-ranked, annotated for on-call engineers, and linked to runbooks rather than generic threshold dumps.

Example alert definitions with PromQL expressions and for durations
Severity labels, team routing, and runbook_url annotations
Historical alert metadata fields (fires per day, false positive rate)
Patterns for high latency, service down, and error-rate alerts
Structured JSON-oriented alert catalog suitable for review in agent sessions

Observability Designer by the numbers

608 all-time installs (skills.sh)
Ranked #240 of 1,440 DevOps & CI/CD skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill observability-designer

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/observability-designer.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/observability-designer)

Installs	608
repo stars	★ 23.5k
Security audit	2 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you design actionable Prometheus alert rules?

Design Prometheus-style alerts, severities, annotations, and runbook links so production signals are actionable.

Who is it for?

Developers defining production Prometheus alerts who need severity labels, templated annotations, and runbook links tied to service SLOs.

Skip if: Teams that only need log shipping setup without alert rule design or Prometheus expression authoring.

When should I use this skill?

A production service needs new or revised Prometheus alerts with severities, annotation text, and runbook URLs for on-call engineers.

What you get

Prometheus alert rule definitions with expr, labels, annotations, severity tiers, and linked runbook URLs.

Prometheus alert rule YAML
runbook-linked annotations

Files

SKILL.mdMarkdownGitHub ↗

Observability Designer (POWERFUL)

Category: Engineering Tier: POWERFUL Description: Design comprehensive observability strategies for production systems including SLI/SLO frameworks, alerting optimization, and dashboard generation.

Overview

Observability Designer creates production-ready dashboards, alert configurations, and monitoring strategies across the three pillars (metrics, logs, traces).

When NOT to use → slo-architect. For SLO/SLI design with error-budget math, multi-window burn-rate alerting thresholds, and SLO review gates, route to slo-architect — it is the authoritative skill for that half. This skill's slo_designer.py produces a quick scaffold only. This skill's lane: dashboards (dashboard_generator.py) and alert-noise reduction (alert_optimizer.py).

Quick Start

# Dashboard spec (Grafana JSON + docs) for a service
python3 scripts/dashboard_generator.py --service-type api --name payments --criticality critical --role sre --format grafana -o dashboard.json --doc-output dashboard.md

# Analyze an existing alert config for noise, duplicates, and coverage gaps
python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report alert_report.json
# ...then emit the optimized config once the report is reviewed:
python3 scripts/alert_optimizer.py --input alerts.json --output alerts_optimized.json

# Quick SLO scaffold (hand off to slo-architect for the real error-budget work)
python3 scripts/slo_designer.py --service-type api --criticality high --user-facing true --service-name payments -o slo_scaffold.json

Verification loop: after deploying optimized alerts, track the report's noise metrics for one on-call rotation — if the actionable-alert ratio didn't improve, re-run --analyze-only against the live config and iterate. Import the generated dashboard into Grafana and confirm every golden-signal panel renders with live data before closing the task.

Core Competencies

SLI/SLO/SLA Framework Design

Service Level Indicators (SLI): Define measurable signals that indicate service health
Service Level Objectives (SLO): Set reliability targets based on user experience
Service Level Agreements (SLA): Establish customer-facing commitments with consequences
Error Budget Management: Calculate and track error budget consumption
Burn Rate Alerting: Multi-window burn rate alerts for proactive SLO protection

Three Pillars of Observability

Metrics

Golden Signals: Latency, traffic, errors, and saturation monitoring
RED Method: Rate, Errors, and Duration for request-driven services
USE Method: Utilization, Saturation, and Errors for resource monitoring
Business Metrics: Revenue, user engagement, and feature adoption tracking
Infrastructure Metrics: CPU, memory, disk, network, and custom resource metrics

Logs

Structured Logging: JSON-based log formats with consistent fields
Log Aggregation: Centralized log collection and indexing strategies
Log Levels: Appropriate use of DEBUG, INFO, WARN, ERROR, FATAL levels
Correlation IDs: Request tracing through distributed systems
Log Sampling: Volume management for high-throughput systems

Traces

Distributed Tracing: End-to-end request flow visualization
Span Design: Meaningful span boundaries and metadata
Trace Sampling: Intelligent sampling strategies for performance and cost
Service Maps: Automatic dependency discovery through traces
Root Cause Analysis: Trace-driven debugging workflows

Dashboard Design Principles

Information Architecture

Hierarchy: Overview → Service → Component → Instance drill-down paths
Golden Ratio: 80% operational metrics, 20% exploratory metrics
Cognitive Load: Maximum 7±2 panels per dashboard screen
User Journey: Role-based dashboard personas (SRE, Developer, Executive)

Visualization Best Practices

Chart Selection: Time series for trends, heatmaps for distributions, gauges for status
Color Theory: Red for critical, amber for warning, green for healthy states
Reference Lines: SLO targets, capacity thresholds, and historical baselines
Time Ranges: Default to meaningful windows (4h for incidents, 7d for trends)

Panel Design

Metric Queries: Efficient Prometheus/InfluxDB queries with proper aggregation
Alerting Integration: Visual alert state indicators on relevant panels
Interactive Elements: Template variables, drill-down links, and annotation overlays
Performance: Sub-second render times through query optimization

Alert Design and Optimization

Alert Classification

Severity Levels:
Critical: Service down, SLO burn rate high
Warning: Approaching thresholds, non-user-facing issues
Info: Deployment notifications, capacity planning alerts
Actionability: Every alert must have a clear response action
Alert Routing: Escalation policies based on severity and team ownership

Alert Fatigue Prevention

Signal vs Noise: High precision (few false positives) over high recall
Hysteresis: Different thresholds for firing and resolving alerts
Suppression: Dependent alert suppression during known outages
Grouping: Related alerts grouped into single notifications

Alert Rule Design

Threshold Selection: Statistical methods for threshold determination
Window Functions: Appropriate averaging windows and percentile calculations
Alert Lifecycle: Clear firing conditions and automatic resolution criteria
Testing: Alert rule validation against historical data

Runbook Generation and Incident Response

Runbook Structure

Alert Context: What the alert means and why it fired
Impact Assessment: User-facing vs internal impact evaluation
Investigation Steps: Ordered troubleshooting procedures with time estimates
Resolution Actions: Common fixes and escalation procedures
Post-Incident: Follow-up tasks and prevention measures

Incident Detection Patterns

Anomaly Detection: Statistical methods for detecting unusual patterns
Composite Alerts: Multi-signal alerts for complex failure modes
Predictive Alerts: Capacity and trend-based forward-looking alerts
Canary Monitoring: Early detection through progressive deployment monitoring

Golden Signals Framework

Latency Monitoring

Request Latency: P50, P95, P99 response time tracking
Queue Latency: Time spent waiting in processing queues
Network Latency: Inter-service communication delays
Database Latency: Query execution and connection pool metrics

Traffic Monitoring

Request Rate: Requests per second with burst detection
Bandwidth Usage: Network throughput and capacity utilization
User Sessions: Active user tracking and session duration
Feature Usage: API endpoint and feature adoption metrics

Error Monitoring

Error Rate: 4xx and 5xx HTTP response code tracking
Error Budget: SLO-based error rate targets and consumption
Error Distribution: Error type classification and trending
Silent Failures: Detection of processing failures without HTTP errors

Saturation Monitoring

Resource Utilization: CPU, memory, disk, and network usage
Queue Depth: Processing queue length and wait times
Connection Pools: Database and service connection saturation
Rate Limiting: API throttling and quota exhaustion tracking

Distributed Tracing Strategies

Trace Architecture

Sampling Strategy: Head-based, tail-based, and adaptive sampling
Trace Propagation: Context propagation across service boundaries
Span Correlation: Parent-child relationship modeling
Trace Storage: Retention policies and storage optimization

Service Instrumentation

Auto-Instrumentation: Framework-based automatic trace generation
Manual Instrumentation: Custom span creation for business logic
Baggage Handling: Cross-cutting concern propagation
Performance Impact: Instrumentation overhead measurement and optimization

Log Aggregation Patterns

Collection Architecture

Agent Deployment: Log shipping agent strategies (push vs pull)
Log Routing: Topic-based routing and filtering
Parsing Strategies: Structured vs unstructured log handling
Schema Evolution: Log format versioning and migration

Storage and Indexing

Index Design: Optimized field indexing for common query patterns
Retention Policies: Time and volume-based log retention
Compression: Log data compression and archival strategies
Search Performance: Query optimization and result caching

Cost Optimization for Observability

Data Management

Metric Retention: Tiered retention based on metric importance
Log Sampling: Intelligent sampling to reduce ingestion costs
Trace Sampling: Cost-effective trace collection strategies
Data Archival: Cold storage for historical observability data

Resource Optimization

Query Efficiency: Optimized metric and log queries
Storage Costs: Appropriate storage tiers for different data types
Ingestion Rate Limiting: Controlled data ingestion to manage costs
Cardinality Management: High-cardinality metric detection and mitigation

Scripts Overview

This skill includes three powerful Python scripts for comprehensive observability design:

1. SLO Designer (`slo_designer.py`)

Generates complete SLI/SLO frameworks based on service characteristics:

Input: Service description JSON (type, criticality, dependencies)
Output: SLI definitions, SLO targets, error budgets, burn rate alerts, SLA recommendations
Features: Multi-window burn rate calculations, error budget policies, alert rule generation

2. Alert Optimizer (`alert_optimizer.py`)

Analyzes and optimizes existing alert configurations:

Input: Alert configuration JSON with rules, thresholds, and routing
Output: Optimization report and improved alert configuration
Features: Noise detection, coverage gaps, duplicate identification, threshold optimization

3. Dashboard Generator (`dashboard_generator.py`)

Creates comprehensive dashboard specifications:

Input: Service/system description JSON
Output: Grafana-compatible dashboard JSON and documentation
Features: Golden signals coverage, RED/USE methods, drill-down paths, role-based views

Integration Patterns

Monitoring Stack Integration

Prometheus: Metric collection and alerting rule generation
Grafana: Dashboard creation and visualization configuration
Elasticsearch/Kibana: Log analysis and dashboard integration
Jaeger/Zipkin: Distributed tracing configuration and analysis

CI/CD Integration

Pipeline Monitoring: Build, test, and deployment observability
Deployment Correlation: Release impact tracking and rollback triggers
Feature Flag Monitoring: A/B test and feature rollout observability
Performance Regression: Automated performance monitoring in pipelines

Incident Management Integration

PagerDuty/VictorOps: Alert routing and escalation policies
Slack/Teams: Notification and collaboration integration
JIRA/ServiceNow: Incident tracking and resolution workflows
Post-Mortem: Automated incident analysis and improvement tracking

Advanced Patterns

Multi-Cloud Observability

Cross-Cloud Metrics: Unified metrics across AWS, GCP, Azure
Network Observability: Inter-cloud connectivity monitoring
Cost Attribution: Cloud resource cost tracking and optimization
Compliance Monitoring: Security and compliance posture tracking

Microservices Observability

Service Mesh Integration: Istio/Linkerd observability configuration
API Gateway Monitoring: Request routing and rate limiting observability
Container Orchestration: Kubernetes cluster and workload monitoring
Service Discovery: Dynamic service monitoring and health checks

Machine Learning Observability

Model Performance: Accuracy, drift, and bias monitoring
Feature Store Monitoring: Feature quality and freshness tracking
Pipeline Observability: ML pipeline execution and performance monitoring
A/B Test Analysis: Statistical significance and business impact measurement

Best Practices

Organizational Alignment

SLO Setting: Collaborative target setting between product and engineering
Alert Ownership: Clear escalation paths and team responsibilities
Dashboard Governance: Centralized dashboard management and standards
Training Programs: Team education on observability tools and practices

Technical Excellence

Infrastructure as Code: Observability configuration version control
Testing Strategy: Alert rule testing and dashboard validation
Performance Monitoring: Observability system performance tracking
Security Considerations: Access control and data privacy in observability

Continuous Improvement

Metrics Review: Regular SLI/SLO effectiveness assessment
Alert Tuning: Ongoing alert threshold and routing optimization
Dashboard Evolution: User feedback-driven dashboard improvements
Tool Evaluation: Regular assessment of observability tool effectiveness

{
  "alerts": [
    {
      "alert": "HighLatency",
      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) > 0.5",
      "for": "5m",
      "labels": {
        "severity": "warning",
        "service": "payment-service",
        "team": "payments"
      },
      "annotations": {
        "summary": "High request latency detected",
        "description": "95th percentile latency is {{ $value }}s for payment-service",
        "runbook_url": "https://runbooks.company.com/high-latency"
      },
      "historical_data": {
        "fires_per_day": 2.5,
        "false_positive_rate": 0.15,
        "average_duration_minutes": 12
      }
    },
    {
      "alert": "ServiceDown",
      "expr": "up{service=\"payment-service\"} == 0",
      "labels": {
        "severity": "critical",
        "service": "payment-service", 
        "team": "payments"
      },
      "annotations": {
        "summary": "Payment service is down",
        "description": "Payment service has been down for more than 1 minute",
        "runbook_url": "https://runbooks.company.com/service-down"
      },
      "historical_data": {
        "fires_per_day": 0.1,
        "false_positive_rate": 0.05,
        "average_duration_minutes": 3
      }
    },
    {
      "alert": "HighErrorRate",
      "expr": "sum(rate(http_requests_total{service=\"payment-service\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"payment-service\"}[5m])) > 0.01",
      "for": "2m",
      "labels": {
        "severity": "warning",
        "service": "payment-service",
        "team": "payments"
      },
      "annotations": {
        "summary": "High error rate detected",
        "description": "Error rate is {{ $value | humanizePercentage }} for payment-service",
        "runbook_url": "https://runbooks.company.com/high-error-rate"
      },
      "historical_data": {
        "fires_per_day": 1.8,
        "false_positive_rate": 0.25,
        "average_duration_minutes": 8
      }
    },
    {
      "alert": "HighCPUUsage",
      "expr": "rate(process_cpu_seconds_total{service=\"payment-service\"}[5m]) * 100 > 80",
      "labels": {
        "severity": "warning",
        "service": "payment-service",
        "team": "payments"
      },
      "annotations": {
        "summary": "High CPU usage",
        "description": "CPU usage is {{ $value }}% for payment-service"
      },
      "historical_data": {
        "fires_per_day": 15.2,
        "false_positive_rate": 0.8,
        "average_duration_minutes": 45
      }
    },
    {
      "alert": "HighMemoryUsage", 
      "expr": "process_resident_memory_bytes{service=\"payment-service\"} / process_virtual_memory_max_bytes{service=\"payment-service\"} * 100 > 85",
      "labels": {
        "severity": "info",
        "service": "payment-service",
        "team": "payments"
      },
      "annotations": {
        "summary": "High memory usage",
        "description": "Memory usage is {{ $value }}% for payment-service"
      },
      "historical_data": {
        "fires_per_day": 8.5,
        "false_positive_rate": 0.6,
        "average_duration_minutes": 30
      }
    },
    {
      "alert": "DatabaseConnectionPoolExhaustion",
      "expr": "db_connections_active{service=\"payment-service\"} / db_connections_max{service=\"payment-service\"} > 0.9",
      "for": "1m",
      "labels": {
        "severity": "critical",
        "service": "payment-service",
        "team": "payments"
      },
      "annotations": {
        "summary": "Database connection pool near exhaustion",
        "description": "Connection pool utilization is {{ $value | humanizePercentage }}",
        "runbook_url": "https://runbooks.company.com/db-connections"
      },
      "historical_data": {
        "fires_per_day": 0.3,
        "false_positive_rate": 0.1,
        "average_duration_minutes": 5
      }
    },
    {
      "alert": "LowTraffic",
      "expr": "sum(rate(http_requests_total{service=\"payment-service\"}[5m])) < 10",
      "for": "10m",
      "labels": {
        "severity": "warning",
        "service": "payment-service",
        "team": "payments"
      },
      "annotations": {
        "summary": "Unusually low traffic",
        "description": "Request rate is {{ $value }} RPS, which is unusually low"
      },
      "historical_data": {
        "fires_per_day": 12.0,
        "false_positive_rate": 0.9,
        "average_duration_minutes": 120
      }
    },
    {
      "alert": "HighLatencyDuplicate",
      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) > 0.5",
      "for": "5m", 
      "labels": {
        "severity": "warning",
        "service": "payment-service",
        "team": "payments"
      },
      "annotations": {
        "summary": "High request latency detected (duplicate)",
        "description": "95th percentile latency is {{ $value }}s for payment-service"
      },
      "historical_data": {
        "fires_per_day": 2.5,
        "false_positive_rate": 0.15,
        "average_duration_minutes": 12
      }
    },
    {
      "alert": "VeryLowErrorRate",
      "expr": "sum(rate(http_requests_total{service=\"payment-service\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"payment-service\"}[5m])) > 0.001",
      "labels": {
        "severity": "info",
        "service": "payment-service",
        "team": "payments"
      },
      "annotations": {
        "summary": "Error rate above 0.1%",
        "description": "Error rate is {{ $value | humanizePercentage }}"
      },
      "historical_data": {
        "fires_per_day": 25.0,
        "false_positive_rate": 0.95,
        "average_duration_minutes": 5
      }
    },
    {
      "alert": "DiskUsageHigh",
      "expr": "disk_usage_percent{service=\"payment-service\"} > 85",
      "labels": {
        "severity": "warning",
        "service": "payment-service",
        "team": "payments"
      },
      "annotations": {
        "summary": "Disk usage high",
        "description": "Disk usage is {{ $value }}%"
      },
      "historical_data": {
        "fires_per_day": 3.2,
        "false_positive_rate": 0.4,
        "average_duration_minutes": 240
      }
    }
  ],
  "services": [
    {
      "name": "payment-service",
      "type": "api",
      "criticality": "critical",
      "team": "payments"
    },
    {
      "name": "user-service", 
      "type": "api",
      "criticality": "high",
      "team": "identity"
    },
    {
      "name": "notification-service",
      "type": "api", 
      "criticality": "medium",
      "team": "communications"
    }
  ],
  "alert_routing": {
    "routes": [
      {
        "match": {
          "severity": "critical"
        },
        "receiver": "pager-critical",
        "group_wait": "10s",
        "group_interval": "1m",
        "repeat_interval": "5m"
      },
      {
        "match": {
          "severity": "warning"
        },
        "receiver": "slack-warnings",
        "group_wait": "30s",
        "group_interval": "5m", 
        "repeat_interval": "1h"
      },
      {
        "match": {
          "severity": "info"
        },
        "receiver": "email-info",
        "group_wait": "2m",
        "group_interval": "10m",
        "repeat_interval": "24h"
      }
    ]
  },
  "receivers": [
    {
      "name": "pager-critical",
      "pagerduty_configs": [
        {
          "routing_key": "pager-key-critical",
          "description": "Critical alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
        }
      ]
    },
    {
      "name": "slack-warnings",
      "slack_configs": [
        {
          "api_url": "https://hooks.slack.com/services/warnings",
          "channel": "#alerts-warnings",
          "title": "Warning Alert",
          "text": "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
        }
      ]
    },
    {
      "name": "email-info",
      "email_configs": [
        {
          "to": "team-notifications@company.com",
          "subject": "Info Alert: {{ .GroupLabels.alertname }}",
          "body": "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
        }
      ]
    }
  ]
}

{
  "name": "payment-service",
  "type": "api",
  "criticality": "critical",
  "user_facing": true,
  "description": "Handles payment processing and transaction management",
  "team": "payments",
  "environment": "production",
  "dependencies": [
    {
      "name": "user-service",
      "type": "api",
      "criticality": "high"
    },
    {
      "name": "payment-gateway",
      "type": "external",
      "criticality": "critical"
    },
    {
      "name": "fraud-detection",
      "type": "ml",
      "criticality": "high"
    }
  ],
  "endpoints": [
    {
      "path": "/api/v1/payments",
      "method": "POST",
      "sla_latency_ms": 500,
      "expected_tps": 100
    },
    {
      "path": "/api/v1/payments/{id}",
      "method": "GET", 
      "sla_latency_ms": 200,
      "expected_tps": 500
    },
    {
      "path": "/api/v1/payments/{id}/refund",
      "method": "POST",
      "sla_latency_ms": 1000,
      "expected_tps": 10
    }
  ],
  "business_metrics": {
    "revenue_per_hour": {
      "metric": "sum(payment_amount * rate(payments_successful_total[1h]))",
      "target": 50000,
      "unit": "USD"
    },
    "conversion_rate": {
      "metric": "sum(rate(payments_successful_total[5m])) / sum(rate(payment_attempts_total[5m]))",
      "target": 0.95,
      "unit": "percentage"
    }
  },
  "infrastructure": {
    "container_orchestrator": "kubernetes",
    "replicas": 6,
    "cpu_limit": "2000m",
    "memory_limit": "4Gi",
    "database": {
      "type": "postgresql",
      "connection_pool_size": 20
    },
    "cache": {
      "type": "redis",
      "cluster_size": 3
    }
  },
  "compliance_requirements": [
    "PCI-DSS",
    "SOX",
    "GDPR"
  ],
  "tags": [
    "payment",
    "transaction", 
    "critical-path",
    "revenue-generating"
  ]
}

{
  "name": "customer-portal",
  "type": "web",
  "criticality": "high",
  "user_facing": true,
  "description": "Customer-facing web application for account management and billing",
  "team": "frontend",
  "environment": "production",
  "dependencies": [
    {
      "name": "user-service",
      "type": "api",
      "criticality": "high"
    },
    {
      "name": "billing-service",
      "type": "api", 
      "criticality": "high"
    },
    {
      "name": "notification-service",
      "type": "api",
      "criticality": "medium"
    },
    {
      "name": "cdn",
      "type": "external",
      "criticality": "medium"
    }
  ],
  "pages": [
    {
      "path": "/dashboard",
      "sla_load_time_ms": 2000,
      "expected_concurrent_users": 1000
    },
    {
      "path": "/billing",
      "sla_load_time_ms": 3000,
      "expected_concurrent_users": 200
    },
    {
      "path": "/settings",
      "sla_load_time_ms": 1500,
      "expected_concurrent_users": 100
    }
  ],
  "business_metrics": {
    "daily_active_users": {
      "metric": "count(user_sessions_started_total[1d])",
      "target": 10000,
      "unit": "users"
    },
    "session_duration": {
      "metric": "avg(user_session_duration_seconds)",
      "target": 300,
      "unit": "seconds"
    },
    "bounce_rate": {
      "metric": "sum(rate(page_views_bounced_total[1h])) / sum(rate(page_views_total[1h]))",
      "target": 0.3,
      "unit": "percentage"
    }
  },
  "infrastructure": {
    "container_orchestrator": "kubernetes",
    "replicas": 4,
    "cpu_limit": "1000m",
    "memory_limit": "2Gi",
    "storage": {
      "type": "nfs",
      "size": "50Gi"
    },
    "ingress": {
      "type": "nginx",
      "ssl_termination": true,
      "rate_limiting": {
        "requests_per_second": 100,
        "burst": 200
      }
    }
  },
  "monitoring": {
    "synthetic_checks": [
      {
        "name": "login_flow",
        "url": "/auth/login",
        "frequency": "1m",
        "locations": ["us-east", "eu-west", "ap-south"]
      },
      {
        "name": "checkout_flow", 
        "url": "/billing/checkout",
        "frequency": "5m",
        "locations": ["us-east", "eu-west"]
      }
    ],
    "rum": {
      "enabled": true,
      "sampling_rate": 0.1
    }
  },
  "compliance_requirements": [
    "GDPR",
    "CCPA"
  ],
  "tags": [
    "frontend",
    "customer-facing",
    "billing",
    "high-traffic"
  ]
}

{
  "metadata": {
    "title": "customer-portal - SRE Dashboard",
    "service": {
      "name": "customer-portal",
      "type": "web",
      "criticality": "high",
      "user_facing": true,
      "description": "Customer-facing web application for account management and billing",
      "team": "frontend",
      "environment": "production",
      "dependencies": [
        {
          "name": "user-service",
          "type": "api",
          "criticality": "high"
        },
        {
          "name": "billing-service",
          "type": "api",
          "criticality": "high"
        },
        {
          "name": "notification-service",
          "type": "api",
          "criticality": "medium"
        },
        {
          "name": "cdn",
          "type": "external",
          "criticality": "medium"
        }
      ],
      "pages": [
        {
          "path": "/dashboard",
          "sla_load_time_ms": 2000,
          "expected_concurrent_users": 1000
        },
        {
          "path": "/billing",
          "sla_load_time_ms": 3000,
          "expected_concurrent_users": 200
        },
        {
          "path": "/settings",
          "sla_load_time_ms": 1500,
          "expected_concurrent_users": 100
        }
      ],
      "business_metrics": {
        "daily_active_users": {
          "metric": "count(user_sessions_started_total[1d])",
          "target": 10000,
          "unit": "users"
        },
        "session_duration": {
          "metric": "avg(user_session_duration_seconds)",
          "target": 300,
          "unit": "seconds"
        },
        "bounce_rate": {
          "metric": "sum(rate(page_views_bounced_total[1h])) / sum(rate(page_views_total[1h]))",
          "target": 0.3,
          "unit": "percentage"
        }
      },
      "infrastructure": {
        "container_orchestrator": "kubernetes",
        "replicas": 4,
        "cpu_limit": "1000m",
        "memory_limit": "2Gi",
        "storage": {
          "type": "nfs",
          "size": "50Gi"
        },
        "ingress": {
          "type": "nginx",
          "ssl_termination": true,
          "rate_limiting": {
            "requests_per_second": 100,
            "burst": 200
          }
        }
      },
      "monitoring": {
        "synthetic_checks": [
          {
            "name": "login_flow",
            "url": "/auth/login",
            "frequency": "1m",
            "locations": [
              "us-east",
              "eu-west",
              "ap-south"
            ]
          },
          {
            "name": "checkout_flow",
            "url": "/billing/checkout",
            "frequency": "5m",
            "locations": [
              "us-east",
              "eu-west"
            ]
          }
        ],
        "rum": {
          "enabled": true,
          "sampling_rate": 0.1
        }
      },
      "compliance_requirements": [
        "GDPR",
        "CCPA"
      ],
      "tags": [
        "frontend",
        "customer-facing",
        "billing",
        "high-traffic"
      ]
    },
    "target_role": "sre",
    "generated_at": "2026-02-16T14:02:03.421248Z",
    "version": "1.0"
  },
  "configuration": {
    "time_ranges": [
      "1h",
      "6h",
      "1d",
      "7d"
    ],
    "default_time_range": "6h",
    "refresh_interval": "30s",
    "timezone": "UTC",
    "theme": "dark"
  },
  "layout": {
    "grid_settings": {
      "width": 24,
      "height_unit": "px",
      "cell_height": 30
    },
    "sections": [
      {
        "title": "Service Overview",
        "collapsed": false,
        "y_position": 0,
        "panels": [
          "service_status",
          "slo_summary",
          "error_budget"
        ]
      },
      {
        "title": "Golden Signals",
        "collapsed": false,
        "y_position": 8,
        "panels": [
          "latency",
          "traffic",
          "errors",
          "saturation"
        ]
      },
      {
        "title": "Resource Utilization",
        "collapsed": false,
        "y_position": 16,
        "panels": [
          "cpu_usage",
          "memory_usage",
          "network_io",
          "disk_io"
        ]
      },
      {
        "title": "Dependencies & Downstream",
        "collapsed": true,
        "y_position": 24,
        "panels": [
          "dependency_status",
          "downstream_latency",
          "circuit_breakers"
        ]
      }
    ]
  },
  "panels": [
    {
      "id": "service_status",
      "title": "Service Status",
      "type": "stat",
      "grid_pos": {
        "x": 0,
        "y": 0,
        "w": 6,
        "h": 4
      },
      "targets": [
        {
          "expr": "up{service=\"customer-portal\"}",
          "legendFormat": "Status"
        }
      ],
      "field_config": {
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "Status"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "mode": "thresholds"
                }
              },
              {
                "id": "thresholds",
                "value": {
                  "steps": [
                    {
                      "color": "red",
                      "value": 0
                    },
                    {
                      "color": "green",
                      "value": 1
                    }
                  ]
                }
              },
              {
                "id": "mappings",
                "value": [
                  {
                    "options": {
                      "0": {
                        "text": "DOWN"
                      }
                    },
                    "type": "value"
                  },
                  {
                    "options": {
                      "1": {
                        "text": "UP"
                      }
                    },
                    "type": "value"
                  }
                ]
              }
            ]
          }
        ]
      },
      "options": {
        "orientation": "horizontal",
        "textMode": "value_and_name"
      }
    },
    {
      "id": "slo_summary",
      "title": "SLO Achievement (30d)",
      "type": "stat",
      "grid_pos": {
        "x": 6,
        "y": 0,
        "w": 9,
        "h": 4
      },
      "targets": [
        {
          "expr": "(1 - (increase(http_requests_total{service=\"customer-portal\",code=~\"5..\"}[30d]) / increase(http_requests_total{service=\"customer-portal\"}[30d]))) * 100",
          "legendFormat": "Availability"
        },
        {
          "expr": "histogram_quantile(0.95, increase(http_request_duration_seconds_bucket{service=\"customer-portal\"}[30d])) * 1000",
          "legendFormat": "P95 Latency (ms)"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "thresholds": {
            "steps": [
              {
                "color": "red",
                "value": 0
              },
              {
                "color": "yellow",
                "value": 99.0
              },
              {
                "color": "green",
                "value": 99.9
              }
            ]
          }
        }
      },
      "options": {
        "orientation": "horizontal",
        "textMode": "value_and_name"
      }
    },
    {
      "id": "error_budget",
      "title": "Error Budget Remaining",
      "type": "gauge",
      "grid_pos": {
        "x": 15,
        "y": 0,
        "w": 9,
        "h": 4
      },
      "targets": [
        {
          "expr": "(1 - (increase(http_requests_total{service=\"customer-portal\",code=~\"5..\"}[30d]) / increase(http_requests_total{service=\"customer-portal\"}[30d])) - 0.999) / 0.001 * 100",
          "legendFormat": "Error Budget %"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "min": 0,
          "max": 100,
          "thresholds": {
            "steps": [
              {
                "color": "red",
                "value": 0
              },
              {
                "color": "yellow",
                "value": 25
              },
              {
                "color": "green",
                "value": 50
              }
            ]
          },
          "unit": "percent"
        }
      },
      "options": {
        "showThresholdLabels": true,
        "showThresholdMarkers": true
      }
    },
    {
      "id": "latency",
      "title": "Request Latency",
      "type": "timeseries",
      "grid_pos": {
        "x": 0,
        "y": 8,
        "w": 12,
        "h": 6
      },
      "targets": [
        {
          "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{service=\"customer-portal\"}[5m])) * 1000",
          "legendFormat": "P50 Latency"
        },
        {
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"customer-portal\"}[5m])) * 1000",
          "legendFormat": "P95 Latency"
        },
        {
          "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service=\"customer-portal\"}[5m])) * 1000",
          "legendFormat": "P99 Latency"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "unit": "ms",
          "custom": {
            "drawStyle": "line",
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "fillOpacity": 10
          }
        }
      },
      "options": {
        "tooltip": {
          "mode": "multi",
          "sort": "desc"
        },
        "legend": {
          "displayMode": "table",
          "placement": "bottom"
        }
      }
    },
    {
      "id": "traffic",
      "title": "Request Rate",
      "type": "timeseries",
      "grid_pos": {
        "x": 12,
        "y": 8,
        "w": 12,
        "h": 6
      },
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{service=\"customer-portal\"}[5m]))",
          "legendFormat": "Total RPS"
        },
        {
          "expr": "sum(rate(http_requests_total{service=\"customer-portal\",code=~\"2..\"}[5m]))",
          "legendFormat": "2xx RPS"
        },
        {
          "expr": "sum(rate(http_requests_total{service=\"customer-portal\",code=~\"4..\"}[5m]))",
          "legendFormat": "4xx RPS"
        },
        {
          "expr": "sum(rate(http_requests_total{service=\"customer-portal\",code=~\"5..\"}[5m]))",
          "legendFormat": "5xx RPS"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "unit": "reqps",
          "custom": {
            "drawStyle": "line",
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "fillOpacity": 0
          }
        }
      },
      "options": {
        "tooltip": {
          "mode": "multi",
          "sort": "desc"
        },
        "legend": {
          "displayMode": "table",
          "placement": "bottom"
        }
      }
    },
    {
      "id": "errors",
      "title": "Error Rate",
      "type": "timeseries",
      "grid_pos": {
        "x": 0,
        "y": 14,
        "w": 12,
        "h": 6
      },
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{service=\"customer-portal\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"customer-portal\"}[5m])) * 100",
          "legendFormat": "5xx Error Rate"
        },
        {
          "expr": "sum(rate(http_requests_total{service=\"customer-portal\",code=~\"4..\"}[5m])) / sum(rate(http_requests_total{service=\"customer-portal\"}[5m])) * 100",
          "legendFormat": "4xx Error Rate"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "unit": "percent",
          "custom": {
            "drawStyle": "line",
            "lineInterpolation": "linear",
            "lineWidth": 2,
            "fillOpacity": 20
          }
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "5xx Error Rate"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red"
                }
              }
            ]
          }
        ]
      },
      "options": {
        "tooltip": {
          "mode": "multi",
          "sort": "desc"
        },
        "legend": {
          "displayMode": "table",
          "placement": "bottom"
        }
      }
    },
    {
      "id": "saturation",
      "title": "Saturation Metrics",
      "type": "timeseries",
      "grid_pos": {
        "x": 12,
        "y": 14,
        "w": 12,
        "h": 6
      },
      "targets": [
        {
          "expr": "rate(process_cpu_seconds_total{service=\"customer-portal\"}[5m]) * 100",
          "legendFormat": "CPU Usage %"
        },
        {
          "expr": "process_resident_memory_bytes{service=\"customer-portal\"} / process_virtual_memory_max_bytes{service=\"customer-portal\"} * 100",
          "legendFormat": "Memory Usage %"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "unit": "percent",
          "max": 100,
          "custom": {
            "drawStyle": "line",
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "fillOpacity": 10
          }
        }
      },
      "options": {
        "tooltip": {
          "mode": "multi",
          "sort": "desc"
        },
        "legend": {
          "displayMode": "table",
          "placement": "bottom"
        }
      }
    },
    {
      "id": "cpu_usage",
      "title": "CPU Usage",
      "type": "gauge",
      "grid_pos": {
        "x": 0,
        "y": 20,
        "w": 6,
        "h": 4
      },
      "targets": [
        {
          "expr": "rate(process_cpu_seconds_total{service=\"customer-portal\"}[5m]) * 100",
          "legendFormat": "CPU %"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "unit": "percent",
          "min": 0,
          "max": 100,
          "thresholds": {
            "steps": [
              {
                "color": "green",
                "value": 0
              },
              {
                "color": "yellow",
                "value": 70
              },
              {
                "color": "red",
                "value": 90
              }
            ]
          }
        }
      },
      "options": {
        "showThresholdLabels": true,
        "showThresholdMarkers": true
      }
    },
    {
      "id": "memory_usage",
      "title": "Memory Usage",
      "type": "gauge",
      "grid_pos": {
        "x": 6,
        "y": 20,
        "w": 6,
        "h": 4
      },
      "targets": [
        {
          "expr": "process_resident_memory_bytes{service=\"customer-portal\"} / 1024 / 1024",
          "legendFormat": "Memory MB"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "unit": "decbytes",
          "thresholds": {
            "steps": [
              {
                "color": "green",
                "value": 0
              },
              {
                "color": "yellow",
                "value": 512000000
              },
              {
                "color": "red",
                "value": 1024000000
              }
            ]
          }
        }
      }
    },
    {
      "id": "network_io",
      "title": "Network I/O",
      "type": "timeseries",
      "grid_pos": {
        "x": 12,
        "y": 20,
        "w": 6,
        "h": 4
      },
      "targets": [
        {
          "expr": "rate(process_network_receive_bytes_total{service=\"customer-portal\"}[5m])",
          "legendFormat": "RX Bytes/s"
        },
        {
          "expr": "rate(process_network_transmit_bytes_total{service=\"customer-portal\"}[5m])",
          "legendFormat": "TX Bytes/s"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "unit": "binBps"
        }
      }
    },
    {
      "id": "disk_io",
      "title": "Disk I/O",
      "type": "timeseries",
      "grid_pos": {
        "x": 18,
        "y": 20,
        "w": 6,
        "h": 4
      },
      "targets": [
        {
          "expr": "rate(process_disk_read_bytes_total{service=\"customer-portal\"}[5m])",
          "legendFormat": "Read Bytes/s"
        },
        {
          "expr": "rate(process_disk_write_bytes_total{service=\"customer-portal\"}[5m])",
          "legendFormat": "Write Bytes/s"
        }
      ],
      "field_config": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "unit": "binBps"
        }
      }
    }
  ],
  "variables": [
    {
      "name": "environment",
      "type": "query",
      "query": "label_values(environment)",
      "current": {
        "text": "production",
        "value": "production"
      },
      "includeAll": false,
      "multi": false,
      "refresh": "on_dashboard_load"
    },
    {
      "name": "instance",
      "type": "query",
      "query": "label_values(up{service=\"customer-portal\"}, instance)",
      "current": {
        "text": "All",
        "value": "$__all"
      },
      "includeAll": true,
      "multi": true,
      "refresh": "on_time_range_change"
    },
    {
      "name": "handler",
      "type": "query",
      "query": "label_values(http_requests_total{service=\"customer-portal\"}, handler)",
      "current": {
        "text": "All",
        "value": "$__all"
      },
      "includeAll": true,
      "multi": true,
      "refresh": "on_time_range_change"
    }
  ],
  "alerts_integration": {
    "alert_annotations": true,
    "alert_rules_query": "ALERTS{service=\"customer-portal\"}",
    "alert_panels": [
      {
        "title": "Active Alerts",
        "type": "table",
        "query": "ALERTS{service=\"customer-portal\",alertstate=\"firing\"}",
        "columns": [
          "alertname",
          "severity",
          "instance",
          "description"
        ]
      }
    ]
  },
  "drill_down_paths": {
    "service_overview": {
      "from": "service_status",
      "to": "detailed_health_dashboard",
      "url": "/d/service-health/customer-portal-health",
      "params": [
        "var-service",
        "var-environment"
      ]
    },
    "error_investigation": {
      "from": "errors",
      "to": "error_details_dashboard",
      "url": "/d/errors/customer-portal-errors",
      "params": [
        "var-service",
        "var-time_range"
      ]
    },
    "latency_analysis": {
      "from": "latency",
      "to": "trace_analysis_dashboard",
      "url": "/d/traces/customer-portal-traces",
      "params": [
        "var-service",
        "var-handler"
      ]
    },
    "capacity_planning": {
      "from": "saturation",
      "to": "capacity_dashboard",
      "url": "/d/capacity/customer-portal-capacity",
      "params": [
        "var-service",
        "var-time_range"
      ]
    }
  }
}

{
  "metadata": {
    "service": {
      "name": "payment-service",
      "type": "api",
      "criticality": "critical",
      "user_facing": true,
      "description": "Handles payment processing and transaction management",
      "team": "payments",
      "environment": "production",
      "dependencies": [
        {
          "name": "user-service",
          "type": "api",
          "criticality": "high"
        },
        {
          "name": "payment-gateway",
          "type": "external",
          "criticality": "critical"
        },
        {
          "name": "fraud-detection",
          "type": "ml",
          "criticality": "high"
        }
      ],
      "endpoints": [
        {
          "path": "/api/v1/payments",
          "method": "POST",
          "sla_latency_ms": 500,
          "expected_tps": 100
        },
        {
          "path": "/api/v1/payments/{id}",
          "method": "GET",
          "sla_latency_ms": 200,
          "expected_tps": 500
        },
        {
          "path": "/api/v1/payments/{id}/refund",
          "method": "POST",
          "sla_latency_ms": 1000,
          "expected_tps": 10
        }
      ],
      "business_metrics": {
        "revenue_per_hour": {
          "metric": "sum(payment_amount * rate(payments_successful_total[1h]))",
          "target": 50000,
          "unit": "USD"
        },
        "conversion_rate": {
          "metric": "sum(rate(payments_successful_total[5m])) / sum(rate(payment_attempts_total[5m]))",
          "target": 0.95,
          "unit": "percentage"
        }
      },
      "infrastructure": {
        "container_orchestrator": "kubernetes",
        "replicas": 6,
        "cpu_limit": "2000m",
        "memory_limit": "4Gi",
        "database": {
          "type": "postgresql",
          "connection_pool_size": 20
        },
        "cache": {
          "type": "redis",
          "cluster_size": 3
        }
      },
      "compliance_requirements": [
        "PCI-DSS",
        "SOX",
        "GDPR"
      ],
      "tags": [
        "payment",
        "transaction",
        "critical-path",
        "revenue-generating"
      ]
    },
    "generated_at": "2026-02-16T14:01:57.572080Z",
    "framework_version": "1.0"
  },
  "slis": [
    {
      "name": "Availability",
      "description": "Percentage of successful requests",
      "type": "ratio",
      "good_events": "sum(rate(http_requests_total{service=\"payment-service\",code!~\"5..\"}))",
      "total_events": "sum(rate(http_requests_total{service=\"payment-service\"}))",
      "unit": "percentage"
    },
    {
      "name": "Request Latency P95",
      "description": "95th percentile of request latency",
      "type": "threshold",
      "query": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m]))",
      "unit": "seconds"
    },
    {
      "name": "Error Rate",
      "description": "Rate of 5xx errors",
      "type": "ratio",
      "good_events": "sum(rate(http_requests_total{service=\"payment-service\",code!~\"5..\"}))",
      "total_events": "sum(rate(http_requests_total{service=\"payment-service\"}))",
      "unit": "percentage"
    },
    {
      "name": "Request Throughput",
      "description": "Requests per second",
      "type": "gauge",
      "query": "sum(rate(http_requests_total{service=\"payment-service\"}[5m]))",
      "unit": "requests/sec"
    },
    {
      "name": "User Journey Success Rate",
      "description": "Percentage of successful complete user journeys",
      "type": "ratio",
      "good_events": "sum(rate(user_journey_total{service=\"payment-service\",status=\"success\"}[5m]))",
      "total_events": "sum(rate(user_journey_total{service=\"payment-service\"}[5m]))",
      "unit": "percentage"
    },
    {
      "name": "Feature Availability",
      "description": "Percentage of time key features are available",
      "type": "ratio",
      "good_events": "sum(rate(feature_checks_total{service=\"payment-service\",status=\"available\"}[5m]))",
      "total_events": "sum(rate(feature_checks_total{service=\"payment-service\"}[5m]))",
      "unit": "percentage"
    }
  ],
  "slos": [
    {
      "name": "Availability SLO",
      "description": "Service level objective for percentage of successful requests",
      "sli_name": "Availability",
      "target_value": 0.9999,
      "target_display": "99.99%",
      "operator": ">=",
      "time_windows": [
        "1h",
        "1d",
        "7d",
        "30d"
      ],
      "measurement_window": "30d",
      "service": "payment-service",
      "criticality": "critical"
    },
    {
      "name": "Request Latency P95 SLO",
      "description": "Service level objective for 95th percentile of request latency",
      "sli_name": "Request Latency P95",
      "target_value": 100,
      "target_display": "0.1s",
      "operator": "<=",
      "time_windows": [
        "1h",
        "1d",
        "7d",
        "30d"
      ],
      "measurement_window": "30d",
      "service": "payment-service",
      "criticality": "critical"
    },
    {
      "name": "Error Rate SLO",
      "description": "Service level objective for rate of 5xx errors",
      "sli_name": "Error Rate",
      "target_value": 0.001,
      "target_display": "0.1%",
      "operator": "<=",
      "time_windows": [
        "1h",
        "1d",
        "7d",
        "30d"
      ],
      "measurement_window": "30d",
      "service": "payment-service",
      "criticality": "critical"
    },
    {
      "name": "User Journey Success Rate SLO",
      "description": "Service level objective for percentage of successful complete user journeys",
      "sli_name": "User Journey Success Rate",
      "target_value": 0.9999,
      "target_display": "99.99%",
      "operator": ">=",
      "time_windows": [
        "1h",
        "1d",
        "7d",
        "30d"
      ],
      "measurement_window": "30d",
      "service": "payment-service",
      "criticality": "critical"
    },
    {
      "name": "Feature Availability SLO",
      "description": "Service level objective for percentage of time key features are available",
      "sli_name": "Feature Availability",
      "target_value": 0.9999,
      "target_display": "99.99%",
      "operator": ">=",
      "time_windows": [
        "1h",
        "1d",
        "7d",
        "30d"
      ],
      "measurement_window": "30d",
      "service": "payment-service",
      "criticality": "critical"
    }
  ],
  "error_budgets": [
    {
      "slo_name": "Availability SLO",
      "error_budget_rate": 9.999999999998899e-05,
      "error_budget_percentage": "0.010%",
      "budgets_by_window": {
        "1h": "0.4 seconds",
        "1d": "8.6 seconds",
        "7d": "1.0 minutes",
        "30d": "4.3 minutes"
      },
      "burn_rate_alerts": [
        {
          "name": "Availability Burn Rate 2% Alert",
          "description": "Alert when Availability is consuming error budget at 14.4x rate",
          "severity": "critical",
          "short_window": "5m",
          "long_window": "1h",
          "burn_rate_threshold": 14.4,
          "budget_consumed": "2%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 14.4) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 14.4)",
          "annotations": {
            "summary": "High burn rate detected for Availability",
            "description": "Error budget consumption rate is 14.4x normal, will exhaust 2% of monthly budget"
          }
        },
        {
          "name": "Availability Burn Rate 5% Alert",
          "description": "Alert when Availability is consuming error budget at 6x rate",
          "severity": "warning",
          "short_window": "30m",
          "long_window": "6h",
          "burn_rate_threshold": 6,
          "budget_consumed": "5%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 6) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 6)",
          "annotations": {
            "summary": "High burn rate detected for Availability",
            "description": "Error budget consumption rate is 6x normal, will exhaust 5% of monthly budget"
          }
        },
        {
          "name": "Availability Burn Rate 10% Alert",
          "description": "Alert when Availability is consuming error budget at 3x rate",
          "severity": "info",
          "short_window": "2h",
          "long_window": "1d",
          "burn_rate_threshold": 3,
          "budget_consumed": "10%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 3) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 3)",
          "annotations": {
            "summary": "High burn rate detected for Availability",
            "description": "Error budget consumption rate is 3x normal, will exhaust 10% of monthly budget"
          }
        },
        {
          "name": "Availability Burn Rate 10% Alert",
          "description": "Alert when Availability is consuming error budget at 1x rate",
          "severity": "info",
          "short_window": "6h",
          "long_window": "3d",
          "burn_rate_threshold": 1,
          "budget_consumed": "10%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 1) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 1)",
          "annotations": {
            "summary": "High burn rate detected for Availability",
            "description": "Error budget consumption rate is 1x normal, will exhaust 10% of monthly budget"
          }
        }
      ]
    },
    {
      "slo_name": "User Journey Success Rate SLO",
      "error_budget_rate": 9.999999999998899e-05,
      "error_budget_percentage": "0.010%",
      "budgets_by_window": {
        "1h": "0.4 seconds",
        "1d": "8.6 seconds",
        "7d": "1.0 minutes",
        "30d": "4.3 minutes"
      },
      "burn_rate_alerts": [
        {
          "name": "User Journey Success Rate Burn Rate 2% Alert",
          "description": "Alert when User Journey Success Rate is consuming error budget at 14.4x rate",
          "severity": "critical",
          "short_window": "5m",
          "long_window": "1h",
          "burn_rate_threshold": 14.4,
          "budget_consumed": "2%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 14.4) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 14.4)",
          "annotations": {
            "summary": "High burn rate detected for User Journey Success Rate",
            "description": "Error budget consumption rate is 14.4x normal, will exhaust 2% of monthly budget"
          }
        },
        {
          "name": "User Journey Success Rate Burn Rate 5% Alert",
          "description": "Alert when User Journey Success Rate is consuming error budget at 6x rate",
          "severity": "warning",
          "short_window": "30m",
          "long_window": "6h",
          "burn_rate_threshold": 6,
          "budget_consumed": "5%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 6) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 6)",
          "annotations": {
            "summary": "High burn rate detected for User Journey Success Rate",
            "description": "Error budget consumption rate is 6x normal, will exhaust 5% of monthly budget"
          }
        },
        {
          "name": "User Journey Success Rate Burn Rate 10% Alert",
          "description": "Alert when User Journey Success Rate is consuming error budget at 3x rate",
          "severity": "info",
          "short_window": "2h",
          "long_window": "1d",
          "burn_rate_threshold": 3,
          "budget_consumed": "10%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 3) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 3)",
          "annotations": {
            "summary": "High burn rate detected for User Journey Success Rate",
            "description": "Error budget consumption rate is 3x normal, will exhaust 10% of monthly budget"
          }
        },
        {
          "name": "User Journey Success Rate Burn Rate 10% Alert",
          "description": "Alert when User Journey Success Rate is consuming error budget at 1x rate",
          "severity": "info",
          "short_window": "6h",
          "long_window": "3d",
          "burn_rate_threshold": 1,
          "budget_consumed": "10%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 1) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 1)",
          "annotations": {
            "summary": "High burn rate detected for User Journey Success Rate",
            "description": "Error budget consumption rate is 1x normal, will exhaust 10% of monthly budget"
          }
        }
      ]
    },
    {
      "slo_name": "Feature Availability SLO",
      "error_budget_rate": 9.999999999998899e-05,
      "error_budget_percentage": "0.010%",
      "budgets_by_window": {
        "1h": "0.4 seconds",
        "1d": "8.6 seconds",
        "7d": "1.0 minutes",
        "30d": "4.3 minutes"
      },
      "burn_rate_alerts": [
        {
          "name": "Feature Availability Burn Rate 2% Alert",
          "description": "Alert when Feature Availability is consuming error budget at 14.4x rate",
          "severity": "critical",
          "short_window": "5m",
          "long_window": "1h",
          "burn_rate_threshold": 14.4,
          "budget_consumed": "2%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 14.4) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 14.4)",
          "annotations": {
            "summary": "High burn rate detected for Feature Availability",
            "description": "Error budget consumption rate is 14.4x normal, will exhaust 2% of monthly budget"
          }
        },
        {
          "name": "Feature Availability Burn Rate 5% Alert",
          "description": "Alert when Feature Availability is consuming error budget at 6x rate",
          "severity": "warning",
          "short_window": "30m",
          "long_window": "6h",
          "burn_rate_threshold": 6,
          "budget_consumed": "5%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 6) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 6)",
          "annotations": {
            "summary": "High burn rate detected for Feature Availability",
            "description": "Error budget consumption rate is 6x normal, will exhaust 5% of monthly budget"
          }
        },
        {
          "name": "Feature Availability Burn Rate 10% Alert",
          "description": "Alert when Feature Availability is consuming error budget at 3x rate",
          "severity": "info",
          "short_window": "2h",
          "long_window": "1d",
          "burn_rate_threshold": 3,
          "budget_consumed": "10%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 3) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 3)",
          "annotations": {
            "summary": "High burn rate detected for Feature Availability",
            "description": "Error budget consumption rate is 3x normal, will exhaust 10% of monthly budget"
          }
        },
        {
          "name": "Feature Availability Burn Rate 10% Alert",
          "description": "Alert when Feature Availability is consuming error budget at 1x rate",
          "severity": "info",
          "short_window": "6h",
          "long_window": "3d",
          "burn_rate_threshold": 1,
          "budget_consumed": "10%",
          "condition": "((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_short > 1) and ((1 - (sum(rate(http_requests_total{service='payment-service',code!~'5..'})) / sum(rate(http_requests_total{service='payment-service'}))))_long > 1)",
          "annotations": {
            "summary": "High burn rate detected for Feature Availability",
            "description": "Error budget consumption rate is 1x normal, will exhaust 10% of monthly budget"
          }
        }
      ]
    }
  ],
  "sla_recommendations": {
    "applicable": true,
    "service": "payment-service",
    "commitments": [
      {
        "metric": "Availability",
        "target": 0.9989,
        "target_display": "99.89%",
        "measurement_window": "monthly",
        "measurement_method": "Uptime monitoring with 1-minute granularity"
      },
      {
        "metric": "Feature Availability",
        "target": 0.9989,
        "target_display": "99.89%",
        "measurement_window": "monthly",
        "measurement_method": "Uptime monitoring with 1-minute granularity"
      }
    ],
    "penalties": [
      {
        "breach_threshold": "< 99.99%",
        "credit_percentage": 10
      },
      {
        "breach_threshold": "< 99.9%",
        "credit_percentage": 25
      },
      {
        "breach_threshold": "< 99%",
        "credit_percentage": 50
      }
    ],
    "measurement_methodology": "External synthetic monitoring from multiple geographic locations",
    "exclusions": [
      "Planned maintenance windows (with 72h advance notice)",
      "Customer-side network or infrastructure issues",
      "Force majeure events",
      "Third-party service dependencies beyond our control"
    ]
  },
  "monitoring_recommendations": {
    "metrics": {
      "collection": "Prometheus with service discovery",
      "retention": "90 days for raw metrics, 1 year for aggregated",
      "alerting": "Prometheus Alertmanager with multi-window burn rate alerts"
    },
    "logging": {
      "format": "Structured JSON logs with correlation IDs",
      "aggregation": "ELK stack or equivalent with proper indexing",
      "retention": "30 days for debug logs, 90 days for error logs"
    },
    "tracing": {
      "sampling": "Adaptive sampling with 1% base rate",
      "storage": "Jaeger or Zipkin with 7-day retention",
      "integration": "OpenTelemetry instrumentation"
    }
  },
  "implementation_guide": {
    "prerequisites": [
      "Service instrumented with metrics collection (Prometheus format)",
      "Structured logging with correlation IDs",
      "Monitoring infrastructure (Prometheus, Grafana, Alertmanager)",
      "Incident response processes and escalation policies"
    ],
    "implementation_steps": [
      {
        "step": 1,
        "title": "Instrument Service",
        "description": "Add metrics collection for all defined SLIs",
        "estimated_effort": "1-2 days"
      },
      {
        "step": 2,
        "title": "Configure Recording Rules",
        "description": "Set up Prometheus recording rules for SLI calculations",
        "estimated_effort": "4-8 hours"
      },
      {
        "step": 3,
        "title": "Implement Burn Rate Alerts",
        "description": "Configure multi-window burn rate alerting rules",
        "estimated_effort": "1 day"
      },
      {
        "step": 4,
        "title": "Create SLO Dashboard",
        "description": "Build Grafana dashboard for SLO tracking and error budget monitoring",
        "estimated_effort": "4-6 hours"
      },
      {
        "step": 5,
        "title": "Test and Validate",
        "description": "Test alerting and validate SLI measurements against expectations",
        "estimated_effort": "1-2 days"
      },
      {
        "step": 6,
        "title": "Documentation and Training",
        "description": "Document runbooks and train team on SLO monitoring",
        "estimated_effort": "1 day"
      }
    ],
    "validation_checklist": [
      "All SLIs produce expected metric values",
      "Burn rate alerts fire correctly during simulated outages",
      "Error budget calculations match manual verification",
      "Dashboard displays accurate SLO achievement rates",
      "Alert routing reaches correct escalation paths",
      "Runbooks are complete and tested"
    ]
  }
}

Observability Designer

A comprehensive toolkit for designing production-ready observability strategies including SLI/SLO frameworks, alert optimization, and dashboard generation.

Overview

The Observability Designer skill provides three powerful Python scripts that help you create, optimize, and maintain observability systems:

SLO Designer: Generate complete SLI/SLO frameworks with error budgets and burn rate alerts
Alert Optimizer: Analyze and optimize existing alert configurations to reduce noise and improve effectiveness
Dashboard Generator: Create comprehensive dashboard specifications with role-based layouts and drill-down paths

Quick Start

Prerequisites

Python 3.7+
No external dependencies required (uses Python standard library only)

Basic Usage

# Generate SLO framework for a service
python3 scripts/slo_designer.py --service-type api --criticality critical --user-facing true --service-name payment-service

# Optimize existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only

# Generate a dashboard specification
python3 scripts/dashboard_generator.py --service-type web --name "Customer Portal" --role sre

Scripts Documentation

SLO Designer (`slo_designer.py`)

Generates comprehensive SLO frameworks based on service characteristics.

Features

Automatic SLI Selection: Recommends appropriate SLIs based on service type
Target Setting: Suggests SLO targets based on service criticality
Error Budget Calculation: Computes error budgets and burn rate thresholds
Multi-Window Burn Rate Alerts: Generates 4-window burn rate alerting rules
SLA Recommendations: Provides customer-facing SLA guidance

Usage Examples

# From service definition file
python3 scripts/slo_designer.py --input assets/sample_service_api.json --output slo_framework.json

# From command line parameters
python3 scripts/slo_designer.py \
    --service-type api \
    --criticality critical \
    --user-facing true \
    --service-name payment-service \
    --output payment_slos.json

# Generate and display summary only
python3 scripts/slo_designer.py --input assets/sample_service_web.json --summary-only

Service Definition Format

{
  "name": "payment-service",
  "type": "api",
  "criticality": "critical",
  "user_facing": true,
  "description": "Handles payment processing",
  "team": "payments",
  "environment": "production",
  "dependencies": [
    {
      "name": "user-service",
      "type": "api",
      "criticality": "high"
    }
  ]
}

Supported Service Types

api: REST APIs, GraphQL services
web: Web applications, SPAs
database: Database services, data stores
queue: Message queues, event streams
batch: Batch processing jobs
ml: Machine learning services

Criticality Levels

critical: 99.99% availability, <100ms P95 latency, <0.1% error rate
high: 99.9% availability, <200ms P95 latency, <0.5% error rate
medium: 99.5% availability, <500ms P95 latency, <1% error rate
low: 99% availability, <1s P95 latency, <2% error rate

Alert Optimizer (`alert_optimizer.py`)

Analyzes existing alert configurations and provides optimization recommendations.

Features

Noise Detection: Identifies alerts with high false positive rates
Coverage Analysis: Finds gaps in monitoring coverage
Duplicate Detection: Locates redundant or overlapping alerts
Threshold Analysis: Reviews alert thresholds for appropriateness
Fatigue Assessment: Evaluates alert volume and routing

Usage Examples

# Analyze existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only

# Generate optimized configuration
python3 scripts/alert_optimizer.py \
    --input assets/sample_alerts.json \
    --output optimized_alerts.json

# Generate HTML report
python3 scripts/alert_optimizer.py \
    --input assets/sample_alerts.json \
    --report alert_analysis.html \
    --format html

Alert Configuration Format

{
  "alerts": [
    {
      "alert": "HighLatency",
      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5",
      "for": "5m",
      "labels": {
        "severity": "warning",
        "service": "payment-service"
      },
      "annotations": {
        "summary": "High request latency detected",
        "runbook_url": "https://runbooks.company.com/high-latency"
      },
      "historical_data": {
        "fires_per_day": 2.5,
        "false_positive_rate": 0.15
      }
    }
  ],
  "services": [
    {
      "name": "payment-service",
      "criticality": "critical"
    }
  ]
}

Analysis Categories

Golden Signals: Latency, traffic, errors, saturation
Resource Utilization: CPU, memory, disk, network
Business Metrics: Revenue, conversion, user engagement
Security: Auth failures, suspicious activity
Availability: Uptime, health checks

Dashboard Generator (`dashboard_generator.py`)

Creates comprehensive dashboard specifications with role-based optimization.

Features

Role-Based Layouts: Optimized for SRE, Developer, Executive, and Ops personas
Golden Signals Coverage: Automatic inclusion of key monitoring metrics
Service-Type Specific Panels: Tailored panels based on service characteristics
Interactive Elements: Template variables, drill-down paths, time range controls
Grafana Compatibility: Generates Grafana-compatible JSON

Usage Examples

# From service definition
python3 scripts/dashboard_generator.py \
    --input assets/sample_service_web.json \
    --output dashboard.json

# With specific role optimization
python3 scripts/dashboard_generator.py \
    --service-type api \
    --name "Payment Service" \
    --role developer \
    --output payment_dev_dashboard.json

# Generate Grafana-compatible JSON
python3 scripts/dashboard_generator.py \
    --input assets/sample_service_api.json \
    --output dashboard.json \
    --format grafana

# With documentation
python3 scripts/dashboard_generator.py \
    --service-type web \
    --name "Customer Portal" \
    --output portal_dashboard.json \
    --doc-output portal_docs.md

Target Roles

sre: Focus on availability, latency, errors, resource utilization
developer: Emphasize latency, errors, throughput, business metrics
executive: Highlight availability, business metrics, user experience
ops: Priority on resource utilization, capacity, alerts, deployments

Panel Types

Stat: Single value displays with thresholds
Gauge: Resource utilization and capacity metrics
Timeseries: Trend analysis and historical data
Table: Top N lists and detailed breakdowns
Heatmap: Distribution and correlation analysis

Sample Data

The assets/ directory contains sample configurations for testing:

sample_service_api.json: Critical API service definition
sample_service_web.json: High-priority web application definition
sample_alerts.json: Alert configuration with optimization opportunities

The expected_outputs/ directory shows example outputs from each script:

sample_slo_framework.json: Complete SLO framework for API service
optimized_alerts.json: Optimized alert configuration
sample_dashboard.json: SRE dashboard specification

Best Practices

SLO Design

Start with 1-2 SLOs per service and iterate
Choose SLIs that directly impact user experience
Set targets based on user needs, not technical capabilities
Use error budgets to balance reliability and velocity

Alert Optimization

Every alert must be actionable
Alert on symptoms, not causes
Use multi-window burn rate alerts for SLO protection
Implement proper escalation and routing policies

Dashboard Design

Follow the F-pattern for visual hierarchy
Use consistent color semantics across dashboards
Include drill-down paths for effective troubleshooting
Optimize for the target role's specific needs

Integration Patterns

CI/CD Integration

# Generate SLOs during service onboarding
python3 scripts/slo_designer.py --input service-config.json --output slos.json

# Validate alert configurations in pipeline
python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report validation.html

# Auto-generate dashboards for new services
python3 scripts/dashboard_generator.py --input service-config.json --format grafana --output dashboard.json

Monitoring Stack Integration

Prometheus: Generated alert rules and recording rules
Grafana: Dashboard JSON for direct import
Alertmanager: Routing and escalation policies
PagerDuty: Escalation configuration

GitOps Workflow

1. Store service definitions in version control 2. Generate observability configurations in CI/CD 3. Deploy configurations via GitOps 4. Monitor effectiveness and iterate

Advanced Usage

Custom SLO Targets

Override default targets by including them in service definitions:

{
  "name": "special-service",
  "type": "api",
  "criticality": "high",
  "custom_slos": {
    "availability_target": 0.9995,
    "latency_p95_target_ms": 150,
    "error_rate_target": 0.002
  }
}

Alert Rule Templates

Use template variables for reusable alert rules:

# Generated Prometheus alert rule
- alert: {{ service_name }}_HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="{{ service_name }}"}[5m])) > {{ latency_threshold }}
  for: 5m
  labels:
    severity: warning
    service: "{{ service_name }}"

Dashboard Variants

Generate multiple dashboard variants for different use cases:

# SRE operational dashboard
python3 scripts/dashboard_generator.py --input service.json --role sre --output sre-dashboard.json

# Developer debugging dashboard  
python3 scripts/dashboard_generator.py --input service.json --role developer --output dev-dashboard.json

# Executive business dashboard
python3 scripts/dashboard_generator.py --input service.json --role executive --output exec-dashboard.json

Troubleshooting

Common Issues

Script Execution Errors

Ensure Python 3.7+ is installed
Check file paths and permissions
Validate JSON syntax in input files

Invalid Service Definitions

Required fields: name, type, criticality
Valid service types: api, web, database, queue, batch, ml
Valid criticality levels: critical, high, medium, low

Missing Historical Data

Alert historical data is optional but improves analysis
Include fires_per_day and false_positive_rate when available
Use monitoring system APIs to populate historical metrics

Debug Mode

Enable verbose logging by setting environment variable:

export DEBUG=1
python3 scripts/slo_designer.py --input service.json

Contributing

Development Setup

# Clone the repository
git clone <repository-url>
cd engineering/observability-designer

# Run tests
python3 -m pytest tests/

# Lint code
python3 -m flake8 scripts/

Adding New Features

1. Follow existing code patterns and error handling 2. Include comprehensive docstrings and type hints 3. Add test cases for new functionality 4. Update documentation and examples

Support

For questions, issues, or feature requests:

Check existing documentation and examples
Review the reference materials in references/
Open an issue with detailed reproduction steps
Include sample configurations when reporting bugs

---

This skill is part of the Claude Skills marketplace. For more information about observability best practices, see the reference documentation in the `references/` directory.

Alert Design Patterns: A Guide to Effective Alerting

Introduction

Well-designed alerts are the difference between a reliable system and 3 AM pages about non-issues. This guide provides patterns and anti-patterns for creating alerts that provide value without causing fatigue.

Fundamental Principles

The Golden Rules of Alerting

1. Every alert should be actionable - If you can't do something about it, don't alert 2. Every alert should require human intelligence - If a script can handle it, automate the response 3. Every alert should be novel - Don't alert on known, ongoing issues 4. Every alert should represent a user-visible impact - Internal metrics matter only if users are affected

Alert Classification

Critical Alerts

Service is completely down
Data loss is occurring
Security breach detected
SLO burn rate indicates imminent SLO violation

Warning Alerts

Service degradation affecting some users
Approaching resource limits
Dependent service issues
Elevated error rates within SLO

Info Alerts

Deployment notifications
Capacity planning triggers
Configuration changes
Maintenance windows

Alert Design Patterns

Pattern 1: Symptoms, Not Causes

Good: Alert on user-visible symptoms

- alert: HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  annotations:
    summary: "API latency is high"
    description: "95th percentile latency is {{ $value }}s, above 500ms threshold"

Bad: Alert on internal metrics that may not affect users

- alert: HighCPU
  expr: cpu_usage > 80
  # This might not affect users at all!

Pattern 2: Multi-Window Alerting

Reduce false positives by requiring sustained problems:

- alert: ServiceDown
  expr: (
    avg_over_time(up[2m]) == 0  # Short window: immediate detection
    and
    avg_over_time(up[10m]) < 0.8  # Long window: avoid flapping
  )
  for: 1m

Pattern 3: Burn Rate Alerting

Alert based on error budget consumption rate:

# Fast burn: 2% of monthly budget in 1 hour
- alert: ErrorBudgetFastBurn  
  expr: (
    error_rate_5m > (14.4 * error_budget_slo)
    and
    error_rate_1h > (14.4 * error_budget_slo)
  )
  for: 2m
  labels:
    severity: critical
    
# Slow burn: 10% of monthly budget in 3 days
- alert: ErrorBudgetSlowBurn
  expr: (
    error_rate_6h > (1.0 * error_budget_slo)
    and  
    error_rate_3d > (1.0 * error_budget_slo)
  )
  for: 15m
  labels:
    severity: warning

Pattern 4: Hysteresis

Use different thresholds for firing and resolving to prevent flapping:

- alert: HighErrorRate
  expr: error_rate > 0.05  # Fire at 5%
  for: 5m
  
# Resolution happens automatically when error_rate < 0.03 (3%)
# This prevents flapping around the 5% threshold

Pattern 5: Composite Alerts

Alert when multiple conditions indicate a problem:

- alert: ServiceDegraded
  expr: (
    (latency_p95 > latency_threshold)
    or
    (error_rate > error_threshold)
    or 
    (availability < availability_threshold)
  ) and (
    request_rate > min_request_rate  # Only alert if we have traffic
  )

Pattern 6: Contextual Alerting

Include relevant context in alerts:

- alert: DatabaseConnections
  expr: db_connections_active / db_connections_max > 0.8
  for: 5m
  annotations:
    summary: "Database connection pool nearly exhausted"
    description: "{{ $labels.database }} has {{ $value | humanizePercentage }} connection utilization"
    runbook_url: "https://runbooks.company.com/database-connections"
    impact: "New requests may be rejected, causing 500 errors"
    suggested_action: "Check for connection leaks or increase pool size"

Alert Routing and Escalation

Routing by Impact and Urgency

Critical Path Services

route:
  group_by: ['service']
  routes:
  - match:
      service: 'payment-api'
      severity: 'critical'
    receiver: 'payment-team-pager'
    continue: true
  - match:
      service: 'payment-api' 
      severity: 'warning'
    receiver: 'payment-team-slack'

Time-Based Routing

route:
  routes:
  - match:
      severity: 'critical'
    receiver: 'oncall-pager'
  - match:
      severity: 'warning'
      time: 'business_hours'  # 9 AM - 5 PM
    receiver: 'team-slack'
  - match:
      severity: 'warning'
      time: 'after_hours'
    receiver: 'team-email'  # Lower urgency outside business hours

Escalation Patterns

Linear Escalation

receivers:
- name: 'primary-oncall'
  pagerduty_configs:
  - escalation_policy: 'P1-Escalation'
    # 0 min: Primary on-call
    # 5 min: Secondary on-call  
    # 15 min: Engineering manager
    # 30 min: Director of engineering

Severity-Based Escalation

# Critical: Immediate escalation
- match:
    severity: 'critical'
  receiver: 'critical-escalation'
  
# Warning: Team-first escalation
- match:
    severity: 'warning'
  receiver: 'team-escalation'

Alert Fatigue Prevention

Grouping and Suppression

Time-Based Grouping

route:
  group_wait: 30s        # Wait 30s to group similar alerts
  group_interval: 2m     # Send grouped alerts every 2 minutes
  repeat_interval: 1h    # Re-send unresolved alerts every hour

Dependent Service Suppression

- alert: ServiceDown
  expr: up == 0
  
- alert: HighLatency
  expr: latency_p95 > 1
  # This alert is suppressed when ServiceDown is firing
  inhibit_rules:
  - source_match:
      alertname: 'ServiceDown'
    target_match:
      alertname: 'HighLatency'
    equal: ['service']

Alert Throttling

# Limit to 1 alert per 10 minutes for noisy conditions
- alert: HighMemoryUsage
  expr: memory_usage_percent > 85
  for: 10m  # Longer 'for' duration reduces noise
  annotations:
    summary: "Memory usage has been high for 10+ minutes"

Smart Defaults

# Use business logic to set intelligent thresholds
- alert: LowTraffic
  expr: request_rate < (
    avg_over_time(request_rate[7d]) * 0.1  # 10% of weekly average
  )
  # Only alert during business hours when low traffic is unusual
  for: 30m

Runbook Integration

Runbook Structure Template

# Alert: {{ $labels.alertname }}

## Immediate Actions
1. Check service status dashboard
2. Verify if users are affected
3. Look at recent deployments/changes

## Investigation Steps
1. Check logs for errors in the last 30 minutes
2. Verify dependent services are healthy  
3. Check resource utilization (CPU, memory, disk)
4. Review recent alerts for patterns

## Resolution Actions
- If deployment-related: Consider rollback
- If resource-related: Scale up or optimize queries
- If dependency-related: Engage appropriate team

## Escalation
- Primary: @team-oncall
- Secondary: @engineering-manager  
- Emergency: @site-reliability-team

Runbook Integration in Alerts

annotations:
  runbook_url: "https://runbooks.company.com/alerts/{{ $labels.alertname }}"
  quick_debug: |
    1. curl -s https://{{ $labels.instance }}/health
    2. kubectl logs {{ $labels.pod }} --tail=50
    3. Check dashboard: https://grafana.company.com/d/service-{{ $labels.service }}

Testing and Validation

Alert Testing Strategies

Chaos Engineering Integration

# Test that alerts fire during controlled failures
def test_alert_during_cpu_spike():
    with chaos.cpu_spike(target='payment-api', duration='2m'):
        assert wait_for_alert('HighCPU', timeout=180)
        
def test_alert_during_network_partition():
    with chaos.network_partition(target='database'):
        assert wait_for_alert('DatabaseUnreachable', timeout=60)

Historical Alert Analysis

# Query to find alerts that fired without incidents
count by (alertname) (
  ALERTS{alertstate="firing"}[30d]
) unless on (alertname) (
  count by (alertname) (
    incident_created{source="alert"}[30d]
  )
)

Alert Quality Metrics

Alert Precision

Precision = True Positives / (True Positives + False Positives)

Track alerts that resulted in actual incidents vs false alarms.

Time to Resolution

# Average time from alert firing to resolution
avg_over_time(
  (alert_resolved_timestamp - alert_fired_timestamp)[30d]
) by (alertname)

Alert Fatigue Indicators

# Alerts per day by team
sum by (team) (
  increase(alerts_fired_total[1d])
)

# Percentage of alerts acknowledged within 15 minutes
sum(alerts_acked_within_15m) / sum(alerts_fired) * 100

Advanced Patterns

Machine Learning-Enhanced Alerting

Anomaly Detection

- alert: AnomalousTraffic
  expr: |
    abs(request_rate - predict_linear(request_rate[1h], 300)) / 
    stddev_over_time(request_rate[1h]) > 3
  for: 10m
  annotations:
    summary: "Traffic pattern is anomalous"
    description: "Current traffic deviates from predicted pattern by >3 standard deviations"

Dynamic Thresholds

- alert: DynamicHighLatency
  expr: |
    latency_p95 > (
      quantile_over_time(0.95, latency_p95[7d]) +  # Historical 95th percentile
      2 * stddev_over_time(latency_p95[7d])        # Plus 2 standard deviations
    )

Business Hours Awareness

# Different thresholds for business vs off hours
- alert: HighLatencyBusinessHours  
  expr: latency_p95 > 0.2  # Stricter during business hours
  for: 2m
  # Active 9 AM - 5 PM weekdays
  
- alert: HighLatencyOffHours
  expr: latency_p95 > 0.5  # More lenient after hours  
  for: 5m
  # Active nights and weekends

Progressive Alerting

# Escalating alert severity based on duration
- alert: ServiceLatencyElevated
  expr: latency_p95 > 0.5
  for: 5m
  labels:
    severity: info
    
- alert: ServiceLatencyHigh
  expr: latency_p95 > 0.5
  for: 15m  # Same condition, longer duration
  labels:
    severity: warning
    
- alert: ServiceLatencyCritical  
  expr: latency_p95 > 0.5
  for: 30m  # Same condition, even longer duration
  labels:
    severity: critical

Anti-Patterns to Avoid

Anti-Pattern 1: Alerting on Everything

Problem: Too many alerts create noise and fatigue Solution: Be selective; only alert on user-impacting issues

Anti-Pattern 2: Vague Alert Messages

Problem: "Service X is down" - which instance? what's the impact? Solution: Include specific details and context

Anti-Pattern 3: Alerts Without Runbooks

Problem: Alerts that don't explain what to do Solution: Every alert must have an associated runbook

Anti-Pattern 4: Static Thresholds

Problem: 80% CPU might be normal during peak hours Solution: Use contextual, adaptive thresholds

Anti-Pattern 5: Ignoring Alert Quality

Problem: Accepting high false positive rates Solution: Regularly review and tune alert precision

Implementation Checklist

Pre-Implementation

[ ] Define alert severity levels and escalation policies
[ ] Create runbook templates
[ ] Set up alert routing configuration
[ ] Define SLOs that alerts will protect

Alert Development

[ ] Each alert has clear success criteria
[ ] Alert conditions tested against historical data
[ ] Runbook created and accessible
[ ] Severity and routing configured
[ ] Context and suggested actions included

Post-Implementation

[ ] Monitor alert precision and recall
[ ] Regular review of alert fatigue metrics
[ ] Quarterly alert effectiveness review
[ ] Team training on alert response procedures

Quality Assurance

[ ] Test alerts fire during controlled failures
[ ] Verify alerts resolve when conditions improve
[ ] Confirm runbooks are accurate and helpful
[ ] Validate escalation paths work correctly

Remember: Great alerts are invisible when things work and invaluable when things break. Focus on quality over quantity, and always optimize for the human who will respond to the alert at 3 AM.

Dashboard Best Practices: Design for Insight and Action

Introduction

A well-designed dashboard is like a good story - it guides you through the data with purpose and clarity. This guide provides practical patterns for creating dashboards that inform decisions and enable quick troubleshooting.

Design Principles

The Hierarchy of Information

Primary Information (Top Third)

Service health status
SLO achievement
Critical alerts
Business KPIs

Secondary Information (Middle Third)

Golden signals (latency, traffic, errors, saturation)
Resource utilization
Throughput and performance metrics

Tertiary Information (Bottom Third)

Detailed breakdowns
Historical trends
Dependency status
Debug information

Visual Design Principles

Rule of 7±2

Maximum 7±2 panels per screen
Group related information together
Use sections to organize complexity

Color Psychology

Red: Critical issues, danger, immediate attention needed
Yellow/Orange: Warnings, caution, degraded state
Green: Healthy, normal operation, success
Blue: Information, neutral metrics, capacity
Gray: Disabled, unknown, or baseline states

Chart Selection Guide

Line charts: Time series, trends, comparisons over time
Bar charts: Categorical comparisons, top N lists
Gauges: Single value with defined good/bad ranges
Stat panels: Key metrics, percentages, counts
Heatmaps: Distribution data, correlation analysis
Tables: Detailed breakdowns, multi-dimensional data

Dashboard Archetypes

The Overview Dashboard

Purpose: High-level health check and business metrics Audience: Executives, managers, cross-team stakeholders Update Frequency: 5-15 minutes

sections:
  - title: "Business Health"
    panels:
      - service_availability_summary
      - revenue_per_hour  
      - active_users
      - conversion_rate
      
  - title: "System Health"  
    panels:
      - critical_alerts_count
      - slo_achievement_summary
      - error_budget_remaining
      - deployment_status

The SRE Operational Dashboard

Purpose: Real-time monitoring and incident response Audience: SRE, on-call engineers Update Frequency: 15-30 seconds

sections:
  - title: "Service Status"
    panels:
      - service_up_status
      - active_incidents
      - recent_deployments
      
  - title: "Golden Signals"
    panels:
      - latency_percentiles
      - request_rate
      - error_rate  
      - resource_saturation
      
  - title: "Infrastructure"
    panels:
      - cpu_memory_utilization
      - network_io
      - disk_space

The Developer Debug Dashboard

Purpose: Deep-dive troubleshooting and performance analysis Audience: Development teams Update Frequency: 30 seconds - 2 minutes

sections:
  - title: "Application Performance"
    panels:
      - endpoint_latency_breakdown
      - database_query_performance
      - cache_hit_rates
      - queue_depths
      
  - title: "Errors and Logs"
    panels:
      - error_rate_by_endpoint
      - log_volume_by_level
      - exception_types
      - slow_queries

Layout Patterns

The F-Pattern Layout

Based on eye-tracking studies, users scan in an F-pattern:

[Critical Status] [SLO Summary  ] [Error Budget ]
[Latency       ] [Traffic      ] [Errors       ]
[Saturation    ] [Resource Use ] [Detailed View]
[Historical    ] [Dependencies ] [Debug Info   ]

The Z-Pattern Layout

For executive dashboards, follow the Z-pattern:

[Business KPIs          ] → [System Status]
      ↓                          ↓
[Trend Analysis         ] ← [Key Metrics ]

Responsive Design

Desktop (1920x1080)

24-column grid
Panels can be 6, 8, 12, or 24 units wide
4-6 rows visible without scrolling

Laptop (1366x768)

Stack wider panels vertically
Reduce panel heights
Prioritize most critical information

Mobile (768px width)

Single column layout
Simplified panels
Touch-friendly controls

Effective Panel Design

Stat Panels

# Good: Clear value with context
- title: "API Availability"
  type: stat
  targets:
    - expr: avg(up{service="api"}) * 100
  field_config:
    unit: percent
    thresholds:
      steps:
        - color: red
          value: 0
        - color: yellow  
          value: 99
        - color: green
          value: 99.9
  options:
    color_mode: background
    text_mode: value_and_name

Time Series Panels

# Good: Multiple related metrics with clear legend
- title: "Request Latency"
  type: timeseries
  targets:
    - expr: histogram_quantile(0.50, rate(http_duration_bucket[5m]))
      legend: "P50"
    - expr: histogram_quantile(0.95, rate(http_duration_bucket[5m]))
      legend: "P95"  
    - expr: histogram_quantile(0.99, rate(http_duration_bucket[5m]))
      legend: "P99"
  field_config:
    unit: ms
    custom:
      draw_style: line
      fill_opacity: 10
  options:
    legend:
      display_mode: table
      placement: bottom
      values: [min, max, mean, last]

Table Panels

# Good: Top N with relevant columns
- title: "Slowest Endpoints"
  type: table
  targets:
    - expr: topk(10, histogram_quantile(0.95, sum by (handler)(rate(http_duration_bucket[5m]))))
      format: table
      instant: true
  transformations:
    - id: organize
      options:
        exclude_by_name: 
          Time: true
        rename_by_name:
          Value: "P95 Latency (ms)"
          handler: "Endpoint"

Color and Visualization Best Practices

Threshold Configuration

# Traffic light system with meaningful boundaries
thresholds:
  steps:
    - color: green     # Good performance
      value: null      # Default
    - color: yellow    # Degraded performance  
      value: 95        # 95th percentile of historical normal
    - color: orange    # Poor performance
      value: 99        # 99th percentile of historical normal
    - color: red       # Critical performance
      value: 99.9      # Worst case scenario

Color Blind Friendly Palettes

# Use patterns and shapes in addition to color
field_config:
  overrides:
    - matcher:
        id: byName
        options: "Critical"
      properties:
        - id: color
          value:
            mode: fixed
            fixed_color: "#d73027"  # Red-orange for protanopia
        - id: custom.draw_style
          value: "points"           # Different shape

Consistent Color Semantics

Success/Health: Green (#28a745)
Warning/Degraded: Yellow (#ffc107)
Error/Critical: Red (#dc3545)
Information: Blue (#007bff)
Neutral: Gray (#6c757d)

Time Range Strategy

Default Time Ranges by Dashboard Type

Real-time Operational

Default: Last 15 minutes
Quick options: 5m, 15m, 1h, 4h
Auto-refresh: 15-30 seconds

Troubleshooting

Default: Last 1 hour
Quick options: 15m, 1h, 4h, 12h, 1d
Auto-refresh: 1 minute

Business Review

Default: Last 24 hours
Quick options: 1d, 7d, 30d, 90d
Auto-refresh: 5 minutes

Capacity Planning

Default: Last 7 days
Quick options: 7d, 30d, 90d, 1y
Auto-refresh: 15 minutes

Time Range Annotations

# Add context for time-based events
annotations:
  - name: "Deployments"
    datasource: "Prometheus"
    expr: "deployment_timestamp"
    title_format: "Deploy {{ version }}"
    text_format: "Deployed version {{ version }} to {{ environment }}"
    
  - name: "Incidents"  
    datasource: "Incident API"
    query: "incidents.json?service={{ service }}"
    color: "red"

Interactive Features

Template Variables

# Service selector
- name: service
  type: query
  query: label_values(up, service)
  current:
    text: All
    value: $__all
  include_all: true
  multi: true
  
# Environment selector  
- name: environment
  type: query
  query: label_values(up{service="$service"}, environment)
  current:
    text: production
    value: production

Drill-Down Links

# Panel-level drill-downs
- title: "Error Rate"
  type: timeseries
  # ... other config ...
  options:
    data_links:
      - title: "View Error Logs"
        url: "/d/logs-dashboard?var-service=${__field.labels.service}&from=${__from}&to=${__to}"
      - title: "Error Traces"  
        url: "/d/traces-dashboard?var-service=${__field.labels.service}"

Dynamic Panel Titles

- title: "${service} - Request Rate"  # Uses template variable
  type: timeseries
  # Title updates automatically when service variable changes

Performance Optimization

Query Optimization

Use Recording Rules

# Instead of complex queries in dashboards
groups:
  - name: http_requests
    rules:
      - record: http_request_rate_5m
        expr: sum(rate(http_requests_total[5m])) by (service, method, handler)
        
      - record: http_request_latency_p95_5m
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))

Limit Data Points

# Good: Reasonable resolution for dashboard
- expr: http_request_rate_5m[1h]
  interval: 15s  # One point every 15 seconds

# Bad: Too many points for visualization  
- expr: http_request_rate_1s[1h]  # 3600 points!

Dashboard Performance

Panel Limits

Maximum panels per dashboard: 20-30
Maximum queries per panel: 10
Maximum time series per panel: 50

Caching Strategy

# Use appropriate cache headers
cache_timeout: 30  # Cache for 30 seconds on fast-changing panels
cache_timeout: 300 # Cache for 5 minutes on slow-changing panels

Accessibility

Screen Reader Support

# Provide text alternatives for visual elements
- title: "Service Health Status"
  type: stat
  options:
    text_mode: value_and_name  # Includes both value and description
  field_config:
    mappings:
      - options:
          "1": 
            text: "Healthy"
            color: "green"
          "0":
            text: "Unhealthy"  
            color: "red"

Keyboard Navigation

Ensure all interactive elements are keyboard accessible
Provide logical tab order
Include skip links for complex dashboards

High Contrast Mode

# Test dashboards work in high contrast mode
theme: high_contrast
colors:
  - "#000000"  # Pure black
  - "#ffffff"  # Pure white  
  - "#ffff00"  # Pure yellow
  - "#ff0000"  # Pure red

Testing and Validation

Dashboard Testing Checklist

Functional Testing

[ ] All panels load without errors
[ ] Template variables filter correctly
[ ] Time range changes update all panels
[ ] Drill-down links work as expected
[ ] Auto-refresh functions properly

Visual Testing

[ ] Dashboard renders correctly on different screen sizes
[ ] Colors are distinguishable and meaningful
[ ] Text is readable at normal zoom levels
[ ] Legends and labels are clear

Performance Testing

[ ] Dashboard loads in < 5 seconds
[ ] No queries timeout under normal load
[ ] Auto-refresh doesn't cause browser lag
[ ] Memory usage remains reasonable

Usability Testing

[ ] New team members can understand the dashboard
[ ] Action items are clear during incidents
[ ] Key information is quickly discoverable
[ ] Dashboard supports common troubleshooting workflows

Maintenance and Governance

Dashboard Lifecycle

Creation

1. Define dashboard purpose and audience 2. Identify key metrics and success criteria 3. Design layout following established patterns 4. Implement with consistent styling 5. Test with real data and user scenarios

Maintenance

Weekly: Check for broken panels or queries
Monthly: Review dashboard usage analytics
Quarterly: Gather user feedback and iterate
Annually: Major review and potential redesign

Retirement

Archive dashboards that are no longer used
Migrate users to replacement dashboards
Document lessons learned

Dashboard Standards

# Organization dashboard standards
standards:
  naming_convention: "[Team] [Service] - [Purpose]"
  tags: [team, service_type, environment, purpose]
  refresh_intervals: [15s, 30s, 1m, 5m, 15m]
  time_ranges: [5m, 15m, 1h, 4h, 1d, 7d, 30d]
  color_scheme: "company_standard"
  max_panels_per_dashboard: 25

Advanced Patterns

Composite Dashboards

# Dashboard that includes panels from other dashboards
- title: "Service Overview"
  type: dashlist
  targets:
    - "service-health"
    - "service-performance" 
    - "service-business-metrics"
  options:
    show_headings: true
    max_items: 10

Dynamic Dashboard Generation

# Generate dashboards from service definitions
def generate_service_dashboard(service_config):
    panels = []
    
    # Always include golden signals
    panels.extend(generate_golden_signals_panels(service_config))
    
    # Add service-specific panels
    if service_config.type == 'database':
        panels.extend(generate_database_panels(service_config))
    elif service_config.type == 'queue':
        panels.extend(generate_queue_panels(service_config))
        
    return {
        'title': f"{service_config.name} - Operational Dashboard",
        'panels': panels,
        'variables': generate_variables(service_config)
    }

A/B Testing for Dashboards

# Test different dashboard designs with different teams
experiment:
  name: "dashboard_layout_test"
  variants:
    - name: "traditional_layout"
      weight: 50
      config: "dashboard_v1.json"
    - name: "f_pattern_layout"  
      weight: 50
      config: "dashboard_v2.json"
  success_metrics:
    - "time_to_insight"
    - "user_satisfaction"
    - "troubleshooting_efficiency"

Remember: A dashboard should tell a story about your system's health and guide users toward the right actions. Focus on clarity over complexity, and always optimize for the person who will use it during a stressful incident.

SLO Cookbook: A Practical Guide to Service Level Objectives

Introduction

Service Level Objectives (SLOs) are a key tool for managing service reliability. This cookbook provides practical guidance for implementing SLOs that actually improve system reliability rather than just creating meaningless metrics.

Fundamentals

The SLI/SLO/SLA Hierarchy

SLI (Service Level Indicator): A quantifiable measure of service quality
SLO (Service Level Objective): A target range of values for an SLI
SLA (Service Level Agreement): A business agreement with consequences for missing SLO targets

Golden Rule of SLOs

Start simple, iterate based on learning. Your first SLOs won't be perfect, and that's okay.

Choosing Good SLIs

The Four Golden Signals

1. Latency: How long requests take to complete 2. Traffic: How many requests are coming in 3. Errors: How many requests are failing 4. Saturation: How "full" your service is

SLI Selection Criteria

A good SLI should be:

Measurable: You can collect data for it
Meaningful: It reflects user experience
Controllable: You can take action to improve it
Proportional: Changes in the SLI reflect changes in user happiness

Service Type Specific SLIs

HTTP APIs

Request latency: P95 or P99 response time
Availability: Proportion of successful requests (non-5xx)
Throughput: Requests per second capacity

# Availability SLI
sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Latency SLI  
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Batch Jobs

Freshness: Age of the last successful run
Correctness: Proportion of jobs completing successfully
Throughput: Items processed per unit time

Data Pipelines

Data freshness: Time since last successful update
Data quality: Proportion of records passing validation
Processing latency: Time from ingestion to availability

Anti-Patterns in SLI Selection

❌ Don't use: CPU usage, memory usage, disk space as primary SLIs

These are symptoms, not user-facing impacts

❌ Don't use: Counts instead of rates or proportions

"Number of errors" vs "Error rate"

❌ Don't use: Internal metrics that users don't care about

Queue depth, cache hit rate (unless they directly impact user experience)

Setting SLO Targets

The Art of Target Setting

Setting SLO targets is balancing act between:

User happiness: Targets should reflect acceptable user experience
Business value: Tighter SLOs cost more to maintain
Current performance: Targets should be achievable but aspirational

Target Setting Strategies

Historical Performance Method

1. Collect 4-6 weeks of historical data 2. Calculate the worst user-visible performance in that period 3. Set your SLO slightly better than the worst acceptable performance

User Journey Mapping

1. Map critical user journeys 2. Identify acceptable performance for each step 3. Work backwards to component SLOs

Error Budget Approach

1. Decide how much unreliability you can afford 2. Set SLO targets based on acceptable error budget consumption 3. Example: 99.9% availability = 43.8 minutes downtime per month

SLO Target Examples by Service Criticality

Critical Services (Revenue Impact)

Availability: 99.95% - 99.99%
Latency (P95): 100-200ms
Error Rate: < 0.1%

High Priority Services

Availability: 99.9% - 99.95%
Latency (P95): 200-500ms
Error Rate: < 0.5%

Standard Services

Availability: 99.5% - 99.9%
Latency (P95): 500ms - 1s
Error Rate: < 1%

Error Budget Management

What is an Error Budget?

Your error budget is the maximum amount of unreliability you can accumulate while still meeting your SLO. It's calculated as:

Error Budget = (1 - SLO) × Time Window

For a 99.9% availability SLO over 30 days:

Error Budget = (1 - 0.999) × 30 days = 0.001 × 30 days = 43.8 minutes

Error Budget Policies

Define what happens when you consume your error budget:

Conservative Policy (High-Risk Services)

> 50% consumed: Freeze non-critical feature releases
> 75% consumed: Focus entirely on reliability improvements
> 90% consumed: Consider emergency measures (traffic shaping, etc.)

Balanced Policy (Standard Services)

> 75% consumed: Increase focus on reliability work
> 90% consumed: Pause feature work, focus on reliability

Aggressive Policy (Early Stage Services)

> 90% consumed: Review but continue normal operations
100% consumed: Evaluate SLO appropriateness

Burn Rate Alerting

Multi-window burn rate alerts help you catch SLO violations before they become critical:

# Fast burn: 2% budget consumed in 1 hour
- alert: FastBurnSLOViolation
  expr: (
    (1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])))) > (14.4 * 0.001)
    and
    (1 - (sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > (14.4 * 0.001)
  )
  for: 2m

# Slow burn: 10% budget consumed in 3 days  
- alert: SlowBurnSLOViolation
  expr: (
    (1 - (sum(rate(http_requests_total{code!~"5.."}[6h])) / sum(rate(http_requests_total[6h])))) > (1.0 * 0.001)
    and
    (1 - (sum(rate(http_requests_total{code!~"5.."}[3d])) / sum(rate(http_requests_total[3d])))) > (1.0 * 0.001)
  )
  for: 15m

Implementation Patterns

The SLO Implementation Ladder

Level 1: Basic SLOs

Choose 1-2 SLIs that matter most to users
Set aspirational but achievable targets
Implement basic alerting when SLOs are missed

Level 2: Operational SLOs

Add burn rate alerting
Create error budget dashboards
Establish error budget policies
Regular SLO review meetings

Level 3: Advanced SLOs

Multi-window burn rate alerts
Automated error budget policy enforcement
SLO-driven incident prioritization
Integration with CI/CD for deployment decisions

SLO Measurement Architecture

Push vs Pull Metrics

Pull (Prometheus): Good for infrastructure metrics, real-time alerting
Push (StatsD): Good for application metrics, business events

Measurement Points

Server-side: More reliable, easier to implement
Client-side: Better reflects user experience
Synthetic: Consistent, predictable, may not reflect real user experience

SLO Dashboard Design

Essential elements for SLO dashboards:

1. Current SLO Achievement: Large, prominent display 2. Error Budget Remaining: Visual indicator (gauge, progress bar) 3. Burn Rate: Time series showing error budget consumption rate 4. Historical Trends: 4-week view of SLO achievement 5. Alerts: Current and recent SLO-related alerts

Advanced Topics

Dependency SLOs

For services with dependencies:

SLO_service ≤ min(SLO_inherent, ∏SLO_dependencies)

If your service depends on 3 other services each with 99.9% SLO:

Maximum_SLO = 0.999³ = 0.997 = 99.7%

User Journey SLOs

Track end-to-end user experiences:

# Registration success rate
sum(rate(user_registration_success_total[5m])) / sum(rate(user_registration_attempts_total[5m]))

# Purchase completion latency
histogram_quantile(0.95, rate(purchase_completion_duration_seconds_bucket[5m]))

SLOs for Batch Systems

Special considerations for non-request/response systems:

Freshness SLO

# Data should be no more than 4 hours old
(time() - last_successful_update_timestamp) < (4 * 3600)

Throughput SLO

# Should process at least 1000 items per hour
rate(items_processed_total[1h]) >= 1000

Quality SLO

# At least 99.5% of records should pass validation
sum(rate(records_valid_total[5m])) / sum(rate(records_processed_total[5m])) >= 0.995

Common Mistakes and How to Avoid Them

Mistake 1: Too Many SLOs

Problem: Drowning in metrics, losing focus Solution: Start with 1-2 SLOs per service, add more only when needed

Mistake 2: Internal Metrics as SLIs

Problem: Optimizing for metrics that don't impact users Solution: Always ask "If this metric changes, do users notice?"

Mistake 3: Perfectionist SLOs

Problem: 99.99% SLO when 99.9% would be fine Solution: Higher SLOs cost exponentially more; pick the minimum acceptable level

Mistake 4: Ignoring Error Budgets

Problem: Treating any SLO miss as an emergency Solution: Error budgets exist to be spent; use them to balance feature velocity and reliability

Mistake 5: Static SLOs

Problem: Setting SLOs once and never updating them Solution: Review SLOs quarterly; adjust based on user feedback and business changes

SLO Review Process

Monthly SLO Review Agenda

1. SLO Achievement Review: Did we meet our SLOs? 2. Error Budget Analysis: How did we spend our error budget? 3. Incident Correlation: Which incidents impacted our SLOs? 4. SLI Quality Assessment: Are our SLIs still meaningful? 5. Target Adjustment: Should we change any targets?

Quarterly SLO Health Check

1. User Impact Validation: Survey users about acceptable performance 2. Business Alignment: Do SLOs still reflect business priorities? 3. Measurement Quality: Are we measuring the right things? 4. Cost/Benefit Analysis: Are tighter SLOs worth the investment?

Tooling and Automation

Essential Tools

1. Metrics Collection: Prometheus, InfluxDB, CloudWatch 2. Alerting: Alertmanager, PagerDuty, OpsGenie 3. Dashboards: Grafana, DataDog, New Relic 4. SLO Platforms: Sloth, Pyrra, Service Level Blue

Automation Opportunities

Burn rate alert generation from SLO definitions
Dashboard creation from SLO specifications
Error budget calculation and tracking
Release blocking based on error budget consumption

Getting Started Checklist

[ ] Identify your service's critical user journeys
[ ] Choose 1-2 SLIs that best reflect user experience
[ ] Collect 4-6 weeks of baseline data
[ ] Set initial SLO targets based on historical performance
[ ] Implement basic SLO monitoring and alerting
[ ] Create an SLO dashboard
[ ] Define error budget policies
[ ] Schedule monthly SLO reviews
[ ] Plan for quarterly SLO health checks

Remember: SLOs are a journey, not a destination. Start simple, learn from experience, and iterate toward better reliability management.

Related skills

Azure DeploySafely execute production deployments of already-prepared applications to Microsoft Azure.478k1.3k

Azure ValidateRun deep pre-deployment checks on Azure configuration, infrastructure definitions, RBAC roles, and managed identities before pushing to production.477k1.3k

Github Actions DocsGet precise, docs-grounded answers about GitHub Actions workflows, syntax, security, and migration instead of relying on stale knowledge.275k72

Setup Pre CommitAutomatically run Prettier, type checking, and tests on every commit via Husky and lint-staged.161k188k

Deploy To VercelSafely turn any local project into a live Vercel preview with one instruction.97.8k29.5k

Vercel Cli With TokensDeploy projects to Vercel from agents and scripts using token authentication instead of interactive browser login.73.4k29.5k

How it compares

Choose observability-designer to draft alert YAML and runbook metadata with an agent; use managed observability suites when dashboards and paging are already templated end to end.

FAQ

What alert format does observability-designer produce?

observability-designer produces Prometheus-style alert rules with expr, for duration, labels such as severity and service, and annotations including summary, description templates, and runbook_url links for operator guidance.

Which metrics does observability-designer example alerts use?

observability-designer examples use Prometheus histogram_quantile on http_request_duration_seconds_bucket with rate windows, plus labels for severity, service name, and team ownership to route incidents.

Is Observability Designer safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

DevOps & CI/CDmonitoringinfra

About

Observability Designer by the numbers

Add your badge

How do you design actionable Prometheus alert rules?

Who is it for?

When should I use this skill?

What you get

Files

Observability Designer (POWERFUL)

Overview

Quick Start

Core Competencies

SLI/SLO/SLA Framework Design

Three Pillars of Observability

Metrics

Logs

Traces

Dashboard Design Principles

Information Architecture

Visualization Best Practices

Panel Design

Alert Design and Optimization

Alert Classification

Alert Fatigue Prevention

Alert Rule Design

Runbook Generation and Incident Response

Runbook Structure

Incident Detection Patterns

Golden Signals Framework

Latency Monitoring

Traffic Monitoring

Error Monitoring

Saturation Monitoring

Distributed Tracing Strategies

Trace Architecture

Service Instrumentation

Log Aggregation Patterns

Collection Architecture

Storage and Indexing

Cost Optimization for Observability

Data Management

Resource Optimization

Scripts Overview

1. SLO Designer (slo_designer.py)

2. Alert Optimizer (alert_optimizer.py)

3. Dashboard Generator (dashboard_generator.py)

Integration Patterns

Monitoring Stack Integration

CI/CD Integration

Incident Management Integration

Advanced Patterns

Multi-Cloud Observability

Microservices Observability

Machine Learning Observability

Best Practices

Organizational Alignment

Technical Excellence

Continuous Improvement

Observability Designer

Overview

Quick Start

Prerequisites

Basic Usage

Scripts Documentation

SLO Designer (slo_designer.py)

Features

Usage Examples

Service Definition Format

Supported Service Types

Criticality Levels

Alert Optimizer (alert_optimizer.py)

Features

Usage Examples

Alert Configuration Format

Analysis Categories

Dashboard Generator (dashboard_generator.py)

Features

Usage Examples

Target Roles

Panel Types

1. SLO Designer (`slo_designer.py`)

2. Alert Optimizer (`alert_optimizer.py`)

3. Dashboard Generator (`dashboard_generator.py`)

SLO Designer (`slo_designer.py`)

Alert Optimizer (`alert_optimizer.py`)

Dashboard Generator (`dashboard_generator.py`)