
Slo Implementation
Define SLIs, SLOs, burn-rate alerts, and review cadences so a solo builder can run production services with measurable reliability instead of guessing uptime.
Install
npx skills add https://github.com/wshobson/agents --skill slo-implementationWhat is this skill?
- Multi-window burn-rate alert patterns to cut false positives on availability SLOs
- Weekly, monthly, and quarterly SLO review checklists (compliance, error budget, adjustments)
- 10 documented SLO best practices including user-facing SLIs and error-budget prioritization
- Pairs with prometheus-configuration and grafana-dashboards for metrics and dashboards
Adoption & trust: 6.7k installs on skills.sh; 36.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
SLOs and error budgets are canonical operate work—alerts, compliance tracking, and post-incident prioritization live in production monitoring. Monitoring is where SLI metrics, multi-window burn-rate rules, and SLO reporting attach to observability stacks like Prometheus and Grafana.
Common Questions / FAQ
Is Slo Implementation safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Slo Implementation
# slo-implementation — additional patterns and templates ## Multi-Window Burn Rate Alerts ```yaml # Combination of short and long windows reduces false positives rules: - alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical ``` ## SLO Review Process ### Weekly Review - Current SLO compliance - Error budget status - Trend analysis - Incident impact ### Monthly Review - SLO achievement - Error budget usage - Incident postmortems - SLO adjustments ### Quarterly Review - SLO relevance - Target adjustments - Process improvements - Tooling enhancements ## Best Practices 1. **Start with user-facing services** 2. **Use multiple SLIs** (availability, latency, etc.) 3. **Set achievable SLOs** (don't aim for 100%) 4. **Implement multi-window alerts** to reduce noise 5. **Track error budget** consistently 6. **Review SLOs regularly** 7. **Document SLO decisions** 8. **Align with business goals** 9. **Automate SLO reporting** 10. **Use SLOs for prioritization** ## Related Skills - `prometheus-configuration` - For metric collection - `grafana-dashboards` - For SLO visualization --- name: slo-implementation description: Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance. --- # SLO Implementation Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. ## Purpose Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity. ## When to Use - Define service reliability targets - Measure user-perceived reliability - Implement error budgets - Create SLO-based alerts - Track reliability goals ## SLI/SLO/SLA Hierarchy ``` SLA (Service Level Agreement) ↓ Contract with customers SLO (Service Level Objective) ↓ Internal reliability target SLI (Service Level Indicator) ↓ Actual measurement ``` ## Defining SLIs ### Common SLI Types #### 1. Availability SLI ```promql # Successful requests / Total requests sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) ``` #### 2. Latency SLI ```promql # Requests below latency threshold / Total requests sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) ``` #### 3. Durability SLI ``` # Successful writes / Total writes sum(storage_writes_successful_total) / sum(storage_writes_total) ``` **Reference:** See `references/slo-definitions.md` ## Setting SLO Targets ### Availability SLO Examples | SLO % | Downtime/Month | Downtime/Year | | ------ | -------------- | ------------- | | 99% | 7.2 hours | 3.65 days | | 99.9% | 43.2 minutes | 8.76 hours | | 99.95% | 21.6 minutes | 4.38 hours | | 99.99% | 4.32 minutes | 52.56 minutes | ### Choose Appropriate SLOs **Consider:** - User expectations - Business requirements - Current performance - Cost of reliability - Competitor benchmarks **Example SLOs:** ```yaml slos: - name: api_availability target: 99.9 window: 28d sli: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) - name: api_latency_p95 target: 99 window: 28d sli: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) ``` ## Error Budget Calculation ### Error Budget Formula ``` Error Budget = 1 - SLO Target ``` **Example:** - SLO: 99.9% availability - Error Budget: 0.1% = 43.2 minutes/month - Current Er