
Sre Engineer
Inventory operational toil, score automation ROI, and prioritize SRE-style automation when you run services without a dedicated ops team.
Overview
SRE Engineer is an agent skill most often used in Operate (also Build, Ship) that models operational toil and prioritizes automation using categorized workloads and ROI scoring.
Install
npx skills add https://github.com/jeffallan/claude-skills --skill sre-engineerWhat is this skill?
- Defines toil using five ToilCategory enums including manual, repetitive, and interrupt-driven work
- Provides ToilItem dataclass with weekly and annual hour calculations from frequency and duration
- ROI scoring weights annual hours by automation difficulty (easy, medium, hard multipliers)
- Frames automation as the primary lever to stop linear scaling of manual ops with user growth
- Includes Python-ready patterns for building a prioritized toil backlog
- 5 ToilCategory classification values
- 3 automation difficulty tiers in ROI scoring
Adoption & trust: 2.8k installs on skills.sh; 9.7k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Production chores eat your week and you cannot tell which repetitive tasks are worth automating first.
Who is it for?
Indie SaaS founders handling deploys, DB touch-ups, and on-call interruptions themselves.
Skip if: Greenfield ideas with no users yet and zero production toil to measure.
When should I use this skill?
Operational work feels repetitive, scales with users, or you are planning an automation sprint.
What do I get? / Deliverables
You produce a quantified toil inventory ranked by annual hours saved versus automation difficulty so agents can implement the highest-ROI fixes next.
- Prioritized toil inventory
- ROI-ranked automation candidates
- Time-saved estimates (weekly and annual hours)
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Operate is where toil and reliability debt show up daily; this skill catalogs and prioritizes automation from that lens. Infra subphase covers runbooks, repetitive ops work, and scaling pain that SRE practices target first.
Where it fits
List weekly manual database interventions and rank them by annual_hours × easy automation multiplier.
Pick the top ROI toil item and have your agent draft an idempotent automation script.
Design admin APIs so future interrupt-driven toil categories shrink before you ship.
Separate customer-driven reactive toil from engineering automation candidates.
How it compares
Focuses on toil elimination and automation ROI, not generic uptime slogans or a single monitoring vendor integration.
Common Questions / FAQ
Who is sre-engineer for?
Solo builders and small teams using AI coding agents who operate their own APIs or SaaS and need SRE-style prioritization without hiring SREs.
When should I use sre-engineer?
In Operate when cataloging manual runbook steps; in Grow when support load scales; in Build when designing self-healing deploy paths before launch.
Is sre-engineer safe to install?
Content is guidance and sample Python structures—check the Security Audits panel on this page and review any automation scripts before they touch production credentials.
SKILL.md
READMESKILL.md - Sre Engineer
# Automation and Toil Reduction ## Toil Definition Toil is manual, repetitive, automatable work that scales linearly with service growth. ```python from dataclasses import dataclass from enum import Enum class ToilCategory(Enum): """Categories of operational toil.""" MANUAL_INTERVENTION = "manual" REPETITIVE_TASKS = "repetitive" NO_ENDURING_VALUE = "no_value" SCALES_WITH_SERVICE = "scales" INTERRUPT_DRIVEN = "reactive" @dataclass class ToilItem: """Track a specific toil activity.""" name: str frequency_per_week: int minutes_per_occurrence: int category: ToilCategory automation_difficulty: str # 'easy', 'medium', 'hard' @property def weekly_hours(self) -> float: """Calculate weekly hours spent on this toil.""" return (self.frequency_per_week * self.minutes_per_occurrence) / 60 @property def annual_hours(self) -> float: """Calculate annual hours spent on this toil.""" return self.weekly_hours * 52 def roi_score(self) -> float: """Calculate ROI score for automation (higher = better). Score considers time saved vs. difficulty. """ difficulty_multiplier = { 'easy': 1.0, 'medium': 0.5, 'hard': 0.25, } return self.annual_hours * difficulty_multiplier.get( self.automation_difficulty, 0.1 ) # Example toil inventory toil_items = [ ToilItem( name="Manual database failover", frequency_per_week=2, minutes_per_occurrence=30, category=ToilCategory.MANUAL_INTERVENTION, automation_difficulty='medium', ), ToilItem( name="Restarting hung processes", frequency_per_week=5, minutes_per_occurrence=15, category=ToilCategory.REPETITIVE_TASKS, automation_difficulty='easy', ), ToilItem( name="Log file cleanup", frequency_per_week=7, minutes_per_occurrence=10, category=ToilCategory.SCALES_WITH_SERVICE, automation_difficulty='easy', ), ] # Calculate total toil and prioritize automation total_weekly_hours = sum(item.weekly_hours for item in toil_items) print(f"Total weekly toil: {total_weekly_hours:.1f} hours") # Sort by ROI score to prioritize automation sorted_items = sorted(toil_items, key=lambda x: x.roi_score(), reverse=True) for item in sorted_items: print(f"{item.name}: {item.roi_score():.1f} ROI score") ``` ## Self-Healing Systems Automate common failure remediation. ```python # auto_healing.py - Self-healing automation examples import subprocess import logging from typing import Callable, Dict from dataclasses import dataclass logger = logging.getLogger(__name__) @dataclass class HealthCheck: """Define a health check and remediation.""" name: str check: Callable[[], bool] remediate: Callable[[], bool] max_retries: int = 3 class SelfHealer: """Automatically remediate common failures.""" def __init__(self): self.health_checks: Dict[str, HealthCheck] = {} def register(self, check: HealthCheck): """Register a health check with remediation.""" self.health_checks[check.name] = check def run(self): """Run all health checks and remediate failures.""" for name, check in self.health_checks.items(): if not check.check(): logger.warning(f"Health check failed: {name}") self._remediate(check) def _remediate(self, check: HealthCheck): """Attempt remediation with retries.""" for attempt in range(check.max_retries): logger.info(f"Remediation attempt {attempt + 1}/{check.max_retries}") if check.remediate(): logger.info(f"Remediation successful: {check.name}") return if check.check(): logger.info(f"Health check passed after remediation: {check.name}") return