
Dora Metrics
Compare DORA-style change failure rate and lead time for AI-authored versus human work so agent velocity does not hide quality regressions.
Overview
dora-metrics is an agent skill most often used in Grow (also Operate monitoring, Ship launch) that maps DORA signals to AI-assisted workflow failure modes and remediation levers.
Install
npx skills add https://github.com/athola/claude-night-market --skill dora-metricsWhat is this skill?
- Run minister.dora_metrics with --window 30 and distinct --failure-label values (e.g. bug vs ai-bug) and compare JSON exp
- Treat AI CFR more than five percentage points above human CFR as lenient review signal, not a ban on AI assistance
- Compare lead time for 30 days before versus after agent adoption to spot velocity-for-stability tradeoffs
- Watch time-to-restore when agents ship hotfixes—incomplete RCA can inflate TRS once truth surfaces
- Pair rising deployment frequency with CFR so arbitrarily high DF from agents does not look healthy alone
- Example commands use a 30-day --window with JSON output via python3 -m minister.dora_metrics
- AI versus human CFR gap of more than five percentage points is called out as a review-leniency signal
Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 1/2 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
Agent adoption sped up merges and deploys but you cannot tell if failures, restore time, or review quality got worse in the same window.
Who is it for?
Indie teams shipping with agents who already log deploys and failure labels and want minister-style JSON metrics over rolling windows.
Skip if: Solo builders with no deployment or incident labeling history who only need a single pre-launch checklist.
When should I use this skill?
Measuring delivery health after enabling or scaling agentic coding workflows and you have labeled failures or deploy history to query.
What do I get? / Deliverables
You run labeled DORA windows, interpret AI versus human CFR and lead-time tradeoffs, and add gates at friction points instead of guessing.
- JSON metric snapshots for all versus AI-labeled failure windows
- Interpretation notes on CFR, lead time, TRS, and DF tradeoffs with suggested friction gates
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Grow analytics is the canonical shelf because the skill turns deployment and incident labels into comparative metrics you act on over time. Analytics fits best: you slice windows, label failures, and read CFR and lead-time deltas—not a one-off security scan.
Where it fits
You export 30-day CFR with bug and ai-bug labels and tighten review when AI CFR exceeds human by more than five points.
After an agent-shipped hotfix, you inspect TRS and whether restore claims matched real recovery time.
Before a release week, you pair rising DF from agents with CFR so launch velocity does not outrun stability.
How it compares
Interpretation layer for DORA-style metrics in agent pipelines—not a dashboard product or generic analytics MCP by itself.
Common Questions / FAQ
Who is dora-metrics for?
Solo builders and small teams measuring delivery health while Claude Code or similar agents author a growing share of changes and hotfixes.
When should I use dora-metrics?
Use it in Grow analytics when comparing AI-labeled bugs to human CFR, in Operate monitoring after restore incidents, or in Ship launch prep when deployment frequency spikes with agents.
Is dora-metrics safe to install?
The skill describes running local minister metrics commands; review the Security Audits panel on this Prism page before installing skills from the Night Market repo.
SKILL.md
READMESKILL.md - Dora Metrics
# Agentic Workflow Signals from DORA DORA metrics were designed for human-driven engineering teams, but the same four numbers expose specific failure modes in AI-assisted pipelines. ## What to Watch ### Change Failure Rate, AI vs human Run the metric twice with different `--failure-label` values: ```bash python3 -m minister.dora_metrics --window 30 --failure-label bug --json > all.json python3 -m minister.dora_metrics --window 30 --failure-label ai-bug --json > ai.json ``` If AI-authored CFR exceeds human-authored CFR by more than five percentage points, treat it as a signal that review is too lenient on AI output, not that AI is unsafe in general. The right response is usually adding a hookify rule or imbue gate at the friction point, not banning AI assistance. ### Lead Time, before vs after agent adoption Compute lead time for the 30 days before and after enabling an agentic workflow. If LT improved but CFR or TRS regressed, the team is trading stability for velocity. The bottleneck dimension surfaced by the skill points at which trade was made. ### Time to Restore, agent-driven hotfixes If TRS got worse after agents started shipping hotfixes, suspect incomplete root-cause analysis. The Replit incident is a case study: fast restore claims that turn out to be fabricated extend TRS once the truth surfaces. ### Deployment Frequency, ceiling check Agents can push DF arbitrarily high. Pair DF with CFR; if DF rose and CFR rose proportionally, the agent is generating noise rather than signal. A high-DF, high-CFR team produces churn. ## Producing a Comparison Report Combine two windows side-by-side: ```python from minister.dora_metrics import compute_metrics # ... collect events for each cohort ... human = compute_metrics(human_deploys, human_failures, window_days=30) agent = compute_metrics(agent_deploys, agent_failures, window_days=30) print("Human:", human.tier()) print("Agent:", agent.tier()) print("Human bottleneck:", human.bottleneck()) print("Agent bottleneck:", agent.bottleneck()) ``` If the bottleneck differs across cohorts, that is the most useful single output: it tells the engineering manager which guardrail is missing for which population. ## Anti-Patterns - Reporting only DF as proof of agent ROI without CFR. - Excluding agent-authored failures from the failure label. - Comparing against last quarter when agent adoption mid-window invalidates the comparison. # DORA Tier Thresholds Source: DORA's State of DevOps research. The thresholds below match the published bands; minor adjustments per release year are common but the band shape is stable. ## Deployment Frequency (DF) How often code is deployed to production. Higher is better. | Tier | Threshold | |------|-----------| | Elite | At least once per day | | High | Between once per week and once per day | | Medium | Between once per month and once per week | | Low | Less often than once per month | ## Lead Time for Changes (LT) Median time from commit to production. Lower is better. | Tier | Threshold | |------|-----------| | Elite | Less than one day | | High | One day to one week | | Medium | One week to one month | | Low | More than one month | ## Change Failure Rate (CFR) Percentage of deployments that cause a production failure. Lower is better. | Tier | Threshold | |------|-----------| | Elite | At most 15% | | High | 16-30% | | Medium | 31-45% | | Low | More than 45% | ## Time to Restore Service (TRS) Median time to recover from a production failure. Lower is better. | Tier | Threshold | |------|-----------| | Elite | Less than one hour | | High | Less than one day | | Medium | Less than one week | | Low | One week or more | ## Boundary Behavior The implementation places the boundary value in the better tier: - DF exactly 1.0/day classifies as Elite, not High. - LT exactly 24 hours classifies as Elite, not High. - CFR exactly 15% classifies as Elite, not High. - TRS exactly 1 hour classifies as High, not Elite (TRS uses strict