Dora Metrics

Grow analytics is the canonical shelf because the skill turns deployment and incident labels into comparative metrics you act on over time. Analytics fits best: you slice windows, label failures, and read CFR and lead-time deltas—not a one-off security scan.

Also useful

Also useful

Where it fits

Example use

You export 30-day CFR with bug and ai-bug labels and tighten review when AI CFR exceeds human by more than five points.

Example use

After an agent-shipped hotfix, you inspect TRS and whether restore claims matched real recovery time.

Example use

Before a release week, you pair rising DF from agents with CFR so launch velocity does not outrun stability.

How it compares

Interpretation layer for DORA-style metrics in agent pipelines—not a dashboard product or generic analytics MCP by itself.

Common Questions / FAQ

Who is dora-metrics for?

Solo builders and small teams measuring delivery health while Claude Code or similar agents author a growing share of changes and hotfixes.

When should I use dora-metrics?

Use it in Grow analytics when comparing AI-labeled bugs to human CFR, in Operate monitoring after restore incidents, or in Ship launch prep when deployment frequency spikes with agents.

Is dora-metrics safe to install?

The skill describes running local minister metrics commands; review the Security Audits panel on this Prism page before installing skills from the Night Market repo.

SKILL.md

READMESKILL.md - Dora Metrics

# Agentic Workflow Signals from DORA

DORA metrics were designed for human-driven engineering teams, but
the same four numbers expose specific failure modes in
AI-assisted pipelines.

## What to Watch

### Change Failure Rate, AI vs human

Run the metric twice with different `--failure-label` values:

```bash
python3 -m minister.dora_metrics --window 30 --failure-label bug --json > all.json
python3 -m minister.dora_metrics --window 30 --failure-label ai-bug --json > ai.json
```

If AI-authored CFR exceeds human-authored CFR by more than five
percentage points, treat it as a signal that review is too lenient
on AI output, not that AI is unsafe in general. The right response
is usually adding a hookify rule or imbue gate at the friction
point, not banning AI assistance.

### Lead Time, before vs after agent adoption

Compute lead time for the 30 days before and after enabling an
agentic workflow. If LT improved but CFR or TRS regressed, the team
is trading stability for velocity. The bottleneck dimension surfaced
by the skill points at which trade was made.

### Time to Restore, agent-driven hotfixes

If TRS got worse after agents started shipping hotfixes, suspect
incomplete root-cause analysis. The Replit incident is a case study:
fast restore claims that turn out to be fabricated extend TRS once
the truth surfaces.

### Deployment Frequency, ceiling check

Agents can push DF arbitrarily high. Pair DF with CFR; if DF rose
and CFR rose proportionally, the agent is generating noise rather
than signal. A high-DF, high-CFR team produces churn.

## Producing a Comparison Report

Combine two windows side-by-side:

```python
from minister.dora_metrics import compute_metrics
# ... collect events for each cohort ...
human = compute_metrics(human_deploys, human_failures, window_days=30)
agent = compute_metrics(agent_deploys, agent_failures, window_days=30)
print("Human:", human.tier())
print("Agent:", agent.tier())
print("Human bottleneck:", human.bottleneck())
print("Agent bottleneck:", agent.bottleneck())
```

If the bottleneck differs across cohorts, that is the most
useful single output: it tells the engineering manager which
guardrail is missing for which population.

## Anti-Patterns

- Reporting only DF as proof of agent ROI without CFR.
- Excluding agent-authored failures from the failure label.
- Comparing against last quarter when agent adoption mid-window
  invalidates the comparison.


# DORA Tier Thresholds

Source: DORA's State of DevOps research. The thresholds below match
the published bands; minor adjustments per release year are common
but the band shape is stable.

## Deployment Frequency (DF)

How often code is deployed to production. Higher is better.

| Tier | Threshold |
|------|-----------|
| Elite | At least once per day |
| High | Between once per week and once per day |
| Medium | Between once per month and once per week |
| Low | Less often than once per month |

## Lead Time for Changes (LT)

Median time from commit to production. Lower is better.

| Tier | Threshold |
|------|-----------|
| Elite | Less than one day |
| High | One day to one week |
| Medium | One week to one month |
| Low | More than one month |

## Change Failure Rate (CFR)

Percentage of deployments that cause a production failure. Lower is
better.

| Tier | Threshold |
|------|-----------|
| Elite | At most 15% |
| High | 16-30% |
| Medium | 31-45% |
| Low | More than 45% |

## Time to Restore Service (TRS)

Median time to recover from a production failure. Lower is better.

| Tier | Threshold |
|------|-----------|
| Elite | Less than one hour |
| High | Less than one day |
| Medium | Less than one week |
| Low | One week or more |

## Boundary Behavior

The implementation places the boundary value in the better tier:

- DF exactly 1.0/day classifies as Elite, not High.
- LT exactly 24 hours classifies as Elite, not High.
- CFR exactly 15% classifies as Elite, not High.
- TRS exactly 1 hour classifies as High, not Elite (TRS uses strict

What is this skill?

Run minister.dora_metrics with --window 30 and distinct --failure-label values (e.g. bug vs ai-bug) and compare JSON exp

Treat AI CFR more than five percentage points above human CFR as lenient review signal, not a ban on AI assistance

Compare lead time for 30 days before versus after agent adoption to spot velocity-for-stability tradeoffs

Watch time-to-restore when agents ship hotfixes—incomplete RCA can inflate TRS once truth surfaces

Pair rising deployment frequency with CFR so arbitrarily high DF from agents does not look healthy alone

Example commands use a 30-day --window with JSON output via python3 -m minister.dora_metrics

AI versus human CFR gap of more than five percentage points is called out as a review-leniency signal

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 1/2 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

You export 30-day CFR with bug and ai-bug labels and tighten review when AI CFR exceeds human by more than five points.

Example use

After an agent-shipped hotfix, you inspect TRS and whether restore claims matched real recovery time.

Example use