
Quality Playbook
Run an autonomous Quality Playbook calibration cycle with spawned benchmark subprocesses, observable run state, and lever updates across sessions.
Overview
Quality Playbook is an agent skill most often used in Ship (also Operate iterate) that orchestrates autonomous QPB calibration cycles with resumable run-state logging and per-benchmark playbook spawns.
Install
npx skills add https://github.com/github/awesome-copilot --skill quality-playbookWhat is this skill?
- Orchestrator executes Steps 1–12 from ai_context/CALIBRATION_PROTOCOL.md (Mode 1 autonomous)
- Spawn-and-resume model across multiple AI sessions using run_state.jsonl (v1.5.6)
- Background playbook subprocess per benchmark with benchmark_start instrumentation
- Writes cycle audit and Lever Calibration Log entry on completion
- Designed for bash plus file tools; works beyond Claude Code with resumable cycle directories
- 12-step calibration protocol (Steps 1–12)
- v1.5.6 spawn-and-resume cluster instrumentation
- Cycle-level events schema in references/run_state_schema.md
Adoption & trust: 1.2k installs on skills.sh; 34.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have playbook quality rules but no resumable, instrumented process to calibrate them across many benchmarks and AI sessions.
Who is it for?
Maintainers running multi-benchmark QPB calibration with nohup subprocesses and council/lever steps after evidence gathering.
Skip if: Builders who only need a single-pass code review or unit tests without a calibration protocol repo and cycle directory.
When should I use this skill?
You are driving an end-to-end QPB calibration cycle per CALIBRATION_PROTOCOL Mode 1 with bash and file tools available.
What do I get? / Deliverables
You complete a documented calibration cycle with audit artifacts and lever log updates, resuming from run_state.jsonl after each session ends.
- Cycle audit document
- Lever Calibration Log entry
- Append-only run_state.jsonl event trail
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
End-to-end QPB calibration is a ship-time quality gate before you trust agent playbooks in production-like benchmarks. The template drives CALIBRATION_PROTOCOL Mode 1 autonomous testing cycles with benchmarks—not feature coding—so testing is the canonical shelf.
Where it fits
Kick off Step 2 with nohup playbook runs for each benchmark and log benchmark_start before the orchestrator session ends.
Finalize a benchmark completion, aggregate results into the cycle audit, and prepare Council review inputs.
Re-attach to the same cycle directory after a timeout and continue from the last run_state.jsonl event through lever application.
How it compares
Autonomous calibration orchestrator tied to CALIBRATION_PROTOCOL—not a generic test runner skill or linter integration.
Common Questions / FAQ
Who is quality-playbook for?
Solo operators and agent-quality maintainers who own a QPB repo with CALIBRATION_PROTOCOL.md and need session-resumable calibration cycles.
When should I use quality-playbook?
In Ship testing when validating playbook changes against benchmarks, and in Operate iterate when re-running calibration after lever adjustments or failed cycles.
Is quality-playbook safe to install?
It expects bash, background jobs, and filesystem writes in your calibration repo—review the Security Audits panel on this page before granting agent shell access.
SKILL.md
READMESKILL.md - Quality Playbook
# Calibration Orchestrator — autonomous cycle prompt template (v1.5.6) *Prompt template for the AI session driving an end-to-end QPB calibration cycle. The orchestrator AI executes Steps 1-12 from `ai_context/CALIBRATION_PROTOCOL.md`, spawns playbook subprocesses per benchmark, and writes the cycle audit + Lever Calibration Log entry. Designed for Claude Code sessions but will work in any tool with bash + file tools.* *This prompt builds on `ai_context/CALIBRATION_PROTOCOL.md` Mode 1 (autonomous). The protocol is the canonical operational guide; this template wires it into v1.5.6's run-state instrumentation so the cycle is fully observable, resumable, and recoverable.* *Schema for cycle-level events: `references/run_state_schema.md`.* *Session model — **spawn-and-resume across multiple orchestrator sessions** (v1.5.6 cluster F.1 finding from the 2026-05-02 Pattern 7 cycle). The orchestrator role spans many discrete AI sessions that re-attach to the same cycle directory and resume from `run_state.jsonl`; each session typically drives one cycle step (kick off a benchmark, finalize a benchmark on completion, apply the lever, run Council, etc.) and exits. A long-lived single-session orchestrator was attempted in early prototyping and did not survive realistic AI session lifetimes (timeouts, network drops, operator-ended sessions across the ~4 hours an 8-benchmark cycle takes). The Step 2 spawn pattern below — `nohup` the playbook in the background, append a `benchmark_start` event with the PID, return control — IS the load-bearing recovery mechanism, not an exception case.* *Compare with `ai_context/AI_ORCHESTRATION_PATTERNS.md`. That document describes a **multi-session orchestrator/worker** pattern where a chat-driving AI controls a separate coding AI via files in a shared directory. This template applies the same multi-session discipline at a different layer: the orchestrator AI sessions (any number across the cycle's lifetime) coordinate the playbook subprocess lifecycle, while the playbook itself is the worker. Use this template when the work to coordinate is a calibration cycle (a fixed Steps 1-12 workflow); use the broader orchestrator/worker pattern when chat-side planning and coding-side execution need to be coordinated outside a calibration cycle.* --- ## Role You are the **calibration orchestrator** for a Quality Playbook calibration cycle. Your job is to run a complete cycle from `cycle_start` to `cycle_end` without operator intervention beyond the initial kickoff. You are NOT the playbook AI. You spawn playbook AI sessions (via `python3 -m bin.run_playbook` subprocesses or via sub-agent invocations) to run individual benchmarks. You drive the cycle-level workflow above the playbook. --- ## Inputs (operator provides at kickoff) The operator launches you with these inputs filled in: - **`<cycle_name>`** — short kebab-case identifier. Format: `<YYYY-MM-DD>-<lever-or-test-shorthand>`. Example: `2026-05-15-pattern7-displacement-recovery`. - **`<lever_id>`** — the lever from `ai_context/IMPROVEMENT_LOOP.md` you're calibrating. Example: `lever-1-exploration-breadth-depth`. - **`<lever_change_description>`** — what you'll actually edit. Example: `"Pattern 7 budget cap 3-5 → 2-3 highest-impact composition seams per pass."` - **`<benchmarks>`** — comma-separated benchmark list. Example: `chi-1.3.45,chi-1.5.1,virtio-1.5.1,express-1.3.50`. - **`<hypothesis>`** — the testable claim. Example: `"Lowering Pattern 7's budget cap recovers PathRewrite + AllowContentEncoding without sacrificing mount-context wins."` - **`<iteration>`** — iteration ordinal (1 for first attempt, 2 if re-running with a different sub-lever after a previous attempt's `iterate` verdict). Default: 1. - **`<iterate_cap>`** — maximum iterations before halt. Default: 3. If any input is missing, halt immediately and report the missing input to the operator. --- ## Cycle directory layout Working directory: `~/Documents/AI-Driven Development/Quality