
Self Eval
Calibrate honest quality scores after any agent task using a fixed two-axis matrix, devil’s advocate reasoning, and JSONL history to catch score inflation across sessions.
Overview
Self-eval is a journey-wide agent skill that produces calibrated two-axis work scores with devil’s advocate reasoning—usable whenever a solo builder needs an honest quality check after completing a task before committing
Install
npx skills add https://github.com/alirezarezvani/claude-skills --skill self-evalWhat is this skill?
- Two-axis scoring: task ambition (Low/Medium/High) and execution (Poor/Adequate/Strong) via fixed lookup matrix
- Mandatory devil’s advocate arguments for both higher and lower scores before finalizing
- Appends scores to `.self-eval-scores.jsonl` for cross-session history
- Anti-inflation detection by reading past scores in the working directory
- Prompt-only skill with no external tool dependencies
- Two-axis scoring matrix combining ambition and execution
- Persists scores to `.self-eval-scores.jsonl`
Adoption & trust: 522 installs on skills.sh; 17.5k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
AI self-assessment collapses to ‘everything is a 4’ because one number mixes task difficulty with execution quality, so you cannot tell if work is actually shippable.
Who is it for?
Solo builders closing agent sessions who want structured, honest calibration without external eval infrastructure.
Skip if: Formal compliance sign-off, production SLO proof, or situations where only automated test and security gate results should decide ship/no-ship.
When should I use this skill?
Use after completing a task, code review, or work session when the user wants an unbiased quality assessment; detects score inflation using persisted history.
What do I get? / Deliverables
You get a matrix-derived score, documented higher/lower arguments, and an appended JSONL log entry you can compare across sessions to detect grade inflation.
- Final two-axis score with matrix-derived rating
- Devil’s advocate higher/lower arguments and resolution
- Appended JSONL score record for session history
Recommended Skills
Journey fit
Useful at every journey phase - explore requirements and options before committing to a direction.
Where it fits
Score ambition vs execution after implementing a UI slice before moving to the next feature.
Run self-eval after a code review pass to avoid default 4/5 inflation on mediocre fixes.
Honestly rate a throwaway prototype’s scope and polish before committing to full build.
Compare today’s incident fix scores against `.self-eval-scores.jsonl` history to spot drifting self-grades.
Evaluate a marketing asset draft session for effort level vs outcome before publishing.
How it compares
Use as a structured self-grade ritual instead of asking the model ‘how did we do?’ on a single 1–5 scale.
Common Questions / FAQ
Who is self-eval for?
Indie developers and agent users who want unbiased post-task assessments with persisted history, especially when working alone without a human reviewer.
When should I use self-eval?
After completing any task, code review, or work session—during build feature wrap-up, ship review, validate prototypes, operate iterations, or grow content drafts—whenever you need honest closure before moving on.
Is self-eval safe to install?
Review the Security Audits panel on this Prism page; the skill writes `.self-eval-scores.jsonl` in your working directory and is prompt-only, so confirm you are comfortable with that local log file in your repo.
SKILL.md
READMESKILL.md - Self Eval
# Self-Eval: Honest Work Evaluation ultrathink **Tier:** STANDARD **Category:** Engineering / Quality **Dependencies:** None (prompt-only, no external tools required) ## Description Self-eval is a Claude Code skill that produces honest, calibrated work evaluations. It replaces the default AI tendency to rate everything 4/5 with a structured two-axis scoring system, mandatory devil's advocate reasoning, and cross-session anti-inflation detection. The core insight: AI self-assessment converges to "everything is a 4" because a single-axis score conflates task difficulty with execution quality. Self-eval separates these axes, then combines them via a fixed matrix that the model cannot override. ## Features - **Two-axis scoring** — Independently rates task ambition (Low/Medium/High) and execution quality (Poor/Adequate/Strong), then combines via a lookup matrix - **Mandatory devil's advocate** — Before finalizing, must argue for both higher AND lower scores, then resolve the tension - **Score persistence** — Appends scores to `.self-eval-scores.jsonl` in the working directory, building history across sessions - **Anti-inflation detection** — Reads past scores and flags clustering (4+ of last 5 identical) - **Matrix-locked scoring** — The composite score comes from the matrix, not from direct selection. Low ambition caps at 2/5 regardless of execution quality ## Usage After completing work in a Claude Code session: ``` /self-eval ``` With context about what to evaluate: ``` /self-eval the authentication refactor we just completed ``` The skill reviews the conversation history, identifies what was accomplished, scores it on both axes, runs the devil's advocate check, and appends the result to the score history file. ## Examples ### Example 1: Feature Implementation ``` /self-eval added pagination to the user list API ``` Output: ``` ## Self-Evaluation **Task:** Added cursor-based pagination to the user list API endpoint **Ambition:** Medium — New feature with established patterns, but required careful backwards compatibility **Execution:** Strong — All tests pass, handles edge cases, existing clients unaffected **Devil's Advocate:** - Lower: Pagination is well-documented. The pattern was copied from another endpoint with minor changes. Could argue Low ambition. - Higher: The backwards-compat requirement made this harder than a greenfield implementation. Had to support both offset and cursor modes. - Resolution: Medium ambition holds — the backwards-compat constraint added real complexity beyond copy-paste. **Score: 4** — Solid execution on meaningful work with real constraints. ``` ### Example 2: Bug Fix ``` /self-eval ``` Output: ``` ## Self-Evaluation **Task:** Fixed race condition in the checkout flow causing duplicate orders **Ambition:** Medium — Required debugging across 3 services to find the root cause **Execution:** Adequate — Fix works but the test only covers the exact scenario reported, not the general class of race conditions **Devil's Advocate:** - Lower: The fix was a single line (adding a mutex). Once found, it was trivial. - Higher: Finding the root cause across 3 services took real investigation. The mutex placement required understanding the full flow. - Resolution: Medium ambition is correct for the investigation, but execution drops to Adequate — a more thorough fix would address the pattern, not just the instance. **Score: 3** — Good debugging work but the fix is narrow. ``` --- ## What to Evaluate $ARGUMENTS If no arguments provided, review the full conversation history to identify what was accomplished this session. Summarize the work in one sentence before scoring. ## How to