Self Eval

Name: Self Eval
Author: alirezarezvani

alirezarezvani/claude-skills

556 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

self-eval is a Claude skill that calibrates honest quality scores after any agent task using a fixed two-axis matrix, devil's advocate reasoning, and JSONL history to catch score inflation across sessions.

About

self-eval is a Claude skill from alirezarezvani/claude-skills that post-processes agent work with structured quality calibration instead of accepting optimistic self-ratings. It applies a fixed two-axis evaluation matrix, devil's advocate reasoning, and persistent JSONL history so scores stay honest across multiple sessions. Developers running Claude Code, Cursor, or Codex agents reach for self-eval after codegen, refactoring, or test-writing tasks when they need an objective quality gate that tracks whether the agent is systematically overrating its own output. The skill is especially useful on long-running agent workflows where score inflation would otherwise hide regressions between tasks.

Two-axis scoring: task ambition (Low/Medium/High) and execution (Poor/Adequate/Strong) via fixed lookup matrix
Mandatory devil’s advocate arguments for both higher and lower scores before finalizing
Appends scores to `.self-eval-scores.jsonl` for cross-session history
Anti-inflation detection by reading past scores in the working directory
Prompt-only skill with no external tool dependencies

Self Eval by the numbers

556 all-time installs (skills.sh)
Ranked #229 of 1,356 Code Review & Quality skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill self-eval

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/self-eval.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/self-eval)

Installs	556
repo stars	★ 23.5k
Security audit	2 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you calibrate honest quality scores after agent tasks?

Calibrate honest quality scores after any agent task using a fixed two-axis matrix, devil’s advocate reasoning, and JSONL history to catch score inflation across sessions.

Who is it for?

Developers running multi-step AI coding agents who need consistent post-task quality scoring with historical calibration across sessions.

Skip if: Manual human-only code review workflows with no agent involvement or one-off tasks where persistent JSONL history adds no value.

When should I use this skill?

An agent just completed a coding task and the developer needs calibrated quality scores compared against prior JSONL history to catch inflation.

What you get

Calibrated two-axis quality scores, devil's advocate critique notes, and append-only JSONL evaluation history per agent session.

Calibrated quality scores
JSONL evaluation log
Devil's advocate critique notes

By the numbers

Uses a fixed two-axis evaluation matrix
Persists evaluation history in JSONL format

Files

SKILL.mdMarkdownGitHub ↗

../../../engineering/skills/self-eval/SKILL.md

Related skills

Improve Codebase ArchitectureSafely deepen clusters of shallow modules into cohesive, testable units while respecting their external dependencies.531k185k

Caveman ReviewGet ultra-compressed, one-line code review comments that cut noise while keeping every actionable fix.260k92.5k

Codebase DesignShared vocabulary for designing deep modules: improve a module's interface, find deepening opportunities, decide where a seam goes, make code more testable.233k185k

CavecrewDelegate coding tasks to specialized subagents that return compressed output, keeping the main context window usable for much longer sessions.210k92.5k

Requesting Code ReviewDispatch a consistent, high-signal code reviewer subagent that catches plan deviations and quality issues before merging or continuing development.178k260k

Code ReviewReviews a branch or PR diff on two axes at once: conformance to coding standards plus a code-smell baseline, and whether it actually implements the original spec.167k185k

How it compares

Use self-eval instead of generic code-review skills when the input is agent-generated work and you need cross-session score calibration, not a one-time human lint pass.

FAQ

What scoring method does self-eval use?

self-eval applies a fixed two-axis quality matrix combined with devil's advocate reasoning after each agent task. The matrix produces calibrated scores rather than accepting the agent's optimistic self-assessment at face value.

How does self-eval track score inflation over time?

self-eval appends calibrated evaluation results to JSONL history files across sessions. Developers can compare current scores against prior entries to detect when an agent systematically overrates its own output.

When should developers invoke self-eval?

Developers should invoke self-eval immediately after any agent completes a coding, refactoring, or testing task. The skill fits ship-stage review when honest quality gates are needed before merging agent-generated changes.

Is Self Eval safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Code Review & Qualitytestingdocs