
Hooks Eval
Score Claude Code hook scripts against a security-first 100-point rubric before you enable them in your agent loop.
Overview
hooks-eval is an agent skill most often used in Ship—security (also Build—agent-tooling, Ship—review) that scores Claude Code hooks on a 100-point security-first rubric with explicit quality gates.
Install
npx skills add https://github.com/athola/claude-night-market --skill hooks-evalWhat is this skill?
- 100-point MCDA scoring rubric with documented vector normalization and stakeholder-weight methodology
- Security analysis block (30 points) with Critical −15, High −8, Medium −4, and Low −1 per finding
- Performance analysis block (25 points) as a dedicated weighted criterion alongside security
- Security checklist rows for dynamic eval with user input, command injection, unvalidated paths, and embedded secrets
- Aligns with the night-market skills-eval multi-metric evaluation methodology and sensitivity analysis guidance
- 100-point total scoring system
- 30-point security analysis weight with per-severity deductions
- 25-point performance analysis weight
Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You wrote a Claude Code hook but have no consistent way to know if it is safe and performant enough to turn on in real sessions.
Who is it for?
Solo and indie builders who add Claude Code hooks and want MCDA-weighted security and performance review before enabling automation.
Skip if: Teams with no Claude Code hook lifecycle, or anyone who only needs generic app pen-testing unrelated to agent hook scripts.
When should I use this skill?
Use when evaluating Claude Code hook scripts for security vulnerabilities, performance risk, and overall quality gates before enabling or merging them.
What do I get? / Deliverables
You get a normalized score, categorized security deductions, and a clear pass/fail against quality gates so you can fix hooks or ship them with justified confidence.
- Weighted hook score out of 100 with security and performance breakdown
- Checklist-mapped findings with severity and point deductions
- Pass/fail recommendation against documented quality gates
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Hook evaluation is a pre-ship quality gate: the rubric penalizes injection, secrets, and unsafe paths—the same risks you must clear before automation runs in production sessions. The canonical shelf is security because 30 of 100 points come from a vulnerability checklist with explicit Critical/High/Medium/Low deductions, not from feature completeness.
Where it fits
After drafting a PostToolUse formatter hook, run the rubric to catch unvalidated paths before you commit settings.json.
Block enabling a hook that shells out with user-controlled strings until Critical command-injection deductions are cleared.
Attach a scored hook evaluation summary to a PR so reviewers see the 30-point security section outcomes, not just the diff.
Re-run evaluation when you upgrade Claude Code or change hook timeouts after latency regressions in the 25-point performance block.
How it compares
Use a structured hook rubric instead of one-off chat reviews that skip weighted security penalties and repeatable scoring.
Common Questions / FAQ
Who is hooks-eval for?
It is for solo builders and small teams shipping Claude Code agent hooks who need a documented scoring rubric—not a popularity list entry—before hooks touch real tool-use traffic.
When should I use hooks-eval?
Use it in Build—agent-tooling while authoring hooks, in Ship—security before merging hook changes, in Ship—review on PRs, and in Operate—iterate after you change paths, dependencies, or hook events; run it whenever a hook reads user input or runs shell commands.
Is hooks-eval safe to install?
Treat it as evaluation criteria your agent follows locally; review the Security Audits panel on this Prism page for the ingested package risk signals before enabling it in automated workflows.
SKILL.md
READMESKILL.md - Hooks Eval
# Hook Evaluation Criteria Detailed scoring rubric and quality gates for hook evaluation. ## Mathematical Foundation This evaluation framework follows Multi-Criteria Decision Analysis (MCDA) best practices: - **Normalization**: Vector normalization for scale invariance ([full methodology](../../skills-eval/modules/multi-metric-evaluation-methodology.md)) - **Weighting**: Security-first weights with stakeholder validation - **Aggregation**: Weighted sum with penalty-based security scoring - **Validation**: Sensitivity analysis on non-security weights **Documentation**: See [Multi-Metric Evaluation Methodology](../../skills-eval/modules/multi-metric-evaluation-methodology.md) for complete mathematical foundation. ## Scoring System (100 points total) ### Security Analysis (30 points) **Vulnerability Detection:** - Critical vulnerabilities: -15 points each - High-risk issues: -8 points each - Medium-risk issues: -4 points each - Low-risk issues: -1 point each **Security Checklist:** | Check | Severity | Points Lost | |-------|----------|-------------| | Dynamic code evaluation with user input | Critical | -15 | | Command injection vulnerability | Critical | -15 | | Unvalidated file path access | High | -8 | | Secrets/credentials in code | High | -8 | | Missing input validation | Medium | -4 | | Overly permissive patterns | Medium | -4 | | No rate limiting | Low | -1 | | Verbose error messages exposing internals | Low | -1 | ### Performance Analysis (25 points) | Metric | Max Points | Criteria | |--------|------------|----------| | Execution time efficiency | 10 | PreToolUse <100ms, PostToolUse <200ms | | Memory usage optimization | 8 | <50MB for simple hooks, <100MB for complex | | I/O operation efficiency | 4 | Minimal file/network operations | | Resource cleanup | 3 | Proper cleanup of handles, connections | **Performance Thresholds:** ```yaml pre_tool_use: excellent: <50ms good: <100ms acceptable: <200ms poor: >200ms post_tool_use: excellent: <100ms good: <200ms acceptable: <500ms poor: >500ms memory: excellent: <25MB good: <50MB acceptable: <100MB poor: >100MB ``` ### Compliance Analysis (20 points) | Aspect | Max Points | Requirements | |--------|------------|--------------| | Structure compliance | 8 | Valid JSON/Python, correct schema | | Documentation completeness | 6 | Purpose, parameters, return values documented | | Error handling | 4 | All exceptions caught, meaningful messages | | Best practices | 2 | Follows hook authoring guidelines | **Structure Requirements:** - JSON hooks: Valid JSON schema with required fields - Python hooks: Type hints, async/await patterns - Matcher patterns: Valid regex, appropriate scope ### Reliability Analysis (15 points) | Aspect | Max Points | Requirements | |--------|------------|--------------| | Error handling robustness | 6 | Graceful handling of all error conditions | | Timeout management | 4 | Appropriate timeouts configured | | Idempotency | 3 | Safe to retry without side effects | | Graceful degradation | 2 | Falls back safely on failure | **Reliability Checklist:** - [ ] Hook returns valid response on all code paths - [ ] Exceptions are caught and handled - [ ] Timeout is configured appropriately - [ ] Hook can be called multiple times safely - [ ] Failure doesn't break agent operation ### Maintainability (10 points) | Aspect | Max Points | Requirements | |--------|------------|--------------| | Code structure | 4 | Clear, modular, single responsibility | | Documentation clarity | 3 | Purpose and behavior well explained | | Modularity | 2 | Reusable components, no duplication | | Test coverage | 1 | Tests exist for key functionality | ## Quality Levels | Score | Level | Description | |-------|-------|-------------| | 91-100 | Excellent | Production-ready, follows all best practices | | 76-90 | Good | Minor improvements suggested | | 51-75 | Acceptable | Some issues requiring attention | | 26-50 | Poor | Significant issues need