Karpathy Coder

Name: Karpathy Coder
Author: alirezarezvani

alirezarezvani/claude-skills

972 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

karpathy-coder is an active coding discipline skill that enforces Andrej Karpathy's four coding principles for developers who want LLM-assisted sessions to surface assumptions, stay simple, make surgical changes, and def

About

karpathy-coder is an MIT-licensed agent skill (version 2.3.0) that enforces Andrej Karpathy's four core coding principles during every LLM-assisted coding session: surface assumptions before coding, keep it simple, make surgical changes, and define verifiable goals. It triggers on phrases like review my diff, check complexity, am I overcomplicating this, karpathy check, and before I commit. The skill runs as a fork context and is compatible with claude-code, codex-cli, cursor, antigravity, opencode, and gemini-cli. Developers reach for karpathy-coder when an LLM might be overcoding, adding unnecessary abstraction, or shipping changes without explicit success criteria—acting as an active coding discipline guardrail rather than a one-time linter pass.

Enforces 4 Karpathy principles: surface assumptions, keep it simple, make surgical changes, define verifiable goals
Ships Python detection tools, review agent, slash command, and pre-commit hook
Triggers on "review my diff", "check complexity", "am I overcomplicating this", "karpathy check", or "before I commit"
Prevents overcomplicated abstractions, dead code, and unstated assumptions
Hard-gate: run before any commit or major code change when using LLM agents

Karpathy Coder by the numbers

972 all-time installs (skills.sh)
+11 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #125 of 1,356 Code Review & Quality skills by installs in the Skillselion catalog
Security screen: CRITICAL risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill karpathy-coder

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/karpathy-coder.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/karpathy-coder)

Installs	972
repo stars	★ 23.5k
Security audit	2 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you stop LLMs from overcoding complex diffs?

Automatically enforce Andrej Karpathy’s four core coding principles during every LLM-assisted coding session.

Who is it for?

Developers using LLM coding assistants who want a pre-commit discipline check against overengineering and bloated diffs.

Skip if: Greenfield prototyping sessions where exploratory throwaway code is intentional and no commit is planned.

When should I use this skill?

The user says review my diff, check complexity, karpathy check, before I commit, or expresses concern about overcomplicating LLM-generated code.

What you get

A reviewed diff with surfaced assumptions, simplified changes, surgical scope, and explicitly defined verifiable success criteria.

reviewed diff notes
surfaced assumptions list
verifiable goal definitions

By the numbers

Enforces 4 Karpathy coding principles in version 2.3.0
Compatible with 6 coding agent CLIs listed in manifest metadata

Files

SKILL.mdMarkdownGitHub ↗

Karpathy Coder — Active Coding Discipline

Derived from Andrej Karpathy's observations on LLM coding pitfalls. This is not just guidelines — it ships Python tools that detect violations, a review agent, a slash command, and a pre-commit hook.

"The models make wrong assumptions on your behalf and just run along with them without checking. They don't manage their confusion, don't seek clarifications, don't surface inconsistencies, don't present tradeoffs, don't push back when they should."

"They really like to overcomplicate code and APIs, bloat abstractions, don't clean up dead code... implement a bloated construction over 1000 lines when 100 would do."

"LLMs are exceptionally good at looping until they meet specific goals... Don't tell it what to do, give it success criteria and watch it go."

— Andrej Karpathy

The four principles

1. Think Before Coding

Don't assume. Don't hide confusion. Surface tradeoffs.

State assumptions explicitly. If uncertain, ask.
If multiple interpretations exist, present them — don't pick silently.
If a simpler approach exists, say so. Push back when warranted.
If something is unclear, stop. Name what's confusing. Ask.

2. Simplicity First

Minimum code that solves the problem. Nothing speculative.

No features beyond what was asked.
No abstractions for single-use code.
No "flexibility" or "configurability" that wasn't requested.
No error handling for impossible scenarios.
If you write 200 lines and it could be 50, rewrite it.

The test: Would a senior engineer say this is overcomplicated? If yes, simplify.

3. Surgical Changes

Touch only what you must. Clean up only your own mess.

Don't "improve" adjacent code, comments, or formatting.
Don't refactor things that aren't broken.
Match existing style, even if you'd do it differently.
If you notice unrelated dead code, mention it — don't delete it.
Remove imports/variables/functions that YOUR changes made unused.
Don't remove pre-existing dead code unless asked.

The test: Every changed line should trace directly to the user's request.

4. Goal-Driven Execution

Define success criteria. Loop until verified.

Instead of...	Transform to...
"Add validation"	"Write tests for invalid inputs, then make them pass"
"Fix the bug"	"Write a test that reproduces it, then make it pass"
"Refactor X"	"Ensure tests pass before and after"

For multi-step tasks, state a brief plan:

1. [Step] → verify: [check]
2. [Step] → verify: [check]
3. [Step] → verify: [check]

Slash command

/karpathy-check — Run the full 4-principle review on your staged changes.

Python tools (`scripts/`)

All tools are stdlib-only. Run with --help.

Script	What it detects
`complexity_checker.py`	Over-engineering: too many classes, deep nesting, high cyclomatic complexity, unused params, premature abstractions
`diff_surgeon.py`	Diff noise: lines that don't trace to the stated goal — comment changes, style drift, drive-by refactors
`assumption_linter.py`	Hidden assumptions in a plan: unasked features, missing clarifications, silent interpretation choices
`goal_verifier.py`	Weak success criteria: vague plans without verifiable checks, missing test assertions

Sub-agent

karpathy-reviewer — Runs all 4 principles against a diff. Dispatched by /karpathy-check or manually before committing.

Pre-commit hook

hooks/karpathy-gate.sh — runs complexity_checker.py and diff_surgeon.py on staged files. Warns (non-blocking) when violations are found. Wire it via .claude/settings.json or Husky.

References

references/karpathy-principles.md — the source quotes, deeper context, when to relax each principle
references/anti-patterns.md — 10+ before/after examples across Python, TypeScript, and shell
references/enforcement-patterns.md — how to wire hooks, CI integration, team adoption

When to relax

These principles bias toward caution over speed. For trivial tasks (typo fixes, obvious one-liners), use judgment. The principles matter most on:

Non-trivial implementations (>20 lines changed)
Code you don't fully understand
Multi-step tasks with unclear requirements
Anything that will be reviewed by humans

Cross-tool compatibility

Installs via plugin for Claude Code. For other tools, copy the principles into your schema file:

Tool	Schema file
Claude Code	`CLAUDE.md` (auto-loaded by plugin)
Codex CLI	`AGENTS.md`
Cursor	`AGENTS.md` or `.cursorrules`
Antigravity / OpenCode / Gemini CLI	`AGENTS.md`

Related skills (chains via `context: fork`)

`self-eval` — honest quality scoring after completing work
`code-reviewer` — broader code review; karpathy-coder focuses on the 4 LLM-specific pitfalls
`llm-wiki` — compound knowledge; karpathy-coder ensures you don't overcomplicate while building it

{
  "status": "ok",
  "source": "stdin",
  "total_findings": 3,
  "by_category": {
    "assumption-just": 1,
    "assumption-obvious": 1,
    "missing-format": 1
  },
  "verdict": "REVIEW",
  "findings": [
    {
      "line": 1,
      "category": "assumption-just",
      "matched": "just",
      "message": "'just' often hides complexity. What's being skipped?",
      "context": "I'll just add a function to export all user data"
    },
    {
      "line": 1,
      "category": "assumption-obvious",
      "matched": "Obviously",
      "message": "Signals an unstated assumption. Is it really obvious?",
      "context": "Obviously we need caching too"
    },
    {
      "line": 1,
      "category": "missing-format",
      "matched": "export all user data",
      "message": "Export/save/fetch mentioned but format not specified (JSON? CSV? API?)",
      "context": "I'll just add a function to export all user data"
    }
  ]
}

{
  "status": "ok",
  "threshold": "medium",
  "files_analyzed": 1,
  "total_findings": 1,
  "average_score": 85.0,
  "verdict": "WARN",
  "results": [
    {
      "file": "scripts/complexity_checker.py",
      "language": "python",
      "lines": 210,
      "functions": 7,
      "classes": 0,
      "imports": 6,
      "max_nesting": 5,
      "avg_cyclomatic": 3.4,
      "score": 85,
      "findings": [
        {
          "rule": "nesting-depth",
          "severity": "warn",
          "message": "Max nesting depth 5 (max 4). Extract or flatten."
        }
      ]
    }
  ]
}

{
  "status": "ok",
  "source": "plan.md",
  "steps_found": 4,
  "score": 4,
  "max_score": 12,
  "percentage": 33.3,
  "has_final_verification": false,
  "verdict": "MISSING",
  "step_results": [
    {"title": "1. Add validation to the form", "score": 0, "level": "none", "has_verify_label": false},
    {"title": "2. Fix the bug with empty inputs", "score": 0, "level": "none", "has_verify_label": false},
    {"title": "3. Write test: submit empty form → expect error → verify: test passes", "score": 2, "level": "reasonable", "has_verify_label": true},
    {"title": "4. Run full test suite → verify: all green", "score": 2, "level": "reasonable", "has_verify_label": true}
  ],
  "recommendations": [
    "2 step(s) have no verification. Add 'verify: [check]' to each.",
    "No final/end-to-end verification step. Add one at the end."
  ]
}

Anti-Patterns Gallery

Before/after examples for each of the 4 Karpathy principles. Use these to calibrate your reviews.

Principle #1 — Think Before Coding

Silent assumption: scope

Request: "Add a feature to export user data"

Bad: Immediately writes a function that exports ALL users as JSON to disk, picks column names, and adds CSV support.

Good:

Before implementing, I need to clarify:
1. Which users — all? current tenant? filtered?
2. Which fields — some may be sensitive (PII)
3. Format — JSON, CSV, or API endpoint?
4. Destination — file download, S3, or API response?

Simplest: add an API endpoint returning paginated JSON for authenticated admins.
Want me to go with that, or do you need something different?

Silent assumption: interpretation

Request: "Make the search faster"

Bad: Adds Redis caching, async queries, AND database indexes in one commit.

Good:

"Faster" could mean:
- Lower latency per query (add indexes, ~2h)
- Higher throughput under load (add async, ~4h)
- Faster perceived speed (progressive loading, ~3h)

Current: ~500ms p50. Which dimension matters most?

Principle #2 — Simplicity First

Over-abstraction

Request: "Add a function to calculate discount"

Bad (150 lines):

class DiscountStrategy(ABC):
    @abstractmethod
    def calculate(self, amount: float) -> float: ...

class PercentageDiscount(DiscountStrategy): ...
class FixedDiscount(DiscountStrategy): ...

@dataclass
class DiscountConfig:
    strategy: DiscountStrategy
    min_purchase: float = 0.0
    max_discount: float = float('inf')

class DiscountCalculator:
    def __init__(self, config: DiscountConfig): ...
    def apply_discount(self, amount: float) -> float: ...

Good (3 lines):

def calculate_discount(amount: float, percent: float) -> float:
    return amount * (percent / 100)

Add the Strategy pattern when you actually have multiple discount types. Not before.

Speculative features

Request: "Save user preferences to database"

Bad: Adds caching, validation, merge mode, notification system, and a PreferenceManager class.

Good:

def save_preferences(db, user_id: int, preferences: dict):
    db.execute("UPDATE users SET preferences = ? WHERE id = ?",
               (json.dumps(preferences), user_id))

Principle #3 — Surgical Changes

Drive-by refactoring

Request: "Fix the bug where empty emails crash the validator"

Bad diff (touches 15 lines, only 2 fix the bug):

  def validate_user(user_data):
-     # Check email format
+     """Validate user data."""           # ← docstring added (not asked)
+     email = user_data.get('email', '').strip()
      ...
+     if len(username) < 3:               # ← username validation (not asked)
+         raise ValueError("Username too short")

Good diff (touches 3 lines, all fix the bug):

  def validate_user(user_data):
-     if not user_data.get('email'):
+     email = user_data.get('email', '')
+     if not email or not email.strip():
          raise ValueError("Email required")
-     if '@' not in user_data['email']:
+     if '@' not in email:

Style drift

Request: "Add logging to the upload function"

Bad: Changes quote style, adds type hints, adds docstring, reformats boolean logic.

Good: Adds import logging, logger = logging.getLogger(__name__), and 3 logger.info/error calls. Matches existing single-quote style. Doesn't touch anything else.

Principle #4 — Goal-Driven Execution

Vague vs concrete

Request: "Fix the authentication system"

Bad plan:

1. Review the code
2. Identify issues
3. Make improvements
4. Test the changes

Good plan:

Specific issue: users stay logged in after password change.

1. Write test: change password → old session should be invalid
   verify: test fails (reproduces bug)
2. Invalidate all sessions on password change
   verify: test passes
3. Check edge: multiple active sessions, concurrent changes
   verify: additional tests pass
4. Run full auth test suite
   verify: all green, no regressions

Missing final verification

Bad: "I've added rate limiting. It should work."

Good: "Rate limiting added. Verified: sent 11 requests → first 10 got 200, 11th got 429. Existing tests still pass."

Quick-reference decision table

Situation	Principle	Action
Ambiguous requirement	#1 Think	List interpretations, ask
"I need a class for this"	#2 Simplicity	Can it be a function?
"While I'm here, I'll fix this too"	#3 Surgical	Mention it, don't fix it
"This should work"	#4 Goals	What test proves it?
User explicitly asked for abstraction	#2 relaxed	Build the abstraction
User said "refactor this file"	#3 relaxed	Broader changes are OK
One-liner fix, obvious correctness	all relaxed	Use judgment

Enforcement Patterns

How to wire the Karpathy principles into your workflow so they're enforced, not just documented.

Level 1 — Passive (read-only)

Install the plugin. The SKILL.md loads into every Claude Code session as context. The LLM reads it and (usually) follows it.

/plugin install karpathy-coder@claude-code-skills

Effectiveness: ~60%. The LLM sometimes forgets under pressure or for long tasks.

Level 2 — Active review (on demand)

Run /karpathy-check before committing. The review agent catches what the LLM missed.

# In Claude Code
/karpathy-check

# Or directly from shell
python scripts/complexity_checker.py src/ --threshold strict
python scripts/diff_surgeon.py

Effectiveness: ~85%. Catches most violations. Requires the user to remember to run it.

Level 3 — Automated gate (hook)

Wire hooks/karpathy-gate.sh as a pre-commit hook. Non-blocking (warns, doesn't reject) but visible.

Via Husky (Node.js projects)

npx husky add .husky/pre-commit "bash path/to/karpathy-gate.sh"

Via Claude Code settings

// .claude/settings.json
{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Bash",
      "hooks": [{
        "type": "command",
        "command": "${CLAUDE_PLUGIN_ROOT}/hooks/karpathy-gate.sh"
      }]
    }]
  }
}

Via pre-commit framework

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: karpathy-complexity
        name: Karpathy complexity check
        entry: python engineering/karpathy-coder/scripts/complexity_checker.py
        language: python
        types: [python]
        args: [--threshold, medium]
      - id: karpathy-diff
        name: Karpathy diff surgeon
        entry: python engineering/karpathy-coder/scripts/diff_surgeon.py
        language: python
        always_run: true

Effectiveness: ~95%. Violations get flagged before they enter the codebase.

Level 4 — CI integration

Add the tools to your CI pipeline so PRs get Karpathy-reviewed automatically.

GitHub Actions

# .github/workflows/karpathy-review.yml
name: Karpathy Review
on: [pull_request]

jobs:
  karpathy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Complexity check
        run: |
          python engineering/karpathy-coder/scripts/complexity_checker.py \
            $(git diff --name-only origin/main...HEAD | grep -E '\.(py|ts|tsx)$' | tr '\n' ' ') \
            --threshold medium --json > complexity.json
      - name: Diff noise check
        run: |
          python engineering/karpathy-coder/scripts/diff_surgeon.py \
            --diff origin/main...HEAD --json > noise.json
      - name: Report
        run: |
          echo "## Karpathy Review" >> $GITHUB_STEP_SUMMARY
          python -c "
          import json
          c = json.load(open('complexity.json'))
          n = json.load(open('noise.json'))
          print(f'Complexity: {c[\"average_score\"]}/100 ({c[\"total_findings\"]} findings)')
          print(f'Diff noise: {n[\"noise_ratio\"]*100:.0f}% ({n[\"verdict\"]})')
          " >> $GITHUB_STEP_SUMMARY

Team adoption

1. Start with Level 1 for a week. Let the team see the principles in action. 2. Add Level 2 when reviewing PRs. Run /karpathy-check on every PR. 3. Add Level 3 when the team agrees the principles are useful. Gate commits. 4. Add Level 4 for repos with multiple contributors or LLM-heavy workflows.

Anti-pattern: Going straight to Level 4 without team buy-in. The principles are opinionated — teams should experience them before enforcing them.

Karpathy Principles — Full Context

Source: Andrej Karpathy on X, January 2026.

The original observations

Karpathy identified four categories of LLM coding failure:

1. Assumption management

"The models make wrong assumptions on your behalf and just run along with them without checking. They don't manage their confusion, don't seek clarifications, don't surface inconsistencies, don't present tradeoffs, don't push back when they should."

What this means in practice:

User says "export user data" → LLM picks JSON, writes to disk, includes all fields, doesn't ask which users
User says "make it faster" → LLM adds caching, async, and connection pooling without asking what "faster" means
User says "fix the bug" → LLM guesses which bug based on context, never confirms

The fix: Before writing ANY code, list assumptions explicitly. If there are 2+ valid interpretations, present them and ask. If something is unclear, stop and name the confusion.

2. Overcomplexity

"They really like to overcomplicate code and APIs, bloat abstractions, don't clean up dead code... implement a bloated construction over 1000 lines when 100 would do."

Why LLMs do this:

Training data contains enterprise patterns (Strategy, Factory, Observer) applied at inappropriate scale
"More thorough" feels safe — the LLM can't be wrong for handling edge cases, even if they're impossible
No cost pressure — generating 1000 lines takes the same effort as generating 100

The fix: Ask "would a senior engineer say this is overcomplicated?" after writing. If a function has one caller, it shouldn't be a class. If an abstraction serves one use case, inline it.

3. Orthogonal edits

"They still sometimes change/remove comments and code they don't sufficiently understand as side effects, even if orthogonal to the task."

Common manifestations:

Reformats quote style while fixing a bug
Adds type annotations to unchanged functions
"Improves" a comment near the bug fix
Renames variables in untouched code
Adds docstrings to functions that weren't changed

The fix: Every changed line must trace to the user's request. If you notice something unrelated that could be improved, mention it — don't change it.

4. Weak verification loops

"LLMs are exceptionally good at looping until they meet specific goals... Don't tell it what to do, give it success criteria and watch it go."

The insight: LLMs perform dramatically better with declarative goals ("all tests pass") than imperative instructions ("add a try/except block"). The best workflow:

1. Define success criteria as concrete, verifiable checks 2. Let the LLM loop until all checks pass 3. Each step has its own "verify:" annotation

When to relax each principle

Principle	Relax when...
Think Before Coding	The request is unambiguous and self-contained (e.g., "add a return statement on line 42")
Simplicity First	The user explicitly asked for an abstraction, configuration, or extensibility
Surgical Changes	The user said "refactor this file" or "clean up this module"
Goal-Driven Execution	The task is a one-liner with obvious correctness (e.g., rename a variable)

The 80/20 of enforcement

If you adopt only ONE principle, adopt Surgical Changes (#3). It's the most measurable (diff analysis), the most commonly violated (LLMs love to "improve" things), and the easiest to check (does the diff contain lines unrelated to the task?).

If you adopt TWO, add Simplicity First (#2). Overcomplexity is the second-most-common failure and the most expensive to fix (you ship abstraction debt, then maintain it forever).

#!/usr/bin/env python3
"""
assumption_linter.py — Detect hidden assumptions in a plan or proposal.

Karpathy Principle #1 (Think Before Coding): "State your assumptions
explicitly. If uncertain, ask. If multiple interpretations exist, present
them — don't pick silently."

Reads a markdown plan (or stdin) and flags:
  - Phrases that indicate silent choices ("I'll just...", "Obviously...", "Simply...")
  - Missing scope boundaries ("export" without specifying what/who/how)
  - Format/location assumptions without explicit mention
  - Single-interpretation language for ambiguous requirements
  - Missing error/edge-case consideration

Usage:
    python assumption_linter.py plan.md
    echo "I'll add a function to export user data" | python assumption_linter.py -
    python assumption_linter.py plan.md --json

This is a heuristic tool, not a proof engine. False positives are expected;
the point is to trigger a conversation about assumptions.
"""
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path

# --- Pattern library ---

ASSUMPTION_SIGNALS = [
    (re.compile(r"\b(?:I'll just|let me just|we can just|just)\b", re.I),
     "assumption-just", "'just' often hides complexity. What's being skipped?"),
    (re.compile(r"\b(?:obviously|clearly|of course|naturally)\b", re.I),
     "assumption-obvious", "Signals an unstated assumption. Is it really obvious?"),
    (re.compile(r"\b(?:simply|straightforward|trivial|easy)\b", re.I),
     "assumption-simple", "Minimizing language. Could be hiding real complexity."),
    (re.compile(r"\b(?:should be fine|should work|shouldn't be a problem)\b", re.I),
     "assumption-hopeful", "Hopeful rather than verified. How will you confirm?"),
    (re.compile(r"\b(?:I assume|assuming|I'm guessing|probably)\b", re.I),
     "assumption-explicit", "At least it's explicit — but have you verified?"),
    (re.compile(r"\b(?:all users|every|everything|always|never)\b", re.I),
     "scope-absolute", "Absolute scope. Is that really the case?"),
]

MISSING_CLARIFICATION = [
    (re.compile(r"\b(?:export|import|save|load|fetch|send)\b.*\b(?:data|file|users)\b", re.I),
     "missing-format", "Export/save/fetch mentioned but format not specified (JSON? CSV? API?)"),
    (re.compile(r"\b(?:fix|improve|optimize|refactor|update)\b", re.I),
     "vague-action", "Vague action verb. What specifically changes? What's the measurable improvement?"),
    (re.compile(r"\b(?:handle|deal with|take care of)\b.*\b(?:error|edge|case)\b", re.I),
     "vague-error-handling", "Error handling mentioned vaguely. Which errors? What behavior?"),
    (re.compile(r"\b(?:the user|users)\b(?!.*\b(?:who|which|specific|certain|admin|role)\b)", re.I),
     "unscoped-user", "Which user(s)? All? Specific role? Authenticated only?"),
]

NO_VERIFICATION = [
    (re.compile(r"^(?:(?!(?:test|verify|check|assert|confirm|ensure|validate)).)*$", re.I),
     "no-verification", "No verification step found in this block. How will you know it works?"),
]


def lint_text(text, source_name="stdin"):
    """Lint a plan text. Return list of findings."""
    findings = []
    lines = text.splitlines()

    for i, line in enumerate(lines, 1):
        stripped = line.strip()
        if not stripped or stripped.startswith("#"):
            continue

        for pattern, category, message in ASSUMPTION_SIGNALS:
            for m in pattern.finditer(stripped):
                findings.append({
                    "line": i,
                    "category": category,
                    "matched": m.group(0),
                    "message": message,
                    "context": stripped[:120],
                })

        for pattern, category, message in MISSING_CLARIFICATION:
            if pattern.search(stripped):
                findings.append({
                    "line": i,
                    "category": category,
                    "matched": pattern.search(stripped).group(0),
                    "message": message,
                    "context": stripped[:120],
                })

    # Check if any "plan" or numbered-list block lacks verification
    plan_blocks = re.findall(r"(?:^|\n)((?:\d+\.\s+.+\n?)+)", text)
    for block in plan_blocks:
        has_verify = bool(re.search(r"\b(?:test|verify|check|assert|confirm|ensure|validate)\b", block, re.I))
        if not has_verify:
            findings.append({
                "line": 0,
                "category": "missing-verification",
                "matched": block[:80].replace("\n", " "),
                "message": "Plan block has no verification step. Add 'verify:' checks.",
                "context": block[:120].replace("\n", " "),
            })

    return findings


def main():
    p = argparse.ArgumentParser(
        description="Detect hidden assumptions in a plan or proposal (Karpathy Principle #1).",
        epilog="Reads a markdown file or stdin. Flags silent choices, vague actions, and missing verification.",
    )
    p.add_argument("input", nargs="?", default="-", help="Markdown file to lint, or - for stdin")
    p.add_argument("--json", action="store_true", help="JSON output")
    args = p.parse_args()

    if args.input == "-":
        text = sys.stdin.read()
        source = "stdin"
    else:
        path = Path(args.input)
        if not path.exists():
            print(f"[error] {path} not found", file=sys.stderr)
            sys.exit(1)
        text = path.read_text(encoding="utf-8", errors="replace")
        source = str(path)

    findings = lint_text(text, source)

    categories = {}
    for f in findings:
        categories.setdefault(f["category"], []).append(f)

    result = {
        "status": "ok",
        "source": source,
        "total_findings": len(findings),
        "by_category": {k: len(v) for k, v in categories.items()},
        "verdict": "CLEAN" if len(findings) == 0 else ("REVIEW" if len(findings) < 5 else "CLARIFY"),
        "findings": findings,
    }

    if args.json:
        print(json.dumps(result, indent=2))
        return

    print(f"Assumption Linter — {source}")
    print(f"Findings: {len(findings)}  Verdict: {result['verdict']}")
    if findings:
        print()
        for cat, items in categories.items():
            print(f"  [{cat}] ({len(items)})")
            for item in items[:5]:
                line_ref = f"L{item['line']}: " if item["line"] else ""
                print(f"    {line_ref}{item['message']}")
                print(f"    → \"{item['matched']}\" in: {item['context'][:80]}")
            if len(items) > 5:
                print(f"    ... and {len(items) - 5} more")
            print()
    else:
        print("\n  Plan looks explicit. Assumptions are surfaced.")

    print(f"\nVerdict: {result['verdict']}")


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
complexity_checker.py — Detect over-engineering in Python/TypeScript files.

Karpathy Principle #2 (Simplicity First): "No abstractions for single-use code.
If you write 200 lines and it could be 50, rewrite it."

Checks:
  - Cyclomatic complexity (branches per function)
  - Class count relative to file size (too many classes = premature abstraction)
  - Nesting depth (deep nesting = hard to read)
  - Function length (long functions = doing too much)
  - Import count (many imports = over-coupled)
  - Abstract base classes / protocols for small files (premature patterns)

Usage:
    python complexity_checker.py path/to/file.py
    python complexity_checker.py src/ --threshold medium
    python complexity_checker.py . --ext py,ts --json

Thresholds:
    strict  — flags aggressively (good for new code)
    medium  — balanced (default)
    relaxed — flags only egregious cases (good for legacy code)
"""
from __future__ import annotations
import argparse
import json
import os
import re
import sys
from pathlib import Path

# --- Thresholds ---

THRESHOLDS = {
    "strict": {
        "max_cyclomatic": 5,
        "max_nesting": 3,
        "max_function_lines": 30,
        "max_imports": 10,
        "max_classes_per_100_lines": 2,
        "max_file_lines": 300,
    },
    "medium": {
        "max_cyclomatic": 8,
        "max_nesting": 4,
        "max_function_lines": 50,
        "max_imports": 15,
        "max_classes_per_100_lines": 3,
        "max_file_lines": 500,
    },
    "relaxed": {
        "max_cyclomatic": 12,
        "max_nesting": 5,
        "max_function_lines": 80,
        "max_imports": 25,
        "max_classes_per_100_lines": 5,
        "max_file_lines": 1000,
    },
}

# --- Analysis functions ---

BRANCH_KEYWORDS_PY = re.compile(
    r"^\s*(if |elif |for |while |except |with |and |or |case )", re.MULTILINE
)
BRANCH_KEYWORDS_TS = re.compile(
    r"^\s*(if\s*\(|else if|for\s*\(|while\s*\(|catch\s*\(|case |switch\s*\(|\?\?|&&|\|\|)",
    re.MULTILINE,
)
FUNC_DEF_PY = re.compile(r"^\s*(?:async\s+)?def\s+(\w+)", re.MULTILINE)
FUNC_DEF_TS = re.compile(
    r"^\s*(?:export\s+)?(?:async\s+)?(?:function\s+(\w+)|(?:const|let)\s+(\w+)\s*=\s*(?:async\s+)?\()",
    re.MULTILINE,
)
CLASS_DEF_PY = re.compile(r"^\s*class\s+\w+", re.MULTILINE)
CLASS_DEF_TS = re.compile(r"^\s*(?:export\s+)?(?:abstract\s+)?class\s+\w+", re.MULTILINE)
IMPORT_PY = re.compile(r"^(?:import |from \S+ import )", re.MULTILINE)
IMPORT_TS = re.compile(r"^import\s+", re.MULTILINE)
ABC_PATTERN = re.compile(r"ABC|abstractmethod|Protocol|@abstract|Abstract\w+Base", re.MULTILINE)
INDENT_RE = re.compile(r"^( *)\S", re.MULTILINE)


def detect_lang(path):
    ext = path.suffix.lower()
    if ext in {".py"}:
        return "python"
    if ext in {".ts", ".tsx", ".js", ".jsx"}:
        return "typescript"
    return None


def count_branches(text, lang):
    pat = BRANCH_KEYWORDS_PY if lang == "python" else BRANCH_KEYWORDS_TS
    return len(pat.findall(text))


def extract_functions(text, lang):
    """Return list of (name, start_line, line_count)."""
    pat = FUNC_DEF_PY if lang == "python" else FUNC_DEF_TS
    lines = text.splitlines()
    funcs = []
    for m in pat.finditer(text):
        name = m.group(1) or (m.group(2) if m.lastindex and m.lastindex >= 2 else "anonymous")
        start = text[:m.start()].count("\n")
        # Estimate function length: count indented lines until next same-level def or end
        indent = len(m.group(0)) - len(m.group(0).lstrip())
        end = start + 1
        for i in range(start + 1, len(lines)):
            stripped = lines[i].rstrip()
            if not stripped:
                continue
            line_indent = len(stripped) - len(stripped.lstrip())
            if line_indent <= indent and stripped.lstrip() and not stripped.lstrip().startswith(("#", "//", "/*", "*")):
                if lang == "python" and (stripped.lstrip().startswith("def ") or stripped.lstrip().startswith("class ") or stripped.lstrip().startswith("async def ")):
                    break
                if lang == "typescript" and pat.match(stripped):
                    break
            end = i + 1
        funcs.append({"name": name, "start_line": start + 1, "lines": end - start})
    return funcs


def max_nesting(text, lang):
    """Return the maximum indentation depth in the file."""
    if lang == "python":
        unit = 4
    else:
        unit = 2
    depths = []
    for m in INDENT_RE.finditer(text):
        spaces = len(m.group(1))
        depths.append(spaces // unit if unit else 0)
    return max(depths) if depths else 0


def analyze_file(path, thresholds):
    """Analyze a single file. Return dict with findings."""
    text = path.read_text(encoding="utf-8", errors="replace")
    lang = detect_lang(path)
    if not lang:
        return None

    lines = text.splitlines()
    line_count = len(lines)
    findings = []

    # File length
    if line_count > thresholds["max_file_lines"]:
        findings.append({
            "rule": "file-length",
            "severity": "warn",
            "message": f"File is {line_count} lines (max {thresholds['max_file_lines']}). Consider splitting.",
        })

    # Import count
    imp_pat = IMPORT_PY if lang == "python" else IMPORT_TS
    import_count = len(imp_pat.findall(text))
    if import_count > thresholds["max_imports"]:
        findings.append({
            "rule": "import-count",
            "severity": "warn",
            "message": f"{import_count} imports (max {thresholds['max_imports']}). High coupling?",
        })

    # Class density
    cls_pat = CLASS_DEF_PY if lang == "python" else CLASS_DEF_TS
    class_count = len(cls_pat.findall(text))
    if line_count > 0:
        density = class_count / (line_count / 100)
        if density > thresholds["max_classes_per_100_lines"]:
            findings.append({
                "rule": "class-density",
                "severity": "warn",
                "message": f"{class_count} classes in {line_count} lines ({density:.1f} per 100). Premature abstraction?",
            })

    # Premature ABC/Protocol in small files
    if class_count > 0 and line_count < 200 and ABC_PATTERN.search(text):
        findings.append({
            "rule": "premature-abstraction",
            "severity": "warn",
            "message": "Abstract base class / Protocol in a file under 200 lines. Is this needed yet?",
        })

    # Nesting depth
    depth = max_nesting(text, lang)
    if depth > thresholds["max_nesting"]:
        findings.append({
            "rule": "nesting-depth",
            "severity": "warn",
            "message": f"Max nesting depth {depth} (max {thresholds['max_nesting']}). Extract or flatten.",
        })

    # Cyclomatic complexity (file-level)
    branches = count_branches(text, lang)
    funcs = extract_functions(text, lang)
    func_count = max(len(funcs), 1)
    avg_cyclomatic = branches / func_count
    if avg_cyclomatic > thresholds["max_cyclomatic"]:
        findings.append({
            "rule": "cyclomatic-complexity",
            "severity": "warn",
            "message": f"Average cyclomatic complexity {avg_cyclomatic:.1f} (max {thresholds['max_cyclomatic']}). Simplify branching.",
        })

    # Function length
    for f in funcs:
        if f["lines"] > thresholds["max_function_lines"]:
            findings.append({
                "rule": "function-length",
                "severity": "warn",
                "message": f"Function '{f['name']}' is {f['lines']} lines (max {thresholds['max_function_lines']}). Split it.",
                "line": f["start_line"],
            })

    score = max(0, 100 - len(findings) * 15)
    return {
        "file": str(path),
        "language": lang,
        "lines": line_count,
        "functions": len(funcs),
        "classes": class_count,
        "imports": import_count,
        "max_nesting": depth,
        "avg_cyclomatic": round(avg_cyclomatic, 1),
        "score": score,
        "findings": findings,
    }


def collect_files(target, extensions):
    target = Path(target)
    if target.is_file():
        return [target]
    files = []
    for ext in extensions:
        files.extend(target.rglob(f"*.{ext}"))
    # Exclude common non-source dirs
    skip = {"node_modules", ".git", "__pycache__", ".venv", "venv", "dist", "build"}
    return [f for f in files if not any(p in skip for p in f.parts)]


def main():
    p = argparse.ArgumentParser(
        description="Detect over-engineering in Python/TypeScript files (Karpathy Principle #2).",
        epilog="Thresholds: strict (new code), medium (default), relaxed (legacy).",
    )
    p.add_argument("target", help="File or directory to analyze")
    p.add_argument(
        "--threshold",
        choices=sorted(THRESHOLDS.keys()),
        default="medium",
        help="Strictness level (default: medium)",
    )
    p.add_argument(
        "--ext",
        default="py,ts,tsx,js,jsx",
        help="Comma-separated file extensions to scan (default: py,ts,tsx,js,jsx)",
    )
    p.add_argument("--json", action="store_true", help="JSON output")
    args = p.parse_args()

    thresholds = THRESHOLDS[args.threshold]
    extensions = [e.strip().lstrip(".") for e in args.ext.split(",")]
    files = collect_files(args.target, extensions)

    if not files:
        msg = f"No files found matching extensions: {extensions}"
        if args.json:
            print(json.dumps({"status": "error", "message": msg}))
        else:
            print(f"[error] {msg}", file=sys.stderr)
        sys.exit(1)

    results = []
    for f in sorted(files):
        r = analyze_file(f, thresholds)
        if r:
            results.append(r)

    total_findings = sum(len(r["findings"]) for r in results)
    avg_score = sum(r["score"] for r in results) / len(results) if results else 100

    summary = {
        "status": "ok",
        "threshold": args.threshold,
        "files_analyzed": len(results),
        "total_findings": total_findings,
        "average_score": round(avg_score, 1),
        "verdict": "PASS" if total_findings == 0 else ("WARN" if avg_score >= 50 else "FAIL"),
        "results": results,
    }

    if args.json:
        print(json.dumps(summary, indent=2))
        return

    print(f"Karpathy Simplicity Check — {len(results)} files, threshold: {args.threshold}")
    print(f"Average score: {avg_score:.0f}/100  Findings: {total_findings}")
    print()
    for r in results:
        if not r["findings"]:
            continue
        print(f"  {r['file']}  (score {r['score']}/100)")
        for f in r["findings"]:
            line = f"  line {f['line']}" if "line" in f else ""
            print(f"    [{f['severity'].upper()}] {f['rule']}{line}: {f['message']}")
        print()
    if total_findings == 0:
        print("  No findings. Code looks appropriately simple.")
    print(f"\nVerdict: {summary['verdict']}")


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
diff_surgeon.py — Detect diff noise: changes that don't trace to the stated goal.

Karpathy Principle #3 (Surgical Changes): "Every changed line should trace
directly to the user's request."

Analyzes a git diff and flags:
  - Comment-only changes (unrelated to the task)
  - Whitespace / formatting changes
  - Import additions not used by the new code
  - Style changes (quote style, trailing commas, semicolons)
  - Docstring additions to unchanged functions
  - Variable renames in untouched code
  - Type annotation additions to unchanged signatures

Usage:
    python diff_surgeon.py                          # analyze staged diff
    python diff_surgeon.py --diff HEAD~1..HEAD      # analyze last commit
    python diff_surgeon.py --file changes.diff      # analyze a diff file
    python diff_surgeon.py --json

Exit codes:
    0  clean — all changes look intentional
    1  noise detected — review before committing
"""
from __future__ import annotations
import argparse
import json
import re
import subprocess
import sys
from pathlib import Path

# --- Noise detectors ---

COMMENT_ONLY = re.compile(r"^[+-]\s*(?:#|//|/\*|\*|<!--)")
WHITESPACE_ONLY = re.compile(r"^[+-]\s*$")
QUOTE_CHANGE = re.compile(r'^[+-]\s*.*["\'].*["\']')
DOCSTRING_ADD = re.compile(r'^[+]\s*"""')
IMPORT_LINE = re.compile(r"^[+]\s*(?:import |from \S+ import |const .* = require)")
TYPE_ANNOTATION = re.compile(r"^[+-].*:\s*(?:str|int|float|bool|list|dict|Optional|Union|Any|string|number|boolean)\b")
SEMICOLON_CHANGE = re.compile(r"^[+-].*;\s*$")
TRAILING_COMMA = re.compile(r"^[+-].*,\s*$")


def get_diff(args):
    """Get diff text from args."""
    if args.file:
        return Path(args.file).read_text(encoding="utf-8", errors="replace")
    diff_range = args.diff or "--staged"
    cmd = ["git", "diff", diff_range] if diff_range != "--staged" else ["git", "diff", "--staged"]
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
        return result.stdout
    except (subprocess.TimeoutExpired, FileNotFoundError) as e:
        print(f"[error] git diff failed: {e}", file=sys.stderr)
        sys.exit(1)


def parse_hunks(diff_text):
    """Parse a unified diff into per-file hunks."""
    files = []
    current_file = None
    current_lines = []

    for line in diff_text.splitlines():
        if line.startswith("diff --git"):
            if current_file:
                files.append({"file": current_file, "lines": current_lines})
            # Extract filename: diff --git a/path b/path
            parts = line.split(" b/")
            current_file = parts[-1] if len(parts) > 1 else "unknown"
            current_lines = []
        elif line.startswith("+++ ") or line.startswith("--- "):
            continue
        elif line.startswith("@@"):
            current_lines.append({"type": "hunk_header", "text": line})
        elif line.startswith("+") or line.startswith("-"):
            current_lines.append({"type": "change", "text": line})

    if current_file:
        files.append({"file": current_file, "lines": current_lines})
    return files


def classify_line(line_text):
    """Classify a changed line. Returns a noise category or None if intentional."""
    if WHITESPACE_ONLY.match(line_text):
        return "whitespace"
    if COMMENT_ONLY.match(line_text):
        return "comment-only"
    if DOCSTRING_ADD.match(line_text):
        return "docstring-addition"
    if SEMICOLON_CHANGE.match(line_text):
        # Check if ONLY change is semicolon
        stripped = line_text[1:].rstrip(";").rstrip()
        if not stripped.strip():
            return None
        return "semicolon-style"
    return None


def analyze_file_diff(file_data):
    """Analyze a single file's diff for noise."""
    findings = []
    change_lines = [l for l in file_data["lines"] if l["type"] == "change"]
    total_changes = len(change_lines)

    if total_changes == 0:
        return findings

    # Detect paired +/- that are only whitespace/style changes
    additions = [l["text"] for l in change_lines if l["text"].startswith("+")]
    deletions = [l["text"] for l in change_lines if l["text"].startswith("-")]

    noise_count = 0
    for line_data in change_lines:
        category = classify_line(line_data["text"])
        if category:
            noise_count += 1
            findings.append({
                "category": category,
                "line": line_data["text"][:120],
            })

    # Detect quote-style swaps (paired changes where only quotes differ)
    for a, d in zip(sorted(additions), sorted(deletions)):
        a_norm = a[1:].replace('"', "'").strip()
        d_norm = d[1:].replace('"', "'").strip()
        if a_norm == d_norm and a[1:].strip() != d[1:].strip():
            findings.append({
                "category": "quote-style-swap",
                "line": f"{d[:60]} → {a[:60]}",
            })

    noise_ratio = noise_count / total_changes if total_changes > 0 else 0
    return findings


def main():
    p = argparse.ArgumentParser(
        description="Detect diff noise — changes that don't trace to the stated goal (Karpathy Principle #3).",
        epilog="Run before committing to catch drive-by refactors and style drift.",
    )
    p.add_argument("--diff", default=None, help="Git diff range (e.g. HEAD~1..HEAD). Default: staged changes.")
    p.add_argument("--file", default=None, help="Read diff from a file instead of git")
    p.add_argument("--json", action="store_true", help="JSON output")
    args = p.parse_args()

    diff_text = get_diff(args)
    if not diff_text.strip():
        result = {"status": "ok", "message": "No diff to analyze", "files": 0, "noise_lines": 0, "verdict": "CLEAN"}
        if args.json:
            print(json.dumps(result, indent=2))
        else:
            print("No diff to analyze. Stage changes first (git add) or specify --diff range.")
        return

    file_diffs = parse_hunks(diff_text)
    all_findings = []
    file_results = []

    for fd in file_diffs:
        findings = analyze_file_diff(fd)
        if findings:
            file_results.append({"file": fd["file"], "findings": findings})
            all_findings.extend(findings)

    total_noise = len(all_findings)
    total_changes = sum(
        len([l for l in fd["lines"] if l["type"] == "change"]) for fd in file_diffs
    )
    noise_ratio = total_noise / total_changes if total_changes > 0 else 0

    verdict = "CLEAN" if noise_ratio < 0.1 else ("NOISY" if noise_ratio < 0.3 else "VERY_NOISY")

    result = {
        "status": "ok",
        "files_in_diff": len(file_diffs),
        "total_change_lines": total_changes,
        "noise_lines": total_noise,
        "noise_ratio": round(noise_ratio, 2),
        "verdict": verdict,
        "file_results": file_results,
    }

    if args.json:
        print(json.dumps(result, indent=2))
        return

    print(f"Diff Surgeon — {len(file_diffs)} files, {total_changes} changed lines")
    print(f"Noise ratio: {noise_ratio:.0%} ({total_noise} noise lines)")
    print(f"Verdict: {verdict}")
    if file_results:
        print()
        for fr in file_results:
            print(f"  {fr['file']}:")
            categories = {}
            for f in fr["findings"]:
                categories.setdefault(f["category"], []).append(f["line"])
            for cat, lines in categories.items():
                print(f"    [{cat}] {len(lines)} instance(s)")
                for l in lines[:3]:
                    print(f"      {l}")
                if len(lines) > 3:
                    print(f"      ... and {len(lines) - 3} more")
        print()
        print("Recommendation: review flagged lines. Remove changes that don't trace to your task.")
    else:
        print("\n  All changes look intentional. Clean diff.")

    sys.exit(1 if verdict != "CLEAN" else 0)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
goal_verifier.py — Check if a plan has verifiable success criteria.

Karpathy Principle #4 (Goal-Driven Execution): "Define success criteria.
Loop until verified. Don't tell it what to do — give it success criteria
and watch it go."

Reads a markdown plan and scores:
  - Does each step have a verification check?
  - Are success criteria concrete (test, assertion, measurement)?
  - Are there vague criteria ("make it work", "looks good")?
  - Is there a final verification step?

Usage:
    python goal_verifier.py plan.md
    python goal_verifier.py plan.md --json

Scoring:
    Each plan step gets 0-3 points:
      3 = concrete verification (test assertion, metric, command)
      2 = reasonable verification (manual check, visual)
      1 = vague verification ("should work", "looks right")
      0 = no verification mentioned
"""
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path

CONCRETE_VERIFY = re.compile(
    r"\b(?:test\s+pass|assert|assertEqual|expect\(|\.toBe|\.toEqual|"
    r"exit\s+code\s*[=:]\s*0|status\s*[=:]\s*200|curl\s|"
    r"grep\s|diff\s|python.*test|npm\s+test|pytest|jest|"
    r"measure|benchmark|metric|latency\s*<|throughput\s*>)\b",
    re.I,
)
REASONABLE_VERIFY = re.compile(
    r"\b(?:verify|check|confirm|inspect|review|compare|validate|"
    r"run\s+and\s+see|manually|open\s+in\s+browser|visual|screenshot)\b",
    re.I,
)
VAGUE_VERIFY = re.compile(
    r"\b(?:should\s+work|looks?\s+(?:good|right|fine|ok)|"
    r"seems?\s+(?:correct|fine)|hopefully|probably\s+works?)\b",
    re.I,
)
STEP_PATTERN = re.compile(r"^(?:\d+[\.\)]\s+|[-*]\s+\[.\]\s+|[-*]\s+(?:Step\s+\d+))", re.M)
VERIFY_LABEL = re.compile(r"(?:verify|check|success\s+criteria|done\s+when|acceptance)\s*:", re.I)


def extract_steps(text):
    """Extract plan steps from markdown."""
    lines = text.splitlines()
    steps = []
    current_step = None
    current_body = []

    for line in lines:
        if STEP_PATTERN.match(line.strip()):
            if current_step:
                steps.append({"title": current_step, "body": "\n".join(current_body)})
            current_step = line.strip()
            current_body = []
        elif current_step:
            current_body.append(line)

    if current_step:
        steps.append({"title": current_step, "body": "\n".join(current_body)})

    return steps


def score_step(step):
    """Score a step's verification quality (0-3)."""
    full_text = step["title"] + "\n" + step["body"]

    if CONCRETE_VERIFY.search(full_text):
        return 3, "concrete"
    if VERIFY_LABEL.search(full_text) and REASONABLE_VERIFY.search(full_text):
        return 2, "reasonable"
    if REASONABLE_VERIFY.search(full_text):
        return 2, "reasonable"
    if VAGUE_VERIFY.search(full_text):
        return 1, "vague"
    return 0, "none"


def analyze_plan(text, source):
    """Analyze a plan for verification quality."""
    steps = extract_steps(text)

    if not steps:
        return {
            "status": "ok",
            "source": source,
            "steps_found": 0,
            "message": "No numbered/bulleted plan steps found. Is this a plan?",
            "verdict": "NO_PLAN",
            "score": 0,
            "max_score": 0,
            "step_results": [],
        }

    step_results = []
    total_score = 0
    max_score = len(steps) * 3

    for step in steps:
        pts, level = score_step(step)
        total_score += pts
        step_results.append({
            "title": step["title"][:120],
            "score": pts,
            "level": level,
            "has_verify_label": bool(VERIFY_LABEL.search(step["body"])),
        })

    # Check for final verification
    has_final = False
    if steps:
        last_full = steps[-1]["title"] + steps[-1]["body"]
        if re.search(r"\b(?:final|end-to-end|full.*test|regression|all.*pass)\b", last_full, re.I):
            has_final = True

    pct = (total_score / max_score * 100) if max_score > 0 else 0
    if pct >= 70:
        verdict = "STRONG"
    elif pct >= 40:
        verdict = "WEAK"
    else:
        verdict = "MISSING"

    return {
        "status": "ok",
        "source": source,
        "steps_found": len(steps),
        "score": total_score,
        "max_score": max_score,
        "percentage": round(pct, 1),
        "has_final_verification": has_final,
        "verdict": verdict,
        "step_results": step_results,
        "recommendations": _recommendations(step_results, has_final),
    }


def _recommendations(step_results, has_final):
    recs = []
    none_steps = [s for s in step_results if s["level"] == "none"]
    vague_steps = [s for s in step_results if s["level"] == "vague"]

    if none_steps:
        recs.append(f"{len(none_steps)} step(s) have no verification. Add 'verify: [check]' to each.")
    if vague_steps:
        recs.append(f"{len(vague_steps)} step(s) have vague criteria. Replace 'should work' with a concrete check.")
    if not has_final:
        recs.append("No final/end-to-end verification step. Add one at the end.")
    if not recs:
        recs.append("Plan has strong verification coverage. Good to go.")
    return recs


def main():
    p = argparse.ArgumentParser(
        description="Check if a plan has verifiable success criteria (Karpathy Principle #4).",
        epilog="Scores each step 0-3 based on verification quality.",
    )
    p.add_argument("input", nargs="?", default="-", help="Markdown plan file, or - for stdin")
    p.add_argument("--json", action="store_true", help="JSON output")
    args = p.parse_args()

    if args.input == "-":
        text = sys.stdin.read()
        source = "stdin"
    else:
        path = Path(args.input)
        if not path.exists():
            print(f"[error] {path} not found", file=sys.stderr)
            sys.exit(1)
        text = path.read_text(encoding="utf-8", errors="replace")
        source = str(path)

    result = analyze_plan(text, source)

    if args.json:
        print(json.dumps(result, indent=2))
        return

    print(f"Goal Verifier — {source}")
    print(f"Steps: {result['steps_found']}  Score: {result['score']}/{result['max_score']} ({result['percentage']}%)")
    print(f"Verdict: {result['verdict']}")
    print()

    for sr in result["step_results"]:
        icon = {"concrete": "+", "reasonable": "~", "vague": "?", "none": "!"}[sr["level"]]
        print(f"  [{icon}] {sr['title'][:100]}  ({sr['level']}, {sr['score']}/3)")

    print()
    for rec in result["recommendations"]:
        print(f"  -> {rec}")


if __name__ == "__main__":
    main()

Related skills

Improve Codebase ArchitectureSafely deepen clusters of shallow modules into cohesive, testable units while respecting their external dependencies.531k185k

Caveman ReviewGet ultra-compressed, one-line code review comments that cut noise while keeping every actionable fix.260k92.5k

Codebase DesignShared vocabulary for designing deep modules: improve a module's interface, find deepening opportunities, decide where a seam goes, make code more testable.233k185k

CavecrewDelegate coding tasks to specialized subagents that return compressed output, keeping the main context window usable for much longer sessions.210k92.5k

Requesting Code ReviewDispatch a consistent, high-signal code reviewer subagent that catches plan deviations and quality issues before merging or continuing development.178k260k

Code ReviewReviews a branch or PR diff on two axes at once: conformance to coding standards plus a code-smell baseline, and whether it actually implements the original spec.167k185k

How it compares

Use karpathy-coder over generic lint skills when the goal is session-level coding discipline and anti-overcoding guardrails, not syntax rule enforcement.

FAQ

What are Karpathy's four principles in karpathy-coder?

karpathy-coder enforces four principles: surface assumptions before coding, keep it simple, make surgical changes, and define verifiable goals. The skill version 2.3.0 triggers during writing, reviewing, or committing code when LLMs might overengineer solutions.

Which agents support karpathy-coder?

karpathy-coder lists compatible tools including claude-code, codex-cli, cursor, antigravity, opencode, and gemini-cli. The MIT-licensed skill runs in fork context and activates on phrases like review my diff, karpathy check, and before I commit.

Is Karpathy Coder safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Code Review & Qualitytestingintegrations

About

Karpathy Coder by the numbers

Add your badge

How do you stop LLMs from overcoding complex diffs?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Karpathy Coder — Active Coding Discipline

The four principles

1. Think Before Coding

2. Simplicity First

3. Surgical Changes

4. Goal-Driven Execution

Slash command

Python tools (scripts/)

Sub-agent

Pre-commit hook

References

When to relax

Cross-tool compatibility

Related skills (chains via context: fork)

Anti-Patterns Gallery

Principle #1 — Think Before Coding

Silent assumption: scope

Silent assumption: interpretation

Principle #2 — Simplicity First

Over-abstraction

Speculative features

Principle #3 — Surgical Changes

Drive-by refactoring

Style drift

Principle #4 — Goal-Driven Execution

Vague vs concrete

Missing final verification

Quick-reference decision table

Enforcement Patterns

Level 1 — Passive (read-only)

Level 2 — Active review (on demand)

Level 3 — Automated gate (hook)

Via Husky (Node.js projects)

Via Claude Code settings

Via pre-commit framework

Level 4 — CI integration

GitHub Actions

Team adoption

Karpathy Principles — Full Context

The original observations

1. Assumption management

2. Overcomplexity

3. Orthogonal edits

4. Weak verification loops

When to relax each principle

The 80/20 of enforcement

Related skills

How it compares

FAQ

What are Karpathy's four principles in karpathy-coder?

Which agents support karpathy-coder?

Is Karpathy Coder safe to install?

This week in AI coding

Python tools (`scripts/`)

Related skills (chains via `context: fork`)