
Darwin Skill
Autonomously score, hill-climb, and validation-test changes to a SKILL.md, then pause for human sign-off before keeping or rolling back.
Overview
Darwin-skill is a journey-wide agent skill that evaluates and hill-climbs SKILL.md files with a 9-dimension rubric, blind judge agents, and human checkpoints—usable whenever a solo builder needs to improve agent skill qu
Install
npx skills add https://github.com/alchaincyf/darwin-skill --skill darwin-skillWhat is this skill?
- 9-dimension rubric (structure + effectiveness + meta-skill blacklists), 100-point scale aligned with SkillLens (arXiv 26
- Validation-gated hill-climbing with git version control and ratchet-only-improvements rollback
- Independent judge sub-agents for blind scoring to avoid self-evaluation bias
- Test-prompt verification with auto-break on diminishing returns (SkillOpt-style gate)
- Human-in-the-loop checkpoints after each skill optimization cycle before continuing
- 9-dimension evaluation rubric with 100-point total score
- SkillLens paper cites ~46.4% LLM-as-judge accuracy for skill quality without structured rubrics
- v2.0 integrates SkillLens (arXiv 2605.23899) and SkillOpt (arXiv 2605.23904) patterns
Adoption & trust: 5.6k installs on skills.sh; 3.6k GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You cannot tell if a SKILL.md change actually helps agents, and LLM self-grading alone misleads you into keeping weaker instructions.
Who is it for?
Maintainers of Claude Code or Cursor skills who want research-backed scoring, independent judges, and mandatory human confirmation after each optimization pass.
Skip if: One-off prompt tweaks without a SKILL.md file, optimizing application source code, or teams that refuse git-based rollback on skill experiments.
When should I use this skill?
Use when the user mentions 优化skill, skill评分, auto optimize, darwin, 达尔文, skill review, skill打分, or asks how to improve SKILL.md quality.
What do I get? / Deliverables
You get rubric-scored iterations, test-prompt validation, git-backed rollback, human-approved keeps, and a result card documenting what improved.
- Rubric-scored SKILL.md revision with git history
- Validation run log with pass/fail on test prompts
- Visual optimization result card for the iteration
Recommended Skills
Journey fit
Useful at every journey phase - explore requirements and options before committing to a direction.
Where it fits
Score a draft SKILL.md against the nine-dimension rubric before investing in a full agent workflow.
Run hill-climbing on triggers and checklists until test prompts pass with higher rubric scores.
Human-in-the-loop checkpoint before merging skill changes that agents will load in production.
Re-optimize a skill after user reports misfires, using ratchet rollback when validation regresses.
How it compares
Use as a structured skill evolution loop—not a quick linter pass or a single chat rewrite of your prompt.
Common Questions / FAQ
Who is darwin-skill for?
Indie builders and skill authors who maintain SKILL.md packages and want autonomous optimization with validation gates and human approval, including users searching 优化skill, skill review, or darwin.
When should I use darwin-skill?
In Idea when drafting first skills; in Build when polishing agent-tooling; in Ship before publishing a skill repo; in Operate when regression-testing skill quality—whenever you mention skill scoring, auto optimize, or 达尔文.
Is darwin-skill safe to install?
It drives git edits and sub-agent judges; review the Security Audits panel on this page and confirm repository backups before letting it hill-climb production skills.
SKILL.md
READMESKILL.md - Darwin Skill
# Darwin Skill 2.0 > **v2.0 · 2026-05-28** — 吸收 Microsoft Research SkillLens(arXiv 2605.23899)的 9 维评分药方 + SkillOpt(arXiv 2605.23904)的 validation-gated 验证机制 + human in the loop 三层守关。 > > 借鉴 Karpathy autoresearch 的自主实验循环,对 skills 进行持续优化。 > 核心理念:**评估 → 改进 → 实测验证 → 人类确认 → 保留或回滚 → 生成成果卡片** > GitHub: https://github.com/alchaincyf/darwin-skill --- ## 设计哲学 autoresearch 的精髓: 1. **单一可编辑资产** — 每次只改一个 SKILL.md 2. **双重评估** — 结构评分(静态分析)+ 效果验证(跑测试看输出) 3. **棘轮机制** — 只保留改进,自动回滚退步 4. **独立评分** — 评分用子agent,避免「自己改自己评」的偏差 5. **人在回路** — 每个skill优化完后暂停,用户确认再继续 与纯结构审查的区别:不只看 SKILL.md 写得规不规范,更看改完后**实际跑出来的效果是否更好**。 --- ## 评估 Rubric(9维度,总分100) > **设计依据**:基于 SkillLens 论文(arXiv 2605.23899)实证发现——LLM-as-judge 评估 skill 质量准确率仅 46.4%(接近随机),加入 meta-skill 三维度后提升到 73.8%。本 rubric 强化 dim3 / dim5 评分标准,新增 dim9「反例与黑名单」,权重平衡到 100。**目的:让评分对真实质量更敏感,减少 LLM judge 的乐观偏差。** ### 结构维度(59分)— 静态分析 | # | 维度 | 权重 | 评分标准 | |---|------|------|---------| | 1 | **Frontmatter质量** | 7 | name规范、description包含做什么+何时用+触发词、≤1024字符、**禁结尾加"灵活应用/根据情况判断"等空话尾巴** | | 2 | **工作流清晰度** | 12 | 步骤明确可执行、有序号、每步有明确输入/输出 | | 3 | **失败模式编码** | 12 | **必须显式编码失败模式**(写出"如果 X 失败 → Y"的明确分支);有fallback路径、错误恢复;**只写正向流程而不写失败分支扣 ≥3 分**(SkillLens meta-skill 维度) | | 4 | **检查点设计** | 6 | 关键决策前有用户确认、防止自主失控;**检查点必须显性标记(🔴/STOP/CHECKPOINT),仅靠"如果...建议..."措辞不算** | | 5 | **可执行具体性** | 17 | 不模糊、有具体参数/格式/示例、可直接执行;**禁止"建议/可以考虑/根据情况/灵活把握/视情况而定"等软化措辞**——出现 ≥3 处扣 ≥3 分(SkillLens actionable specificity 维度) | | 6 | **资源整合度** | 4 | references/scripts/assets引用正确、路径可达 | ### 效果维度(35分)— 需要实测 | # | 维度 | 权重 | 评分标准 | |---|------|------|---------| | 7 | **整体架构** | 12 | 结构层次清晰、不冗余不遗漏、与花叔生态一致;**冗余/AI腔废话段落(说白了/换句话说/首先其次综上等花叔禁用词)出现一处扣 1 分** | | 8 | **实测表现** | 23 | 用测试prompt跑一遍,输出质量是否符合skill宣称的能力 | ### Meta-skill 维度(6分)— 反例与黑名单 | # | 维度 | 权重 | 评分标准 | |---|------|------|---------| | 9 | **反例与黑名单** | 6 | **skill 必须有"不要做什么"的反例清单**;只写"应该做 X"没有"不要做 Y"扣 ≥3 分;红灯/危险动作/反模式应单独章节列出(SkillLens risk-action blacklist 维度) | ### 评分规则 - 维度1-7、9:每个维度打 1-10 分,乘以权重得到该维度得分 - 维度8(实测表现):跑2-3个测试prompt,按输出质量打1-10分 - **总分 = Σ(维度分 × 权重) / 10**,满分100 - 改进后总分必须 **严格高于** 改进前才保留 ### Rubric 的实证基础 rubric 设计依据来自 **SkillLens 论文(arXiv 2605.23899)** + **本机 controlled study**: - SkillLens 发现 LLM-as-judge 准确率仅 46.4%(接近随机),加入 meta-skill 三维度后升到 73.8% - 本机对 huashu-research 做 4 类 degradation → 5 个独立 judge 盲测一致 V1>V2,Δ 均值 +46.5(5/5 high confidence) **结论**:rubric 能识别 gross degradation,但 fine-grained quality difference 仍不可信,**重要决策必须人审**。 → 详细论文证据 + 5 judges 完整数据 + HL 实战案例数字见 [references/skilllens-evidence.md](references/skilllens-evidence.md) ### 关于「实测表现」维度 这是与纯结构评分最大的区别。评分方式: 1. 为每个skill设计2-3个**典型用户prompt**(不是边缘case,是最常见的使用场景) 2. 用子agent执行:一个带skill跑,一个不带skill跑(baseline) 3. 对比输出质量,从以下角度打分: - 输出是否完成了用户意图? - 相比不带skill的baseline,质量提升明显吗? - 有没有skill引入的负面影响(过度冗余、跑偏、格式奇怪)? 若子 agent 不可用(超时/资源限制),退化为「干跑验证」:读完 skill 后模拟一个典型 prompt 的执行思路,判断流程是否合理;必须在 results.tsv 标注 `dry_run`。**dry_run 比例 > 30% → 评估失效警告**(来自本机 controlled study:dim8 实测维度权重 23%,无 full_test 验证时分数不可信)。 --- ## Runtime 适配性审查(gate 项,独立于 9 维度评分) skill 应当能在 Claude Code / Codex / Cursor / OpenClaw / Hermes / Gemini CLI / OpenCode 等 50+ skills-compatible runtime 通用——否则其他 agent 解析时会被「在 Claude Code 里」「Claude Code skill」等措辞误判为「不是给我用的」直接拒装(实例:nuwa-skill 因此被 Ma