Ansh Dawda contributor

Proofrag

Install when you ship a RAG or LLM app and need golden sets from your docs, LLM-as-judge scoring, retrieval metrics, shareable scorecards, and CI gates.

Overview

proofrag is a plugin marketplace for the Ship phase that evaluates RAG/LLM apps with golden sets, LLM-as-judge, retrieval metrics, scorecards, and CI gates.

What is this marketplace?

  • Golden test sets generated from your own documentation corpus
  • LLM-as-judge evaluation plus retrieval metrics in one skill bundle
  • Shareable scorecard output for stakeholders and iteration reviews
  • CI gate support so regressions fail builds (MIT licensed v0.5.2)
  • Keywords: rag, llm-as-judge, evaluation, retrieval metrics
  • 1 plugin: proofrag version 0.5.2, MIT license
  • Marketplace metadata describes golden sets, LLM-as-judge, and scorecards

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Community signal: 1 GitHub stars.

What problem does it solve?

RAG demos look fine in chat but nobody measures grounded accuracy, retrieval drift, or regressions before merge.

Who is it for?

Indie builders shipping doc-QA, support bots, or internal copilots who need repeatable evals—not one-off manual spot checks.

Skip if: Static sites with no LLM or retrieval layer, or teams unwilling to maintain golden questions tied to their docs.

What do I get? / Deliverables

After install, you get doc-derived golden sets, judged scores, retrieval metrics, a shareable scorecard, and optional CI failure on quality drops.

  • Golden evaluation set derived from your docs
  • LLM-as-judge and retrieval metric results
  • Shareable scorecard and optional CI gate signal

Plugins in this marketplace

1 plugin — install individually after you add the marketplace.

Recommended Marketplaces

Journey fit

Primary fit

RAG quality proof belongs on the ship/testing shelf because you run it before trusting production answers and in CI—not while brainstorming positioning. Golden-set evaluation, judge models, and retrieval metrics are test harness concerns for LLM apps, matching ship → testing.

How it compares

RAG/LLM evaluation skill marketplace, not a vector DB or embedding provider integration.

Common Questions / FAQ

Who is Proofrag for?

It is for builders running RAG or LLM apps who want golden-set evaluation, LLM-as-judge, and CI gates inside Claude Code.

When should I use Proofrag?

Use it before release and after chunking, embedder, or model changes to catch retrieval and answer-quality regressions.

How do I add Proofrag to my agent?

Add the unshDee/Proofrag Claude marketplace, enable the Proofrag plugin (MIT v0.5.2), and point it at your app and documentation for golden-set generation.

This week for builders

Five minutes, every Monday — the tools, releases and tactics for shipping solo.

unsubscribe anytime.