
Forgejudge
Hook autonomous coding agents into an open eval leaderboard and CI gate that solves tasks, scores runs, and traces execution.
Overview
ForgeJudge is an MCP server for the ship phase that connects agents to an open eval leaderboard and CI gate for solving, scoring, and tracing autonomous coding runs.
What is this MCP server?
- Open eval leaderboard for comparing autonomous coding agent runs
- CI gate pattern: block merges or releases when agent evals fail thresholds
- MCP server via stdio using uvx and PyPI package forgejudge[mcp]
- Traces agent solve paths for post-hoc debugging of failures
- Public site forgejudge.ahmedhobeishy.tech for leaderboard context
- Registry server version 0.1.1; PyPI MCP package identifier version 0.1.0
- Transport: stdio via uvx runtimeHint
- Website: forgejudge.ahmedhobeishy.tech
What problem does it solve?
You cannot confidently ship an agent workflow when you lack standardized solve-score-trace benchmarks and an automated gate when regressions appear.
Who is it for?
Solo builders iterating on coding agents who want PyPI-based stdio MCP plus a public eval narrative before tagging releases.
Skip if: Pure application teams with no autonomous agent component, or anyone who only needs linting without agent-task benchmarks.
What do I get? / Deliverables
After install, your agent can drive ForgeJudge evals from MCP and you can enforce CI thresholds using comparable leaderboard scores and traces.
- MCP-accessible solve, score, and trace workflows for coding agents
- Comparable runs suitable for an open eval leaderboard
- CI-oriented gating signals for agent regression control
Recommended MCP Servers
Journey fit
Benchmarking and gating agent behavior belongs in ship when you prove the product—or the agent stack—meets quality bars before release. Solve-score-trace evals and CI gates are testing and verification mechanics, not initial build or distribution work.
How it compares
Agent eval and CI-gate MCP integration, not a generic unit-test runner skill or hosting marketplace.
Common Questions / FAQ
Who is ForgeJudge for?
Builders of autonomous coding agents who need scored benchmarks, traces, and CI-friendly gates exposed through MCP.
When should I use ForgeJudge?
Use it in ship and testing when you compare agent versions or block deploys until eval suites pass.
How do I add ForgeJudge to my agent?
Configure stdio MCP with uvx per the registry runtimeArguments (forgejudge[mcp] module forgejudge.mcp.server) in Claude Code, Cursor, or another MCP client.