
Evalview Mcp
Catch agent regressions before deploy by comparing live runs to golden baselines in CI.
Overview
EvalView MCP is a MCP server for the Ship phase that runs golden-baseline regression tests on AI agents inside CI/CD.
What is this MCP server?
- Golden baseline regression runs for multi-framework agent stacks
- CI/CD-friendly evaluation alongside LangGraph, CrewAI, OpenAI, and Claude agents
- Deterministic tool and sequence checks without requiring an LLM judge
- Optional OpenAI API key enables LLM-as-judge output quality scoring
- PyPI stdio package evalview v0.6.0 for local MCP wiring
- Server version 0.6.0 on PyPI identifier evalview
- stdio transport
- OPENAI_API_KEY optional for deterministic-only eval paths
Community signal: 114 GitHub stars.
What problem does it solve?
Agent prompts and tool graphs change constantly, and without baselines you only notice broken behavior after users do.
Who is it for?
Indie builders shipping LangGraph, CrewAI, or Claude Code agents who want lightweight eval gates without building a custom harness.
Skip if: Teams that only need one-off manual chat tests or who do not run agents in automated pipelines.
What do I get? / Deliverables
You get repeatable pass-or-fail regression signals in CI so agent updates ship only when behavior matches known-good runs.
- Repeatable regression runs against stored golden baselines
- CI-friendly pass or fail signals on agent tool and sequence behavior
- Optional judged scores on model output quality
Recommended MCP Servers
Journey fit
How it compares
Agent regression MCP integration, not a general unit-test runner or a prompt-writing skill.
Common Questions / FAQ
Who is EvalView MCP for?
Solo and small teams building AI agents who need golden-baseline regression checks tied to CI/CD.
When should I use EvalView MCP?
Use it in the Ship and testing phase whenever you change prompts, tools, or orchestration and need to prove behavior still matches baselines.
How do I add EvalView MCP to my agent?
Install the evalview PyPI package, register the stdio MCP server in Claude Code or your host, and optionally set OPENAI_API_KEY for LLM-as-judge scoring.