Senior Prompt Engineer

Name: Senior Prompt Engineer
Author: alirezarezvani

alirezarezvani/claude-skills

903 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

senior-prompt-engineer is an agent skill that systematically optimizes prompts, designs few-shot examples, evaluates LLM outputs, implements RAG pipelines, and architects reliable agentic workflows for developers buildin

About

senior-prompt-engineer is a Claude skill for prompt engineering patterns, LLM evaluation frameworks, and agentic system design. It activates when developers ask to optimize prompts, design templates, evaluate LLM outputs, build agentic systems, implement RAG, create few-shot examples, analyze token usage, or design AI workflows. The skill bundles dedicated tooling areas including a Prompt Optimizer, RAG Evaluator, and Agent Orchestration guidance for structured output design. Developers reach for senior-prompt-engineer when moving from ad-hoc prompts to measurable, repeatable LLM pipelines with evaluation gates. It spans few-shot example design, retrieval evaluation, token budgeting, and multi-step agent architecture decisions in one workflow-oriented skill.

Prompt Optimization Workflow with analysis and iterative refinement
RAG Evaluator that scores context relevance and answer quality
Agent Orchestrator for visualizing and debugging multi-step agent flows
Structured Output Design templates for consistent JSON results
Few-Shot Example Design workflow with 7 common patterns

Senior Prompt Engineer by the numbers

903 all-time installs (skills.sh)
+3 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #1,173 of 16,565 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill senior-prompt-engineer

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/senior-prompt-engineer.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/senior-prompt-engineer)

Installs	903
repo stars	★ 23.5k
Security audit	2 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you evaluate and optimize RAG prompt pipelines?

Systematically optimize prompts, design few-shot examples, evaluate LLM outputs, implement RAG pipelines, and architect reliable agentic workflows.

Who is it for?

Developers shipping LLM-powered features who need structured prompt optimization, RAG evaluation, and agent architecture patterns beyond trial-and-error prompting.

Skip if: Teams with no LLM integration who only need conventional backend APIs without retrieval or agent orchestration.

When should I use this skill?

User asks to optimize prompts, design few-shot examples, evaluate LLM outputs, implement RAG, or architect AI agent workflows.

What you get

Optimized prompt templates, few-shot example sets, RAG evaluation reports, token usage analyses, and agentic workflow architectures.

Optimized prompt templates
RAG evaluation report
Agent workflow architecture

By the numbers

Bundles 3 dedicated tooling areas: Prompt Optimizer, RAG Evaluator, and Agent Orchestration guidance

Files

SKILL.mdMarkdownGitHub ↗

Senior Prompt Engineer

Eval-driven prompt engineering, RAG quality measurement, and agent workflow validation. Everything here is model-agnostic by design: techniques are framed by what they do, not by which model generation they were observed on, and the tools never hardcode model IDs or pricing — you supply your provider's current rates when you want dollar figures.

Operating Rules

1. Never change a prompt without a baseline. Capture metrics first (--analyze --output baseline.json), then compare every iteration against it. 2. Eval set before optimization. 10–20 representative cases with expected outputs minimum. If the user has no eval set, build one with them before touching the prompt — optimizing against vibes is the #1 failure mode. 3. Prefer platform features over prompt hacks. If the provider offers native structured outputs / JSON schema enforcement, tool-use APIs, or prompt caching, use those instead of "respond ONLY with JSON" incantations. Prompt-level format enforcement is the fallback, not the default. 4. Current-generation models need less scaffolding. Don't add chain-of-thought boilerplate, role framing, or few-shot examples reflexively — frontier models often do worse with redundant scaffolding. Add each element only when the eval set shows it helps. 5. Cost numbers are always user-supplied. Look up the provider's current per-Mtok pricing and pass it via --price-per-mtok (never trust a cached price table — including any you remember).

Tools (exact CLIs, all stdlib)

1. Prompt Optimizer — `scripts/prompt_optimizer.py`

Static analysis: token estimate, clarity/structure scores (0–100), ambiguity + redundancy detection, few-shot example extraction.

# Full analysis (human-readable report)
python3 scripts/prompt_optimizer.py prompt.txt --analyze

# Save machine-readable baseline for later comparison
python3 scripts/prompt_optimizer.py prompt.txt --analyze --json --output baseline.json

# Token estimate; cost only if you supply your provider's current rate
python3 scripts/prompt_optimizer.py prompt.txt --tokens --model claude --price-per-mtok 3.00

# Whitespace/redundancy-trimmed version
python3 scripts/prompt_optimizer.py prompt.txt --optimize --output optimized.txt

# Extract Input/Output few-shot pairs to JSON
python3 scripts/prompt_optimizer.py prompt.txt --extract-examples --output examples.json

# Compare a revision against the saved baseline
python3 scripts/prompt_optimizer.py optimized.txt --analyze --compare baseline.json

--model accepts any string; only the tokenizer family is inferred (names containing "claude" → 3.5 chars/token, otherwise 4.0). Exit 0 on success, 1 on missing file.

2. RAG Evaluator — `scripts/rag_evaluator.py`

Measures retrieval and grounding quality from two JSON files (formats printed in --help).

python3 scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --k 10 --json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --output report.json --verbose
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --compare baseline_report.json

Reports context relevance, precision@k, coverage, answer faithfulness, groundedness. Treat relevance < 0.80 as a retrieval problem (chunking/embedding/filtering), not a prompt problem — fix retrieval before rewriting the generation prompt.

3. Agent Orchestrator — `scripts/agent_orchestrator.py`

Validates agent configs (YAML/JSON): tool wiring, missing required config, loop risk, token estimates.

python3 scripts/agent_orchestrator.py agent.yaml --validate
python3 scripts/agent_orchestrator.py agent.yaml --visualize --format mermaid
python3 scripts/agent_orchestrator.py agent.yaml --estimate-cost --runs 100 \
    --input-price-per-mtok 3.00 --output-price-per-mtok 15.00

Without the two price flags, --estimate-cost reports token estimates only. The model: field in the config is informational — any model name is accepted.

Workflows

Prompt Optimization (eval-gated)

1. Baseline: python3 scripts/prompt_optimizer.py current_prompt.txt --analyze --json --output baseline.json 2. Diagnose from the report: ambiguous verbs ("analyze", "handle"), redundant blocks, missing output contract, token waste. 3. Apply one change at a time, in this order of leverage:

Symptom	Fix
Malformed/unparseable output	Native structured outputs / JSON schema if the API supports it; explicit schema-in-prompt otherwise
Inconsistent answers across runs	Tighten instructions + add 2–3 contrastive examples (one near-miss showing what NOT to do)
Misses edge cases	Enumerate the edge cases explicitly; add a "when uncertain, do X" rule
Token bloat on repeated calls	Move stable prefix (system rules, examples) first so prompt caching applies; trim redundancy
Wrong reasoning on hard cases	Ask for stepwise reasoning in a scratch field the consumer ignores, or use the provider's extended-thinking mode

4. Re-analyze and compare: python3 scripts/prompt_optimizer.py revised.txt --analyze --compare baseline.json 5. Eval gate (must pass before shipping): run the revised prompt over the eval set, write per-case pass/fail to eval_results.json, then assert:

   python3 scripts/prompt_optimizer.py revised.txt --analyze --json --output revised.json \
     && python3 -c "
   import json, sys
   r = json.load(open('revised.json')); b = json.load(open('baseline.json'))
   ok = r['clarity_score'] >= b['clarity_score'] and r['token_count'] <= b['token_count'] * 1.10
   sys.exit(0 if ok else 1)"
   echo "gate exit=$?"   # 0 = ship; 1 = regression, iterate again

Pair this structural gate with your task-level eval: the revision must not lose any previously-passing eval case (no-regression rule).

Few-Shot Example Design

1. Define the task contract first (input shape, output shape, edge-case policy). 2. Start with zero examples and measure — current models often need none. Add examples only for failure clusters the eval reveals. 3. When adding: 3–5 max, ordered simple → edge → negative (what NOT to extract), formatted identically to the real output contract. 4. Validate consistency: python3 scripts/prompt_optimizer.py prompt_with_examples.txt --extract-examples --output examples.json and inspect that every extracted pair parses against your schema. 5. Re-run the eval set; if a case passes only because it resembles an example, add a held-out variant to the eval set.

Structured Output Design

1. Write the JSON Schema first (types, enums, required, maxLength). 2. Prefer API-native enforcement: structured-outputs / response-schema / tool-call parameters guarantee shape; prompt text cannot. 3. Fallback (API without schema support): include the schema rendered as field-by-field rules + one valid example, and instruct "output only the JSON object". 4. Gate: pipe 10 eval outputs through a schema validator (python3 -c "import json,sys; [json.loads(l) for l in sys.stdin]" at minimum); 10/10 must parse, else return to step 2.

RAG Tuning Loop

1. Build questions.json (id, question, reference answer) and capture current retrievals to contexts.json. 2. python3 scripts/rag_evaluator.py --contexts contexts.json --questions questions.json --output rag_baseline.json 3. Fix the lowest metric first: relevance → chunking/embeddings/metadata filters; faithfulness → grounding instructions + "answer only from context" + citation requirement; coverage → retrieval k / query expansion. 4. Gate: python3 scripts/rag_evaluator.py --contexts new_contexts.json --questions questions.json --compare rag_baseline.json — every metric must be ≥ baseline; any regression blocks the change.

Agent Config Review

1. python3 scripts/agent_orchestrator.py agent.yaml --validate — must exit with VALIDATION PASSED; fix every error and warning (missing tool config, unbounded iterations, loop risk). 2. Check context discipline: each tool description ≤ 1–2 sentences, tool count minimal for the job, stable system prompt placed first (cache-friendly), iteration cap + early-exit condition present. 3. Budget: --estimate-cost --runs N with your current prices; if cost/run exceeds budget, cut tools or context before downgrading the model.

References

File	Contains	Load when user asks about
`references/prompt_engineering_patterns.md`	10 prompt patterns with input/output examples	"which pattern?", few-shot design, decomposition, meta-prompting
`references/llm_evaluation_frameworks.md`	Eval metrics, scoring methods, A/B testing	"how to evaluate?", "measure quality", "compare prompts"
`references/agentic_system_design.md`	Agent architectures (ReAct, Plan-Execute, Tool Use)	"build agent", "tool calling", "multi-agent"

Related Skills

engineering-team/skills/senior-ml-engineer — model deployment and serving (this skill stops at the prompt/eval layer)
engineering/rag-architect — RAG system architecture (this skill measures RAG quality; that one designs the pipeline)
engineering/agent-designer — full agent system design (this skill validates configs; that one designs the architecture)

Agentic System Design

Agent architectures, tool use patterns, and multi-agent orchestration with pseudocode.

Architectures Index

1. ReAct Pattern 2. Plan-and-Execute 3. Tool Use / Function Calling 4. Multi-Agent Collaboration 5. Memory and State Management 6. Agent Design Patterns

---

1. ReAct Pattern

Reasoning + Acting: The agent alternates between thinking about what to do and taking actions.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        ReAct Loop                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐ │
│   │ Thought │───▶│ Action  │───▶│  Tool   │───▶│Observat.│ │
│   └─────────┘    └─────────┘    └─────────┘    └────┬────┘ │
│        ▲                                            │       │
│        └────────────────────────────────────────────┘       │
│                         (loop until done)                   │
└─────────────────────────────────────────────────────────────┘

Pseudocode

def react_agent(query, tools, max_iterations=10):
    """
    ReAct agent implementation.

    Args:
        query: User question
        tools: Dict of available tools {name: function}
        max_iterations: Safety limit
    """
    context = f"Question: {query}\n"

    for i in range(max_iterations):
        # Generate thought and action
        response = llm.generate(
            REACT_PROMPT.format(
                tools=format_tools(tools),
                context=context
            )
        )

        # Parse response
        thought = extract_thought(response)
        action = extract_action(response)

        context += f"Thought: {thought}\n"

        # Check for final answer
        if action.name == "finish":
            return action.argument

        # Execute tool
        if action.name in tools:
            observation = tools[action.name](action.argument)
            context += f"Action: {action.name}({action.argument})\n"
            context += f"Observation: {observation}\n"
        else:
            context += f"Error: Unknown tool {action.name}\n"

    return "Max iterations reached"

Prompt Template

You are a helpful assistant that can use tools to answer questions.

Available tools:
{tools}

Answer format:
Thought: [your reasoning about what to do next]
Action: [tool_name(argument)] OR finish(final_answer)

{context}

Continue:

When to Use

Scenario	ReAct Fit
Simple Q&A with lookup	Good
Multi-step research	Good
Math calculations	Good
Creative writing	Poor
Real-time conversation	Poor

---

2. Plan-and-Execute

Two-phase approach: First create a plan, then execute each step.

Architecture

┌──────────────────────────────────────────────────────────────┐
│                     Plan-and-Execute                         │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  Phase 1: Planning                                           │
│  ┌──────────┐    ┌──────────────────────────────────────┐   │
│  │  Query   │───▶│  Generate step-by-step plan          │   │
│  └──────────┘    └──────────────────────────────────────┘   │
│                              │                               │
│                              ▼                               │
│                  ┌──────────────────────┐                    │
│                  │ Plan: [S1, S2, S3]   │                    │
│                  └──────────┬───────────┘                    │
│                             │                                │
│  Phase 2: Execution         │                                │
│                  ┌──────────▼───────────┐                    │
│                  │   Execute Step 1     │                    │
│                  └──────────┬───────────┘                    │
│                             │                                │
│                  ┌──────────▼───────────┐                    │
│                  │   Execute Step 2     │──▶ Replan?         │
│                  └──────────┬───────────┘                    │
│                             │                                │
│                  ┌──────────▼───────────┐                    │
│                  │   Execute Step 3     │                    │
│                  └──────────┬───────────┘                    │
│                             │                                │
│                  ┌──────────▼───────────┐                    │
│                  │    Final Answer      │                    │
│                  └──────────────────────┘                    │
└──────────────────────────────────────────────────────────────┘

Pseudocode

def plan_and_execute(query, tools):
    """
    Plan-and-Execute agent.

    Separates planning from execution for complex tasks.
    """
    # Phase 1: Generate plan
    plan = generate_plan(query)

    results = []

    # Phase 2: Execute each step
    for i, step in enumerate(plan.steps):
        # Execute step
        result = execute_step(step, tools, results)
        results.append(result)

        # Optional: Check if replanning needed
        if should_replan(step, result, plan):
            remaining_steps = plan.steps[i+1:]
            new_plan = replan(query, results, remaining_steps)
            plan.steps = plan.steps[:i+1] + new_plan.steps

    # Synthesize final answer
    return synthesize_answer(query, results)


def generate_plan(query):
    """Generate execution plan from query."""
    prompt = f"""
    Create a step-by-step plan to answer this question:
    {query}

    Format each step as:
    Step N: [action description]

    Keep the plan concise (3-7 steps).
    """
    response = llm.generate(prompt)
    return parse_plan(response)


def execute_step(step, tools, previous_results):
    """Execute a single step using available tools."""
    prompt = f"""
    Execute this step: {step.description}

    Previous results:
    {format_results(previous_results)}

    Available tools: {format_tools(tools)}

    Provide the result of this step.
    """
    return llm.generate(prompt)

When to Use

Task Complexity	Recommendation
Simple (1-2 steps)	Use ReAct
Medium (3-5 steps)	Plan-and-Execute
Complex (6+ steps)	Plan-and-Execute with replanning
Highly dynamic	ReAct with adaptive planning

---

3. Tool Use / Function Calling

Structured tool invocation: LLM generates structured calls that are executed externally.

Tool Definition Schema

{
  "name": "search_web",
  "description": "Search the web for current information",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Search query"
      },
      "num_results": {
        "type": "integer",
        "default": 5,
        "description": "Number of results to return"
      }
    },
    "required": ["query"]
  }
}

Implementation Pattern

class ToolRegistry:
    """Registry for agent tools."""

    def __init__(self):
        self.tools = {}

    def register(self, name, func, schema):
        """Register a tool with its schema."""
        self.tools[name] = {
            "function": func,
            "schema": schema
        }

    def get_schemas(self):
        """Get all tool schemas for LLM."""
        return [t["schema"] for t in self.tools.values()]

    def execute(self, name, arguments):
        """Execute a tool by name."""
        if name not in self.tools:
            raise ValueError(f"Unknown tool: {name}")

        func = self.tools[name]["function"]
        return func(**arguments)


def tool_use_agent(query, registry):
    """Agent with function calling."""
    messages = [{"role": "user", "content": query}]

    while True:
        # Call LLM with tools
        response = llm.chat(
            messages=messages,
            tools=registry.get_schemas(),
            tool_choice="auto"
        )

        # Check if done
        if response.finish_reason == "stop":
            return response.content

        # Execute tool calls
        if response.tool_calls:
            for call in response.tool_calls:
                result = registry.execute(
                    call.function.name,
                    json.loads(call.function.arguments)
                )
                messages.append({
                    "role": "tool",
                    "tool_call_id": call.id,
                    "content": str(result)
                })

Tool Design Best Practices

Practice	Example
Clear descriptions	"Search web for query" not "search"
Type hints	Use JSON Schema types
Default values	Provide sensible defaults
Error handling	Return error messages, not exceptions
Idempotency	Same input = same output

---

4. Multi-Agent Collaboration

Orchestration Patterns

Pattern 1: Sequential Pipeline

Agent A → Agent B → Agent C → Output

Use case: Research → Analysis → Writing

Pattern 2: Hierarchical

        ┌─────────────┐
        │ Coordinator │
        └──────┬──────┘
    ┌──────────┼──────────┐
    ▼          ▼          ▼
┌───────┐ ┌───────┐ ┌───────┐
│Agent A│ │Agent B│ │Agent C│
└───────┘ └───────┘ └───────┘

Use case: Complex task decomposition

Pattern 3: Debate/Consensus

┌───────┐     ┌───────┐
│Agent A│◄───▶│Agent B│
└───┬───┘     └───┬───┘
    │             │
    └──────┬──────┘
           ▼
    ┌─────────────┐
    │   Arbiter   │
    └─────────────┘

Use case: Critical decisions, fact-checking

Pseudocode: Hierarchical Multi-Agent

class CoordinatorAgent:
    """Coordinates multiple specialized agents."""

    def __init__(self, agents):
        self.agents = agents  # Dict[str, Agent]

    def process(self, query):
        # Decompose task
        subtasks = self.decompose(query)

        # Assign to agents
        results = {}
        for subtask in subtasks:
            agent_name = self.select_agent(subtask)
            result = self.agents[agent_name].execute(subtask)
            results[subtask.id] = result

        # Synthesize
        return self.synthesize(query, results)

    def decompose(self, query):
        """Break query into subtasks."""
        prompt = f"""
        Break this task into subtasks for specialized agents:

        Task: {query}

        Available agents:
        - researcher: Gathers information
        - analyst: Analyzes data
        - writer: Produces content

        Format:
        1. [agent]: [subtask description]
        """
        response = llm.generate(prompt)
        return parse_subtasks(response)

    def select_agent(self, subtask):
        """Select best agent for subtask."""
        return subtask.assigned_agent

    def synthesize(self, query, results):
        """Combine agent results into final answer."""
        prompt = f"""
        Combine these results to answer: {query}

        Results:
        {format_results(results)}

        Provide a coherent final answer.
        """
        return llm.generate(prompt)

Communication Protocols

Protocol	Description	Use When
Direct	Agent calls agent	Simple pipelines
Message queue	Async message passing	High throughput
Shared state	Shared memory/database	Collaborative editing
Broadcast	One-to-many	Status updates

---

5. Memory and State Management

Memory Types

┌─────────────────────────────────────────────────────────────┐
│                    Agent Memory System                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────┐  ┌─────────────────┐                  │
│  │ Working Memory  │  │  Episodic Memory │                  │
│  │ (Current task)  │  │ (Past sessions)  │                  │
│  └────────┬────────┘  └────────┬─────────┘                  │
│           │                    │                            │
│           └────────┬───────────┘                            │
│                    ▼                                        │
│  ┌─────────────────────────────────────────┐               │
│  │           Semantic Memory               │               │
│  │    (Long-term knowledge, embeddings)    │               │
│  └─────────────────────────────────────────┘               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Implementation

class AgentMemory:
    """Memory system for conversational agents."""

    def __init__(self, embedding_model, vector_store):
        self.embedding_model = embedding_model
        self.vector_store = vector_store
        self.working_memory = []  # Current conversation
        self.buffer_size = 10     # Recent messages to keep

    def add_message(self, role, content):
        """Add message to working memory."""
        self.working_memory.append({
            "role": role,
            "content": content,
            "timestamp": datetime.now()
        })

        # Trim if too long
        if len(self.working_memory) > self.buffer_size:
            # Summarize old messages before removing
            old_messages = self.working_memory[:5]
            summary = self.summarize(old_messages)
            self.store_long_term(summary)
            self.working_memory = self.working_memory[5:]

    def store_long_term(self, content):
        """Store in semantic memory (vector store)."""
        embedding = self.embedding_model.embed(content)
        self.vector_store.add(
            embedding=embedding,
            metadata={"content": content, "type": "summary"}
        )

    def retrieve_relevant(self, query, k=5):
        """Retrieve relevant memories for context."""
        query_embedding = self.embedding_model.embed(query)
        results = self.vector_store.search(query_embedding, k=k)
        return [r.metadata["content"] for r in results]

    def get_context(self, query):
        """Build context for LLM from memories."""
        relevant = self.retrieve_relevant(query)
        recent = self.working_memory[-self.buffer_size:]

        return {
            "relevant_memories": relevant,
            "recent_conversation": recent
        }

    def summarize(self, messages):
        """Summarize messages for long-term storage."""
        content = "\n".join([
            f"{m['role']}: {m['content']}"
            for m in messages
        ])
        prompt = f"Summarize this conversation:\n{content}"
        return llm.generate(prompt)

State Persistence Patterns

Pattern	Storage	Use Case
In-memory	Dict/List	Single session
Redis	Key-value	Multi-session, fast
PostgreSQL	Relational	Complex queries
Vector DB	Embeddings	Semantic search

---

6. Agent Design Patterns

Pattern: Reflection

Agent reviews and critiques its own output.

def reflective_agent(query, tools):
    """Agent that reflects on its answers."""
    # Initial response
    response = react_agent(query, tools)

    # Reflection
    critique = llm.generate(f"""
    Review this answer for:
    1. Accuracy - Is the information correct?
    2. Completeness - Does it fully answer the question?
    3. Clarity - Is it easy to understand?

    Question: {query}
    Answer: {response}

    Critique:
    """)

    # Check if revision needed
    if needs_revision(critique):
        revised = llm.generate(f"""
        Improve this answer based on the critique:

        Original: {response}
        Critique: {critique}

        Improved answer:
        """)
        return revised

    return response

Pattern: Self-Ask

Break complex questions into simpler sub-questions.

def self_ask_agent(query, tools):
    """Agent that asks itself follow-up questions."""
    context = []

    while True:
        prompt = f"""
        Question: {query}

        Previous Q&A:
        {format_qa(context)}

        Do you need to ask a follow-up question to answer this?
        If yes: "Follow-up: [question]"
        If no: "Final Answer: [answer]"
        """

        response = llm.generate(prompt)

        if response.startswith("Final Answer:"):
            return response.replace("Final Answer:", "").strip()

        # Answer follow-up question
        follow_up = response.replace("Follow-up:", "").strip()
        answer = simple_qa(follow_up, tools)
        context.append({"q": follow_up, "a": answer})

Pattern: Expert Routing

Route queries to specialized sub-agents.

class ExpertRouter:
    """Routes queries to expert agents."""

    def __init__(self):
        self.experts = {
            "code": CodeAgent(),
            "math": MathAgent(),
            "research": ResearchAgent(),
            "general": GeneralAgent()
        }

    def route(self, query):
        """Determine best expert for query."""
        prompt = f"""
        Classify this query into one category:
        - code: Programming questions
        - math: Mathematical calculations
        - research: Fact-finding, current events
        - general: Everything else

        Query: {query}
        Category:
        """
        category = llm.generate(prompt).strip().lower()
        return self.experts.get(category, self.experts["general"])

    def process(self, query):
        expert = self.route(query)
        return expert.execute(query)

---

Quick Reference: Pattern Selection

Need	Pattern
Simple tool use	ReAct
Complex multi-step	Plan-and-Execute
API integration	Function Calling
Multiple perspectives	Multi-Agent Debate
Quality assurance	Reflection
Complex reasoning	Self-Ask
Domain expertise	Expert Routing
Conversation continuity	Memory System

LLM Evaluation Frameworks

Concrete metrics, scoring methods, comparison tables, and A/B testing frameworks.

Frameworks Index

1. Evaluation Metrics Overview 2. Text Generation Metrics 3. RAG-Specific Metrics 4. Human Evaluation Frameworks 5. A/B Testing for Prompts 6. Benchmark Datasets 7. Evaluation Pipeline Design

---

1. Evaluation Metrics Overview

Metric Categories

Category	Metrics	When to Use
Lexical	BLEU, ROUGE, Exact Match	Reference-based comparison
Semantic	BERTScore, Embedding similarity	Meaning preservation
Task-specific	F1, Accuracy, Precision/Recall	Classification, extraction
Quality	Coherence, Fluency, Relevance	Open-ended generation
Safety	Toxicity, Bias scores	Content moderation

Choosing the Right Metric

Is there a single correct answer?
├── Yes → Exact Match or F1
└── No
    └── Is there a reference output?
        ├── Yes → BLEU, ROUGE, or BERTScore
        └── No
            └── Can you define quality criteria?
                ├── Yes → Human evaluation + LLM-as-judge
                └── No → A/B testing with user metrics

---

2. Text Generation Metrics

BLEU (Bilingual Evaluation Understudy)

What it measures: N-gram overlap between generated and reference text.

Score range: 0 to 1 (higher is better)

Calculation:

BLEU = BP × exp(Σ wn × log(pn))

Where:
- BP = brevity penalty (penalizes short outputs)
- pn = precision of n-grams
- wn = weight (typically 0.25 for BLEU-4)

Interpretation:

BLEU Score	Quality
> 0.6	Excellent
0.4 - 0.6	Good
0.2 - 0.4	Acceptable
< 0.2	Poor

Example:

Reference: "The quick brown fox jumps over the lazy dog"
Generated: "A fast brown fox leaps over the lazy dog"

1-gram precision: 7/9 = 0.78 (matched: brown, fox, over, the, lazy, dog)
2-gram precision: 4/8 = 0.50 (matched: brown fox, the lazy, lazy dog)
BLEU-4: ~0.35

Limitations:

Doesn't capture meaning (synonyms penalized)
Position-independent
Requires reference text

---

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

What it measures: Overlap focused on recall (coverage of reference).

Variants:

Variant	Measures
ROUGE-1	Unigram overlap
ROUGE-2	Bigram overlap
ROUGE-L	Longest common subsequence
ROUGE-Lsum	LCS with sentence-level computation

Calculation:

ROUGE-N Recall = (matching n-grams) / (n-grams in reference)
ROUGE-N Precision = (matching n-grams) / (n-grams in generated)
ROUGE-N F1 = 2 × (Precision × Recall) / (Precision + Recall)

Example:

Reference: "The cat sat on the mat"
Generated: "The cat was sitting on the mat"

ROUGE-1:
  Recall: 5/6 = 0.83 (matched: the, cat, on, the, mat)
  Precision: 5/7 = 0.71
  F1: 0.77

ROUGE-2:
  Recall: 2/5 = 0.40 (matched: "the cat", "the mat")
  Precision: 2/6 = 0.33
  F1: 0.36

Best for: Summarization, text compression

---

BERTScore

What it measures: Semantic similarity using contextual embeddings.

How it works: 1. Generate BERT embeddings for each token 2. Compute cosine similarity between token pairs 3. Apply greedy matching to find best alignment 4. Aggregate into Precision, Recall, F1

Advantages over lexical metrics:

Captures synonyms and paraphrases
Context-aware matching
Better correlation with human judgment

Example:

Reference: "The movie was excellent"
Generated: "The film was outstanding"

Lexical (BLEU): Low score (only "The" and "was" match)
BERTScore: High score (semantic meaning preserved)

Interpretation:

BERTScore F1	Quality
> 0.9	Excellent
0.8 - 0.9	Good
0.7 - 0.8	Acceptable
< 0.7	Review needed

---

3. RAG-Specific Metrics

Context Relevance

What it measures: How relevant retrieved documents are to the query.

Calculation methods:

Method 1: Embedding similarity

relevance = cosine_similarity(
    embed(query),
    embed(context)
)

Method 2: LLM-as-judge

Prompt: "Rate the relevance of this context to the question.
Question: {question}
Context: {context}
Rate from 1-5 where 5 is highly relevant."

Target: > 0.8 for top-k contexts

---

Answer Faithfulness

What it measures: Whether the answer is supported by the context (no hallucination).

Evaluation prompt:

Given the context and answer, determine if every claim in the
answer is supported by the context.

Context: {context}
Answer: {answer}

For each claim in the answer:
1. Identify the claim
2. Find supporting evidence in context (or mark as unsupported)
3. Rate: Supported / Partially Supported / Not Supported

Overall faithfulness score: [0-1]

Scoring:

Faithfulness = (supported claims) / (total claims)

Target: > 0.95 for production systems

---

Retrieval Metrics

Metric	Formula	What it measures
Precision@k	(relevant in top-k) / k	Quality of top results
Recall@k	(relevant in top-k) / (total relevant)	Coverage
MRR	1 / (rank of first relevant)	Position of first hit
NDCG@k	DCG@k / IDCG@k	Ranking quality

Example:

Query: "What is photosynthesis?"
Retrieved docs (k=5): [R, N, R, N, R]  (R=relevant, N=not relevant)
Total relevant in corpus: 10

Precision@5 = 3/5 = 0.6
Recall@5 = 3/10 = 0.3
MRR = 1/1 = 1.0 (first doc is relevant)

---

4. Human Evaluation Frameworks

Likert Scale Evaluation

Setup:

Rate the following response on a scale of 1-5:

Response: {generated_response}

Criteria:
- Relevance (1-5): Does it address the question?
- Accuracy (1-5): Is the information correct?
- Fluency (1-5): Is it well-written?
- Helpfulness (1-5): Would this be useful to the user?

Sample size guidance:

Confidence Level	Margin of Error	Required Samples
95%	±5%	385
95%	±10%	97
90%	±10%	68

---

Comparative Evaluation (Side-by-Side)

Setup:

Compare these two responses to the question:

Question: {question}

Response A: {response_a}
Response B: {response_b}

Which response is better?
[ ] A is much better
[ ] A is slightly better
[ ] About the same
[ ] B is slightly better
[ ] B is much better

Why? _______________

Advantages:

Easier for humans than absolute scoring
Reduces calibration issues
Clear winner for A/B decisions

Analysis:

Win rate = (A wins + 0.5 × ties) / total
Bradley-Terry model for ranking multiple variants

---

LLM-as-Judge

Setup:

You are an expert evaluator. Rate the quality of this response.

Question: {question}
Response: {response}
Reference (if available): {reference}

Evaluate on:
1. Correctness (0-10): Is the information accurate?
2. Completeness (0-10): Does it fully address the question?
3. Clarity (0-10): Is it easy to understand?
4. Conciseness (0-10): Is it appropriately brief?

Provide scores and brief justification for each.
Overall score (0-10):

Calibration techniques:

Include reference responses with known scores
Use chain-of-thought for reasoning
Compare against human baseline periodically

Known biases:

Bias	Mitigation
Position bias	Randomize order
Length bias	Normalize or specify length
Self-preference	Use different model as judge
Verbosity preference	Penalize unnecessary length

---

5. A/B Testing for Prompts

Experiment Design

Hypothesis template:

H0: Prompt A and Prompt B have equal performance on [metric]
H1: Prompt B improves [metric] by at least [minimum detectable effect]

Sample size calculation:

n = 2 × ((z_α + z_β)² × σ²) / δ²

Where:
- z_α = 1.96 for 95% confidence
- z_β = 0.84 for 80% power
- σ = standard deviation of metric
- δ = minimum detectable effect

Quick reference:

MDE	Baseline Rate	Required n/variant
5% relative	50%	3,200
10% relative	50%	800
20% relative	50%	200

---

Metrics to Track

Primary metrics:

Metric	Measurement
Task success rate	% of queries with correct/helpful response
User satisfaction	Thumbs up/down or 1-5 rating
Engagement	Follow-up questions, session length

Guardrail metrics:

Metric	Threshold
Error rate	< 1%
Latency P95	< 2s
Toxicity rate	< 0.1%
Cost per query	Within budget

---

Analysis Framework

Statistical test selection:

Is the metric binary (success/failure)?
├── Yes → Chi-squared test or Z-test for proportions
└── No
    └── Is the data normally distributed?
        ├── Yes → Two-sample t-test
        └── No → Mann-Whitney U test

Interpreting results:

p-value < 0.05: Statistically significant
Effect size (Cohen's d):
  - Small: 0.2
  - Medium: 0.5
  - Large: 0.8

Decision: Ship if p < 0.05 AND effect size meets threshold AND guardrails pass

---

6. Benchmark Datasets

General NLP Benchmarks

Benchmark	Task	Size	Metric
MMLU	Knowledge QA	14K	Accuracy
HellaSwag	Commonsense	10K	Accuracy
TruthfulQA	Factuality	817	% Truthful
HumanEval	Code generation	164	pass@k
GSM8K	Math reasoning	8.5K	Accuracy

RAG Benchmarks

Benchmark	Focus	Metrics
Natural Questions	Wikipedia QA	EM, F1
HotpotQA	Multi-hop reasoning	EM, F1
MS MARCO	Web search	MRR, Recall
BEIR	Zero-shot retrieval	NDCG@10

Creating Custom Benchmarks

Template:

{
  "id": "custom-001",
  "input": "What are the symptoms of diabetes?",
  "expected_output": "Common symptoms include...",
  "metadata": {
    "category": "medical",
    "difficulty": "easy",
    "source": "internal docs"
  },
  "evaluation": {
    "type": "semantic_similarity",
    "threshold": 0.85
  }
}

Best practices:

Minimum 100 examples per category
Include edge cases (10-20%)
Balance difficulty levels
Version control your benchmark
Update quarterly

---

7. Evaluation Pipeline Design

Automated Evaluation Pipeline

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Prompt    │────▶│   LLM API   │────▶│   Output    │
│   Version   │     │             │     │   Storage   │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                               │
                    ┌──────────────────────────┘
                    ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Metrics   │◀────│  Evaluator  │◀────│  Benchmark  │
│   Dashboard │     │   Service   │     │   Dataset   │
└─────────────┘     └─────────────┘     └─────────────┘

Implementation Checklist

□ Define success metrics
  □ Primary metric (what you're optimizing)
  □ Guardrail metrics (what must not regress)
  □ Monitoring metrics (operational health)

□ Create benchmark dataset
  □ Representative samples from production
  □ Edge cases and failure modes
  □ Golden answers or human labels

□ Set up evaluation infrastructure
  □ Automated scoring pipeline
  □ Version control for prompts
  □ Results tracking and comparison

□ Establish baseline
  □ Run current prompt against benchmark
  □ Document scores for all metrics
  □ Set improvement targets

□ Run experiments
  □ Test one change at a time
  □ Use statistical significance testing
  □ Check all guardrail metrics

□ Deploy and monitor
  □ Gradual rollout (canary)
  □ Real-time metric monitoring
  □ Rollback plan if regression

---

Quick Reference: Metric Selection

Use Case	Primary Metric	Secondary Metrics
Summarization	ROUGE-L	BERTScore, Compression ratio
Translation	BLEU	chrF, Human pref
QA (extractive)	Exact Match, F1
QA (generative)	BERTScore	Faithfulness, Relevance
Code generation	pass@k	Syntax errors
Classification	Accuracy, F1	Precision, Recall
RAG	Faithfulness	Context relevance, MRR
Open-ended chat	Human eval	Helpfulness, Safety

Prompt Engineering Patterns

Specific prompt techniques with example inputs and expected outputs.

Patterns Index

1. Zero-Shot Prompting 2. Few-Shot Prompting 3. Chain-of-Thought (CoT) 4. Role Prompting 5. Structured Output 6. Self-Consistency 7. ReAct (Reasoning + Acting) 8. Tree of Thoughts 9. Retrieval-Augmented Generation 10. Meta-Prompting

---

1. Zero-Shot Prompting

When to use: Simple, well-defined tasks where the model has sufficient training knowledge.

Pattern:

[Task instruction]
[Input]

Example:

Input:

Classify the following customer review as positive, negative, or neutral.

Review: "The shipping was fast but the product quality was disappointing."

Expected Output:

negative

Best practices:

Be explicit about output format
Use clear, unambiguous verbs (classify, extract, summarize)
Specify constraints (word limits, format requirements)

When to avoid:

Tasks requiring specific formatting the model hasn't seen
Domain-specific tasks requiring specialized knowledge
Tasks where consistency is critical

---

2. Few-Shot Prompting

When to use: Tasks requiring consistent formatting or domain-specific patterns.

Pattern:

[Task description]

Example 1:
Input: [example input]
Output: [example output]

Example 2:
Input: [example input]
Output: [example output]

Now process:
Input: [actual input]
Output:

Example:

Input:

Extract the company name and founding year from the text.

Example 1:
Input: "Apple Inc. was founded in 1976 by Steve Jobs."
Output: {"company": "Apple Inc.", "year": 1976}

Example 2:
Input: "Microsoft Corporation started in 1975."
Output: {"company": "Microsoft Corporation", "year": 1975}

Example 3:
Input: "Founded in 1994, Amazon has grown into a tech giant."
Output: {"company": "Amazon", "year": 1994}

Now process:
Input: "Tesla, Inc. was established in 2003 by Martin Eberhard."
Output:

Expected Output:

{"company": "Tesla, Inc.", "year": 2003}

Example selection guidelines:

Example Type	Purpose	Count
Simple/typical	Establish basic pattern	1-2
Edge case	Handle ambiguity	1
Different format	Show variations	1
Total		3-5

Common mistakes:

Too many examples (wastes tokens, may confuse)
Inconsistent formatting between examples
Examples too similar (model doesn't learn variation)

---

3. Chain-of-Thought (CoT)

When to use: Math problems, logical reasoning, multi-step analysis.

Pattern:

[Problem]

Think through this step by step:

Example:

Input:

A store sells apples for $2 each. If you buy 5 or more, you get 20% off.
How much do 7 apples cost?

Think through this step by step:

Expected Output:

Step 1: Calculate the regular price for 7 apples
7 apples × $2 = $14

Step 2: Check if discount applies
7 apples ≥ 5 apples, so 20% discount applies

Step 3: Calculate the discount amount
20% of $14 = 0.20 × $14 = $2.80

Step 4: Calculate final price
$14 - $2.80 = $11.20

The 7 apples cost $11.20

Variations:

Variation	Trigger Phrase	Use Case
Basic CoT	"Think step by step"	General reasoning
Explicit steps	"Break this into numbered steps"	Complex procedures
Show work	"Show your calculations"	Math problems
Reasoning chain	"Explain your reasoning"	Justification needed

Zero-shot CoT: Simply append "Let's think step by step." to any prompt.

---

4. Role Prompting

When to use: Tasks requiring expertise, specific perspective, or consistent persona.

Pattern:

You are a [role/expert description].

[Context about the task]

[Specific instruction]

Example:

Input:

You are a senior security engineer with 15 years of experience in web application security.

Review the following code snippet for security vulnerabilities:

def login(username, password): query = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'" result = db.execute(query) return result


Identify vulnerabilities and provide specific fixes.

Expected Output:

Security Vulnerabilities Identified:

1. SQL Injection (Critical)
   - The code directly interpolates user input into SQL query
   - Attacker can input: username=' OR '1'='1' --
   - This bypasses authentication entirely

   Fix:

def login(username, password): query = "SELECT * FROM users WHERE username=? AND password=?" result = db.execute(query, (username, password)) return result


2. Plain-text Password Storage (Critical)
   - Passwords appear to be stored and compared in plain text
   - Fix: Use bcrypt or argon2 for password hashing

3. Missing Rate Limiting (Medium)
   - No protection against brute force attacks
   - Fix: Implement login attempt throttling

Effective role descriptions:

Weak	Strong
"You are an expert"	"You are a senior data scientist with 10 years of experience in NLP"
"You are helpful"	"You are a patient teacher explaining concepts to beginners"
"You know about X"	"You are a certified AWS solutions architect specializing in serverless"

---

5. Structured Output

When to use: When you need parseable responses (JSON, XML, CSV).

Pattern:

[Task instruction]

Respond in JSON format with exactly these fields:
- field1 (type): description
- field2 (type): description

[Input]

Return ONLY valid JSON, no markdown or explanation.

Example:

Input:

Extract meeting details from this email.

Respond in JSON format with exactly these fields:
- date (string, ISO format): Meeting date
- time (string, 24h format): Meeting time
- attendees (array of strings): List of attendees
- topic (string): Meeting topic
- location (string or null): Meeting location if mentioned

Email: "Hi team, let's meet tomorrow at 2pm to discuss Q4 planning.
Sarah, Mike, and Lisa should attend. We'll use Conference Room B."

Today's date is 2024-01-15.

Return ONLY valid JSON, no markdown or explanation.

Expected Output:

{
  "date": "2024-01-16",
  "time": "14:00",
  "attendees": ["Sarah", "Mike", "Lisa"],
  "topic": "Q4 planning",
  "location": "Conference Room B"
}

Format enforcement techniques:

# Strong enforcement
"Return ONLY valid JSON. Start with { and end with }"

# Schema validation hint
"The output must be valid JSON matching this TypeScript type:
type Output = { name: string; age: number; active: boolean }"

# Negative instruction
"Do NOT include markdown code blocks. Do NOT add explanations."

---

6. Self-Consistency

When to use: Complex reasoning where multiple valid paths exist.

Pattern: 1. Generate multiple reasoning paths (temperature > 0) 2. Extract final answers from each path 3. Select most common answer (majority vote)

Example approach:

# Run this prompt 5 times with temperature=0.7

Solve this logic puzzle. Think through it step by step.

Three friends (Alice, Bob, Carol) each have a different pet (cat, dog, bird).
- Alice doesn't have the dog
- The person with the bird is not Carol
- Bob's pet has four legs

What pet does each person have?

Implementation:

# Pseudo-code for self-consistency
answers = []
for i in range(5):
    response = llm.generate(prompt, temperature=0.7)
    answer = extract_final_answer(response)
    answers.append(answer)

final_answer = majority_vote(answers)

When to use:

Math word problems
Logic puzzles
Any task where you can verify consistency

---

7. ReAct (Reasoning + Acting)

When to use: Tasks requiring external tool use or information gathering.

Pattern:

Answer the question using the available tools.

Tools available:
- search(query): Search the web
- calculate(expression): Evaluate math
- lookup(term): Look up in knowledge base

Format your response as:
Thought: [reasoning about what to do]
Action: [tool_name(arguments)]
Observation: [result from tool]
... (repeat as needed)
Thought: I now have enough information
Final Answer: [answer]

Question: [user question]

Example:

Input:

Question: What is the population of France divided by the population of Belgium?

Expected Output:

Thought: I need to find the population of France and Belgium, then calculate the ratio.

Action: search("population of France 2024")
Observation: France has a population of approximately 67.75 million.

Thought: Now I need Belgium's population.

Action: search("population of Belgium 2024")
Observation: Belgium has a population of approximately 11.7 million.

Thought: Now I can calculate the ratio.

Action: calculate(67.75 / 11.7)
Observation: 5.79

Thought: I now have enough information to answer.

Final Answer: The population of France is approximately 5.79 times the population of Belgium.

---

8. Tree of Thoughts

When to use: Complex problems requiring exploration of multiple solution paths.

Pattern:

Problem: [complex problem]

Generate 3 different approaches to solve this:

Approach 1: [description]
Evaluation: [pros/cons]

Approach 2: [description]
Evaluation: [pros/cons]

Approach 3: [description]
Evaluation: [pros/cons]

Best approach: [selection with reasoning]
Solution using best approach: [detailed solution]

Example:

Input:

Problem: Design a system to handle 10,000 concurrent user sessions with sub-100ms response times.

Generate 3 different architectural approaches:

Expected Output:

Approach 1: Horizontal scaling with load balancer
- Multiple stateless app servers behind load balancer
- Session state in Redis cluster
Evaluation: Simple, proven pattern. May have Redis as bottleneck.

Approach 2: Serverless with edge computing
- AWS Lambda@Edge or Cloudflare Workers
- Session state in DynamoDB Global Tables
Evaluation: Auto-scales, but cold starts may impact latency.

Approach 3: Actor model with sticky sessions
- Akka/Orleans cluster with virtual actors
- Session state co-located with actor
Evaluation: Lowest latency, but complex to operate.

Best approach: Approach 1 (Horizontal scaling)
Reasoning: Best balance of simplicity, reliability, and performance for the given requirements.

Solution: [detailed architecture follows]

---

9. Retrieval-Augmented Generation

When to use: Tasks requiring factual accuracy or current information.

Pattern:

Use the following context to answer the question. If the answer is not in the context, say "I don't have information about that."

Context:
[retrieved documents]

Question: [user question]

Answer based only on the context provided:

Example:

Input:

Use the following context to answer the question.

Context:
Document 1: "The Acme Widget 3000 was released in March 2024. It features
a 12-hour battery life and weighs 250 grams."

Document 2: "Pricing for the Acme Widget 3000 starts at $299 for the base
model. The Pro version with extended battery is $399."

Question: How much does the Acme Widget 3000 cost and what is its battery life?

Answer based only on the context provided:

Expected Output:

The Acme Widget 3000 starts at $299 for the base model, with a Pro version
available at $399. The standard model has a 12-hour battery life.

Chunking strategies:

Strategy	Chunk Size	Overlap	Best For
Fixed	500 tokens	50 tokens	General purpose
Semantic	Paragraph	None	Well-structured docs
Sentence	3-5 sentences	1 sentence	Q&A systems

---

10. Meta-Prompting

When to use: Generating or improving prompts programmatically.

Pattern:

You are a prompt engineering expert.

Task: [description of what the prompt should do]
Target model: [the model family you are deploying on]
Constraints: [length limits, format requirements]

Generate an optimized prompt for this task.

Example:

Input:

You are a prompt engineering expert.

Task: Create a prompt that extracts action items from meeting notes
Target model: a current frontier model
Constraints:
- Output must be valid JSON
- Each action item needs: task, owner, due_date
- Handle cases where information is missing

Generate an optimized prompt for this task.

Expected Output:

Extract action items from the following meeting notes.

For each action item, identify:
- task: The specific action to be taken
- owner: Person responsible (use "Unassigned" if not mentioned)
- due_date: Deadline in YYYY-MM-DD format (use null if not specified)

Meeting Notes:
{meeting_notes}

Respond with a JSON array. Example format:
[
  {"task": "Review proposal", "owner": "Sarah", "due_date": "2024-01-20"},
  {"task": "Send update", "owner": "Unassigned", "due_date": null}
]

Return ONLY the JSON array, no additional text.

---

Pattern Selection Guide

Task Type	Recommended Pattern
Simple classification	Zero-shot
Consistent formatting needed	Few-shot
Math/logic problems	Chain-of-Thought
Need expertise/perspective	Role Prompting
API integration	Structured Output
High-stakes decisions	Self-Consistency
Tool use required	ReAct
Complex problem solving	Tree of Thoughts
Factual Q&A	RAG
Prompt generation	Meta-Prompting

#!/usr/bin/env python3
"""
Agent Orchestrator - Tool for designing and validating agent workflows

Features:
- Parse agent configurations (YAML/JSON)
- Validate tool registrations
- Visualize execution flows (ASCII/Mermaid)
- Estimate token usage per run
- Detect potential issues (loops, missing tools)

Usage:
    python agent_orchestrator.py agent.yaml --validate
    python agent_orchestrator.py agent.yaml --visualize
    python agent_orchestrator.py agent.yaml --visualize --format mermaid
    python agent_orchestrator.py agent.yaml --estimate-cost
"""

import argparse
import json
import re
import sys
from pathlib import Path
from typing import Dict, List, Optional, Set, Tuple, Any
from dataclasses import dataclass, asdict, field
from enum import Enum


class AgentPattern(Enum):
    """Supported agent patterns"""
    REACT = "react"
    PLAN_EXECUTE = "plan-execute"
    TOOL_USE = "tool-use"
    MULTI_AGENT = "multi-agent"
    CUSTOM = "custom"


@dataclass
class ToolDefinition:
    """Definition of an agent tool"""
    name: str
    description: str
    parameters: Dict[str, Any] = field(default_factory=dict)
    required_config: List[str] = field(default_factory=list)
    estimated_tokens: int = 100


@dataclass
class AgentConfig:
    """Agent configuration"""
    name: str
    pattern: AgentPattern
    description: str
    tools: List[ToolDefinition]
    max_iterations: int = 10
    system_prompt: str = ""
    temperature: float = 0.7
    model: str = "unspecified"  # any model name; informational only — never priced from a hardcoded table


@dataclass
class ValidationResult:
    """Result of agent validation"""
    is_valid: bool
    errors: List[str]
    warnings: List[str]
    tool_status: Dict[str, str]
    estimated_tokens_per_run: Tuple[int, int]  # (min, max)
    potential_infinite_loop: bool
    max_depth: int


def parse_yaml_simple(content: str) -> Dict[str, Any]:
    """Simple YAML parser for agent configs (no external dependencies)"""
    result = {}
    current_key = None
    current_list = None
    indent_stack = [(0, result)]

    lines = content.split('\n')

    for line in lines:
        # Skip empty lines and comments
        stripped = line.strip()
        if not stripped or stripped.startswith('#'):
            continue

        # Calculate indent
        indent = len(line) - len(line.lstrip())

        # Check for list item
        if stripped.startswith('- '):
            item = stripped[2:].strip()
            if current_list is not None:
                # Check if it's a key-value pair
                if ':' in item and not item.startswith('{'):
                    key, _, value = item.partition(':')
                    current_list.append({key.strip(): value.strip().strip('"\'')})
                else:
                    current_list.append(item.strip('"\''))
            continue

        # Check for key-value pair
        if ':' in stripped:
            key, _, value = stripped.partition(':')
            key = key.strip()
            value = value.strip().strip('"\'')

            # Pop indent stack as needed
            while indent_stack and indent <= indent_stack[-1][0] and len(indent_stack) > 1:
                indent_stack.pop()

            current_dict = indent_stack[-1][1]

            if value:
                # Simple key-value
                current_dict[key] = value
                current_list = None
            else:
                # Start of nested structure or list
                # Peek ahead to see if it's a list
                next_line_idx = lines.index(line) + 1
                if next_line_idx < len(lines):
                    next_stripped = lines[next_line_idx].strip()
                    if next_stripped.startswith('- '):
                        current_dict[key] = []
                        current_list = current_dict[key]
                    else:
                        current_dict[key] = {}
                        indent_stack.append((indent + 2, current_dict[key]))
                        current_list = None

    return result


def load_config(path: Path) -> AgentConfig:
    """Load agent configuration from file"""
    content = path.read_text(encoding='utf-8')

    # Try JSON first
    if path.suffix == '.json':
        data = json.loads(content)
    else:
        # Try YAML
        try:
            data = parse_yaml_simple(content)
        except Exception:
            # Fallback to JSON if YAML parsing fails
            data = json.loads(content)

    # Parse pattern
    pattern_str = data.get('pattern', 'react').lower()
    try:
        pattern = AgentPattern(pattern_str)
    except ValueError:
        pattern = AgentPattern.CUSTOM

    # Parse tools
    tools = []
    for tool_data in data.get('tools', []):
        if isinstance(tool_data, dict):
            tools.append(ToolDefinition(
                name=tool_data.get('name', 'unknown'),
                description=tool_data.get('description', ''),
                parameters=tool_data.get('parameters', {}),
                required_config=tool_data.get('required_config', []),
                estimated_tokens=tool_data.get('estimated_tokens', 100)
            ))
        elif isinstance(tool_data, str):
            tools.append(ToolDefinition(name=tool_data, description=''))

    return AgentConfig(
        name=data.get('name', 'agent'),
        pattern=pattern,
        description=data.get('description', ''),
        tools=tools,
        max_iterations=int(data.get('max_iterations', 10)),
        system_prompt=data.get('system_prompt', ''),
        temperature=float(data.get('temperature', 0.7)),
        model=data.get('model', 'unspecified')
    )


def validate_agent(config: AgentConfig) -> ValidationResult:
    """Validate agent configuration"""
    errors = []
    warnings = []
    tool_status = {}

    # Validate name
    if not config.name:
        errors.append("Agent name is required")

    # Validate tools
    if not config.tools:
        warnings.append("No tools defined - agent will have limited capabilities")

    tool_names = set()
    for tool in config.tools:
        # Check for duplicates
        if tool.name in tool_names:
            errors.append(f"Duplicate tool name: {tool.name}")
        tool_names.add(tool.name)

        # Check required config
        if tool.required_config:
            missing = [c for c in tool.required_config if not c.startswith('$')]
            if missing:
                tool_status[tool.name] = f"WARN: Missing config: {missing}"
            else:
                tool_status[tool.name] = "OK"
        else:
            tool_status[tool.name] = "OK - No config needed"

        # Check description
        if not tool.description:
            warnings.append(f"Tool '{tool.name}' has no description")

    # Validate pattern-specific requirements
    if config.pattern == AgentPattern.MULTI_AGENT:
        if len(config.tools) < 2:
            warnings.append("Multi-agent pattern typically requires 2+ specialized tools")

    # Check for potential infinite loops
    potential_loop = config.max_iterations > 50

    # Estimate tokens
    base_tokens = len(config.system_prompt.split()) * 1.3 if config.system_prompt else 200
    tool_tokens = sum(t.estimated_tokens for t in config.tools)

    min_tokens = int(base_tokens + tool_tokens)
    max_tokens = int((base_tokens + tool_tokens * 2) * config.max_iterations)

    return ValidationResult(
        is_valid=len(errors) == 0,
        errors=errors,
        warnings=warnings,
        tool_status=tool_status,
        estimated_tokens_per_run=(min_tokens, max_tokens),
        potential_infinite_loop=potential_loop,
        max_depth=config.max_iterations
    )


def generate_ascii_diagram(config: AgentConfig) -> str:
    """Generate ASCII workflow diagram"""
    lines = []

    # Header
    width = max(40, len(config.name) + 10)
    lines.append("┌" + "─" * width + "┐")
    lines.append("│" + config.name.center(width) + "│")
    lines.append("│" + f"({config.pattern.value} Pattern)".center(width) + "│")
    lines.append("└" + "─" * (width // 2 - 1) + "┬" + "─" * (width // 2) + "┘")
    lines.append(" " * (width // 2) + "│")

    # User Query
    lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
    lines.append(" " * (width // 2 - 8) + "│  User Query   │")
    lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")
    lines.append(" " * (width // 2) + "│")

    if config.pattern == AgentPattern.REACT:
        # ReAct loop
        lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
        lines.append(" " * (width // 2 - 8) + "│    Think      │◄──────┐")
        lines.append(" " * (width // 2 - 8) + "└───────┬───────┘       │")
        lines.append(" " * (width // 2) + "│               │")
        lines.append(" " * (width // 2 - 8) + "┌───────────────┐       │")
        lines.append(" " * (width // 2 - 8) + "│  Select Tool  │       │")
        lines.append(" " * (width // 2 - 8) + "└───────┬───────┘       │")
        lines.append(" " * (width // 2) + "│               │")

        # Tools
        if config.tools:
            tool_line = "   ".join([f"[{t.name}]" for t in config.tools[:4]])
            if len(config.tools) > 4:
                tool_line += " ..."
            lines.append(" " * 4 + tool_line)
            lines.append(" " * (width // 2) + "│               │")

        lines.append(" " * (width // 2 - 8) + "┌───────────────┐       │")
        lines.append(" " * (width // 2 - 8) + "│   Observe     │───────┘")
        lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")

    elif config.pattern == AgentPattern.PLAN_EXECUTE:
        # Plan phase
        lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
        lines.append(" " * (width // 2 - 8) + "│  Create Plan  │")
        lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")
        lines.append(" " * (width // 2) + "│")

        # Execute loop
        lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
        lines.append(" " * (width // 2 - 8) + "│ Execute Step  │◄──────┐")
        lines.append(" " * (width // 2 - 8) + "└───────┬───────┘       │")
        lines.append(" " * (width // 2) + "│               │")

        if config.tools:
            tool_line = "   ".join([f"[{t.name}]" for t in config.tools[:4]])
            lines.append(" " * 4 + tool_line)
            lines.append(" " * (width // 2) + "│               │")

        lines.append(" " * (width // 2 - 8) + "┌───────────────┐       │")
        lines.append(" " * (width // 2 - 8) + "│  Check Done?  │───────┘")
        lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")

    else:
        # Generic tool use
        lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
        lines.append(" " * (width // 2 - 8) + "│ Process Query │")
        lines.append(" " * (width // 2 - 8) + "└───────┬───────┘")
        lines.append(" " * (width // 2) + "│")

        if config.tools:
            for tool in config.tools[:6]:
                lines.append(" " * (width // 2 - 8) + f"├──▶ [{tool.name}]")
            if len(config.tools) > 6:
                lines.append(" " * (width // 2 - 8) + "├──▶ [...]")

    # Final answer
    lines.append(" " * (width // 2) + "│")
    lines.append(" " * (width // 2 - 8) + "┌───────────────┐")
    lines.append(" " * (width // 2 - 8) + "│ Final Answer  │")
    lines.append(" " * (width // 2 - 8) + "└───────────────┘")

    return '\n'.join(lines)


def generate_mermaid_diagram(config: AgentConfig) -> str:
    """Generate Mermaid flowchart"""
    lines = ["```mermaid", "flowchart TD"]

    # Start and query
    lines.append(f"    subgraph {config.name}[{config.name}]")
    lines.append("    direction TB")
    lines.append("    A[User Query] --> B{Process}")

    if config.pattern == AgentPattern.REACT:
        lines.append("    B --> C[Think]")
        lines.append("    C --> D{Select Tool}")

        for i, tool in enumerate(config.tools[:6]):
            lines.append(f"    D -->|{tool.name}| T{i}[{tool.name}]")
            lines.append(f"    T{i} --> E[Observe]")

        lines.append("    E -->|Continue| C")
        lines.append("    E -->|Done| F[Final Answer]")

    elif config.pattern == AgentPattern.PLAN_EXECUTE:
        lines.append("    B --> P[Create Plan]")
        lines.append("    P --> X{Execute Step}")

        for i, tool in enumerate(config.tools[:6]):
            lines.append(f"    X -->|{tool.name}| T{i}[{tool.name}]")
            lines.append(f"    T{i} --> R[Review]")

        lines.append("    R -->|More Steps| X")
        lines.append("    R -->|Complete| F[Final Answer]")

    else:
        for i, tool in enumerate(config.tools[:6]):
            lines.append(f"    B -->|use| T{i}[{tool.name}]")
            lines.append(f"    T{i} --> F[Final Answer]")

    lines.append("    end")
    lines.append("```")

    return '\n'.join(lines)


def estimate_cost(config: AgentConfig, runs: int = 100,
                  input_price_per_mtok: Optional[float] = None,
                  output_price_per_mtok: Optional[float] = None) -> Dict[str, Any]:
    """Estimate token usage (always) and dollar costs (only with user-supplied prices).

    Model pricing changes too often to hardcode. Look up your provider's current
    rates and pass --input-price-per-mtok / --output-price-per-mtok; without them
    the tool reports token estimates only.
    """
    validation = validate_agent(config)
    min_tokens, max_tokens = validation.estimated_tokens_per_run

    result: Dict[str, Any] = {
        'model': config.model,
        'tokens_per_run': {'min': min_tokens, 'max': max_tokens},
        'cost_per_run': None,
        'estimated_monthly': {'runs': runs * 30, 'cost_min': None, 'cost_max': None},
    }

    if input_price_per_mtok is None or output_price_per_mtok is None:
        return result

    # Assume 60% input, 40% output
    def run_cost(tokens: int) -> float:
        return (tokens * 0.6 / 1_000_000 * input_price_per_mtok +
                tokens * 0.4 / 1_000_000 * output_price_per_mtok)

    cost_per_run_min = run_cost(min_tokens)
    cost_per_run_max = run_cost(max_tokens)

    result['cost_per_run'] = {'min': round(cost_per_run_min, 4), 'max': round(cost_per_run_max, 4)}
    result['estimated_monthly'] = {
        'runs': runs * 30,
        'cost_min': round(cost_per_run_min * runs * 30, 2),
        'cost_max': round(cost_per_run_max * runs * 30, 2),
    }
    return result


def format_validation_report(config: AgentConfig, result: ValidationResult) -> str:
    """Format validation result as human-readable report"""
    lines = []
    lines.append("=" * 50)
    lines.append("AGENT VALIDATION REPORT")
    lines.append("=" * 50)
    lines.append("")

    lines.append(f"📋 AGENT INFO")
    lines.append(f"  Name:    {config.name}")
    lines.append(f"  Pattern: {config.pattern.value}")
    lines.append(f"  Model:   {config.model}")
    lines.append("")

    lines.append(f"🔧 TOOLS ({len(config.tools)} registered)")
    for tool in config.tools:
        status = result.tool_status.get(tool.name, "Unknown")
        emoji = "✅" if status.startswith("OK") else "⚠️"
        lines.append(f"  {emoji} {tool.name} - {status}")
    lines.append("")

    lines.append("📊 FLOW ANALYSIS")
    lines.append(f"  Max iterations:      {result.max_depth}")
    lines.append(f"  Estimated tokens:    {result.estimated_tokens_per_run[0]:,} - {result.estimated_tokens_per_run[1]:,}")
    lines.append(f"  Potential loop:      {'⚠️ Yes' if result.potential_infinite_loop else '✅ No'}")
    lines.append("")

    if result.errors:
        lines.append(f"❌ ERRORS ({len(result.errors)})")
        for error in result.errors:
            lines.append(f"  • {error}")
        lines.append("")

    if result.warnings:
        lines.append(f"⚠️ WARNINGS ({len(result.warnings)})")
        for warning in result.warnings:
            lines.append(f"  • {warning}")
        lines.append("")

    # Overall status
    if result.is_valid:
        lines.append("✅ VALIDATION PASSED")
    else:
        lines.append("❌ VALIDATION FAILED")

    lines.append("")
    lines.append("=" * 50)

    return '\n'.join(lines)


def main():
    parser = argparse.ArgumentParser(
        description="Agent Orchestrator - Design and validate agent workflows",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  %(prog)s agent.yaml --validate
  %(prog)s agent.yaml --visualize
  %(prog)s agent.yaml --visualize --format mermaid
  %(prog)s agent.yaml --estimate-cost --runs 100

Agent config format (YAML):

name: research_assistant
pattern: react
model: any-model-id   # optional, informational only
max_iterations: 10
tools:
  - name: web_search
    description: Search the web
    required_config: [api_key]
  - name: calculator
    description: Evaluate math expressions
        """
    )

    parser.add_argument('config', help='Agent configuration file (YAML or JSON)')
    parser.add_argument('--validate', '-V', action='store_true', help='Validate agent configuration')
    parser.add_argument('--visualize', '-v', action='store_true', help='Visualize agent workflow')
    parser.add_argument('--format', '-f', choices=['ascii', 'mermaid'], default='ascii',
                       help='Visualization format (default: ascii)')
    parser.add_argument('--estimate-cost', '-e', action='store_true',
                       help='Estimate token usage (and dollar cost if prices supplied)')
    parser.add_argument('--runs', '-r', type=int, default=100, help='Daily runs for cost estimation')
    parser.add_argument('--input-price-per-mtok', type=float, default=None,
                       help='Input price in USD per million tokens (your provider\'s current rate)')
    parser.add_argument('--output-price-per-mtok', type=float, default=None,
                       help='Output price in USD per million tokens (your provider\'s current rate)')
    parser.add_argument('--output', '-o', help='Output file path')
    parser.add_argument('--json', '-j', action='store_true', help='Output as JSON')

    args = parser.parse_args()

    # Load config
    config_path = Path(args.config)
    if not config_path.exists():
        print(f"Error: Config file not found: {args.config}", file=sys.stderr)
        sys.exit(1)

    try:
        config = load_config(config_path)
    except Exception as e:
        print(f"Error parsing config: {e}", file=sys.stderr)
        sys.exit(1)

    # Default to validate if no action specified
    if not any([args.validate, args.visualize, args.estimate_cost]):
        args.validate = True

    output_parts = []

    # Validate
    if args.validate:
        result = validate_agent(config)
        if args.json:
            output_parts.append(json.dumps(asdict(result), indent=2))
        else:
            output_parts.append(format_validation_report(config, result))

    # Visualize
    if args.visualize:
        if args.format == 'mermaid':
            diagram = generate_mermaid_diagram(config)
        else:
            diagram = generate_ascii_diagram(config)
        output_parts.append(diagram)

    # Cost estimation
    if args.estimate_cost:
        costs = estimate_cost(config, args.runs,
                              args.input_price_per_mtok, args.output_price_per_mtok)
        if args.json:
            output_parts.append(json.dumps(costs, indent=2))
        else:
            output_parts.append("")
            output_parts.append("💰 COST ESTIMATION")
            output_parts.append(f"  Model: {costs['model']}")
            output_parts.append(f"  Tokens per run: {costs['tokens_per_run']['min']:,} - {costs['tokens_per_run']['max']:,}")
            if costs['cost_per_run'] is not None:
                output_parts.append(f"  Cost per run: ${costs['cost_per_run']['min']:.4f} - ${costs['cost_per_run']['max']:.4f}")
                output_parts.append(f"  Monthly ({costs['estimated_monthly']['runs']:,} runs):")
                output_parts.append(f"    Min: ${costs['estimated_monthly']['cost_min']:.2f}")
                output_parts.append(f"    Max: ${costs['estimated_monthly']['cost_max']:.2f}")
            else:
                output_parts.append("  Dollar cost: n/a — pass --input-price-per-mtok and "
                                    "--output-price-per-mtok with your provider's current rates")

    # Output
    output = '\n'.join(output_parts)
    print(output)

    if args.output:
        Path(args.output).write_text(output)
        print(f"\nOutput saved to {args.output}")


if __name__ == '__main__':
    main()

#!/usr/bin/env python3
"""
Prompt Optimizer - Static analysis tool for prompt engineering

Features:
- Token estimation (model-agnostic chars-per-token approximation)
- Prompt structure analysis
- Clarity scoring
- Few-shot example extraction and management
- Optimization suggestions

Model-agnostic by design: pass any model name with --model (only the tokenizer
family is inferred from it), and supply current pricing yourself with
--price-per-mtok if you want cost estimates. No model IDs or prices are
hardcoded, so this tool does not rot as the model landscape changes.

Usage:
    python prompt_optimizer.py prompt.txt --analyze
    python prompt_optimizer.py prompt.txt --tokens
    python prompt_optimizer.py prompt.txt --tokens --model claude --price-per-mtok 3.00
    python prompt_optimizer.py prompt.txt --optimize --output optimized.txt
    python prompt_optimizer.py prompt.txt --extract-examples --output examples.json
"""

import argparse
import json
import re
import sys
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, asdict


# Token estimation ratios (chars per token approximation), inferred from the
# tokenizer FAMILY in the --model string. Any model name is accepted.
TOKEN_RATIOS = {
    'claude': 3.5,   # Anthropic tokenizer tends to be denser
    'default': 4.0,  # common BPE-family approximation
}


@dataclass
class PromptAnalysis:
    """Results of prompt analysis"""
    token_count: int
    estimated_cost: Optional[float]
    model: str
    clarity_score: int
    structure_score: int
    issues: List[Dict[str, str]]
    suggestions: List[str]
    sections: List[Dict[str, any]]
    has_examples: bool
    example_count: int
    has_output_format: bool
    word_count: int
    line_count: int


@dataclass
class FewShotExample:
    """A single few-shot example"""
    input_text: str
    output_text: str
    index: int


def estimate_tokens(text: str, model: str = 'default') -> int:
    """Estimate token count based on character ratio (family inferred from model name)"""
    family = 'claude' if 'claude' in model.lower() else 'default'
    ratio = TOKEN_RATIOS[family]
    return int(len(text) / ratio)


def estimate_cost(token_count: int, price_per_mtok: Optional[float] = None) -> Optional[float]:
    """Estimate input cost from a user-supplied price (USD per million tokens).

    Returns None when no price is supplied — pricing changes too often to hardcode;
    look up your provider's current rate and pass --price-per-mtok.
    """
    if price_per_mtok is None:
        return None
    return round((token_count / 1_000_000) * price_per_mtok, 6)


def find_ambiguous_instructions(text: str) -> List[Dict[str, str]]:
    """Find vague or ambiguous instructions"""
    issues = []

    # Vague verbs that need specificity
    vague_patterns = [
        (r'\b(analyze|process|handle|deal with)\b', 'Vague verb - specify the exact action'),
        (r'\b(good|nice|appropriate|suitable)\b', 'Subjective term - define specific criteria'),
        (r'\b(etc\.|and so on|and more)\b', 'Open-ended list - enumerate all items explicitly'),
        (r'\b(if needed|as necessary|when appropriate)\b', 'Conditional without criteria - specify when'),
        (r'\b(some|several|many|few|various)\b', 'Vague quantity - use specific numbers'),
    ]

    lines = text.split('\n')
    for i, line in enumerate(lines, 1):
        for pattern, message in vague_patterns:
            matches = re.finditer(pattern, line, re.IGNORECASE)
            for match in matches:
                issues.append({
                    'type': 'ambiguity',
                    'line': i,
                    'text': match.group(),
                    'message': message,
                    'context': line.strip()[:80]
                })

    return issues


def find_redundant_content(text: str) -> List[Dict[str, str]]:
    """Find potentially redundant content"""
    issues = []
    lines = text.split('\n')

    # Check for repeated phrases (3+ words)
    seen_phrases = {}
    for i, line in enumerate(lines, 1):
        words = line.split()
        for j in range(len(words) - 2):
            phrase = ' '.join(words[j:j+3]).lower()
            phrase = re.sub(r'[^\w\s]', '', phrase)
            if phrase and len(phrase) > 10:
                if phrase in seen_phrases:
                    issues.append({
                        'type': 'redundancy',
                        'line': i,
                        'text': phrase,
                        'message': f'Phrase repeated from line {seen_phrases[phrase]}',
                        'context': line.strip()[:80]
                    })
                else:
                    seen_phrases[phrase] = i

    return issues


def check_output_format(text: str) -> Tuple[bool, List[str]]:
    """Check if prompt specifies output format"""
    suggestions = []

    format_indicators = [
        r'respond\s+(in|with)\s+(json|xml|csv|markdown)',
        r'output\s+format',
        r'return\s+(only|just)',
        r'format:\s*\n',
        r'\{["\']?\w+["\']?\s*:',  # JSON-like structure
        r'```\w*\n',  # Code block
    ]

    has_format = any(re.search(p, text, re.IGNORECASE) for p in format_indicators)

    if not has_format:
        suggestions.append('Add explicit output format specification (e.g., "Respond in JSON with keys: ...")')

    return has_format, suggestions


def extract_sections(text: str) -> List[Dict[str, any]]:
    """Extract logical sections from prompt"""
    sections = []

    # Common section patterns
    section_patterns = [
        r'^#+\s+(.+)$',  # Markdown headers
        r'^([A-Z][A-Za-z\s]+):\s*$',  # Title Case Label:
        r'^(Instructions|Context|Examples?|Input|Output|Task|Role|Format)[:.]',
    ]

    lines = text.split('\n')
    current_section = {'name': 'Introduction', 'start': 1, 'content': []}

    for i, line in enumerate(lines, 1):
        is_header = False
        for pattern in section_patterns:
            match = re.match(pattern, line.strip(), re.IGNORECASE)
            if match:
                if current_section['content']:
                    current_section['end'] = i - 1
                    current_section['line_count'] = len(current_section['content'])
                    sections.append(current_section)
                current_section = {
                    'name': match.group(1).strip() if match.groups() else line.strip(),
                    'start': i,
                    'content': []
                }
                is_header = True
                break

        if not is_header:
            current_section['content'].append(line)

    # Add last section
    if current_section['content']:
        current_section['end'] = len(lines)
        current_section['line_count'] = len(current_section['content'])
        sections.append(current_section)

    return sections


def extract_few_shot_examples(text: str) -> List[FewShotExample]:
    """Extract few-shot examples from prompt"""
    examples = []

    # Pattern 1: "Example N:" or "Example:" blocks
    example_pattern = r'Example\s*\d*:\s*\n(Input:\s*(.+?)\n(?:Output:\s*(.+?)(?=\n\nExample|\n\n[A-Z]|\Z)))'

    matches = re.finditer(example_pattern, text, re.DOTALL | re.IGNORECASE)
    for i, match in enumerate(matches, 1):
        examples.append(FewShotExample(
            input_text=match.group(2).strip() if match.group(2) else '',
            output_text=match.group(3).strip() if match.group(3) else '',
            index=i
        ))

    # Pattern 2: Input/Output pairs without "Example" label
    if not examples:
        io_pattern = r'Input:\s*["\']?(.+?)["\']?\s*\nOutput:\s*(.+?)(?=\nInput:|\Z)'
        matches = re.finditer(io_pattern, text, re.DOTALL)
        for i, match in enumerate(matches, 1):
            examples.append(FewShotExample(
                input_text=match.group(1).strip(),
                output_text=match.group(2).strip(),
                index=i
            ))

    return examples


def calculate_clarity_score(text: str, issues: List[Dict]) -> int:
    """Calculate clarity score (0-100)"""
    score = 100

    # Deduct for issues
    score -= len([i for i in issues if i['type'] == 'ambiguity']) * 5
    score -= len([i for i in issues if i['type'] == 'redundancy']) * 3

    # Check for structure
    if not re.search(r'^#+\s|^[A-Z][a-z]+:', text, re.MULTILINE):
        score -= 10  # No clear sections

    # Check for instruction clarity
    if not re.search(r'(you (should|must|will)|please|your task)', text, re.IGNORECASE):
        score -= 5  # No clear directives

    return max(0, min(100, score))


def calculate_structure_score(sections: List[Dict], has_format: bool, has_examples: bool) -> int:
    """Calculate structure score (0-100)"""
    score = 50  # Base score

    # Bonus for clear sections
    if len(sections) >= 2:
        score += 15
    if len(sections) >= 4:
        score += 10

    # Bonus for output format
    if has_format:
        score += 15

    # Bonus for examples
    if has_examples:
        score += 10

    return min(100, score)


def generate_suggestions(analysis: PromptAnalysis) -> List[str]:
    """Generate optimization suggestions"""
    suggestions = []

    if not analysis.has_output_format:
        suggestions.append('Add explicit output format: "Respond in JSON with keys: ..."')

    if analysis.example_count == 0:
        suggestions.append('Consider adding 2-3 few-shot examples for consistent outputs')
    elif analysis.example_count == 1:
        suggestions.append('Add 1-2 more examples to improve consistency')
    elif analysis.example_count > 5:
        suggestions.append(f'Consider reducing examples from {analysis.example_count} to 3-5 to save tokens')

    if analysis.clarity_score < 70:
        suggestions.append('Improve clarity: replace vague terms with specific instructions')

    if analysis.token_count > 2000:
        suggestions.append(f'Prompt is {analysis.token_count} tokens - consider condensing for cost efficiency')

    # Check for role prompting
    if not re.search(r'you are|act as|as a\s+\w+', analysis.sections[0].get('content', [''])[0] if analysis.sections else '', re.IGNORECASE):
        suggestions.append('Consider adding role context: "You are an expert..."')

    return suggestions


def analyze_prompt(text: str, model: str = 'default',
                   price_per_mtok: Optional[float] = None) -> PromptAnalysis:
    """Perform comprehensive prompt analysis"""

    # Basic metrics
    token_count = estimate_tokens(text, model)
    cost = estimate_cost(token_count, price_per_mtok)
    word_count = len(text.split())
    line_count = len(text.split('\n'))

    # Find issues
    ambiguity_issues = find_ambiguous_instructions(text)
    redundancy_issues = find_redundant_content(text)
    all_issues = ambiguity_issues + redundancy_issues

    # Extract structure
    sections = extract_sections(text)
    examples = extract_few_shot_examples(text)
    has_format, format_suggestions = check_output_format(text)

    # Calculate scores
    clarity_score = calculate_clarity_score(text, all_issues)
    structure_score = calculate_structure_score(sections, has_format, len(examples) > 0)

    analysis = PromptAnalysis(
        token_count=token_count,
        estimated_cost=cost,
        model=model,
        clarity_score=clarity_score,
        structure_score=structure_score,
        issues=all_issues,
        suggestions=[],
        sections=[{'name': s['name'], 'lines': f"{s['start']}-{s.get('end', s['start'])}"} for s in sections],
        has_examples=len(examples) > 0,
        example_count=len(examples),
        has_output_format=has_format,
        word_count=word_count,
        line_count=line_count
    )

    analysis.suggestions = generate_suggestions(analysis) + format_suggestions

    return analysis


def optimize_prompt(text: str) -> str:
    """Generate optimized version of prompt"""
    optimized = text

    # Remove redundant whitespace
    optimized = re.sub(r'\n{3,}', '\n\n', optimized)
    optimized = re.sub(r' {2,}', ' ', optimized)

    # Trim lines
    lines = [line.rstrip() for line in optimized.split('\n')]
    optimized = '\n'.join(lines)

    return optimized.strip()


def format_report(analysis: PromptAnalysis) -> str:
    """Format analysis as human-readable report"""
    report = []
    report.append("=" * 50)
    report.append("PROMPT ANALYSIS REPORT")
    report.append("=" * 50)
    report.append("")

    report.append("📊 METRICS")
    report.append(f"  Token count:     {analysis.token_count:,}")
    if analysis.estimated_cost is not None:
        report.append(f"  Estimated cost:  ${analysis.estimated_cost:.4f} ({analysis.model}, user-supplied price)")
    else:
        report.append("  Estimated cost:  n/a (pass --price-per-mtok with your provider's current rate)")
    report.append(f"  Word count:      {analysis.word_count:,}")
    report.append(f"  Line count:      {analysis.line_count}")
    report.append("")

    report.append("📈 SCORES")
    report.append(f"  Clarity:    {analysis.clarity_score}/100 {'✅' if analysis.clarity_score >= 70 else '⚠️'}")
    report.append(f"  Structure:  {analysis.structure_score}/100 {'✅' if analysis.structure_score >= 70 else '⚠️'}")
    report.append("")

    report.append("📋 STRUCTURE")
    report.append(f"  Sections:      {len(analysis.sections)}")
    report.append(f"  Examples:      {analysis.example_count} {'✅' if analysis.has_examples else '❌'}")
    report.append(f"  Output format: {'✅ Specified' if analysis.has_output_format else '❌ Missing'}")
    report.append("")

    if analysis.sections:
        report.append("  Detected sections:")
        for section in analysis.sections:
            report.append(f"    - {section['name']} (lines {section['lines']})")
        report.append("")

    if analysis.issues:
        report.append(f"⚠️ ISSUES FOUND ({len(analysis.issues)})")
        for issue in analysis.issues[:10]:  # Limit to first 10
            report.append(f"  Line {issue['line']}: {issue['message']}")
            report.append(f"    Found: \"{issue['text']}\"")
        if len(analysis.issues) > 10:
            report.append(f"  ... and {len(analysis.issues) - 10} more issues")
        report.append("")

    if analysis.suggestions:
        report.append("💡 SUGGESTIONS")
        for i, suggestion in enumerate(analysis.suggestions, 1):
            report.append(f"  {i}. {suggestion}")
        report.append("")

    report.append("=" * 50)

    return '\n'.join(report)


def main():
    parser = argparse.ArgumentParser(
        description="Prompt Optimizer - Analyze and optimize prompts",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  %(prog)s prompt.txt --analyze
  %(prog)s prompt.txt --tokens --model claude --price-per-mtok 3.00
  %(prog)s prompt.txt --optimize --output optimized.txt
  %(prog)s prompt.txt --extract-examples --output examples.json
        """
    )

    parser.add_argument('prompt', help='Prompt file to analyze')
    parser.add_argument('--analyze', '-a', action='store_true', help='Run full analysis')
    parser.add_argument('--tokens', '-t', action='store_true', help='Count tokens only')
    parser.add_argument('--optimize', '-O', action='store_true', help='Generate optimized version')
    parser.add_argument('--extract-examples', '-e', action='store_true', help='Extract few-shot examples')
    parser.add_argument('--model', '-m', default='default',
                       help='Model name (any string; only the tokenizer family is inferred: '
                            'names containing "claude" use 3.5 chars/token, otherwise 4.0)')
    parser.add_argument('--price-per-mtok', type=float, default=None,
                       help='Input price in USD per million tokens (look up your provider\'s '
                            'current rate); omit to skip cost estimation')
    parser.add_argument('--output', '-o', help='Output file path')
    parser.add_argument('--json', '-j', action='store_true', help='Output as JSON')
    parser.add_argument('--compare', '-c', help='Compare with baseline analysis JSON')

    args = parser.parse_args()

    # Read prompt file
    prompt_path = Path(args.prompt)
    if not prompt_path.exists():
        print(f"Error: File not found: {args.prompt}", file=sys.stderr)
        sys.exit(1)

    text = prompt_path.read_text(encoding='utf-8')

    # Tokens only
    if args.tokens:
        token_count = estimate_tokens(text, args.model)
        cost = estimate_cost(token_count, args.price_per_mtok)
        if args.json:
            print(json.dumps({
                'tokens': token_count,
                'cost': cost,
                'model': args.model
            }, indent=2))
        else:
            print(f"Tokens: {token_count:,}")
            if cost is not None:
                print(f"Estimated cost: ${cost:.4f} ({args.model}, user-supplied price)")
        sys.exit(0)

    # Extract examples
    if args.extract_examples:
        examples = extract_few_shot_examples(text)
        output = [asdict(ex) for ex in examples]

        if args.output:
            Path(args.output).write_text(json.dumps(output, indent=2))
            print(f"Extracted {len(examples)} examples to {args.output}")
        else:
            print(json.dumps(output, indent=2))
        sys.exit(0)

    # Optimize
    if args.optimize:
        optimized = optimize_prompt(text)

        if args.output:
            Path(args.output).write_text(optimized)
            print(f"Optimized prompt written to {args.output}")

            # Show comparison
            orig_tokens = estimate_tokens(text, args.model)
            new_tokens = estimate_tokens(optimized, args.model)
            saved = orig_tokens - new_tokens
            print(f"Tokens: {orig_tokens:,} -> {new_tokens:,} (saved {saved:,})")
        else:
            print(optimized)
        sys.exit(0)

    # Default: full analysis
    analysis = analyze_prompt(text, args.model, args.price_per_mtok)

    # Compare with baseline
    if args.compare:
        baseline_path = Path(args.compare)
        if baseline_path.exists():
            baseline = json.loads(baseline_path.read_text())
            print("\n📊 COMPARISON WITH BASELINE")
            print(f"  Tokens: {baseline.get('token_count', 0):,} -> {analysis.token_count:,}")
            print(f"  Clarity: {baseline.get('clarity_score', 0)} -> {analysis.clarity_score}")
            print(f"  Issues: {len(baseline.get('issues', []))} -> {len(analysis.issues)}")
            print()

    if args.json:
        print(json.dumps(asdict(analysis), indent=2))
    else:
        print(format_report(analysis))

    # Write to output file
    if args.output:
        output_data = asdict(analysis)
        Path(args.output).write_text(json.dumps(output_data, indent=2))
        print(f"\nAnalysis saved to {args.output}")


if __name__ == '__main__':
    main()

#!/usr/bin/env python3
"""
RAG Evaluator - Evaluation tool for Retrieval-Augmented Generation systems

Features:
- Context relevance scoring (lexical overlap)
- Answer faithfulness checking
- Retrieval metrics (Precision@K, Recall@K, MRR)
- Coverage analysis
- Quality report generation

Usage:
    python rag_evaluator.py --contexts contexts.json --questions questions.json
    python rag_evaluator.py --contexts ctx.json --questions q.json --metrics relevance,faithfulness
    python rag_evaluator.py --contexts ctx.json --questions q.json --output report.json --verbose
"""

import argparse
import json
import re
import sys
from pathlib import Path
from typing import Dict, List, Optional, Set, Tuple
from dataclasses import dataclass, asdict, field
from collections import Counter
import math


@dataclass
class RetrievalMetrics:
    """Retrieval quality metrics"""
    precision_at_k: float
    recall_at_k: float
    mrr: float  # Mean Reciprocal Rank
    ndcg_at_k: float
    k: int


@dataclass
class ContextEvaluation:
    """Evaluation of a single context"""
    context_id: str
    relevance_score: float
    token_overlap: float
    key_terms_covered: List[str]
    missing_terms: List[str]


@dataclass
class AnswerEvaluation:
    """Evaluation of an answer against context"""
    question_id: str
    faithfulness_score: float
    groundedness_score: float
    claims: List[Dict[str, any]]
    unsupported_claims: List[str]
    context_used: List[str]


@dataclass
class RAGEvaluationReport:
    """Complete RAG evaluation report"""
    total_questions: int
    avg_context_relevance: float
    avg_faithfulness: float
    avg_groundedness: float
    retrieval_metrics: Dict[str, float]
    coverage: float
    issues: List[Dict[str, str]]
    recommendations: List[str]
    question_details: List[Dict[str, any]] = field(default_factory=list)


def tokenize(text: str) -> List[str]:
    """Simple tokenization for text comparison"""
    # Lowercase and split on non-alphanumeric
    text = text.lower()
    tokens = re.findall(r'\b\w+\b', text)
    # Remove common stopwords
    stopwords = {'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been',
                 'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will',
                 'would', 'could', 'should', 'may', 'might', 'must', 'shall',
                 'can', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by',
                 'from', 'as', 'into', 'through', 'during', 'before', 'after',
                 'above', 'below', 'up', 'down', 'out', 'off', 'over', 'under',
                 'again', 'further', 'then', 'once', 'here', 'there', 'when',
                 'where', 'why', 'how', 'all', 'each', 'few', 'more', 'most',
                 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
                 'same', 'so', 'than', 'too', 'very', 'just', 'and', 'but',
                 'if', 'or', 'because', 'until', 'while', 'it', 'this', 'that',
                 'these', 'those', 'i', 'you', 'he', 'she', 'we', 'they'}
    return [t for t in tokens if t not in stopwords and len(t) > 2]


def extract_key_terms(text: str, top_n: int = 10) -> List[str]:
    """Extract key terms from text based on frequency"""
    tokens = tokenize(text)
    freq = Counter(tokens)
    return [term for term, _ in freq.most_common(top_n)]


def calculate_token_overlap(text1: str, text2: str) -> float:
    """Calculate Jaccard similarity between two texts"""
    tokens1 = set(tokenize(text1))
    tokens2 = set(tokenize(text2))

    if not tokens1 or not tokens2:
        return 0.0

    intersection = tokens1 & tokens2
    union = tokens1 | tokens2

    return len(intersection) / len(union) if union else 0.0


def calculate_rouge_l(reference: str, candidate: str) -> float:
    """Calculate ROUGE-L score (Longest Common Subsequence)"""
    ref_tokens = tokenize(reference)
    cand_tokens = tokenize(candidate)

    if not ref_tokens or not cand_tokens:
        return 0.0

    # LCS using dynamic programming
    m, n = len(ref_tokens), len(cand_tokens)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if ref_tokens[i-1] == cand_tokens[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    lcs_length = dp[m][n]

    # F1-like score
    precision = lcs_length / n if n > 0 else 0
    recall = lcs_length / m if m > 0 else 0

    if precision + recall == 0:
        return 0.0

    return 2 * precision * recall / (precision + recall)


def evaluate_context_relevance(question: str, context: str, context_id: str = "") -> ContextEvaluation:
    """Evaluate how relevant a context is to a question"""
    question_terms = set(extract_key_terms(question, 15))
    context_terms = set(extract_key_terms(context, 30))

    covered = question_terms & context_terms
    missing = question_terms - context_terms

    # Calculate relevance based on term coverage and overlap
    term_coverage = len(covered) / len(question_terms) if question_terms else 0
    token_overlap = calculate_token_overlap(question, context)

    # Combined relevance score
    relevance = 0.6 * term_coverage + 0.4 * token_overlap

    return ContextEvaluation(
        context_id=context_id,
        relevance_score=round(relevance, 3),
        token_overlap=round(token_overlap, 3),
        key_terms_covered=list(covered),
        missing_terms=list(missing)
    )


def extract_claims(answer: str) -> List[str]:
    """Extract individual claims from an answer"""
    # Split on sentence boundaries
    sentences = re.split(r'[.!?]+', answer)
    claims = []

    for sentence in sentences:
        sentence = sentence.strip()
        if len(sentence) > 10:  # Filter out very short fragments
            claims.append(sentence)

    return claims


def check_claim_support(claim: str, context: str) -> Tuple[bool, float]:
    """Check if a claim is supported by the context"""
    claim_terms = set(tokenize(claim))
    context_terms = set(tokenize(context))

    if not claim_terms:
        return True, 1.0  # Empty claim is "supported"

    # Check term overlap
    overlap = claim_terms & context_terms
    support_ratio = len(overlap) / len(claim_terms)

    # Also check for ROUGE-L style matching
    rouge_score = calculate_rouge_l(context, claim)

    # Combined support score
    support_score = 0.5 * support_ratio + 0.5 * rouge_score

    return support_score > 0.3, support_score


def evaluate_answer_faithfulness(
    question: str,
    answer: str,
    contexts: List[str],
    question_id: str = ""
) -> AnswerEvaluation:
    """Evaluate if answer is faithful to the provided contexts"""
    claims = extract_claims(answer)
    combined_context = ' '.join(contexts)

    claim_evaluations = []
    supported_claims = 0
    unsupported = []
    context_used = []

    for claim in claims:
        is_supported, score = check_claim_support(claim, combined_context)

        claim_eval = {
            'claim': claim[:100] + '...' if len(claim) > 100 else claim,
            'supported': is_supported,
            'score': round(score, 3)
        }

        # Track which contexts support this claim
        for i, ctx in enumerate(contexts):
            _, ctx_score = check_claim_support(claim, ctx)
            if ctx_score > 0.3:
                claim_eval[f'context_{i}'] = round(ctx_score, 3)
                if f'context_{i}' not in context_used:
                    context_used.append(f'context_{i}')

        claim_evaluations.append(claim_eval)

        if is_supported:
            supported_claims += 1
        else:
            unsupported.append(claim[:100])

    # Faithfulness = % of claims supported
    faithfulness = supported_claims / len(claims) if claims else 1.0

    # Groundedness = average support score
    avg_score = sum(c['score'] for c in claim_evaluations) / len(claim_evaluations) if claim_evaluations else 1.0

    return AnswerEvaluation(
        question_id=question_id,
        faithfulness_score=round(faithfulness, 3),
        groundedness_score=round(avg_score, 3),
        claims=claim_evaluations,
        unsupported_claims=unsupported,
        context_used=context_used
    )


def calculate_retrieval_metrics(
    retrieved: List[str],
    relevant: Set[str],
    k: int = 5
) -> RetrievalMetrics:
    """Calculate standard retrieval metrics"""
    retrieved_k = retrieved[:k]

    # Precision@K
    relevant_in_k = sum(1 for doc in retrieved_k if doc in relevant)
    precision = relevant_in_k / k if k > 0 else 0

    # Recall@K
    recall = relevant_in_k / len(relevant) if relevant else 0

    # MRR (Mean Reciprocal Rank)
    mrr = 0.0
    for i, doc in enumerate(retrieved):
        if doc in relevant:
            mrr = 1.0 / (i + 1)
            break

    # NDCG@K
    dcg = 0.0
    for i, doc in enumerate(retrieved_k):
        rel = 1 if doc in relevant else 0
        dcg += rel / math.log2(i + 2)

    # Ideal DCG (all relevant at top)
    idcg = sum(1 / math.log2(i + 2) for i in range(min(len(relevant), k)))
    ndcg = dcg / idcg if idcg > 0 else 0

    return RetrievalMetrics(
        precision_at_k=round(precision, 3),
        recall_at_k=round(recall, 3),
        mrr=round(mrr, 3),
        ndcg_at_k=round(ndcg, 3),
        k=k
    )


def generate_recommendations(report: RAGEvaluationReport) -> List[str]:
    """Generate actionable recommendations based on evaluation"""
    recommendations = []

    if report.avg_context_relevance < 0.8:
        recommendations.append(
            f"Context relevance ({report.avg_context_relevance:.2f}) is below target (0.80). "
            "Consider: improving chunking strategy, adding metadata filtering, or using hybrid search."
        )

    if report.avg_faithfulness < 0.95:
        recommendations.append(
            f"Faithfulness ({report.avg_faithfulness:.2f}) is below target (0.95). "
            "Consider: adding source citations, implementing fact-checking, or adjusting temperature."
        )

    if report.avg_groundedness < 0.85:
        recommendations.append(
            f"Groundedness ({report.avg_groundedness:.2f}) is below target (0.85). "
            "Consider: using more restrictive prompts, adding 'only use provided context' instructions."
        )

    if report.coverage < 0.9:
        recommendations.append(
            f"Coverage ({report.coverage:.2f}) indicates some questions lack relevant context. "
            "Consider: expanding document corpus, improving embedding model, or adding fallback responses."
        )

    retrieval = report.retrieval_metrics
    if retrieval.get('precision_at_k', 0) < 0.7:
        recommendations.append(
            "Retrieval precision is low. Consider: re-ranking retrieved documents, "
            "using cross-encoder for reranking, or adjusting similarity threshold."
        )

    if not recommendations:
        recommendations.append("All metrics meet targets. Consider A/B testing new improvements.")

    return recommendations


def evaluate_rag_system(
    questions: List[Dict],
    contexts: List[Dict],
    k: int = 5,
    verbose: bool = False
) -> RAGEvaluationReport:
    """Comprehensive RAG system evaluation"""

    all_context_scores = []
    all_faithfulness_scores = []
    all_groundedness_scores = []
    issues = []
    question_details = []

    questions_with_context = 0

    for q_data in questions:
        question = q_data.get('question', q_data.get('query', ''))
        question_id = q_data.get('id', str(questions.index(q_data)))
        answer = q_data.get('answer', q_data.get('response', ''))
        expected = q_data.get('expected', q_data.get('ground_truth', ''))

        # Find contexts for this question
        q_contexts = []
        for ctx in contexts:
            if ctx.get('question_id') == question_id or ctx.get('query_id') == question_id:
                q_contexts.append(ctx.get('content', ctx.get('text', '')))

        # If no specific contexts, use all contexts (for simple datasets)
        if not q_contexts:
            q_contexts = [ctx.get('content', ctx.get('text', ''))
                        for ctx in contexts[:k]]

        if q_contexts:
            questions_with_context += 1

        # Evaluate context relevance
        context_evals = []
        for i, ctx in enumerate(q_contexts[:k]):
            eval_result = evaluate_context_relevance(question, ctx, f"ctx_{i}")
            context_evals.append(eval_result)
            all_context_scores.append(eval_result.relevance_score)

        # Evaluate answer faithfulness
        if answer and q_contexts:
            answer_eval = evaluate_answer_faithfulness(question, answer, q_contexts, question_id)
            all_faithfulness_scores.append(answer_eval.faithfulness_score)
            all_groundedness_scores.append(answer_eval.groundedness_score)

            # Track issues
            if answer_eval.unsupported_claims:
                issues.append({
                    'type': 'unsupported_claim',
                    'question_id': question_id,
                    'claims': answer_eval.unsupported_claims[:3]
                })

        # Check for low relevance contexts
        low_relevance = [e for e in context_evals if e.relevance_score < 0.5]
        if low_relevance:
            issues.append({
                'type': 'low_relevance',
                'question_id': question_id,
                'contexts': [e.context_id for e in low_relevance]
            })

        if verbose:
            question_details.append({
                'question_id': question_id,
                'question': question[:100],
                'context_scores': [asdict(e) for e in context_evals],
                'answer_faithfulness': all_faithfulness_scores[-1] if all_faithfulness_scores else None
            })

    # Calculate aggregates
    avg_context_relevance = sum(all_context_scores) / len(all_context_scores) if all_context_scores else 0
    avg_faithfulness = sum(all_faithfulness_scores) / len(all_faithfulness_scores) if all_faithfulness_scores else 0
    avg_groundedness = sum(all_groundedness_scores) / len(all_groundedness_scores) if all_groundedness_scores else 0
    coverage = questions_with_context / len(questions) if questions else 0

    # Simulated retrieval metrics (based on relevance scores)
    high_relevance = sum(1 for s in all_context_scores if s > 0.5)
    retrieval_metrics = {
        'precision_at_k': round(high_relevance / len(all_context_scores) if all_context_scores else 0, 3),
        'estimated_recall': round(coverage, 3),
        'k': k
    }

    report = RAGEvaluationReport(
        total_questions=len(questions),
        avg_context_relevance=round(avg_context_relevance, 3),
        avg_faithfulness=round(avg_faithfulness, 3),
        avg_groundedness=round(avg_groundedness, 3),
        retrieval_metrics=retrieval_metrics,
        coverage=round(coverage, 3),
        issues=issues[:20],  # Limit to 20 issues
        recommendations=[],
        question_details=question_details if verbose else []
    )

    report.recommendations = generate_recommendations(report)

    return report


def format_report(report: RAGEvaluationReport) -> str:
    """Format report as human-readable text"""
    lines = []
    lines.append("=" * 60)
    lines.append("RAG EVALUATION REPORT")
    lines.append("=" * 60)
    lines.append("")

    lines.append(f"📊 SUMMARY")
    lines.append(f"  Questions evaluated: {report.total_questions}")
    lines.append(f"  Coverage: {report.coverage:.1%}")
    lines.append("")

    lines.append("📈 RETRIEVAL METRICS")
    lines.append(f"  Context Relevance:    {report.avg_context_relevance:.2f} {'✅' if report.avg_context_relevance >= 0.8 else '⚠️'} (target: >0.80)")
    lines.append(f"  Precision@{report.retrieval_metrics.get('k', 5)}:         {report.retrieval_metrics.get('precision_at_k', 0):.2f}")
    lines.append("")

    lines.append("📝 GENERATION METRICS")
    lines.append(f"  Answer Faithfulness:  {report.avg_faithfulness:.2f} {'✅' if report.avg_faithfulness >= 0.95 else '⚠️'} (target: >0.95)")
    lines.append(f"  Groundedness:         {report.avg_groundedness:.2f} {'✅' if report.avg_groundedness >= 0.85 else '⚠️'} (target: >0.85)")
    lines.append("")

    if report.issues:
        lines.append(f"⚠️ ISSUES FOUND ({len(report.issues)})")
        for issue in report.issues[:10]:
            if issue['type'] == 'unsupported_claim':
                lines.append(f"  Q{issue['question_id']}: {len(issue.get('claims', []))} unsupported claim(s)")
            elif issue['type'] == 'low_relevance':
                lines.append(f"  Q{issue['question_id']}: Low relevance contexts: {issue.get('contexts', [])}")
        if len(report.issues) > 10:
            lines.append(f"  ... and {len(report.issues) - 10} more issues")
        lines.append("")

    lines.append("💡 RECOMMENDATIONS")
    for i, rec in enumerate(report.recommendations, 1):
        lines.append(f"  {i}. {rec}")
    lines.append("")

    lines.append("=" * 60)

    return '\n'.join(lines)


def main():
    parser = argparse.ArgumentParser(
        description="RAG Evaluator - Evaluate Retrieval-Augmented Generation systems",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  %(prog)s --contexts contexts.json --questions questions.json
  %(prog)s --contexts ctx.json --questions q.json --k 10
  %(prog)s --contexts ctx.json --questions q.json --output report.json --verbose

Input file formats:

questions.json:
[
  {"id": "q1", "question": "What is X?", "answer": "X is..."},
  {"id": "q2", "question": "How does Y work?", "answer": "Y works by..."}
]

contexts.json:
[
  {"question_id": "q1", "content": "Retrieved context text..."},
  {"question_id": "q2", "content": "Another context..."}
]
        """
    )

    parser.add_argument('--contexts', '-c', required=True, help='JSON file with retrieved contexts')
    parser.add_argument('--questions', '-q', required=True, help='JSON file with questions and answers')
    parser.add_argument('--k', type=int, default=5, help='Number of top contexts to evaluate (default: 5)')
    parser.add_argument('--output', '-o', help='Output file for detailed report (JSON)')
    parser.add_argument('--json', '-j', action='store_true', help='Output as JSON instead of text')
    parser.add_argument('--verbose', '-v', action='store_true', help='Include per-question details')
    parser.add_argument('--compare', help='Compare with baseline report JSON')

    args = parser.parse_args()

    # Load input files
    contexts_path = Path(args.contexts)
    questions_path = Path(args.questions)

    if not contexts_path.exists():
        print(f"Error: Contexts file not found: {args.contexts}", file=sys.stderr)
        sys.exit(1)

    if not questions_path.exists():
        print(f"Error: Questions file not found: {args.questions}", file=sys.stderr)
        sys.exit(1)

    try:
        contexts = json.loads(contexts_path.read_text(encoding='utf-8'))
        questions = json.loads(questions_path.read_text(encoding='utf-8'))
    except json.JSONDecodeError as e:
        print(f"Error: Invalid JSON format: {e}", file=sys.stderr)
        sys.exit(1)

    # Run evaluation
    report = evaluate_rag_system(questions, contexts, k=args.k, verbose=args.verbose)

    # Compare with baseline
    if args.compare:
        baseline_path = Path(args.compare)
        if baseline_path.exists():
            baseline = json.loads(baseline_path.read_text())
            print("\n📊 COMPARISON WITH BASELINE")
            print(f"  Relevance:    {baseline.get('avg_context_relevance', 0):.2f} -> {report.avg_context_relevance:.2f}")
            print(f"  Faithfulness: {baseline.get('avg_faithfulness', 0):.2f} -> {report.avg_faithfulness:.2f}")
            print(f"  Groundedness: {baseline.get('avg_groundedness', 0):.2f} -> {report.avg_groundedness:.2f}")
            print()

    # Output
    if args.json:
        print(json.dumps(asdict(report), indent=2))
    else:
        print(format_report(report))

    # Save to file
    if args.output:
        Path(args.output).write_text(json.dumps(asdict(report), indent=2))
        print(f"\nDetailed report saved to {args.output}")


if __name__ == '__main__':
    main()

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Pick senior-prompt-engineer over basic prompt tips when RAG evaluation, few-shot design, and agent architecture must be systematic rather than one-off.

FAQ

What tools does senior-prompt-engineer include?

senior-prompt-engineer bundles a Prompt Optimizer, RAG Evaluator, and Agent Orchestration guidance alongside patterns for few-shot design, token analysis, and structured output architecture.

When should developers invoke senior-prompt-engineer?

senior-prompt-engineer should run when developers need to optimize prompts, evaluate LLM or RAG outputs, implement retrieval pipelines, or design reliable multi-step agentic AI workflows with measurable quality gates.

Is Senior Prompt Engineer safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsllmautomation

About

Senior Prompt Engineer by the numbers

Add your badge

How do you evaluate and optimize RAG prompt pipelines?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Senior Prompt Engineer

Operating Rules

Tools (exact CLIs, all stdlib)

1. Prompt Optimizer — scripts/prompt_optimizer.py

2. RAG Evaluator — scripts/rag_evaluator.py

3. Agent Orchestrator — scripts/agent_orchestrator.py

Workflows

Prompt Optimization (eval-gated)

Few-Shot Example Design

Structured Output Design

RAG Tuning Loop

Agent Config Review

References

Related Skills

Agentic System Design

Architectures Index

1. ReAct Pattern

Architecture

Pseudocode

Prompt Template

When to Use

2. Plan-and-Execute

Architecture

Pseudocode

When to Use

3. Tool Use / Function Calling

Tool Definition Schema

Implementation Pattern

Tool Design Best Practices

4. Multi-Agent Collaboration

Orchestration Patterns

Pseudocode: Hierarchical Multi-Agent

Communication Protocols

5. Memory and State Management

Memory Types

Implementation

State Persistence Patterns

6. Agent Design Patterns

Pattern: Reflection

Pattern: Self-Ask

Pattern: Expert Routing

Quick Reference: Pattern Selection

LLM Evaluation Frameworks

Frameworks Index

1. Evaluation Metrics Overview

Metric Categories

Choosing the Right Metric

2. Text Generation Metrics

BLEU (Bilingual Evaluation Understudy)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

BERTScore

3. RAG-Specific Metrics

Context Relevance

Answer Faithfulness

Retrieval Metrics

4. Human Evaluation Frameworks

Likert Scale Evaluation

Comparative Evaluation (Side-by-Side)

LLM-as-Judge

5. A/B Testing for Prompts

Experiment Design

Metrics to Track

Analysis Framework

6. Benchmark Datasets

General NLP Benchmarks

RAG Benchmarks

Creating Custom Benchmarks

7. Evaluation Pipeline Design

Automated Evaluation Pipeline

Implementation Checklist

Quick Reference: Metric Selection

1. Prompt Optimizer — `scripts/prompt_optimizer.py`

2. RAG Evaluator — `scripts/rag_evaluator.py`

3. Agent Orchestrator — `scripts/agent_orchestrator.py`